linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [HMM 00/15] HMM (Heterogeneous Memory Management) v23
@ 2017-05-24 17:20 Jérôme Glisse
  2017-05-24 17:20 ` [HMM 01/15] hmm: heterogeneous memory management documentation Jérôme Glisse
                   ` (16 more replies)
  0 siblings, 17 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Jérôme Glisse

Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
test same kernel as kbuild system, git branch:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v23

Change since v22 is use of static key for special ZONE_DEVICE case in
put_page() and build fix for architecture with no mmu.

Everything else is the same. Below is the long description of what HMM
is about and why. At the end of this email i describe briefly each patch
and suggest reviewers for each of them.


Heterogeneous Memory Management (HMM) (description and justification)

Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls. The
device can only access and use memory allocated through this API. This
effectively split the program address space into object allocated for the
device and useable by the device and other regular memory (malloc, mmap
of a file, share memory, Ac) only accessible by CPU (or in a very limited
way by a device by pinning memory).

Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, Ac) but this get extremely complex with advance data structure
(list, tree, graph, Ac) that rely on a web of memory pointers. This is
becoming a serious limitation on the kind of work load that can be
offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this
barrier. This require a shared address space between GPU device and CPU so
that GPU can access any memory of a process (while still obeying memory
protection like read only). This kind of feature is also appearing in
various other operating systems.

HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.

Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the device
memory is not accessible by the CPU with the same properties as main
memory (cache coherency, atomic operations, Ac).

To allow migration from main memory to device memory HMM provides a set
of helper to hotplug device memory as a new type of ZONE_DEVICE memory
which is un-addressable by CPU but still has struct page representing it.
This allow most of the core kernel logic that deals with a process memory
to stay oblivious of the peculiarity of device memory.

When page backing an address of a process is migrated to device memory
the CPU page table entry is set to a new specific swap entry. CPU access
to such address triggers a migration back to system memory, just like if
the page was swap on disk. HMM also blocks any one from pinning a
ZONE_DEVICE page so that it can always be migrated back to system memory
if CPU access it. Conversely HMM does not migrate to device memory any
page that is pin in system memory.

To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.

This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.

The expected workload is a program builds a data set on the CPU (from disk,
from network, from sensors, Ac). Program uses GPU API (OpenCL, CUDA, ...)
to give hint on memory placement for the input data and also for the output
buffer. Program call GPU API to schedule a GPU job, this happens using
device driver specific ioctl. All this is hidden from programmer point of
view in case of C++ compiler that transparently offload some part of a
program to GPU. Program can keep doing other stuff on the CPU while the
GPU is crunching numbers.

It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object will
stay in system memory and should not be bottlenecked by system bus
bandwidth (rare write and read access from both CPU and GPU).

As we are relying on device driver API, HMM does not introduce any new
syscall nor does it modify any existing ones. It does not change any POSIX
semantics or behaviors. For instance the child after a fork of a process
that is using HMM will not be impacted in anyway, nor is there any data
hazard between child COW or parent COW of memory that was migrated to
device prior to fork.

HMM assume a numbers of hardware features. Device must allow device page
table to be updated at any time (ie device job must be preemptable). Device
page table must provides memory protection such as read only. Device must
track write access (dirty bit). Device must have a minimum granularity that
match PAGE_SIZE (ie 4k).


Reviewer (just hint):
Patch 1  HMM documentation
Patch 2  introduce core infrastructure and definition of HMM, pretty
         small patch and easy to review
Patch 3  introduce the mirror functionality of HMM, it relies on
         mmu_notifier and thus someone familiar with that part would be
         in better position to review
Patch 4  is an helper to snapshot CPU page table while synchronizing with
         concurrent page table update. Understanding mmu_notifier makes
         review easier.
Patch 5  is mostly a wrapper around handle_mm_fault()
Patch 6  add new add_pages() helper to avoid modifying each arch memory
         hot plug function
Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
         in various core mm to support this new type. Dan Williams and
         any core mm contributor are best people to review each half of
         this patchset
Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
         Dan Williams are best person to review this
Patch 9  add helper to hotplug un-addressable device memory as new type
         of ZONE_DEVICE memory (new type introducted in patch 3 of this
         serie). This is boiler plate code around memory hotplug and it
         also pick a free range of physical address for the device memory.
         Note that the physical address do not point to anything (at least
         as far as the kernel knows).
Patch 10 introduce a new hmm_device class as an helper for device driver
         that want to expose multiple device memory under a common fake
         device driver. This is usefull for multi-gpu configuration.
         Anyone familiar with device driver infrastructure can review
         this. Boiler plate code really.
Patch 11 add a new migrate mode. Any one familiar with page migration is
         welcome to review.
Patch 12 introduce a new migration helper (migrate_vma()) that allow to
         migrate a range of virtual address of a process using device DMA
         engine to perform the copy. It is not limited to do copy from and
         to device but can also do copy between any kind of source and
         destination memory. Again anyone familiar with migration code
         should be able to verify the logic.
Patch 13 optimize the new migrate_vma() by unmapping pages while we are
         collecting them. This can be review by any mm folks.
Patch 14 add unaddressable memory migration to helper introduced in patch
         7, this can be review by anyone familiar with migration code
Patch 15 add a feature that allow device to allocate non-present page on
         the GPU when migrating a range of address to device memory. This
         is an helper for device driver to avoid having to first allocate
         system memory before migration to device memory


Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/
v10 https://lwn.net/Articles/654430/
v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
v12 http://www.kernelhub.org/?msg=972982&p=2
v13 https://lwn.net/Articles/706856/
v14 https://lkml.org/lkml/2016/12/8/344
v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
v16 http://www.spinics.net/lists/linux-mm/msg119814.html
v17 https://lkml.org/lkml/2017/1/27/847
v18 https://lkml.org/lkml/2017/3/16/596
v19 https://lkml.org/lkml/2017/4/5/831
v20 https://lwn.net/Articles/720715/
v21 https://lkml.org/lkml/2017/4/24/747
v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html

JA(C)rA'me Glisse (14):
  hmm: heterogeneous memory management documentation
  mm/hmm: heterogeneous memory management (HMM for short) v4
  mm/hmm/mirror: mirror process address space on device with HMM helpers
    v3
  mm/hmm/mirror: helper to snapshot CPU page table v3
  mm/hmm/mirror: device page fault handler
  mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  mm/ZONE_DEVICE: special case put_page() for device private pages v2
  mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5
  mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3
  mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  mm/migrate: new memory migration helper for use with device memory v4
  mm/migrate: migrate_vma() unmap page from vma while collecting pages
  mm/migrate: support un-addressable ZONE_DEVICE page in migration v2
  mm/migrate: allow migrate_vma() to alloc new page on empty entry v2

Michal Hocko (1):
  mm/memory_hotplug: introduce add_pages

 Documentation/vm/hmm.txt       |  362 ++++++++++++
 MAINTAINERS                    |    7 +
 arch/x86/Kconfig               |    4 +
 arch/x86/mm/init_64.c          |   22 +-
 fs/aio.c                       |    8 +
 fs/f2fs/data.c                 |    5 +-
 fs/hugetlbfs/inode.c           |    5 +-
 fs/proc/task_mmu.c             |    7 +
 fs/ubifs/file.c                |    5 +-
 include/linux/hmm.h            |  468 +++++++++++++++
 include/linux/ioport.h         |    1 +
 include/linux/memory_hotplug.h |   11 +
 include/linux/memremap.h       |   85 +++
 include/linux/migrate.h        |  115 ++++
 include/linux/migrate_mode.h   |    5 +
 include/linux/mm.h             |   25 +
 include/linux/mm_types.h       |    5 +
 include/linux/swap.h           |   24 +-
 include/linux/swapops.h        |   68 +++
 kernel/fork.c                  |    2 +
 kernel/memremap.c              |   53 +-
 mm/Kconfig                     |   47 ++
 mm/Makefile                    |    2 +-
 mm/balloon_compaction.c        |    8 +
 mm/hmm.c                       | 1234 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                    |   61 ++
 mm/memory_hotplug.c            |   10 +-
 mm/migrate.c                   |  787 ++++++++++++++++++++++++-
 mm/mprotect.c                  |   14 +
 mm/page_vma_mapped.c           |   10 +
 mm/rmap.c                      |   25 +
 mm/zsmalloc.c                  |    8 +
 32 files changed, 3463 insertions(+), 30 deletions(-)
 create mode 100644 Documentation/vm/hmm.txt
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [HMM 01/15] hmm: heterogeneous memory management documentation
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-06-24  6:15   ` John Hubbard
  2017-05-24 17:20 ` [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4 Jérôme Glisse
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Jérôme Glisse

This add documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to
be useful and and gives an overview of how this is implemented.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 362 +++++++++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS              |   7 +
 2 files changed, 369 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..a18ffc0
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,362 @@
+Heterogeneous Memory Management (HMM)
+
+Transparently allow any component of a program to use any memory region of said
+program with a device without using device specific memory allocator. This is
+becoming a requirement to simplify the use of advance heterogeneous computing
+where GPU, DSP or FPGA are use to perform various computations.
+
+This document is divided as follow, in the first section i expose the problems
+related to the use of a device specific allocator. The second section i expose
+the hardware limitations that are inherent to many platforms. The third section
+gives an overview of HMM designs. The fourth section explains how CPU page-
+table mirroring works and what is HMM purpose in this context. Fifth section
+deals with how device memory is represented inside the kernel. Finaly the last
+section present the new migration helper that allow to leverage the device DMA
+engine.
+
+
+-------------------------------------------------------------------------------
+
+1) Problems of using device specific memory allocator:
+
+Device with large amount of on board memory (several giga bytes) like GPU have
+historically manage their memory through dedicated driver specific API. This
+creates a disconnect between memory allocated and managed by device driver and
+regular application memory (private anonymous, share memory or regular file
+back memory). From here on i will refer to this aspect as split address space.
+I use share address space to refer to the opposite situation ie one in which
+any memory region can be use by device transparently.
+
+Split address space because device can only access memory allocated through the
+device specific API. This imply that all memory object in a program are not
+equal from device point of view which complicate large program that rely on a
+wide set of libraries.
+
+Concretly this means that code that wants to leverage device like GPU need to
+copy object between genericly allocated memory (malloc, mmap private/share/)
+and memory allocated through the device driver API (this still end up with an
+mmap but of the device file).
+
+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
+data-set need to re-map all the pointer relations between each of its elements.
+This is error prone and program gets harder to debug because of the duplicate
+data-set.
+
+Split address space also means that library can not transparently use data they
+are getting from core program or other library and thus each library might have
+to duplicate its input data-set using specific memory allocator. Large project
+suffer from this and waste resources because of the various memory copy.
+
+Duplicating each library API to accept as input or output memory allocted by
+each device specific allocator is not a viable option. It would lead to a
+combinatorial explosions in the library entry points.
+
+Finaly with the advance of high level language constructs (in C++ but in other
+language too) it is now possible for compiler to leverage GPU or other devices
+without even the programmer knowledge. Some of compiler identified patterns are
+only do-able with a share address. It is as well more reasonable to use a share
+address space for all the other patterns.
+
+
+-------------------------------------------------------------------------------
+
+2) System bus, device memory characteristics
+
+System bus cripple share address due to few limitations. Most system bus only
+allow basic memory access from device to main memory, even cache coherency is
+often optional. Access to device memory from CPU is even more limited, most
+often than not it is not cache coherent.
+
+If we only consider the PCIE bus than device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However it only allows
+a limited set of atomic operation from device on main memory. This is worse
+in the other direction the CPUs can only access a limited range of the device
+memory and can not perform atomic operations on it. Thus device memory can not
+be consider like regular memory from kernel point of view.
+
+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
+The final limitation is latency, access to main memory from the device has an
+order of magnitude higher latency than when the device access its own memory.
+
+Some platform are developing new system bus or additions/modifications to PCIE
+to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+way cache coherency between CPU and device and allow all atomic operations the
+architecture supports. Saddly not all platform are following this trends and
+some major architecture are left without hardware solutions to those problems.
+
+So for share address space to make sense not only we must allow device to
+access any memory memory but we must also permit any memory to be migrated to
+device memory while device is using it (blocking CPU access while it happens).
+
+
+-------------------------------------------------------------------------------
+
+3) Share address space and migration
+
+HMM intends to provide two main features. First one is to share the address
+space by duplication the CPU page table into the device page table so same
+address point to same memory and this for any valid main memory address in
+the process address space.
+
+To achieve this, HMM offer a set of helpers to populate the device page table
+while keeping track of CPU page table updates. Device page table updates are
+not as easy as CPU page table updates. To update the device page table you must
+allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
+commands in it to perform the update (unmap, cache invalidations and flush,
+...). This can not be done through common code for all device. Hence why HMM
+provides helpers to factor out everything that can be while leaving the gory
+details to the device driver.
+
+The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
+allow to allocate a struct page for each page of the device memory. Those page
+are special because the CPU can not map them. They however allow to migrate
+main memory to device memory using exhisting migration mechanism and everything
+looks like if page was swap out to disk from CPU point of view. Using a struct
+page gives the easiest and cleanest integration with existing mm mechanisms.
+Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
+for the device memory and second to perform migration. Policy decision of what
+and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page trigger a page fault and a migration
+back to main memory ie when a page backing an given address A is migrated from
+a main memory page to a device page then any CPU access to address A trigger a
+page fault and initiate a migration back to main memory.
+
+
+With this two features, HMM not only allow a device to mirror a process address
+space and keeps both CPU and device page table synchronize, but also allow to
+leverage device memory by migrating part of data-set that is actively use by a
+device.
+
+
+-------------------------------------------------------------------------------
+
+4) Address space mirroring implementation and API
+
+Address space mirroring main objective is to allow to duplicate range of CPU
+page table into a device page table and HMM helps keeping both synchronize. A
+device driver that want to mirror a process address space must start with the
+registration of an hmm_mirror struct:
+
+ int hmm_mirror_register(struct hmm_mirror *mirror,
+                         struct mm_struct *mm);
+ int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+                                struct mm_struct *mm);
+
+The locked variant is to be use when the driver is already holding the mmap_sem
+of the mm in write mode. The mirror struct has a set of callback that are use
+to propagate CPU page table:
+
+ struct hmm_mirror_ops {
+     /* sync_cpu_device_pagetables() - synchronize page tables
+      *
+      * @mirror: pointer to struct hmm_mirror
+      * @update_type: type of update that occurred to the CPU page table
+      * @start: virtual start address of the range to update
+      * @end: virtual end address of the range to update
+      *
+      * This callback ultimately originates from mmu_notifiers when the CPU
+      * page table is updated. The device driver must update its page table
+      * in response to this callback. The update argument tells what action
+      * to perform.
+      *
+      * The device driver must not return from this callback until the device
+      * page tables are completely updated (TLBs flushed, etc); this is a
+      * synchronous call.
+      */
+      void (*update)(struct hmm_mirror *mirror,
+                     enum hmm_update action,
+                     unsigned long start,
+                     unsigned long end);
+ };
+
+Device driver must perform update to the range following action (turn range
+read only, or fully unmap, ...). Once driver callback returns the device must
+be done with the update.
+
+
+When device driver wants to populate a range of virtual address it can use
+either:
+ int hmm_vma_get_pfns(struct vm_area_struct *vma,
+                      struct hmm_range *range,
+                      unsigned long start,
+                      unsigned long end,
+                      hmm_pfn_t *pfns);
+ int hmm_vma_fault(struct vm_area_struct *vma,
+                   struct hmm_range *range,
+                   unsigned long start,
+                   unsigned long end,
+                   hmm_pfn_t *pfns,
+                   bool write,
+                   bool block);
+
+First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
+will not trigger a page fault on missing or non present entry. The second one
+do trigger page fault on missing or read only entry if write parameter is true.
+Page fault use the generic mm page fault code path just like a CPU page fault.
+
+Both function copy CPU page table into their pfns array argument. Each entry in
+that array correspond to an address in the virtual range. HMM provide a set of
+flags to help driver identify special CPU page table entries.
+
+Locking with the update() callback is the most important aspect the driver must
+respect in order to keep things properly synchronize. The usage pattern is :
+
+ int driver_populate_range(...)
+ {
+      struct hmm_range range;
+      ...
+ again:
+      ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
+      if (ret)
+          return ret;
+      take_lock(driver->update);
+      if (!hmm_vma_range_done(vma, &range)) {
+          release_lock(driver->update);
+          goto again;
+      }
+
+      // Use pfns array content to update device page table
+
+      release_lock(driver->update);
+      return 0;
+ }
+
+The driver->update lock is the same lock that driver takes inside its update()
+callback. That lock must be call before hmm_vma_range_done() to avoid any race
+with a concurrent CPU page table update.
+
+HMM implements all this on top of the mmu_notifier API because we wanted to a
+simpler API and also to be able to perform optimization latter own like doing
+concurrent device update in multi-devices scenario.
+
+HMM also serve as an impedence missmatch between how CPU page table update are
+done (by CPU write to the page table and TLB flushes) from how device update
+their own page table. Device update is a multi-step process, first appropriate
+commands are write to a buffer, then this buffer is schedule for execution on
+the device. It is only once the device has executed commands in the buffer that
+the update is done. Creating and scheduling update command buffer can happen
+concurrently for multiple devices. Waiting for each device to report commands
+as executed is serialize (there is no point in doing this concurrently).
+
+
+-------------------------------------------------------------------------------
+
+5) Represent and manage device memory from core kernel point of view
+
+Several differents design were try to support device memory. First one use
+device specific data structure to keep information about migrated memory and
+HMM hooked itself in various place of mm code to handle any access to address
+that were back by device memory. It turns out that this ended up replicating
+most of the fields of struct page and also needed many kernel code path to be
+updated to understand this new kind of memory.
+
+Thing is most kernel code path never try to access the memory behind a page
+but only care about struct page contents. Because of this HMM switchted to
+directly using struct page for device memory which left most kernel code path
+un-aware of the difference. We only need to make sure that no one ever try to
+map those page from the CPU side.
+
+HMM provide a set of helpers to register and hotplug device memory as a new
+region needing struct page. This is offer through a very simple API:
+
+ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+                                   struct device *device,
+                                   unsigned long size);
+ void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+The hmm_devmem_ops is where most of the important things are:
+
+ struct hmm_devmem_ops {
+     void (*free)(struct hmm_devmem *devmem, struct page *page);
+     int (*fault)(struct hmm_devmem *devmem,
+                  struct vm_area_struct *vma,
+                  unsigned long addr,
+                  struct page *page,
+                  unsigned flags,
+                  pmd_t *pmdp);
+ };
+
+The first callback (free()) happens when the last reference on a device page is
+drop. This means the device page is now free and no longer use by anyone. The
+second callback happens whenever CPU try to access a device page which it can
+not do. This second callback must trigger a migration back to system memory,
+HMM provides an helper to do just that:
+
+ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+                            struct vm_area_struct *vma,
+                            const struct migrate_vma_ops *ops,
+                            unsigned long mentry,
+                            unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long addr,
+                            unsigned long end,
+                            void *private);
+
+It relies on new migrate_vma() helper which is a generic page migration helper
+that work on range of virtual address instead of working on individual pages,
+it also allow to leverage device DMA engine to perform the copy from device to
+main memory (or in the other direction). The next section goes over this new
+helper.
+
+
+-------------------------------------------------------------------------------
+
+6) Migrate to and from device memory
+
+Because CPU can not access device memory, migration must use device DMA engine
+to perform copy from and to device memory. For this we need a new migration
+helper:
+
+ int migrate_vma(const struct migrate_vma_ops *ops,
+                 struct vm_area_struct *vma,
+                 unsigned long mentries,
+                 unsigned long start,
+                 unsigned long end,
+                 unsigned long *src,
+                 unsigned long *dst,
+                 void *private);
+
+Unlike other migration function it works on a range of virtual address, there
+is two reasons for that. First device DMA copy has a high setup overhead cost
+and thus batching multiple pages is needed as otherwise the migration overhead
+make the whole excersie pointless. The second reason is because driver trigger
+such migration base on range of address the device is actively accessing.
+
+The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
+control destination memory allocation and copy operation. Second one is there
+to allow device driver to perform cleanup operation after migration.
+
+ struct migrate_vma_ops {
+     void (*alloc_and_copy)(struct vm_area_struct *vma,
+                            const unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long end,
+                            void *private);
+     void (*finalize_and_map)(struct vm_area_struct *vma,
+                              const unsigned long *src,
+                              const unsigned long *dst,
+                              unsigned long start,
+                              unsigned long end,
+                              void *private);
+ };
+
+It is important to stress that this migration helpers allow for hole in the
+virtual address range. Some pages in the range might not be migrated for all
+the usual reasons (page is pin, page is lock, ...). This helper does not fail
+but just skip over those pages.
+
+The alloc_and_copy() might as well decide to not migrate all pages in the
+range (for reasons under the callback control). For those the callback just
+have to leave the corresponding dst entry empty.
+
+Finaly the migration of the struct page might fails (for file back page) for
+various reasons (failure to freeze reference, or update page cache, ...). If
+that happens then the finalize_and_map() can catch any pages that was not
+migrated. Note those page were still copied to new page and thus we wasted
+bandwidth but this is considered as a rare event and a price that we are
+willing to pay to keep all the code simpler.
diff --git a/MAINTAINERS b/MAINTAINERS
index 2daea20..91129bd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7570,6 +7570,13 @@ M:	Sasha Levin <sasha.levin@oracle.com>
 S:	Maintained
 F:	tools/lib/lockdep/
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M:	Dan Williams <dan.j.williams@intel.com>
 L:	linux-nvdimm@lists.01.org
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
  2017-05-24 17:20 ` [HMM 01/15] hmm: heterogeneous memory management documentation Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-31  2:10   ` Balbir Singh
  2017-05-24 17:20 ` [HMM 03/15] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Changed since v3:
  - Unconditionaly build hmm.c for static keys
Changed since v2:
  - s/device unaddressable/device private
Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h      | 146 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  13 +++++
 mm/Makefile              |   2 +-
 mm/hmm.c                 |  74 ++++++++++++++++++++++++
 6 files changed, 241 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..e24c7a7
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,146 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * Heterogeneous Memory Management (HMM)
+ *
+ * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it
+ * is for. Here we focus on the HMM API description, with some explanation of
+ * the underlying implementation.
+ *
+ * Short description: HMM provides a set of helpers to share a virtual address
+ * space between CPU and a device, so that the device can access any valid
+ * address of the process (while still obeying memory protection). HMM also
+ * provides helpers to migrate process memory to device memory, and back. Each
+ * set of functionality (address space mirroring, and migration to and from
+ * device memory) can be used independently of the other.
+ *
+ *
+ * HMM address space mirroring API:
+ *
+ * Use HMM address space mirroring if you want to mirror range of the CPU page
+ * table of a process into a device page table. Here, "mirror" means "keep
+ * synchronized". Prerequisites: the device must provide the ability to write-
+ * protect its page tables (at PAGE_SIZE granularity), and must be able to
+ * recover from the resulting potential page faults.
+ *
+ * HMM guarantees that at any point in time, a given virtual address points to
+ * either the same memory in both CPU and device page tables (that is: CPU and
+ * device page tables each point to the same pages), or that one page table (CPU
+ * or device) points to no entry, while the other still points to the old page
+ * for the address. The latter case happens when the CPU page table update
+ * happens first, and then the update is mirrored over to the device page table.
+ * This does not cause any issue, because the CPU page table cannot start
+ * pointing to a new page until the device page table is invalidated.
+ *
+ * HMM uses mmu_notifiers to monitor the CPU page tables, and forwards any
+ * updates to each device driver that has registered a mirror. It also provides
+ * some API calls to help with taking a snapshot of the CPU page table, and to
+ * synchronize with any updates that might happen concurrently.
+ *
+ *
+ * HMM migration to and from device memory:
+ *
+ * HMM provides a set of helpers to hotplug device memory as ZONE_DEVICE, with
+ * a new MEMORY_DEVICE_PRIVATE type. This provides a struct page for each page
+ * of the device memory, and allows the device driver to manage its memory
+ * using those struct pages. Having struct pages for device memory makes
+ * migration easier. Because that memory is not addressable by the CPU it must
+ * never be pinned to the device; in other words, any CPU page fault can always
+ * cause the device memory to be migrated (copied/moved) back to regular memory.
+ *
+ * A new migrate helper (migrate_vma()) has been added (see mm/migrate.c) that
+ * allows use of a device DMA engine to perform the copy operation between
+ * regular system memory and device memory.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table has write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27..5ddeacc7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 /*
  * Each physical page in the system has a struct page associated with
@@ -500,6 +501,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM needs to track a few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 extern struct mm_struct init_mm;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6210bb2..9843b16 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -37,6 +37,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -885,6 +886,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
diff --git a/mm/Kconfig b/mm/Kconfig
index b2356fe..8059ad6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,19 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config ARCH_HAS_HMM
+	bool
+	default y
+	depends on X86_64
+	depends on ZONE_DEVICE
+	depends on MMU && 64BIT
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on SPARSEMEM_VMEMMAP
+
+config HMM
+	bool
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..b964848 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   debug.o hmm.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..88a7e10
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,74 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+
+#ifdef CONFIG_HMM
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be used directly by device drivers. It allocates an
+ * HMM struct if mm does not have one, and initializes it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be freed once the mm_struct goes away,
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	kfree(mm->hmm);
+}
+#endif /* CONFIG_HMM */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 03/15] mm/hmm/mirror: mirror process address space on device with HMM helpers v3
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
  2017-05-24 17:20 ` [HMM 01/15] hmm: heterogeneous memory management documentation Jérôme Glisse
  2017-05-24 17:20 ` [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 04/15] mm/hmm/mirror: helper to snapshot CPU page table v3 Jérôme Glisse
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Changed since v2:
  - s/device unaddressable/device private/
Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 110 ++++++++++++++++++++++++++++++++++
 mm/Kconfig          |  12 ++++
 mm/hmm.c            | 168 +++++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 275 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e24c7a7..f72ce59 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
@@ -134,6 +135,115 @@ static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to synchronize device page table with CPU page table.
+ *
+ * A device driver that is participating in HMM mirroring must always
+ * synchronize with CPU page table updates. For this, device drivers can either
+ * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
+ * drivers can decide to register one mirror per device per process, or just
+ * one mirror per process for a group of devices. The pattern is:
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contains an hmm_mirror struct as one of its fields.
+ *          ...
+ *
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *
+ *          // Other device driver specific initialization
+ *          ...
+ *      }
+ *
+ * Once an hmm_mirror is registered for an address space, the device driver
+ * will get callbacks through sync_cpu_device_pagetables() operation (see
+ * hmm_mirror_ops struct).
+ *
+ * Device driver must not free the struct containing the hmm_mirror struct
+ * before calling hmm_mirror_unregister(). The expected usage is to do that when
+ * the device driver is unbinding from an address space.
+ *
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          ...
+ *
+ *          hmm_mirror_unregister(&das->mirror);
+ *
+ *          // Other device driver specific cleanup, and now das can be freed
+ *          ...
+ *      }
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update_type - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update_type {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* sync_cpu_device_pagetables() - synchronize page tables
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update_type: type of update that occurred to the CPU page table
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback ultimately originates from mmu_notifiers when the CPU
+	 * page table is updated. The device driver must update its page table
+	 * in response to this callback. The update argument tells what action
+	 * to perform.
+	 *
+	 * The device driver must not return from this callback until the device
+	 * page tables are completely updated (TLBs flushed, etc); this is a
+	 * synchronous call.
+	 */
+	void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
+					   enum hmm_update_type update_type,
+					   unsigned long start,
+					   unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * instance of an hmm_mirror struct with HMM. HMM will track the list of all
+ * mirrors for each mm_struct.
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 8059ad6..d744cff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -302,6 +302,18 @@ config ARCH_HAS_HMM
 config HMM
 	bool
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	depends on ARCH_HAS_HMM
+	select MMU_NOTIFIER
+	select HMM
+	help
+	  Select HMM_MIRROR if you want to mirror range of the CPU page table of a
+	  process into a device page table. Here, mirror means "keep synchronized".
+	  Prerequisites: the device must provide the ability to write-protect its
+	  page tables (at PAGE_SIZE granularity), and must be able to recover from
+	  the resulting potential page faults.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 88a7e10..ff78c92 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,16 +21,27 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 
 #ifdef CONFIG_HMM
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
+
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @sequence: we track updates to the CPU page table with a sequence number
+ * @mirrors: list of mirrors for this mm
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	atomic_t		sequence;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	struct rw_semaphore	mirrors_sem;
 };
 
 /*
@@ -43,27 +54,48 @@ struct hmm {
  */
 static struct hmm *hmm_register(struct mm_struct *mm)
 {
-	if (!mm->hmm) {
-		struct hmm *hmm = NULL;
-
-		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
-		if (!hmm)
-			return NULL;
-		hmm->mm = mm;
-
-		spin_lock(&mm->page_table_lock);
-		if (!mm->hmm)
-			mm->hmm = hmm;
-		else
-			kfree(hmm);
-		spin_unlock(&mm->page_table_lock);
-	}
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+	bool cleanup = false;
 
 	/*
 	 * The hmm struct can only be freed once the mm_struct goes away,
 	 * hence we should always have pre-allocated an new hmm struct
 	 * above.
 	 */
+	if (hmm)
+		return hmm;
+
+	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+	if (!hmm)
+		return NULL;
+	INIT_LIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->mirrors_sem);
+	atomic_set(&hmm->sequence, 0);
+	hmm->mmu_notifier.ops = NULL;
+	hmm->mm = mm;
+
+	/*
+	 * We should only get here if hold the mmap_sem in write mode ie on
+	 * registration of first mirror through hmm_mirror_register()
+	 */
+	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+		kfree(hmm);
+		return NULL;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (!mm->hmm)
+		mm->hmm = hmm;
+	else
+		cleanup = true;
+	spin_unlock(&mm->page_table_lock);
+
+	if (cleanup) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+
 	return mm->hmm;
 }
 
@@ -72,3 +104,109 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	kfree(mm->hmm);
 }
 #endif /* CONFIG_HMM */
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update_type action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list)
+		mirror->ops->sync_cpu_device_pagetables(mirror, action,
+							start, end);
+	up_read(&hmm->mirrors_sem);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space, the device driver must register
+ * an HMM mirror struct.
+ *
+ * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	down_write(&mirror->hmm->mirrors_sem);
+	list_add(&mirror->list, &mirror->hmm->mirrors);
+	up_write(&mirror->hmm->mirrors_sem);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space, and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	list_del(&mirror->list);
+	up_write(&hmm->mirrors_sem);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 04/15] mm/hmm/mirror: helper to snapshot CPU page table v3
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 03/15] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 05/15] mm/hmm/mirror: device page fault handler Jérôme Glisse
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Changed since v2:
  - s/device unaddressable/device private/
Changes since v1:
  - Use spinlock instead of rcu synchronized list traversal

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  55 ++++++++++-
 mm/hmm.c            | 280 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 333 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f72ce59..f254856 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,13 +79,26 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
+ *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
+ *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
+ *      set and the pfn value is undefined.
+ * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_SPECIAL (1 << 5)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
+#define HMM_PFN_SHIFT 7
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -241,6 +254,44 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
+ * driver lock that serializes device page table updates, then call
+ * hmm_vma_range_done(), to check if the snapshot is still valid. The same
+ * device driver page table update lock must also be used in the
+ * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
+ * table invalidation serializes on it.
+ *
+ * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index ff78c92..472d237 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,8 +19,12 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
 
@@ -31,14 +35,18 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
  * @sequence: we track updates to the CPU page table with a sequence number
+ * @ranges: list of range being snapshotted
  * @mirrors: list of mirrors for this mm
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
 	atomic_t		sequence;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
@@ -72,6 +80,8 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	init_rwsem(&hmm->mirrors_sem);
 	atomic_set(&hmm->sequence, 0);
 	hmm->mmu_notifier.ops = NULL;
+	INIT_LIST_HEAD(&hmm->ranges);
+	spin_lock_init(&hmm->lock);
 	hmm->mm = mm;
 
 	/*
@@ -112,6 +122,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	spin_unlock(&hmm->lock);
 
 	down_read(&hmm->mirrors_sem);
 	list_for_each_entry(mirror, &hmm->mirrors, list)
@@ -209,4 +235,258 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static int hmm_pfns_bad(unsigned long addr,
+			unsigned long end,
+			struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = HMM_PFN_ERROR;
+
+	return 0;
+}
+
+static int hmm_vma_walk_hole(unsigned long addr,
+			     unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = HMM_PFN_EMPTY;
+
+	return 0;
+}
+
+static int hmm_vma_walk_clear(unsigned long addr,
+			      unsigned long end,
+			      struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = 0;
+
+	return 0;
+}
+
+static int hmm_vma_walk_pmd(pmd_t *pmdp,
+			    unsigned long start,
+			    unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long addr = start, i;
+	hmm_pfn_t flag;
+	pte_t *ptep;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		pmd_t pmd;
+
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd))
+			return hmm_vma_walk_hole(start, end, walk);
+
+		if (pmd_bad(pmd))
+			return hmm_pfns_bad(start, end, walk);
+
+		if (pmd_protnone(pmd))
+			return hmm_vma_walk_clear(start, end, walk);
+
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+
+			flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+			for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			return 0;
+		} else {
+			/*
+			 * Something unusual is going on. Better to have the
+			 * driver assume there is nothing for this range and
+			 * let the fault code path sort out proper pages for the
+			 * range.
+			 */
+			return hmm_vma_walk_clear(start, end, walk);
+		}
+	}
+
+	ptep = pte_offset_map(pmdp, addr);
+	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
+		pte_t pte = *ptep;
+
+		pfns[i] = 0;
+
+		if (pte_none(pte) || !pte_present(pte)) {
+			pfns[i] = HMM_PFN_EMPTY;
+			continue;
+		}
+
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	}
+	pte_unmap(ptep - 1);
+
+	return 0;
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
+ * @vma: virtual memory area containing the virtual address range
+ * @range: used to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshots the CPU page table for a range of virtual addresses. Snapshot
+ * validity is tracked by range struct. See hmm_vma_range_done() for further
+ * information.
+ *
+ * The range struct is initialized here. It tracks the CPU page table, but only
+ * if the function returns success (0), in which case the caller must then call
+ * hmm_vma_range_done() to stop CPU page table update tracking on this range.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.private = range;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	walk_page_range(start, end, &mm_walk);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being tracked
+ * Returns: false if range data has been invalidated, true otherwise
+ *
+ * Range struct is used to track updates to the CPU page table after a call to
+ * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
+ * using the data,  or wants to lock updates to the data it got from those
+ * functions, it must call the hmm_vma_range_done() function, which will then
+ * stop tracking CPU page table updates.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * the mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be used while trying to duplicate CPU page table contents for a range of
+ * virtual addresses.
+ *
+ * There are two ways to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 05/15] mm/hmm/mirror: device page fault handler
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 04/15] mm/hmm/mirror: helper to snapshot CPU page table v3 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 06/15] mm/memory_hotplug: introduce add_pages Jérôme Glisse
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  27 ++++++
 mm/hmm.c            | 256 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 271 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f254856..248a6e0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -292,6 +292,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if the block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See the function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 472d237..e7d5a36 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -236,6 +236,36 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+struct hmm_vma_walk {
+	struct hmm_range	*range;
+	unsigned long		last;
+	bool			fault;
+	bool			block;
+	bool			write;
+};
+
+static int hmm_vma_do_fault(struct mm_walk *walk,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn)
+{
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int r;
+
+	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EBUSY;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return -EAGAIN;
+}
+
 static void hmm_pfns_special(hmm_pfn_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
@@ -259,34 +289,62 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = HMM_PFN_EMPTY;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_clear(unsigned long addr,
 			      unsigned long end,
 			      struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -294,15 +352,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long end,
 			    struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
+	bool write_fault;
 	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
 		pmd_t pmd;
@@ -321,6 +382,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
 
+			if (write_fault && !pmd_write(pmd))
+				return hmm_vma_walk_clear(start, end, walk);
+
 			flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 			for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
 				pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
@@ -342,13 +406,55 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 		pfns[i] = 0;
 
-		if (pte_none(pte) || !pte_present(pte)) {
+		if (pte_none(pte)) {
 			pfns[i] = HMM_PFN_EMPTY;
+			if (hmm_vma_walk->fault)
+				goto fault;
+			continue;
+		}
+
+		if (!pte_present(pte)) {
+			swp_entry_t entry;
+
+			if (!non_swap_entry(entry)) {
+				if (hmm_vma_walk->fault)
+					goto fault;
+				continue;
+			}
+
+			entry = pte_to_swp_entry(pte);
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			 */
+			if (is_migration_entry(entry)) {
+				if (hmm_vma_walk->fault) {
+					pte_unmap(ptep);
+					hmm_vma_walk->last = addr;
+					migration_entry_wait(vma->vm_mm,
+							     pmdp, addr);
+					return -EAGAIN;
+				}
+				continue;
+			} else {
+				/* Report error for everything else */
+				pfns[i] = HMM_PFN_ERROR;
+			}
 			continue;
 		}
 
+		if (write_fault && !pte_write(pte))
+			goto fault;
+
 		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+		continue;
+
+fault:
+		pte_unmap(ptep);
+		/* Fault all pages in range */
+		return hmm_vma_walk_clear(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
@@ -381,6 +487,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns)
 {
+	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
@@ -412,9 +519,12 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	hmm_vma_walk.fault = false;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+
 	mm_walk.vma = vma;
 	mm_walk.mm = vma->vm_mm;
-	mm_walk.private = range;
 	mm_walk.pte_entry = NULL;
 	mm_walk.test_walk = NULL;
 	mm_walk.hugetlb_entry = NULL;
@@ -422,7 +532,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	walk_page_range(start, end, &mm_walk);
-
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -449,7 +558,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -460,7 +569,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -489,4 +598,127 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	struct hmm_vma_walk hmm_vma_walk;
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have registered a mirror using hmm_mirror_register() */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	hmm_vma_walk.fault = true;
+	hmm_vma_walk.write = write;
+	hmm_vma_walk.block = block;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+	hmm_vma_walk.last = range->start;
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	do {
+		ret = walk_page_range(start, end, &mm_walk);
+		start = hmm_vma_walk.last;
+	} while (ret == -EAGAIN);
+
+	if (ret) {
+		unsigned long i;
+
+		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
+		hmm_vma_range_done(vma, range);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 06/15] mm/memory_hotplug: introduce add_pages
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 05/15] mm/hmm/mirror: device page fault handler Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-31  1:31   ` Balbir Singh
  2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Michal Hocko,
	Jérôme Glisse

From: Michal Hocko <mhocko@suse.com>

There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be
done by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use
the currently existing __add_pages.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 arch/x86/Kconfig               |  4 ++++
 arch/x86/mm/init_64.c          | 22 +++++++++++++++-------
 include/linux/memory_hotplug.h | 11 +++++++++++
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a42ac0f..1baa525 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2264,6 +2264,10 @@ source "kernel/livepatch/Kconfig"
 
 endmenu
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7332930..a8a9972 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -671,7 +671,7 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void update_end_of_memory_vars(u64 start, u64 size)
 {
 	unsigned long end_pfn = PFN_UP(start + size);
 
@@ -682,22 +682,30 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 	}
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int add_pages(int nid, unsigned long start_pfn,
+	      unsigned long nr_pages, bool want_memblock)
 {
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	init_memory_mapping(start, start + size);
-
 	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
-	update_end_of_memory_vars(start, size);
+	update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
+				  nr_pages << PAGE_SHIFT);
 
 	return ret;
 }
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+
+	init_memory_mapping(start, start + size);
+
+	return add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9e0249d..abb9395d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -127,6 +127,17 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 extern int __add_pages(int nid, unsigned long start_pfn,
 	unsigned long nr_pages, bool want_memblock);
 
+#ifndef CONFIG_ARCH_HAS_ADD_PAGES
+static inline int add_pages(int nid, unsigned long start_pfn,
+			    unsigned long nr_pages, bool want_memblock)
+{
+	return __add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
+#else /* ARCH_HAS_ADD_PAGES */
+int add_pages(int nid, unsigned long start_pfn,
+	      unsigned long nr_pages, bool want_memblock);
+#endif /* ARCH_HAS_ADD_PAGES */
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 06/15] mm/memory_hotplug: introduce add_pages Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-30 16:43   ` Ross Zwisler
                     ` (3 more replies)
  2017-05-24 17:20 ` [HMM 08/15] mm/ZONE_DEVICE: special case put_page() for device private pages v2 Jérôme Glisse
                   ` (9 subsequent siblings)
  16 siblings, 4 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Ross Zwisler

HMM (heterogeneous memory management) need struct page to support migration
from system main memory to device memory.  Reasons for HMM and migration to
device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support different
types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory type.
There is a clear separation between what is expected from each memory type
and existing user of ZONE_DEVICE are un-affected by new requirement and new
use of the un-addressable type. All specific code path are protect with
test against the memory type.

Because memory is un-addressable we use a new special swap type for when
a page is migrated to device memory (this reduces the number of maximum
swap file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which means
the page is free as ZONE_DEVICE page never reach a refcount of 0). This
allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to
an address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

Changed since v2:
  - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
Changed since v1:
  - rename to device private memory (from device unaddressable)

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       |  7 +++++
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h       | 12 ++++++++
 include/linux/swap.h     | 24 ++++++++++++++--
 include/linux/swapops.h  | 68 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c        | 34 +++++++++++++++++++++++
 mm/Kconfig               | 13 +++++++++
 mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c      | 10 +++++--
 mm/mprotect.c            | 14 ++++++++++
 11 files changed, 311 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33..90b2fa4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_private_entry(swpent))
+			page = device_private_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_private_entry(swpent))
+			page = device_private_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_private_entry(entry))
+			page = device_private_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..3a4f691 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
 	IORES_DESC_ACPI_NV_STORAGE		= 3,
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
+	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..0fcf840 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -4,6 +4,8 @@
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
+#include <asm/pgtable.h>
+
 struct resource;
 struct device;
 
@@ -35,18 +37,88 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+/*
+ * Specialize ZONE_DEVICE memory into multiple types each having differents
+ * usage.
+ *
+ * MEMORY_DEVICE_PUBLIC:
+ * Persistent device memory (pmem): struct page might be allocated in different
+ * memory and architecture might want to perform special actions. It is similar
+ * to regular memory, in that the CPU can access it transparently. However,
+ * it is likely to have different bandwidth and latency than regular memory.
+ * See Documentation/nvdimm/nvdimm.txt for more information.
+ *
+ * MEMORY_DEVICE_PRIVATE:
+ * Device memory that is not directly addressable by the CPU: CPU can neither
+ * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
+ * pages backing the device memory. Doing so simplifies the implementation, but
+ * it is important to remember that there are certain points at which the struct
+ * page must be treated as an opaque object, rather than a "normal" struct page.
+ * A more complete discussion of unaddressable memory may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt.
+ */
+enum memory_type {
+	MEMORY_DEVICE_PUBLIC = 0,
+	MEMORY_DEVICE_PRIVATE,
+};
+
+/*
+ * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two
+ * callbacks:
+ *   page_fault()
+ *   page_free()
+ *
+ * Additional notes about MEMORY_DEVICE_PRIVATE may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
+ * explanation in include/linux/memory_hotplug.h.
+ *
+ * The page_fault() callback must migrate page back, from device memory to
+ * system memory, so that the CPU can access it. This might fail for various
+ * reasons (device issues,  device have been unplugged, ...). When such error
+ * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
+ * set the CPU page table entry to "poisoned".
+ *
+ * Note that because memory cgroup charges are transferred to the device memory,
+ * this should never fail due to memory restrictions. However, allocation
+ * of a regular system page might still fail because we are out of memory. If
+ * that happens, the page_fault() callback must return VM_FAULT_OOM.
+ *
+ * The page_fault() callback can also try to migrate back multiple pages in one
+ * chunk, as an optimization. It must, however, prioritize the faulting address
+ * over all the others.
+ *
+ *
+ * The page_free() callback is called once the page refcount reaches 1
+ * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
+ * This allows the device driver to implement its own memory management.)
+ */
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				struct page *page,
+				unsigned int flags,
+				pmd_t *pmdp);
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an unaddressable device page
+ * @page_free: free page callback when page refcount reaches 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @data: private data pointer for page_free()
+ * @type: memory type: see MEMORY_* in memory_hotplug.h
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
+	enum memory_type type;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..a825dab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -788,11 +788,23 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	/* See MEMORY_DEVICE_PRIVATE in include/linux/memory_hotplug.h */
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->type == MEMORY_DEVICE_PRIVATE));
+}
 #else
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 static inline void get_page(struct page *page)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5ab1c98..ab6c20b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -51,6 +51,23 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Unaddressable device memory support. See include/linux/hmm.h and
+ * Documentation/vm/hmm.txt. Short description is we need struct pages for
+ * device memory that is unaddressable (inaccessible) by CPU, so that we can
+ * migrate part of a process memory to device memory.
+ *
+ * When a page is migrated from CPU to device, we set the CPU page table entry
+ * to a special SWP_DEVICE_* entry.
+ */
+#ifdef CONFIG_DEVICE_PRIVATE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
+#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -72,7 +89,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -432,8 +450,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_private_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_private_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..361090c 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,74 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+{
+	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
+			 page_to_pfn(page));
+}
+
+static inline bool is_device_private_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_private_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+}
+
+static inline bool is_write_device_private_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_private_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_private_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_PRIVATE */
+static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_private_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_private_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_private_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_private_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_private_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned int flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_PRIVATE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 124bed7..cd596d4 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -182,6 +184,34 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+int device_private_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_private_entry_to_page(entry);
+
+	/*
+	 * The page_fault() callback must migrate page back to system memory
+	 * so that CPU can access it. This might fail for various reasons
+	 * (device issue, device was unsafely unplugged, ...). When such
+	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
+	 *
+	 * Note that because memory cgroup charges are accounted to the device
+	 * memory, this should never fail because of memory restrictions (but
+	 * allocation of regular system page might still fail because we are
+	 * out of memory).
+	 *
+	 * There is a more in-depth description of what that callback can and
+	 * cannot do, in include/linux/memremap.h
+	 */
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_private_entry_fault);
+#endif /* CONFIG_DEVICE_PRIVATE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -321,6 +351,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->type = MEMORY_DEVICE_PUBLIC;
+	pgmap->page_fault = NULL;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index d744cff..f5357ff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,6 +736,19 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_PRIVATE
+	bool "Unaddressable device memory (GPU memory, ...)"
+	depends on X86_64
+	depends on ZONE_DEVICE
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on SPARSEMEM_VMEMMAP
+
+	help
+	  Allows creation of struct pages to represent unaddressable device
+	  memory; i.e., memory that is only accessible from the device (or
+	  group of devices).
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index d320b4e..eba61dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -49,6 +49,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -927,6 +928,35 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_private_entry(entry)) {
+			page = device_private_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for unaddressable pages, as
+			 * they should treated just like normal pages in this
+			 * respect.
+			 *
+			 * We will likely want to have some new rss counters
+			 * for unaddressable pages, at some point. But for now
+			 * keep things as they are.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			/*
+			 * We do not preserve soft-dirty information, because so
+			 * far, checkpoint/restore is the only feature that
+			 * requires that. And checkpoint/restore does not work
+			 * when a device driver is involved (you cannot easily
+			 * save and restore device driver state).
+			 */
+			if (is_write_device_private_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_private_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1243,6 +1273,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
+
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_private_entry(entry)) {
+			struct page *page = device_private_entry_to_page(entry);
+
+			if (unlikely(details && details->check_mapping)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping !=
+				    page_rmapping(page))
+					continue;
+			}
+
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+			continue;
+		}
+
 		/* If details->check_mapping, we leave swap entries. */
 		if (unlikely(details))
 			continue;
@@ -2690,6 +2743,14 @@ int do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_private_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_private_entry_fault(vma, vmf->address, entry,
+						 vmf->flags, vmf->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 599c675..0a9f690 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -156,7 +156,7 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res;
+	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
@@ -165,7 +165,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	if (request_resource(&iomem_resource, res) < 0) {
+	conflict =  request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+			pr_debug("Device unaddressable memory block "
+				 "memory hotplug at %#010llx !\n",
+				 (unsigned long long)start);
+		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1a8c9ca..868d0ed 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -124,6 +124,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_private_entry(entry)) {
+				pte_t newpte;
+
+				/*
+				 * We do not preserve soft-dirtiness. See
+				 * copy_one_pte() for explanation.
+				 */
+				make_device_private_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 08/15] mm/ZONE_DEVICE: special case put_page() for device private pages v2
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 09/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5 Jérôme Glisse
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Ross Zwisler

A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer
have any user. For device private pages this is important to catch
and thus we need to special case put_page() for this.

Changed since v1:
  - use static key to disable special code path in put_page() by
    default
  - uninline put_zone_device_private_page()
  - fix build issues with some kernel config related to header
    inter-dependency

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h | 13 +++++++++++++
 include/linux/mm.h       | 31 ++++++++++++++++++++++---------
 kernel/memremap.c        | 19 ++++++++++++++++++-
 mm/hmm.c                 |  8 ++++++++
 4 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0fcf840..0e0d2e6 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,14 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool is_zone_device_page(const struct page *page);
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -143,6 +151,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a825dab..7f0656f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
 #include <linux/page_ext.h>
 #include <linux/err.h>
 #include <linux/page_ref.h>
+#include <linux/memremap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -788,25 +789,25 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
-
-static inline bool is_device_private_page(const struct page *page)
-{
-	/* See MEMORY_DEVICE_PRIVATE in include/linux/memory_hotplug.h */
-	return ((page_zonenum(page) == ZONE_DEVICE) &&
-		(page->pgmap->type == MEMORY_DEVICE_PRIVATE));
-}
 #else
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
 }
+#endif
 
-static inline bool is_device_private_page(const struct page *page)
+#ifdef CONFIG_DEVICE_PRIVATE
+void put_zone_device_private_page(struct page *page);
+#else
+static inline void put_zone_device_private_page(struct page *page)
 {
-	return false;
 }
 #endif
 
+static inline bool is_device_private_page(const struct page *page);
+
+DECLARE_STATIC_KEY_FALSE(device_private_key);
+
 static inline void get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -822,6 +823,18 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
+	/*
+	 * For private device pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the private device page is
+	 * free and we need to inform the device driver through callback. See
+	 * include/linux/memremap.h and HMM for details.
+	 */
+	if (static_branch_unlikely(&device_private_key) &&
+	    unlikely(is_device_private_page(page))) {
+		put_zone_device_private_page(page);
+		return;
+	}
+
 	if (put_page_testzero(page))
 		__put_page(page);
 }
diff --git a/kernel/memremap.c b/kernel/memremap.c
index cd596d4..b9baa6c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -11,7 +11,6 @@
  * General Public License for more details.
  */
 #include <linux/radix-tree.h>
-#include <linux/memremap.h>
 #include <linux/device.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -464,3 +463,21 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 	return pgmap ? pgmap->altmap : NULL;
 }
 #endif /* CONFIG_ZONE_DEVICE */
+
+
+#ifdef CONFIG_DEVICE_PRIVATE
+void put_zone_device_private_page(struct page *page)
+{
+	int count = page_ref_dec_return(page);
+
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (count == 1)
+		page->pgmap->page_free(page, page->pgmap->data);
+	else if (!count)
+		__put_page(page);
+}
+EXPORT_SYMBOL(put_zone_device_private_page);
+#endif /* CONFIG_DEVICE_PRIVATE */
diff --git a/mm/hmm.c b/mm/hmm.c
index e7d5a36..ff9011e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -25,9 +25,17 @@
 #include <linux/sched.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
 
 
+/*
+ * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
+ */
+DEFINE_STATIC_KEY_FALSE(device_private_key);
+EXPORT_SYMBOL(device_private_key);
+
+
 #ifdef CONFIG_HMM
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 09/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 08/15] mm/ZONE_DEVICE: special case put_page() for device private pages v2 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-06-24  3:54   ` John Hubbard
  2017-05-24 17:20 ` [HMM 10/15] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Changed since v4:
  - enable device_private_key static key when adding device memory
Changed since v3:
  - s/device unaddressable/device private/
Changed since v2:
  - s/SECTION_SIZE/PA_SECTION_SIZE
Changed since v1:
  - change to adapt to new add_pages() helper
  - make this x86-64 only for now

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 114 ++++++++++++++
 mm/Kconfig          |   9 ++
 mm/hmm.c            | 416 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 538 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 248a6e0..0865afd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/migrate.h>
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -322,6 +327,115 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned int flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct device			*device;
+	struct dev_pagemap		pagemap;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * The device driver can wrap the hmm_devmem struct inside a private device
+ * driver struct. The device driver must call hmm_devmem_remove() before the
+ * device goes away and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index f5357ff..46296d5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -314,6 +314,15 @@ config HMM_MIRROR
 	  page tables (at PAGE_SIZE granularity), and must be able to recover from
 	  the resulting potential page faults.
 
+config HMM_DEVMEM
+	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+	depends on ARCH_HAS_HMM
+	select HMM
+	help
+	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
+	  feature. This is just to avoid having device drivers to replicating a lot
+	  of boiler plate code.  See Documentation/vm/hmm.txt.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index ff9011e..675a558 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,16 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
+#include <linux/memory_hotplug.h>
+
+#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
 
 /*
@@ -436,7 +442,15 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * This is a special swap entry, ignore migration, use
 			 * device and report anything else as error.
 			 */
-			if (is_migration_entry(entry)) {
+			if (is_device_private_entry(entry)) {
+				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				if (is_write_device_private_entry(entry)) {
+					pfns[i] |= HMM_PFN_WRITE;
+				} else if (write_fault)
+					goto fault;
+				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
 					hmm_vma_walk->last = addr;
@@ -730,3 +744,403 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_ref_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+}
+
+static void hmm_devmem_ref_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
+}
+
+static void hmm_devmem_ref_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_kill(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned int flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+static DEFINE_MUTEX(hmm_devmem_lock);
+static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
+
+static void hmm_devmem_radix_release(struct resource *resource)
+{
+	resource_size_t key, align_start, align_size, align_end;
+
+	align_start = resource->start & ~(PA_SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), PA_SECTION_SIZE);
+	align_end = align_start + align_size - 1;
+
+	mutex_lock(&hmm_devmem_lock);
+	for (key = resource->start;
+	     key <= resource->end;
+	     key += PA_SECTION_SIZE)
+		radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
+	mutex_unlock(&hmm_devmem_lock);
+}
+
+static void hmm_devmem_release(struct device *dev, void *data)
+{
+	struct hmm_devmem *devmem = data;
+	struct resource *resource = devmem->resource;
+	unsigned long start_pfn, npages;
+	struct zone *zone;
+	struct page *page;
+
+	if (percpu_ref_tryget_live(&devmem->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(&devmem->ref);
+	}
+
+	/* pages are dead and unused, undo the arch mapping */
+	start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT;
+	npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT;
+
+	page = pfn_to_page(start_pfn);
+	zone = page_zone(page);
+
+	mem_hotplug_begin();
+	__remove_pages(zone, start_pfn, npages);
+	mem_hotplug_done();
+
+	hmm_devmem_radix_release(resource);
+}
+
+static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
+}
+
+static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
+{
+	resource_size_t key, align_start, align_size, align_end;
+	struct device *device = devmem->device;
+	int ret, nid, is_ram;
+	unsigned long pfn;
+
+	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
+	align_size = ALIGN(devmem->resource->start +
+			   resource_size(devmem->resource),
+			   PA_SECTION_SIZE) - align_start;
+
+	is_ram = region_intersects(align_start, align_size,
+				   IORESOURCE_SYSTEM_RAM,
+				   IORES_DESC_NONE);
+	if (is_ram == REGION_MIXED) {
+		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+				__func__, devmem->resource);
+		return -ENXIO;
+	}
+	if (is_ram == REGION_INTERSECTS)
+		return -ENXIO;
+
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.res = devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.dev = devmem->device;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+
+	mutex_lock(&hmm_devmem_lock);
+	align_end = align_start + align_size - 1;
+	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
+		struct hmm_devmem *dup;
+
+		rcu_read_lock();
+		dup = hmm_devmem_find(key);
+		rcu_read_unlock();
+		if (dup) {
+			dev_err(device, "%s: collides with mapping for %s\n",
+				__func__, dev_name(dup->device));
+			mutex_unlock(&hmm_devmem_lock);
+			ret = -EBUSY;
+			goto error;
+		}
+		ret = radix_tree_insert(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT,
+					devmem);
+		if (ret) {
+			dev_err(device, "%s: failed: %d\n", __func__, ret);
+			mutex_unlock(&hmm_devmem_lock);
+			goto error_radix;
+		}
+	}
+	mutex_unlock(&hmm_devmem_lock);
+
+	nid = dev_to_node(device);
+	if (nid < 0)
+		nid = numa_mem_id();
+
+	mem_hotplug_begin();
+	ret = add_pages(nid, align_start >> PAGE_SHIFT,
+			align_size >> PAGE_SHIFT, false);
+	if (ret) {
+		mem_hotplug_done();
+		goto error_add_memory;
+	}
+	move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
+				align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT);
+	mem_hotplug_done();
+
+	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
+		 * freed or placed on a driver-private list. Therefore,
+		 * seed the storage with LIST_POISON* values.
+		 */
+		list_del(&page->lru);
+		page->pgmap = &devmem->pagemap;
+	}
+	return 0;
+
+error_add_memory:
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+error_radix:
+	hmm_devmem_radix_release(devmem->resource);
+error:
+	return ret;
+}
+
+static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
+{
+	struct hmm_devmem *devmem = data;
+
+	return devmem->resource == match_data;
+}
+
+static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
+{
+	devres_release(devmem->device, &hmm_devmem_release,
+		       &hmm_devmem_match, devmem->resource);
+}
+
+/*
+ * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
+ *
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
+ *
+ * This first finds an empty range of physical address big enough to contain the
+ * new resource, and then hotplugs it as ZONE_DEVICE memory, which in turn
+ * allocates struct pages. It does not do anything beyond that; all events
+ * affecting the memory will go through the various callbacks provided by
+ * hmm_devmem_ops struct.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size)
+{
+	struct hmm_devmem *devmem;
+	resource_size_t addr;
+	int ret;
+
+	static_branch_enable(&device_private_key);
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+	size = ALIGN(size, PA_SECTION_SIZE);
+	addr = min((unsigned long)iomem_resource.end, 1UL << MAX_PHYSMEM_BITS);
+	addr = addr - size + 1UL;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size && addr >= iomem_resource.start; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error_no_resource;
+		}
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error_no_resource;
+	}
+
+	devmem->resource->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_pages;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_pages:
+	devm_release_mem_region(device, devmem->resource->start,
+				resource_size(devmem->resource));
+error_no_resource:
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ *
+ * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf
+ * of the device driver. It will free struct page and remove the resource that
+ * reserved the physical address range for this device memory.
+ */
+void hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	resource_size_t start, size;
+	struct device *device;
+
+	if (!devmem)
+		return;
+
+	device = devmem->device;
+	start = devmem->resource->start;
+	size = resource_size(devmem->resource);
+
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+	hmm_devmem_pages_remove(devmem);
+
+	devm_release_mem_region(device, start, size);
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @src: array of unsigned long containing source pfns
+ * @dst: array of unsigned long containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around migrate_vma() which checks the migration status
+ * for a given fault address and returns the corresponding page fault handler
+ * status. That will be 0 on success, or VM_FAULT_SIGBUS if migration failed
+ * for the faulting address.
+ *
+ * This is a helper intendend to be used by the ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private)
+{
+	if (migrate_vma(ops, vma, start, end, src, dst, private))
+		return VM_FAULT_SIGBUS;
+
+	if (dst[(addr - start) >> PAGE_SHIFT] & MIGRATE_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 10/15] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 09/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 11/15] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is useful to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v2:
  - use device_initcall() and drop everything that is module specific
Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 +++++++++++++-
 mm/hmm.c            | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0865afd..2f2a6ff 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,11 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -433,6 +433,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned int		minor;
+};
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper and
+ * it is not strictly needed, in order to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 675a558..4ecf618 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,6 +19,7 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/init.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
@@ -1143,4 +1144,91 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt, 0, 1, "hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor = find_first_zero_bit(hmm_device_mask, HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	dev_set_drvdata(&hmm_device->device, drvdata);
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+device_initcall(hmm_init);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 11/15] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 10/15] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Jérôme Glisse

Introduce a new migration mode that allow to offload the copy to
a device DMA engine. This changes the workflow of migration and
not all address_space migratepage callback can support this. So
it needs to be tested in those cases.

This is intended to be use by migrate_vma() which itself is use
for thing like HMM (see include/linux/hmm.h).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 fs/aio.c                     |  8 +++++++
 fs/f2fs/data.c               |  5 ++++-
 fs/hugetlbfs/inode.c         |  5 ++++-
 fs/ubifs/file.c              |  5 ++++-
 include/linux/migrate.h      |  5 +++++
 include/linux/migrate_mode.h |  5 +++++
 mm/balloon_compaction.c      |  8 +++++++
 mm/migrate.c                 | 52 ++++++++++++++++++++++++++++++++++----------
 mm/zsmalloc.c                |  8 +++++++
 9 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925..e51351e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -373,6 +373,14 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	pgoff_t idx;
 	int rc;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the ctx->completion_lock. That does not work with the
+	 * migration workflow of MIGRATE_SYNC_NO_COPY.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 7c0f6bd..b824d05 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2185,7 +2185,10 @@ int f2fs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	set_page_private(newpage, page_private(page));
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dde8613..c02ff56 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -846,7 +846,10 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 2cda3d6..b2292be 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1482,7 +1482,10 @@ static int ubifs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	}
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e2484..78a0fdc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,7 @@ extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
+extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
@@ -63,6 +64,10 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
+static inline void migrate_page_states(struct page *newpage, struct page *page)
+{
+}
+
 static inline void migrate_page_copy(struct page *newpage,
 				     struct page *page) {}
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..bdf66af 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,16 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
+ *	with the CPU. Instead, page copy happens outside the migratepage()
+ *	callback and is likely using a DMA engine. See migrate_vma() and HMM
+ *	(mm/hmm.c) for users of this mode.
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_SYNC_NO_COPY,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..145b903 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -139,6 +139,14 @@ int balloon_page_migrate(struct address_space *mapping,
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
+	/*
+	 * We can not easily support the no copy case here so ignore it as it
+	 * is unlikely to be use with ballon pages. See include/linux/hmm.h for
+	 * user of the MIGRATE_SYNC_NO_COPY mode.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 051cc15..66410fc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -603,15 +603,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 /*
  * Copy the page to its new location
  */
-void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -665,6 +660,17 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+EXPORT_SYMBOL(migrate_page_states);
+
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -690,7 +696,10 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -740,12 +749,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -804,8 +816,13 @@ static int fallback_migrate_page(struct address_space *mapping,
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
-		if (mode != MIGRATE_SYNC)
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			return -EBUSY;
+		}
 		return writeout(mapping, page);
 	}
 
@@ -942,7 +959,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the retry loop is too short and in the sync-light case,
 		 * the overhead of stalling is too much
 		 */
-		if (mode != MIGRATE_SYNC) {
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			rc = -EBUSY;
 			goto out_unlock;
 		}
@@ -1212,8 +1233,15 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (!trylock_page(hpage)) {
-		if (!force || mode != MIGRATE_SYNC)
+		if (!force)
 			goto out;
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			goto out;
+		}
 		lock_page(hpage);
 	}
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..aeea3a5 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1983,6 +1983,14 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	unsigned int obj_idx;
 	int ret = -EAGAIN;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the zs lock, which does not work with
+	 * MIGRATE_SYNC_NO_COPY workflow.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 11/15] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-31  3:59   ` Balbir Singh
  2017-05-24 17:20 ` [HMM 13/15] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changes since v3:
  - Rebase

Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/migrate.h | 104 ++++++++++++
 mm/migrate.c            | 444 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 548 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78a0fdc..576b3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID	(1UL << 0)
+#define MIGRATE_PFN_MIGRATE	(1UL << 1)
+#define MIGRATE_PFN_LOCKED	(1UL << 2)
+#define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_ERROR	(1UL << 4)
+#define MIGRATE_PFN_SHIFT	5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+	if (!(mpfn & MIGRATE_PFN_VALID))
+		return NULL;
+	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
+ *
+ * It is safe to update device page table from within the finalize_and_map()
+ * callback because both destination and source page are still locked, and the
+ * mmap_sem is held in read mode (hence no one can unmap the range being
+ * migrated).
+ *
+ * Once callback is done cleaning up things and updating its page table (if it
+ * chose to do so, this is not an obligation) then it returns. At this point,
+ * the HMM core will finish up the final steps, and the migration is complete.
+ *
+ * THE finalize_and_map() CALLBACK MUST NOT CHANGE ANY OF THE SRC OR DST ARRAY
+ * ENTRIES OR BAD THINGS WILL HAPPEN !
+ */
+struct migrate_vma_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const unsigned long *src,
+			       unsigned long *dst,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const unsigned long *src,
+				 const unsigned long *dst,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private);
+
+#endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 66410fc..12063f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -397,6 +397,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2077,3 +2085,439 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+	struct vm_area_struct	*vma;
+	unsigned long		*dst;
+	unsigned long		*src;
+	unsigned long		cpages;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int migrate_vma_collect_hole(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr, next;
+
+	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->dst[migrate->npages] = 0;
+		migrate->src[migrate->npages++] = 0;
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		/* FIXME support THP */
+		return migrate_vma_collect_hole(start, end, walk);
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		unsigned long mpfn, pfn;
+		struct page *page;
+		pte_t pte;
+
+		pte = *ptep;
+		pfn = pte_pfn(pte);
+
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/* FIXME support THP */
+		page = vm_normal_page(migrate->vma, addr, pte);
+		if (!page || !page->mapping || PageTransCompound(page)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/*
+		 * By getting a reference on the page we pin it and that blocks
+		 * any kind of migration. Side effect is that it "freezes" the
+		 * pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be dropped from it).
+		 */
+		get_page(page);
+		migrate->cpages++;
+		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+		migrate->src[migrate->npages++] = mpfn;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect pages over a range of virtual addresses
+ * @migrate: migrate struct containing all migration information
+ *
+ * This will walk the CPU page table. For each virtual address backed by a
+ * valid page, it updates the src array and takes a reference on the page, in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = migrate_vma_collect_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = migrate_vma_collect_hole;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+
+	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pinned or not
+ * @page: struct page to check
+ *
+ * Pinned pages cannot be migrated. This is the same test as in
+ * migrate_page_move_mapping(), except that here we allow migration of a
+ * ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+	/*
+	 * One extra ref because caller holds an extra reference, either from
+	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
+	 * a device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them than regular pages, because they can be mapped with a pmd
+	 * or with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration information
+ *
+ * This locks pages that have been collected by migrate_vma_collect(). Once each
+ * page is locked it is isolated from the lru (for non-device pages). Finally,
+ * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
+ * migrated by concurrent kernel threads.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec */
+			lru_add_drain_all();
+			allow_drain = false;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+
+		if (!migrate_vma_check_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+
+			putback_lru_page(page);
+		}
+	}
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Replace page mapping (CPU page table pte) with a special migration pte entry
+ * and check again if it has been pinned. Pinned pages are restored because we
+ * cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
+		}
+	}
+
+	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration information
+ *
+ * This migrates struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i;
+
+	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+	}
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
+			newpage = page;
+		}
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->cpages--;
+
+		putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This function tries to migrate a range of memory virtual address range, using
+ * callbacks to allocate and copy memory from source to destination. First it
+ * collects all the pages backing each virtual address in the range, saving this
+ * inside the src array. Then it locks those pages and unmaps them. Once the pages
+ * are locked and unmapped, it checks whether each page is pinned or not. Pages
+ * that aren't pinned have the MIGRATE_PFN_MIGRATE flag set (by this function)
+ * in the corresponding src array entry. It then restores any pages that are
+ * pinned, by remapping and unlocking those pages.
+ *
+ * At this point it calls the alloc_and_copy() callback. For documentation on
+ * what is expected from that callback, see struct migrate_vma_ops comments in
+ * include/linux/migrate.h
+ *
+ * After the alloc_and_copy() callback, this function goes over each entry in
+ * the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag
+ * set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,
+ * then the function tries to migrate struct page information from the source
+ * struct page to the destination struct page. If it fails to migrate the struct
+ * page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src
+ * array.
+ *
+ * At this point all successfully migrated pages have an entry in the src
+ * array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst
+ * array entry with MIGRATE_PFN_VALID flag set.
+ *
+ * It then calls the finalize_and_map() callback. See comments for "struct
+ * migrate_vma_ops", in include/linux/migrate.h for details about
+ * finalize_and_map() behavior.
+ *
+ * After the finalize_and_map() callback, for successfully migrated pages, this
+ * function updates the CPU page table to point to new pages, otherwise it
+ * restores the CPU page table to point to the original source pages.
+ *
+ * Function returns 0 after the above steps, even if no pages were migrated
+ * (The function only returns an error if any of the arguments are invalid.)
+ *
+ * Both src and dst array must be big enough for (end - start) >> PAGE_SHIFT
+ * unsigned long entries.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private)
+{
+	struct migrate_vma migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+	if (!ops || !src || !dst || start >= end)
+		return -EINVAL;
+
+	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+	migrate.src = src;
+	migrate.dst = dst;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.cpages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	migrate_vma_collect(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Lock and isolate page */
+	migrate_vma_prepare(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Unmap pages */
+	migrate_vma_unmap(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/*
+	 * At this point pages are locked and unmapped, and thus they have
+	 * stable content and can safely be copied to destination memory that
+	 * is allocated by the callback.
+	 *
+	 * Note that migration can fail in migrate_vma_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+	/* This does the real migration of struct page */
+	migrate_vma_pages(&migrate);
+
+	ops->finalize_and_map(vma, src, dst, start, end, private);
+
+	/* Unlock and remap pages */
+	migrate_vma_finalize(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 13/15] mm/migrate: migrate_vma() unmap page from vma while collecting pages
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-24 17:20 ` [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2 Jérôme Glisse
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 98 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 12063f3..1f2bc61 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2119,7 +2119,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 {
 	struct migrate_vma *migrate = walk->private;
 	struct mm_struct *mm = walk->vma->vm_mm;
-	unsigned long addr = start;
+	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
 
@@ -2129,9 +2129,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		unsigned long mpfn, pfn;
 		struct page *page;
+		swp_entry_t entry;
 		pte_t pte;
 
 		pte = *ptep;
@@ -2163,11 +2166,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+		/*
+		 * Optimize for the common case where page is only mapped once
+		 * in one process. If we can lock the page, then we can safely
+		 * set up a special migration page table entry now.
+		 */
+		if (trylock_page(page)) {
+			pte_t swp_pte;
+
+			mpfn |= MIGRATE_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, pte_write(pte));
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regular unmap: we remove the rmap and
+			 * drop page refcount. Page won't be freed, as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmapped++;
+		}
+
 next:
 		migrate->src[migrate->npages++] = mpfn;
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmapped)
+		flush_tlb_range(walk->vma, start, end);
+
 	return 0;
 }
 
@@ -2192,7 +2228,13 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	mm_walk.mm = migrate->vma->vm_mm;
 	mm_walk.private = migrate;
 
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
 	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
 
 	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
 }
@@ -2248,12 +2290,16 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 
 	for (i = 0; i < npages; i++) {
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		bool remap = true;
 
 		if (!page)
 			continue;
 
-		lock_page(page);
-		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+			remap = false;
+			lock_page(page);
+			migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		}
 
 		if (!PageLRU(page) && allow_drain) {
 			/* Drain CPU's pagevec */
@@ -2262,21 +2308,50 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 		}
 
 		if (isolate_lru_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
-			put_page(page);
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+				put_page(page);
+			}
 			continue;
 		}
 
 		if (!migrate_vma_check_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
 
-			putback_lru_page(page);
+				get_page(page);
+				putback_lru_page(page);
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+
+				putback_lru_page(page);
+			}
 		}
 	}
+
+	for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_pte(page, migrate->vma, addr, page);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		put_page(page);
+		restore--;
+	}
 }
 
 /*
@@ -2303,12 +2378,19 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
 			continue;
 
-		try_to_unmap(page, flags);
-		if (page_mapped(page) || !migrate_vma_check_page(page)) {
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-			migrate->cpages--;
-			restore++;
+		if (page_mapped(page)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page))
+				goto restore;
 		}
+
+		if (migrate_vma_check_page(page))
+			continue;
+
+restore:
+		migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+		migrate->cpages--;
+		restore++;
 	}
 
 	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 13/15] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-05-31  4:09   ` Balbir Singh
  2017-05-24 17:20 ` [HMM 15/15] mm/migrate: allow migrate_vma() to alloc new page on empty entry v2 Jérôme Glisse
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Jérôme Glisse

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Changed since v1:
  - s/device unaddressable/device private/

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/migrate.h |  10 +++-
 mm/migrate.c            | 134 ++++++++++++++++++++++++++++++++++++++----------
 mm/page_vma_mapped.c    |  10 ++++
 mm/rmap.c               |  25 +++++++++
 4 files changed, 150 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 576b3f5..7dd875a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,12 +130,18 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #ifdef CONFIG_MIGRATION
 
+/*
+ * Watch out for PAE architecture, which has an unsigned long, and might not
+ * have enough bits to store all physical address and flags. So far we have
+ * enough room for all our flags.
+ */
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_LOCKED	(1UL << 2)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
-#define MIGRATE_PFN_ERROR	(1UL << 4)
-#define MIGRATE_PFN_SHIFT	5
+#define MIGRATE_PFN_DEVICE	(1UL << 4)
+#define MIGRATE_PFN_ERROR	(1UL << 5)
+#define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 1f2bc61..9e68399 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
+#include <linux/memremap.h>
 #include <linux/balloon_compaction.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
@@ -227,7 +228,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
-		flush_dcache_page(new);
+		if (unlikely(is_zone_device_page(new)) &&
+		    is_device_private_page(new)) {
+			entry = make_device_private_entry(new, pte_write(pte));
+			pte = swp_entry_to_pte(entry);
+			if (pte_swp_soft_dirty(*pvmw.pte))
+				pte = pte_mksoft_dirty(pte);
+		} else
+			flush_dcache_page(new);
+
 #ifdef CONFIG_HUGETLB_PAGE
 		if (PageHuge(new)) {
 			pte = pte_mkhuge(pte);
@@ -2140,17 +2149,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pte = *ptep;
 		pfn = pte_pfn(pte);
 
-		if (!pte_present(pte)) {
+		if (pte_none(pte)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
 
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+
+			/*
+			 * Only care about unaddressable device page special
+			 * page table entry. Other special swap entries are not
+			 * migratable, and we ignore regular swapped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_private_entry(entry))
+				goto next;
+
+			page = device_private_entry_to_page(entry);
+			mpfn = migrate_pfn(page_to_pfn(page))|
+				MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
+			if (is_write_device_private_entry(entry))
+				mpfn |= MIGRATE_PFN_WRITE;
+		} else {
+			page = vm_normal_page(migrate->vma, addr, pte);
+			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+		}
+
 		/* FIXME support THP */
-		page = vm_normal_page(migrate->vma, addr, pte);
 		if (!page || !page->mapping || PageTransCompound(page)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
+		pfn = page_to_pfn(page);
 
 		/*
 		 * By getting a reference on the page we pin it and that blocks
@@ -2163,8 +2195,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 */
 		get_page(page);
 		migrate->cpages++;
-		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
 		/*
 		 * Optimize for the common case where page is only mapped once
@@ -2195,6 +2225,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		}
 
 next:
+		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = mpfn;
 	}
 	arch_leave_lazy_mmu_mode();
@@ -2264,6 +2295,15 @@ static bool migrate_vma_check_page(struct page *page)
 	if (PageCompound(page))
 		return false;
 
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
+			extra++;
+		} else
+			/* Other ZONE_DEVICE memory type are not supported */
+			return false;
+	}
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2301,24 +2341,30 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			migrate->src[i] |= MIGRATE_PFN_LOCKED;
 		}
 
-		if (!PageLRU(page) && allow_drain) {
-			/* Drain CPU's pagevec */
-			lru_add_drain_all();
-			allow_drain = false;
-		}
+		/* ZONE_DEVICE pages are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
 
-		if (isolate_lru_page(page)) {
-			if (remap) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
-				restore++;
-			} else {
-				migrate->src[i] = 0;
-				unlock_page(page);
-				migrate->cpages--;
-				put_page(page);
+			if (isolate_lru_page(page)) {
+				if (remap) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					migrate->cpages--;
+					restore++;
+				} else {
+					migrate->src[i] = 0;
+					unlock_page(page);
+					migrate->cpages--;
+					put_page(page);
+				}
+				continue;
 			}
-			continue;
+
+			/* Drop the reference we took in collect */
+			put_page(page);
 		}
 
 		if (!migrate_vma_check_page(page)) {
@@ -2327,14 +2373,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 				migrate->cpages--;
 				restore++;
 
-				get_page(page);
-				putback_lru_page(page);
+				if (!is_zone_device_page(page)) {
+					get_page(page);
+					putback_lru_page(page);
+				}
 			} else {
 				migrate->src[i] = 0;
 				unlock_page(page);
 				migrate->cpages--;
 
-				putback_lru_page(page);
+				if (!is_zone_device_page(page))
+					putback_lru_page(page);
+				else
+					put_page(page);
 			}
 		}
 	}
@@ -2405,7 +2456,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		unlock_page(page);
 		restore--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -2436,6 +2490,26 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 
 		mapping = page_mapping(page);
 
+		if (is_zone_device_page(newpage)) {
+			if (is_device_private_page(newpage)) {
+				/*
+				 * For now only support private anonymous when
+				 * migrating to un-addressable device memory.
+				 */
+				if (mapping) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					continue;
+				}
+			} else {
+				/*
+				 * Other types of ZONE_DEVICE page are not
+				 * supported.
+				 */
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				continue;
+			}
+		}
+
 		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
@@ -2476,11 +2550,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		unlock_page(page);
 		migrate->cpages--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 
 		if (newpage != page) {
 			unlock_page(newpage);
-			putback_lru_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
 		}
 	}
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index de9c40d..f95765c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -48,6 +48,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		if (!is_swap_pte(*pvmw->pte))
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
+
 		if (!is_migration_entry(entry))
 			return false;
 		if (migration_entry_to_page(entry) - pvmw->page >=
@@ -60,6 +61,15 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		WARN_ON_ONCE(1);
 #endif
 	} else {
+		if (is_swap_pte(*pvmw->pte)) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*pvmw->pte);
+			if (is_device_private_entry(entry) &&
+			    device_private_entry_to_page(entry) == pvmw->page)
+				return true;
+		}
+
 		if (!pte_present(*pvmw->pte))
 			return false;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e..515cea6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -63,6 +63,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1308,6 +1309,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return true;
 
+	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+	    is_zone_device_page(page) && !is_device_private_page(page))
+		return true;
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_MIGRATION, page);
@@ -1343,6 +1348,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (IS_ENABLED(CONFIG_MIGRATION) &&
+		    (flags & TTU_MIGRATION) &&
+		    is_zone_device_page(page)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_migration_entry(page, 0);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			goto discard;
+		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
 			if (ptep_clear_flush_young_notify(vma, address,
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM 15/15] mm/migrate: allow migrate_vma() to alloc new page on empty entry v2
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2 Jérôme Glisse
@ 2017-05-24 17:20 ` Jérôme Glisse
  2017-06-16  7:22 ` [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Bridgman, John
  2017-06-23 15:00 ` Bob Liu
  16 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-24 17:20 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Jérôme Glisse

This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry. It only support anoymous memory and it won't allow
new page to be instance if userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Changed since v1:
  - 5 level page table fix

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/migrate.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 131 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 9e68399..d7c4db6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -37,6 +37,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
 #include <linux/memremap.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/balloon_compaction.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
@@ -2111,9 +2112,10 @@ static int migrate_vma_collect_hole(unsigned long start,
 				    struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
-	unsigned long addr, next;
+	unsigned long addr;
 
 	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->cpages++;
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = 0;
 	}
@@ -2150,6 +2152,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pfn = pte_pfn(pte);
 
 		if (pte_none(pte)) {
+			migrate->cpages++;
 			mpfn = pfn = 0;
 			goto next;
 		}
@@ -2463,6 +2466,118 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 	}
 }
 
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+				    unsigned long addr,
+				    struct page *page,
+				    unsigned long *src,
+				    unsigned long *dst)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mem_cgroup *memcg;
+	spinlock_t *ptl;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+	pte_t entry;
+
+	/* Only allow populating anonymous memory */
+	if (!vma_is_anonymous(vma))
+		goto abort;
+
+	pgdp = pgd_offset(mm, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr);
+	if (!p4dp)
+		goto abort;
+	pudp = pud_alloc(mm, p4dp, addr);
+	if (!pudp)
+		goto abort;
+	pmdp = pmd_alloc(mm, pudp, addr);
+	if (!pmdp)
+		goto abort;
+
+	if (pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))
+		goto abort;
+
+	/*
+	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
+	 * pte_offset_map() on pmds where a huge pmd might be created
+	 * from a different thread.
+	 *
+	 * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+	 * parallel threads are excluded by other means.
+	 *
+	 * Here we only have down_read(mmap_sem).
+	 */
+	if (pte_alloc(mm, pmdp, addr))
+		goto abort;
+
+	/* See the comment in pte_alloc_one_map() */
+	if (unlikely(pmd_trans_unstable(pmdp)))
+		goto abort;
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto abort;
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+		goto abort;
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+
+	if (is_zone_device_page(page) && is_device_private_page(page)) {
+		swp_entry_t swp_entry;
+
+		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+		entry = swp_entry_to_pte(swp_entry);
+	} else {
+		entry = mk_pte(page, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	if (!pte_none(*ptep)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	/*
+	 * Check for usefaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	inc_mm_counter(mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
+	if (!is_zone_device_page(page))
+		lru_cache_add_active_or_unevictable(page, vma);
+	set_pte_at(mm, addr, ptep, entry);
+
+	/* Take a reference on the page */
+	get_page(page);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, addr, ptep);
+	pte_unmap_unlock(ptep, ptl);
+	*src = MIGRATE_PFN_MIGRATE;
+	return;
+
+abort:
+	*src &= ~MIGRATE_PFN_MIGRATE;
+}
+
 /*
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
@@ -2483,10 +2598,16 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 		struct address_space *mapping;
 		int r;
 
-		if (!page || !newpage)
+		if (!newpage) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
-		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+		} else if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) {
+			if (!page)
+				migrate_vma_insert_page(migrate, addr, newpage,
+							&migrate->src[i],
+							&migrate->dst[i]);
 			continue;
+		}
 
 		mapping = page_mapping(page);
 
@@ -2536,8 +2657,14 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
 
-		if (!page)
+		if (!page) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
 			continue;
+		}
+
 		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
 			if (newpage) {
 				unlock_page(newpage);
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
@ 2017-05-30 16:43   ` Ross Zwisler
  2017-05-30 21:43     ` Jerome Glisse
  2017-05-31  1:23   ` Balbir Singh
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Ross Zwisler @ 2017-05-30 16:43 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Ross Zwisler

On Wed, May 24, 2017 at 01:20:16PM -0400, Jerome Glisse wrote:
> HMM (heterogeneous memory management) need struct page to support migration
> from system main memory to device memory.  Reasons for HMM and migration to
> device memory is explained with HMM core patch.
> 
> This patch deals with device memory that is un-addressable memory (ie CPU
> can not access it). Hence we do not want those struct page to be manage
> like regular memory. That is why we extend ZONE_DEVICE to support different
> types of memory.
> 
> A persistent memory type is define for existing user of ZONE_DEVICE and a
> new device un-addressable type is added for the un-addressable memory type.
> There is a clear separation between what is expected from each memory type
> and existing user of ZONE_DEVICE are un-affected by new requirement and new
> use of the un-addressable type. All specific code path are protect with
> test against the memory type.
> 
> Because memory is un-addressable we use a new special swap type for when
> a page is migrated to device memory (this reduces the number of maximum
> swap file).
> 
> The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> First one, page_free() is call whenever page refcount reach 1 (which means
> the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> allow device driver to manage its memory and associated struct page.
> 
> The second callback page_fault() happens when there is a CPU access to
> an address that is back by a device page (which are un-addressable by the
> CPU). This callback is responsible to migrate the page back to system
> main memory. Device driver can not block migration back to system memory,
> HMM make sure that such page can not be pin into device memory.
> 
> If device is in some error condition and can not migrate memory back then
> a CPU page fault to device memory should end with SIGBUS.
> 
> Changed since v2:
>   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> Changed since v1:
>   - rename to device private memory (from device unaddressable)
> 
> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
<>
> @@ -35,18 +37,88 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>  }
>  #endif
>  
> +/*
> + * Specialize ZONE_DEVICE memory into multiple types each having differents
> + * usage.
> + *
> + * MEMORY_DEVICE_PUBLIC:
> + * Persistent device memory (pmem): struct page might be allocated in different
> + * memory and architecture might want to perform special actions. It is similar
> + * to regular memory, in that the CPU can access it transparently. However,
> + * it is likely to have different bandwidth and latency than regular memory.
> + * See Documentation/nvdimm/nvdimm.txt for more information.
> + *
> + * MEMORY_DEVICE_PRIVATE:
> + * Device memory that is not directly addressable by the CPU: CPU can neither
> + * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
		     _PRIVATE

Just noticed that one holdover from the DEVICE_UNADDRESSABLE naming.

> + * pages backing the device memory. Doing so simplifies the implementation, but
> + * it is important to remember that there are certain points at which the struct
> + * page must be treated as an opaque object, rather than a "normal" struct page.
> + * A more complete discussion of unaddressable memory may be found in
> + * include/linux/hmm.h and Documentation/vm/hmm.txt.
> + */
> +enum memory_type {
> +	MEMORY_DEVICE_PUBLIC = 0,
> +	MEMORY_DEVICE_PRIVATE,
> +};

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-05-30 16:43   ` Ross Zwisler
@ 2017-05-30 21:43     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-05-30 21:43 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard

On Tue, May 30, 2017 at 10:43:55AM -0600, Ross Zwisler wrote:
> On Wed, May 24, 2017 at 01:20:16PM -0400, Jerome Glisse wrote:
> > HMM (heterogeneous memory management) need struct page to support migration
> > from system main memory to device memory.  Reasons for HMM and migration to
> > device memory is explained with HMM core patch.
> > 
> > This patch deals with device memory that is un-addressable memory (ie CPU
> > can not access it). Hence we do not want those struct page to be manage
> > like regular memory. That is why we extend ZONE_DEVICE to support different
> > types of memory.
> > 
> > A persistent memory type is define for existing user of ZONE_DEVICE and a
> > new device un-addressable type is added for the un-addressable memory type.
> > There is a clear separation between what is expected from each memory type
> > and existing user of ZONE_DEVICE are un-affected by new requirement and new
> > use of the un-addressable type. All specific code path are protect with
> > test against the memory type.
> > 
> > Because memory is un-addressable we use a new special swap type for when
> > a page is migrated to device memory (this reduces the number of maximum
> > swap file).
> > 
> > The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> > First one, page_free() is call whenever page refcount reach 1 (which means
> > the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> > allow device driver to manage its memory and associated struct page.
> > 
> > The second callback page_fault() happens when there is a CPU access to
> > an address that is back by a device page (which are un-addressable by the
> > CPU). This callback is responsible to migrate the page back to system
> > main memory. Device driver can not block migration back to system memory,
> > HMM make sure that such page can not be pin into device memory.
> > 
> > If device is in some error condition and can not migrate memory back then
> > a CPU page fault to device memory should end with SIGBUS.
> > 
> > Changed since v2:
> >   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> > Changed since v1:
> >   - rename to device private memory (from device unaddressable)
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> <>
> > @@ -35,18 +37,88 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >  }
> >  #endif
> >  
> > +/*
> > + * Specialize ZONE_DEVICE memory into multiple types each having differents
> > + * usage.
> > + *
> > + * MEMORY_DEVICE_PUBLIC:
> > + * Persistent device memory (pmem): struct page might be allocated in different
> > + * memory and architecture might want to perform special actions. It is similar
> > + * to regular memory, in that the CPU can access it transparently. However,
> > + * it is likely to have different bandwidth and latency than regular memory.
> > + * See Documentation/nvdimm/nvdimm.txt for more information.
> > + *
> > + * MEMORY_DEVICE_PRIVATE:
> > + * Device memory that is not directly addressable by the CPU: CPU can neither
> > + * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
> 		     _PRIVATE
> 
> Just noticed that one holdover from the DEVICE_UNADDRESSABLE naming.
> 

Thanks for catching that, Andrew can you change it yourself to _PRIVATE
s/_UNADDRESSABLE/_PRIVATE

Or should i repost fixed patch ?

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
  2017-05-30 16:43   ` Ross Zwisler
@ 2017-05-31  1:23   ` Balbir Singh
  2017-06-09  3:55   ` John Hubbard
  2017-06-15  3:41   ` zhong jiang
  3 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2017-05-31  1:23 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Ross Zwisler

On Wed, 24 May 2017 13:20:16 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> HMM (heterogeneous memory management) need struct page to support migration
> from system main memory to device memory.  Reasons for HMM and migration to
> device memory is explained with HMM core patch.
> 
> This patch deals with device memory that is un-addressable memory (ie CPU
> can not access it). Hence we do not want those struct page to be manage
> like regular memory. That is why we extend ZONE_DEVICE to support different
> types of memory.
> 
> A persistent memory type is define for existing user of ZONE_DEVICE and a
> new device un-addressable type is added for the un-addressable memory type.
> There is a clear separation between what is expected from each memory type
> and existing user of ZONE_DEVICE are un-affected by new requirement and new
> use of the un-addressable type. All specific code path are protect with
> test against the memory type.
> 
> Because memory is un-addressable we use a new special swap type for when
> a page is migrated to device memory (this reduces the number of maximum
> swap file).
> 
> The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> First one, page_free() is call whenever page refcount reach 1 (which means
> the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> allow device driver to manage its memory and associated struct page.
> 
> The second callback page_fault() happens when there is a CPU access to
> an address that is back by a device page (which are un-addressable by the
> CPU). This callback is responsible to migrate the page back to system
> main memory. Device driver can not block migration back to system memory,
> HMM make sure that such page can not be pin into device memory.
> 
> If device is in some error condition and can not migrate memory back then
> a CPU page fault to device memory should end with SIGBUS.
> 
> Changed since v2:
>   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> Changed since v1:
>   - rename to device private memory (from device unaddressable)
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/proc/task_mmu.c       |  7 +++++
>  include/linux/ioport.h   |  1 +
>  include/linux/memremap.h | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h       | 12 ++++++++
>  include/linux/swap.h     | 24 ++++++++++++++--
>  include/linux/swapops.h  | 68 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/memremap.c        | 34 +++++++++++++++++++++++
>  mm/Kconfig               | 13 +++++++++
>  mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c      | 10 +++++--
>  mm/mprotect.c            | 14 ++++++++++
>  11 files changed, 311 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33..90b2fa4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  			}
>  		} else if (is_migration_entry(swpent))
>  			page = migration_entry_to_page(swpent);
> +		else if (is_device_private_entry(swpent))
> +			page = device_private_entry_to_page(swpent);
>  	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
>  							&& pte_none(*pte))) {
>  		page = find_get_entry(vma->vm_file->f_mapping,
> @@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>  
>  		if (is_migration_entry(swpent))
>  			page = migration_entry_to_page(swpent);
> +		else if (is_device_private_entry(swpent))
> +			page = device_private_entry_to_page(swpent);
>  	}
>  	if (page) {
>  		int mapcount = page_mapcount(page);
> @@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		flags |= PM_SWAP;
>  		if (is_migration_entry(entry))
>  			page = migration_entry_to_page(entry);
> +
> +		if (is_device_private_entry(entry))
> +			page = device_private_entry_to_page(entry);
>  	}
>  
>  	if (page && !PageAnon(page))
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 6230064..3a4f691 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -130,6 +130,7 @@ enum {
>  	IORES_DESC_ACPI_NV_STORAGE		= 3,
>  	IORES_DESC_PERSISTENT_MEMORY		= 4,
>  	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
> +	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
>  };
>  
>  /* helpers to define resources */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..0fcf840 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -4,6 +4,8 @@
>  #include <linux/ioport.h>
>  #include <linux/percpu-refcount.h>
>  
> +#include <asm/pgtable.h>
> +
>  struct resource;
>  struct device;
>  
> @@ -35,18 +37,88 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>  }
>  #endif
>  
> +/*
> + * Specialize ZONE_DEVICE memory into multiple types each having differents
> + * usage.
> + *
> + * MEMORY_DEVICE_PUBLIC:
> + * Persistent device memory (pmem): struct page might be allocated in different
> + * memory and architecture might want to perform special actions. It is similar
> + * to regular memory, in that the CPU can access it transparently. However,
> + * it is likely to have different bandwidth and latency than regular memory.
> + * See Documentation/nvdimm/nvdimm.txt for more information.
> + *
> + * MEMORY_DEVICE_PRIVATE:
> + * Device memory that is not directly addressable by the CPU: CPU can neither
> + * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
> + * pages backing the device memory. Doing so simplifies the implementation, but
> + * it is important to remember that there are certain points at which the struct
> + * page must be treated as an opaque object, rather than a "normal" struct page.
> + * A more complete discussion of unaddressable memory may be found in
> + * include/linux/hmm.h and Documentation/vm/hmm.txt.
> + */
> +enum memory_type {
> +	MEMORY_DEVICE_PUBLIC = 0,
> +	MEMORY_DEVICE_PRIVATE,

I guess CDM falls somewhere in between PUBLIC and PRIVATE, I wonder how we'll represent
it?

> +};
> +
> +/*
> + * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two
> + * callbacks:
> + *   page_fault()
> + *   page_free()
> + *
> + * Additional notes about MEMORY_DEVICE_PRIVATE may be found in
> + * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
> + * explanation in include/linux/memory_hotplug.h.
> + *
> + * The page_fault() callback must migrate page back, from device memory to
> + * system memory, so that the CPU can access it. This might fail for various
> + * reasons (device issues,  device have been unplugged, ...). When such error
> + * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
> + * set the CPU page table entry to "poisoned".

I would expect this would become optional with HMM-CDM.

> + *
> + * Note that because memory cgroup charges are transferred to the device memory,

What does transferred mean? I presume you mean since device memory is not visible
and private, it implies that we charge on the system and migrate to the device?

> + * this should never fail due to memory restrictions. However, allocation
> + * of a regular system page might still fail because we are out of memory. If
> + * that happens, the page_fault() callback must return VM_FAULT_OOM.
> + *
> + * The page_fault() callback can also try to migrate back multiple pages in one
> + * chunk, as an optimization. It must, however, prioritize the faulting address
> + * over all the others.
> + *

I think the page_fault() is good to have for CDM, it should not be made mandatory

> + *
> + * The page_free() callback is called once the page refcount reaches 1
> + * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
> + * This allows the device driver to implement its own memory management.)
> + */
> +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
> +				unsigned long addr,
> +				struct page *page,
> +				unsigned int flags,
> +				pmd_t *pmdp);
> +typedef void (*dev_page_free_t)(struct page *page, void *data);
> +
>  /**
>   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> + * @page_fault: callback when CPU fault on an unaddressable device page
> + * @page_free: free page callback when page refcount reaches 1
>   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @data: private data pointer for page_free()
> + * @type: memory type: see MEMORY_* in memory_hotplug.h
>   */
>  struct dev_pagemap {
> +	dev_page_fault_t page_fault;
> +	dev_page_free_t page_free;
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
>  	struct device *dev;
> +	void *data;
> +	enum memory_type type;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cb17c6..a825dab 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -788,11 +788,23 @@ static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
> +
> +static inline bool is_device_private_page(const struct page *page)
> +{
> +	/* See MEMORY_DEVICE_PRIVATE in include/linux/memory_hotplug.h */
> +	return ((page_zonenum(page) == ZONE_DEVICE) &&
> +		(page->pgmap->type == MEMORY_DEVICE_PRIVATE));
> +}
>  #else
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return false;
>  }
> +
> +static inline bool is_device_private_page(const struct page *page)
> +{
> +	return false;
> +}
>  #endif
>  
>  static inline void get_page(struct page *page)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 5ab1c98..ab6c20b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -51,6 +51,23 @@ static inline int current_is_kswapd(void)
>   */
>  
>  /*
> + * Unaddressable device memory support. See include/linux/hmm.h and
> + * Documentation/vm/hmm.txt. Short description is we need struct pages for
> + * device memory that is unaddressable (inaccessible) by CPU, so that we can
> + * migrate part of a process memory to device memory.
> + *
> + * When a page is migrated from CPU to device, we set the CPU page table entry
> + * to a special SWP_DEVICE_* entry.
> + */
> +#ifdef CONFIG_DEVICE_PRIVATE
> +#define SWP_DEVICE_NUM 2
> +#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
> +#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
> +#else
> +#define SWP_DEVICE_NUM 0
> +#endif
> +
> +/*
>   * NUMA node memory migration support
>   */
>  #ifdef CONFIG_MIGRATION
> @@ -72,7 +89,8 @@ static inline int current_is_kswapd(void)
>  #endif
>  
>  #define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> @@ -432,8 +450,8 @@ static inline void show_swap_cache_info(void)
>  {
>  }
>  
> -#define free_swap_and_cache(swp)	is_migration_entry(swp)
> -#define swapcache_prepare(swp)		is_migration_entry(swp)
> +#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_private_entry(e))
> +#define swapcache_prepare(e) (is_migration_entry(e) || is_device_private_entry(e))
>  
>  static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
>  {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3..361090c 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -100,6 +100,74 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
> +{
> +	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
> +			 page_to_pfn(page));
> +}
> +
> +static inline bool is_device_private_entry(swp_entry_t entry)
> +{
> +	int type = swp_type(entry);
> +	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
> +}
> +
> +static inline void make_device_private_entry_read(swp_entry_t *entry)
> +{
> +	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
> +}
> +
> +static inline bool is_write_device_private_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
> +}
> +
> +static inline struct page *device_private_entry_to_page(swp_entry_t entry)
> +{
> +	return pfn_to_page(swp_offset(entry));
> +}
> +
> +int device_private_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned int flags,
> +		       pmd_t *pmdp);
> +#else /* CONFIG_DEVICE_PRIVATE */
> +static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline void make_device_private_entry_read(swp_entry_t *entry)
> +{
> +}
> +
> +static inline bool is_device_private_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline bool is_write_device_private_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline struct page *device_private_entry_to_page(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
> +static inline int device_private_entry_fault(struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     swp_entry_t entry,
> +				     unsigned int flags,
> +				     pmd_t *pmdp)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif /* CONFIG_DEVICE_PRIVATE */
> +
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 124bed7..cd596d4 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -18,6 +18,8 @@
>  #include <linux/io.h>
>  #include <linux/mm.h>
>  #include <linux/memory_hotplug.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  #ifndef ioremap_cache
>  /* temporary while we convert existing ioremap_cache users to memremap */
> @@ -182,6 +184,34 @@ struct page_map {
>  	struct vmem_altmap altmap;
>  };
>  
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +int device_private_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned int flags,
> +		       pmd_t *pmdp)
> +{
> +	struct page *page = device_private_entry_to_page(entry);
> +
> +	/*
> +	 * The page_fault() callback must migrate page back to system memory
> +	 * so that CPU can access it. This might fail for various reasons
> +	 * (device issue, device was unsafely unplugged, ...). When such
> +	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
> +	 *
> +	 * Note that because memory cgroup charges are accounted to the device
> +	 * memory, this should never fail because of memory restrictions (but
> +	 * allocation of regular system page might still fail because we are
> +	 * out of memory).
> +	 *
> +	 * There is a more in-depth description of what that callback can and
> +	 * cannot do, in include/linux/memremap.h
> +	 */
> +	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
> +}
> +EXPORT_SYMBOL(device_private_entry_fault);
> +#endif /* CONFIG_DEVICE_PRIVATE */
> +
>  static void pgmap_radix_release(struct resource *res)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -321,6 +351,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	}
>  	pgmap->ref = ref;
>  	pgmap->res = &page_map->res;
> +	pgmap->type = MEMORY_DEVICE_PUBLIC;
> +	pgmap->page_fault = NULL;
> +	pgmap->page_free = NULL;
> +	pgmap->data = NULL;
>  
>  	mutex_lock(&pgmap_lock);
>  	error = 0;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d744cff..f5357ff 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,6 +736,19 @@ config ZONE_DEVICE
>  
>  	  If FS_DAX is enabled, then say Y.
>  
> +config DEVICE_PRIVATE
> +	bool "Unaddressable device memory (GPU memory, ...)"
> +	depends on X86_64
> +	depends on ZONE_DEVICE
> +	depends on MEMORY_HOTPLUG
> +	depends on MEMORY_HOTREMOVE
> +	depends on SPARSEMEM_VMEMMAP
> +
> +	help
> +	  Allows creation of struct pages to represent unaddressable device
> +	  memory; i.e., memory that is only accessible from the device (or
> +	  group of devices).
> +
>  config FRAME_VECTOR
>  	bool
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index d320b4e..eba61dd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -49,6 +49,7 @@
>  #include <linux/swap.h>
>  #include <linux/highmem.h>
>  #include <linux/pagemap.h>
> +#include <linux/memremap.h>
>  #include <linux/ksm.h>
>  #include <linux/rmap.h>
>  #include <linux/export.h>
> @@ -927,6 +928,35 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  					pte = pte_swp_mksoft_dirty(pte);
>  				set_pte_at(src_mm, addr, src_pte, pte);
>  			}
> +		} else if (is_device_private_entry(entry)) {
> +			page = device_private_entry_to_page(entry);
> +
> +			/*
> +			 * Update rss count even for unaddressable pages, as
> +			 * they should treated just like normal pages in this
> +			 * respect.
> +			 *
> +			 * We will likely want to have some new rss counters
> +			 * for unaddressable pages, at some point. But for now
> +			 * keep things as they are.
> +			 */
> +			get_page(page);
> +			rss[mm_counter(page)]++;
> +			page_dup_rmap(page, false);
> +
> +			/*
> +			 * We do not preserve soft-dirty information, because so
> +			 * far, checkpoint/restore is the only feature that
> +			 * requires that. And checkpoint/restore does not work
> +			 * when a device driver is involved (you cannot easily
> +			 * save and restore device driver state).
> +			 */
> +			if (is_write_device_private_entry(entry) &&
> +			    is_cow_mapping(vm_flags)) {
> +				make_device_private_entry_read(&entry);
> +				pte = swp_entry_to_pte(entry);
> +				set_pte_at(src_mm, addr, src_pte, pte);
> +			}
>  		}
>  		goto out_set_pte;
>  	}
> @@ -1243,6 +1273,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			}
>  			continue;
>  		}
> +
> +		entry = pte_to_swp_entry(ptent);
> +		if (non_swap_entry(entry) && is_device_private_entry(entry)) {
> +			struct page *page = device_private_entry_to_page(entry);
> +
> +			if (unlikely(details && details->check_mapping)) {
> +				/*
> +				 * unmap_shared_mapping_pages() wants to
> +				 * invalidate cache without truncating:
> +				 * unmap shared but keep private pages.
> +				 */
> +				if (details->check_mapping !=
> +				    page_rmapping(page))
> +					continue;
> +			}
> +
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			rss[mm_counter(page)]--;
> +			page_remove_rmap(page, false);
> +			put_page(page);
> +			continue;
> +		}
> +
>  		/* If details->check_mapping, we leave swap entries. */
>  		if (unlikely(details))
>  			continue;
> @@ -2690,6 +2743,14 @@ int do_swap_page(struct vm_fault *vmf)
>  		if (is_migration_entry(entry)) {
>  			migration_entry_wait(vma->vm_mm, vmf->pmd,
>  					     vmf->address);
> +		} else if (is_device_private_entry(entry)) {
> +			/*
> +			 * For un-addressable device memory we call the pgmap
> +			 * fault handler callback. The callback must migrate
> +			 * the page back to some CPU accessible page.
> +			 */
> +			ret = device_private_entry_fault(vma, vmf->address, entry,
> +						 vmf->flags, vmf->pmd);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
>  		} else {
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 599c675..0a9f690 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -156,7 +156,7 @@ void mem_hotplug_done(void)
>  /* add this memory to iomem resource */
>  static struct resource *register_memory_resource(u64 start, u64 size)
>  {
> -	struct resource *res;
> +	struct resource *res, *conflict;
>  	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
>  	if (!res)
>  		return ERR_PTR(-ENOMEM);
> @@ -165,7 +165,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  	res->start = start;
>  	res->end = start + size - 1;
>  	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	if (request_resource(&iomem_resource, res) < 0) {
> +	conflict =  request_resource_conflict(&iomem_resource, res);
> +	if (conflict) {
> +		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> +			pr_debug("Device unaddressable memory block "
> +				 "memory hotplug at %#010llx !\n",
> +				 (unsigned long long)start);
> +		}
>  		pr_debug("System RAM resource %pR cannot be added\n", res);
>  		kfree(res);
>  		return ERR_PTR(-EEXIST);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1a8c9ca..868d0ed 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -124,6 +124,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  				pages++;
>  			}
> +
> +			if (is_write_device_private_entry(entry)) {
> +				pte_t newpte;
> +
> +				/*
> +				 * We do not preserve soft-dirtiness. See
> +				 * copy_one_pte() for explanation.
> +				 */
> +				make_device_private_entry_read(&entry);
> +				newpte = swp_entry_to_pte(entry);
> +				set_pte_at(mm, addr, pte, newpte);
> +
> +				pages++;
> +			}
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();

The patch looks good overall, but I'm trying to evaluate it from a HMM-CDM perspective
to make sure we have no missing bits for extension


Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 06/15] mm/memory_hotplug: introduce add_pages
  2017-05-24 17:20 ` [HMM 06/15] mm/memory_hotplug: introduce add_pages Jérôme Glisse
@ 2017-05-31  1:31   ` Balbir Singh
  0 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2017-05-31  1:31 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Michal Hocko

On Wed, 24 May 2017 13:20:15 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> There are new users of memory hotplug emerging. Some of them require
> different subset of arch_add_memory. There are some which only require
> allocation of struct pages without mapping those pages to the kernel
> address space. We currently have __add_pages for that purpose. But this
> is rather lowlevel and not very suitable for the code outside of the
> memory hotplug. E.g. x86_64 wants to update max_pfn which should be
> done by the caller. Introduce add_pages() which should care about those
> details if they are needed. Each architecture should define its
> implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use
> the currently existing __add_pages.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

Looks good, from a CDM perspective, this means that HMM-CDM would continue
to use arch_add_memory()

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4
  2017-05-24 17:20 ` [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4 Jérôme Glisse
@ 2017-05-31  2:10   ` Balbir Singh
  2017-06-01 22:35     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Balbir Singh @ 2017-05-31  2:10 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Wed, 24 May 2017 13:20:11 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> HMM provides 3 separate types of functionality:
>     - Mirroring: synchronize CPU page table and device page table
>     - Device memory: allocating struct page for device memory
>     - Migration: migrating regular memory to device memory
> 
> This patch introduces some common helpers and definitions to all of
> those 3 functionality.
> 
> Changed since v3:
>   - Unconditionaly build hmm.c for static keys
> Changed since v2:
>   - s/device unaddressable/device private
> Changed since v1:
>   - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---

It would be nice to explain a bit of how hmm_pfn_t bits work with pfn
and find out what we need from an arch to support HMM.


Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-05-24 17:20 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
@ 2017-05-31  3:59   ` Balbir Singh
  2017-06-01 22:35     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Balbir Singh @ 2017-05-31  3:59 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Wed, 24 May 2017 13:20:21 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> This patch add a new memory migration helpers, which migrate memory
> backing a range of virtual address of a process to different memory
> (which can be allocated through special allocator). It differs from
> numa migration by working on a range of virtual address and thus by
> doing migration in chunk that can be large enough to use DMA engine
> or special copy offloading engine.
> 
> Expected users are any one with heterogeneous memory where different
> memory have different characteristics (latency, bandwidth, ...). As
> an example IBM platform with CAPI bus can make use of this feature
> to migrate between regular memory and CAPI device memory. New CPU
> architecture with a pool of high performance memory not manage as
> cache but presented as regular memory (while being faster and with
> lower latency than DDR) will also be prime user of this patch.
> 
> Migration to private device memory will be useful for device that
> have large pool of such like GPU, NVidia plans to use HMM for that.
> 

It is helpful, for HMM-CDM however we would like to avoid the downsides
of MIGRATE_SYNC_NOCOPY

> Changes since v3:
>   - Rebase
> 
> Changes since v2:
>   - droped HMM prefix and HMM specific code
> Changes since v1:
>   - typos fix
>   - split early unmap optimization for page with single mapping
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  include/linux/migrate.h | 104 ++++++++++++
>  mm/migrate.c            | 444 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 548 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 78a0fdc..576b3f5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  }
>  #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
>  
> +
> +#ifdef CONFIG_MIGRATION
> +
> +#define MIGRATE_PFN_VALID	(1UL << 0)
> +#define MIGRATE_PFN_MIGRATE	(1UL << 1)
> +#define MIGRATE_PFN_LOCKED	(1UL << 2)
> +#define MIGRATE_PFN_WRITE	(1UL << 3)
> +#define MIGRATE_PFN_ERROR	(1UL << 4)
> +#define MIGRATE_PFN_SHIFT	5
> +
> +static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> +{
> +	if (!(mpfn & MIGRATE_PFN_VALID))
> +		return NULL;
> +	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
> +}
> +
> +static inline unsigned long migrate_pfn(unsigned long pfn)
> +{
> +	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
> +}
> +
> +/*
> + * struct migrate_vma_ops - migrate operation callback
> + *
> + * @alloc_and_copy: alloc destination memory and copy source memory to it
> + * @finalize_and_map: allow caller to map the successfully migrated pages
> + *
> + *
> + * The alloc_and_copy() callback happens once all source pages have been locked,
> + * unmapped and checked (checked whether pinned or not). All pages that can be
> + * migrated will have an entry in the src array set with the pfn value of the
> + * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
> + * flags might be set but should be ignored by the callback).
> + *
> + * The alloc_and_copy() callback can then allocate destination memory and copy
> + * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
> + * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
> + * callback must update each corresponding entry in the dst array with the pfn
> + * value of the destination page and with the MIGRATE_PFN_VALID and
> + * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
> + * locked, via lock_page()).
> + *
> + * At this point the alloc_and_copy() callback is done and returns.
> + *
> + * Note that the callback does not have to migrate all the pages that are
> + * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
> + * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
> + * set in the src array entry). If the device driver cannot migrate a device
> + * page back to system memory, then it must set the corresponding dst array
> + * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
> + * access any of the virtual addresses originally backed by this page. Because
> + * a SIGBUS is such a severe result for the userspace process, the device
> + * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
> + * unrecoverable state.
> + *
> + * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
> + * OR BAD THINGS WILL HAPPEN !
> + *
> + *
> + * The finalize_and_map() callback happens after struct page migration from
> + * source to destination (destination struct pages are the struct pages for the
> + * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
> + * thus the finalize_and_map() allows the driver to inspect which pages were
> + * successfully migrated, and which were not. Successfully migrated pages will
> + * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
> + *
> + * It is safe to update device page table from within the finalize_and_map()
> + * callback because both destination and source page are still locked, and the
> + * mmap_sem is held in read mode (hence no one can unmap the range being
> + * migrated).
> + *
> + * Once callback is done cleaning up things and updating its page table (if it
> + * chose to do so, this is not an obligation) then it returns. At this point,
> + * the HMM core will finish up the final steps, and the migration is complete.
> + *
> + * THE finalize_and_map() CALLBACK MUST NOT CHANGE ANY OF THE SRC OR DST ARRAY
> + * ENTRIES OR BAD THINGS WILL HAPPEN !
> + */
> +struct migrate_vma_ops {
> +	void (*alloc_and_copy)(struct vm_area_struct *vma,
> +			       const unsigned long *src,
> +			       unsigned long *dst,
> +			       unsigned long start,
> +			       unsigned long end,
> +			       void *private);
> +	void (*finalize_and_map)(struct vm_area_struct *vma,
> +				 const unsigned long *src,
> +				 const unsigned long *dst,
> +				 unsigned long start,
> +				 unsigned long end,
> +				 void *private);
> +};
> +
> +int migrate_vma(const struct migrate_vma_ops *ops,
> +		struct vm_area_struct *vma,
> +		unsigned long start,
> +		unsigned long end,
> +		unsigned long *src,
> +		unsigned long *dst,
> +		void *private);
> +
> +#endif /* CONFIG_MIGRATION */
> +
>  #endif /* _LINUX_MIGRATE_H */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 66410fc..12063f3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -397,6 +397,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  	int expected_count = 1 + extra_count;
>  	void **pslot;
>  
> +	/*
> +	 * ZONE_DEVICE pages have 1 refcount always held by their device
> +	 *
> +	 * Note that DAX memory will never reach that point as it does not have
> +	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).

I couldn't find this flag in memory_hotplug.h? stale comment?

> +	 */
> +	expected_count += is_zone_device_page(page);
> +
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
>  		if (page_count(page) != expected_count)
> @@ -2077,3 +2085,439 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #endif /* CONFIG_NUMA */
> +
> +
> +struct migrate_vma {
> +	struct vm_area_struct	*vma;
> +	unsigned long		*dst;
> +	unsigned long		*src;
> +	unsigned long		cpages;
> +	unsigned long		npages;
> +	unsigned long		start;
> +	unsigned long		end;

Could we add a flags that specify if the migration should be MIGRATE_SYNC_NOCOPY or not?
I think the generic routine is helpful outside of the specific HMM use case as well.

> +};
> +
> +static int migrate_vma_collect_hole(unsigned long start,
> +				    unsigned long end,
> +				    struct mm_walk *walk)
> +{
> +	struct migrate_vma *migrate = walk->private;
> +	unsigned long addr, next;
> +
> +	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
> +		migrate->dst[migrate->npages] = 0;
> +		migrate->src[migrate->npages++] = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> +				   unsigned long start,
> +				   unsigned long end,
> +				   struct mm_walk *walk)
> +{
> +	struct migrate_vma *migrate = walk->private;
> +	struct mm_struct *mm = walk->vma->vm_mm;
> +	unsigned long addr = start;
> +	spinlock_t *ptl;
> +	pte_t *ptep;
> +
> +	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
> +		/* FIXME support THP */
> +		return migrate_vma_collect_hole(start, end, walk);
> +	}
> +
> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	for (; addr < end; addr += PAGE_SIZE, ptep++) {
> +		unsigned long mpfn, pfn;
> +		struct page *page;
> +		pte_t pte;
> +
> +		pte = *ptep;
> +		pfn = pte_pfn(pte);
> +
> +		if (!pte_present(pte)) {
> +			mpfn = pfn = 0;
> +			goto next;
> +		}
> +
> +		/* FIXME support THP */
> +		page = vm_normal_page(migrate->vma, addr, pte);
> +		if (!page || !page->mapping || PageTransCompound(page)) {
> +			mpfn = pfn = 0;
> +			goto next;
> +		}
> +
> +		/*
> +		 * By getting a reference on the page we pin it and that blocks
> +		 * any kind of migration. Side effect is that it "freezes" the
> +		 * pte.
> +		 *
> +		 * We drop this reference after isolating the page from the lru
> +		 * for non device page (device page are not on the lru and thus
> +		 * can't be dropped from it).
> +		 */
> +		get_page(page);
> +		migrate->cpages++;
> +		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> +		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> +
> +next:
> +		migrate->src[migrate->npages++] = mpfn;
> +	}
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
> +	return 0;
> +}
> +
> +/*
> + * migrate_vma_collect() - collect pages over a range of virtual addresses
> + * @migrate: migrate struct containing all migration information
> + *
> + * This will walk the CPU page table. For each virtual address backed by a
> + * valid page, it updates the src array and takes a reference on the page, in
> + * order to pin the page until we lock it and unmap it.
> + */
> +static void migrate_vma_collect(struct migrate_vma *migrate)
> +{
> +	struct mm_walk mm_walk;
> +
> +	mm_walk.pmd_entry = migrate_vma_collect_pmd;
> +	mm_walk.pte_entry = NULL;
> +	mm_walk.pte_hole = migrate_vma_collect_hole;
> +	mm_walk.hugetlb_entry = NULL;
> +	mm_walk.test_walk = NULL;
> +	mm_walk.vma = migrate->vma;
> +	mm_walk.mm = migrate->vma->vm_mm;
> +	mm_walk.private = migrate;
> +
> +	walk_page_range(migrate->start, migrate->end, &mm_walk);
> +
> +	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
> +}
> +
> +/*
> + * migrate_vma_check_page() - check if page is pinned or not
> + * @page: struct page to check
> + *
> + * Pinned pages cannot be migrated. This is the same test as in
> + * migrate_page_move_mapping(), except that here we allow migration of a
> + * ZONE_DEVICE page.
> + */
> +static bool migrate_vma_check_page(struct page *page)
> +{
> +	/*
> +	 * One extra ref because caller holds an extra reference, either from
> +	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
> +	 * a device page.
> +	 */
> +	int extra = 1;
> +
> +	/*
> +	 * FIXME support THP (transparent huge page), it is bit more complex to
> +	 * check them than regular pages, because they can be mapped with a pmd
> +	 * or with a pte (split pte mapping).
> +	 */
> +	if (PageCompound(page))
> +		return false;
> +
> +	if ((page_count(page) - extra) > page_mapcount(page))
> +		return false;
> +
> +	return true;
> +}
> +
> +/*
> + * migrate_vma_prepare() - lock pages and isolate them from the lru
> + * @migrate: migrate struct containing all migration information
> + *
> + * This locks pages that have been collected by migrate_vma_collect(). Once each
> + * page is locked it is isolated from the lru (for non-device pages). Finally,
> + * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
> + * migrated by concurrent kernel threads.
> + */
> +static void migrate_vma_prepare(struct migrate_vma *migrate)
> +{
> +	const unsigned long npages = migrate->npages;
> +	const unsigned long start = migrate->start;
> +	unsigned long addr, i, restore = 0;
> +	bool allow_drain = true;
> +
> +	lru_add_drain();
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		lock_page(page);
> +		migrate->src[i] |= MIGRATE_PFN_LOCKED;
> +
> +		if (!PageLRU(page) && allow_drain) {
> +			/* Drain CPU's pagevec */
> +			lru_add_drain_all();
> +			allow_drain = false;
> +		}
> +
> +		if (isolate_lru_page(page)) {
> +			migrate->src[i] = 0;
> +			unlock_page(page);
> +			migrate->cpages--;
> +			put_page(page);
> +			continue;
> +		}
> +
> +		if (!migrate_vma_check_page(page)) {
> +			migrate->src[i] = 0;
> +			unlock_page(page);
> +			migrate->cpages--;
> +
> +			putback_lru_page(page);
> +		}
> +	}
> +}
> +
> +/*
> + * migrate_vma_unmap() - replace page mapping with special migration pte entry
> + * @migrate: migrate struct containing all migration information
> + *
> + * Replace page mapping (CPU page table pte) with a special migration pte entry
> + * and check again if it has been pinned. Pinned pages are restored because we
> + * cannot migrate them.
> + *
> + * This is the last step before we call the device driver callback to allocate
> + * destination memory and copy contents of original page over to new page.
> + */
> +static void migrate_vma_unmap(struct migrate_vma *migrate)
> +{
> +	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	const unsigned long npages = migrate->npages;
> +	const unsigned long start = migrate->start;
> +	unsigned long addr, i, restore = 0;
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> +
> +		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		try_to_unmap(page, flags);
> +		if (page_mapped(page) || !migrate_vma_check_page(page)) {
> +			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +			migrate->cpages--;
> +			restore++;
> +		}
> +	}
> +
> +	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
> +		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> +
> +		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		remove_migration_ptes(page, page, false);
> +
> +		migrate->src[i] = 0;
> +		unlock_page(page);
> +		restore--;
> +
> +		putback_lru_page(page);
> +	}
> +}
> +
> +/*
> + * migrate_vma_pages() - migrate meta-data from src page to dst page
> + * @migrate: migrate struct containing all migration information
> + *
> + * This migrates struct page meta-data from source struct page to destination
> + * struct page. This effectively finishes the migration from source page to the
> + * destination page.
> + */
> +static void migrate_vma_pages(struct migrate_vma *migrate)
> +{
> +	const unsigned long npages = migrate->npages;
> +	const unsigned long start = migrate->start;
> +	unsigned long addr, i;
> +
> +	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
> +		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
> +		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> +		struct address_space *mapping;
> +		int r;
> +
> +		if (!page || !newpage)
> +			continue;
> +		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		mapping = page_mapping(page);
> +
> +		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);

Could we use a flags field to determine if we should use MIGRATE_SYNC_NO_COPY or not?

> +		if (r != MIGRATEPAGE_SUCCESS)
> +			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +	}
> +}
> +
> +/*
> + * migrate_vma_finalize() - restore CPU page table entry
> + * @migrate: migrate struct containing all migration information
> + *
> + * This replaces the special migration pte entry with either a mapping to the
> + * new page if migration was successful for that page, or to the original page
> + * otherwise.
> + *
> + * This also unlocks the pages and puts them back on the lru, or drops the extra
> + * refcount, for device pages.
> + */
> +static void migrate_vma_finalize(struct migrate_vma *migrate)
> +{
> +	const unsigned long npages = migrate->npages;
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
> +		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> +
> +		if (!page)
> +			continue;
> +		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
> +			if (newpage) {
> +				unlock_page(newpage);
> +				put_page(newpage);
> +			}
> +			newpage = page;
> +		}
> +
> +		remove_migration_ptes(page, newpage, false);
> +		unlock_page(page);
> +		migrate->cpages--;
> +
> +		putback_lru_page(page);
> +
> +		if (newpage != page) {
> +			unlock_page(newpage);
> +			putback_lru_page(newpage);
> +		}
> +	}
> +}
> +
> +/*
> + * migrate_vma() - migrate a range of memory inside vma
> + *
> + * @ops: migration callback for allocating destination memory and copying
> + * @vma: virtual memory area containing the range to be migrated
> + * @start: start address of the range to migrate (inclusive)
> + * @end: end address of the range to migrate (exclusive)
> + * @src: array of hmm_pfn_t containing source pfns
> + * @dst: array of hmm_pfn_t containing destination pfns
> + * @private: pointer passed back to each of the callback
> + * Returns: 0 on success, error code otherwise
> + *
> + * This function tries to migrate a range of memory virtual address range, using
> + * callbacks to allocate and copy memory from source to destination. First it
> + * collects all the pages backing each virtual address in the range, saving this
> + * inside the src array. Then it locks those pages and unmaps them. Once the pages
> + * are locked and unmapped, it checks whether each page is pinned or not. Pages
> + * that aren't pinned have the MIGRATE_PFN_MIGRATE flag set (by this function)
> + * in the corresponding src array entry. It then restores any pages that are
> + * pinned, by remapping and unlocking those pages.
> + *
> + * At this point it calls the alloc_and_copy() callback. For documentation on
> + * what is expected from that callback, see struct migrate_vma_ops comments in
> + * include/linux/migrate.h
> + *
> + * After the alloc_and_copy() callback, this function goes over each entry in
> + * the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag
> + * set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,
> + * then the function tries to migrate struct page information from the source
> + * struct page to the destination struct page. If it fails to migrate the struct
> + * page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src
> + * array.
> + *
> + * At this point all successfully migrated pages have an entry in the src
> + * array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst
> + * array entry with MIGRATE_PFN_VALID flag set.
> + *
> + * It then calls the finalize_and_map() callback. See comments for "struct
> + * migrate_vma_ops", in include/linux/migrate.h for details about
> + * finalize_and_map() behavior.
> + *
> + * After the finalize_and_map() callback, for successfully migrated pages, this
> + * function updates the CPU page table to point to new pages, otherwise it
> + * restores the CPU page table to point to the original source pages.
> + *
> + * Function returns 0 after the above steps, even if no pages were migrated
> + * (The function only returns an error if any of the arguments are invalid.)
> + *
> + * Both src and dst array must be big enough for (end - start) >> PAGE_SHIFT
> + * unsigned long entries.
> + */
> +int migrate_vma(const struct migrate_vma_ops *ops,
> +		struct vm_area_struct *vma,
> +		unsigned long start,
> +		unsigned long end,
> +		unsigned long *src,
> +		unsigned long *dst,
> +		void *private)
> +{
> +	struct migrate_vma migrate;
> +
> +	/* Sanity check the arguments */
> +	start &= PAGE_MASK;
> +	end &= PAGE_MASK;
> +	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
> +		return -EINVAL;
> +	if (start < vma->vm_start || start >= vma->vm_end)
> +		return -EINVAL;
> +	if (end <= vma->vm_start || end > vma->vm_end)
> +		return -EINVAL;
> +	if (!ops || !src || !dst || start >= end)
> +		return -EINVAL;
> +
> +	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
> +	migrate.src = src;
> +	migrate.dst = dst;
> +	migrate.start = start;
> +	migrate.npages = 0;
> +	migrate.cpages = 0;
> +	migrate.end = end;
> +	migrate.vma = vma;
> +
> +	/* Collect, and try to unmap source pages */
> +	migrate_vma_collect(&migrate);
> +	if (!migrate.cpages)
> +		return 0;
> +
> +	/* Lock and isolate page */
> +	migrate_vma_prepare(&migrate);
> +	if (!migrate.cpages)
> +		return 0;
> +
> +	/* Unmap pages */
> +	migrate_vma_unmap(&migrate);
> +	if (!migrate.cpages)
> +		return 0;
> +
> +	/*
> +	 * At this point pages are locked and unmapped, and thus they have
> +	 * stable content and can safely be copied to destination memory that
> +	 * is allocated by the callback.
> +	 *
> +	 * Note that migration can fail in migrate_vma_struct_page() for each
> +	 * individual page.
> +	 */
> +	ops->alloc_and_copy(vma, src, dst, start, end, private);
> +
> +	/* This does the real migration of struct page */
> +	migrate_vma_pages(&migrate);
> +
> +	ops->finalize_and_map(vma, src, dst, start, end, private);
> +
> +	/* Unlock and remap pages */
> +	migrate_vma_finalize(&migrate);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(migrate_vma);

In general, its helpful to have

Acked-by: Balbir Singh <bsingharora@gmail.com>

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2
  2017-05-24 17:20 ` [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2 Jérôme Glisse
@ 2017-05-31  4:09   ` Balbir Singh
  2017-05-31  8:39     ` Balbir Singh
  0 siblings, 1 reply; 66+ messages in thread
From: Balbir Singh @ 2017-05-31  4:09 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard

On Wed, 24 May 2017 13:20:23 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:

> Allow to unmap and restore special swap entry of un-addressable
> ZONE_DEVICE memory.
> 
> Changed since v1:
>   - s/device unaddressable/device private/
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/migrate.h |  10 +++-
>  mm/migrate.c            | 134 ++++++++++++++++++++++++++++++++++++++----------
>  mm/page_vma_mapped.c    |  10 ++++
>  mm/rmap.c               |  25 +++++++++
>  4 files changed, 150 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 576b3f5..7dd875a 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -130,12 +130,18 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  
>  #ifdef CONFIG_MIGRATION
>  
> +/*
> + * Watch out for PAE architecture, which has an unsigned long, and might not
> + * have enough bits to store all physical address and flags. So far we have
> + * enough room for all our flags.
> + */
>  #define MIGRATE_PFN_VALID	(1UL << 0)
>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>  #define MIGRATE_PFN_LOCKED	(1UL << 2)
>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> -#define MIGRATE_PFN_ERROR	(1UL << 4)
> -#define MIGRATE_PFN_SHIFT	5
> +#define MIGRATE_PFN_DEVICE	(1UL << 4)
> +#define MIGRATE_PFN_ERROR	(1UL << 5)
> +#define MIGRATE_PFN_SHIFT	6
>  
>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>  {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1f2bc61..9e68399 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -36,6 +36,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/hugetlb_cgroup.h>
>  #include <linux/gfp.h>
> +#include <linux/memremap.h>
>  #include <linux/balloon_compaction.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/page_idle.h>
> @@ -227,7 +228,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		if (is_write_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
>  
> -		flush_dcache_page(new);
> +		if (unlikely(is_zone_device_page(new)) &&
> +		    is_device_private_page(new)) {

I would expect HMM-CDM to never hit this pattern, given that
we should not be creating migration entries for CDM memory.
Is that a fair assumption?

> +			entry = make_device_private_entry(new, pte_write(pte));
> +			pte = swp_entry_to_pte(entry);
> +			if (pte_swp_soft_dirty(*pvmw.pte))
> +				pte = pte_mksoft_dirty(pte);
> +		} else
> +			flush_dcache_page(new);
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>  		if (PageHuge(new)) {
>  			pte = pte_mkhuge(pte);
> @@ -2140,17 +2149,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		pte = *ptep;
>  		pfn = pte_pfn(pte);
>  
> -		if (!pte_present(pte)) {
> +		if (pte_none(pte)) {
>  			mpfn = pfn = 0;
>  			goto next;
>  		}
>  
> +		if (!pte_present(pte)) {
> +			mpfn = pfn = 0;
> +
> +			/*
> +			 * Only care about unaddressable device page special
> +			 * page table entry. Other special swap entries are not
> +			 * migratable, and we ignore regular swapped page.
> +			 */
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_device_private_entry(entry))
> +				goto next;
> +
> +			page = device_private_entry_to_page(entry);
> +			mpfn = migrate_pfn(page_to_pfn(page))|
> +				MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
> +			if (is_write_device_private_entry(entry))
> +				mpfn |= MIGRATE_PFN_WRITE;
> +		} else {
> +			page = vm_normal_page(migrate->vma, addr, pte);
> +			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> +			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> +		}
> +
>  		/* FIXME support THP */
> -		page = vm_normal_page(migrate->vma, addr, pte);
>  		if (!page || !page->mapping || PageTransCompound(page)) {
>  			mpfn = pfn = 0;
>  			goto next;
>  		}
> +		pfn = page_to_pfn(page);
>  
>  		/*
>  		 * By getting a reference on the page we pin it and that blocks
> @@ -2163,8 +2195,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		 */
>  		get_page(page);
>  		migrate->cpages++;
> -		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> -		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  
>  		/*
>  		 * Optimize for the common case where page is only mapped once
> @@ -2195,6 +2225,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		}
>  
>  next:
> +		migrate->dst[migrate->npages] = 0;
>  		migrate->src[migrate->npages++] = mpfn;
>  	}
>  	arch_leave_lazy_mmu_mode();
> @@ -2264,6 +2295,15 @@ static bool migrate_vma_check_page(struct page *page)
>  	if (PageCompound(page))
>  		return false;
>  
> +	/* Page from ZONE_DEVICE have one extra reference */
> +	if (is_zone_device_page(page)) {
> +		if (is_device_private_page(page)) {
> +			extra++;
> +		} else
> +			/* Other ZONE_DEVICE memory type are not supported */
> +			return false;
> +	}
> +
>  	if ((page_count(page) - extra) > page_mapcount(page))
>  		return false;
>  
> @@ -2301,24 +2341,30 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>  			migrate->src[i] |= MIGRATE_PFN_LOCKED;
>  		}
>  
> -		if (!PageLRU(page) && allow_drain) {
> -			/* Drain CPU's pagevec */
> -			lru_add_drain_all();
> -			allow_drain = false;
> -		}
> +		/* ZONE_DEVICE pages are not on LRU */
> +		if (!is_zone_device_page(page)) {
> +			if (!PageLRU(page) && allow_drain) {
> +				/* Drain CPU's pagevec */
> +				lru_add_drain_all();
> +				allow_drain = false;
> +			}
>  
> -		if (isolate_lru_page(page)) {
> -			if (remap) {
> -				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> -				migrate->cpages--;
> -				restore++;
> -			} else {
> -				migrate->src[i] = 0;
> -				unlock_page(page);
> -				migrate->cpages--;
> -				put_page(page);
> +			if (isolate_lru_page(page)) {
> +				if (remap) {
> +					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +					migrate->cpages--;
> +					restore++;
> +				} else {
> +					migrate->src[i] = 0;
> +					unlock_page(page);
> +					migrate->cpages--;
> +					put_page(page);
> +				}
> +				continue;
>  			}
> -			continue;
> +
> +			/* Drop the reference we took in collect */
> +			put_page(page);
>  		}
>  
>  		if (!migrate_vma_check_page(page)) {
> @@ -2327,14 +2373,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>  				migrate->cpages--;
>  				restore++;
>  
> -				get_page(page);
> -				putback_lru_page(page);
> +				if (!is_zone_device_page(page)) {
> +					get_page(page);
> +					putback_lru_page(page);
> +				}
>  			} else {
>  				migrate->src[i] = 0;
>  				unlock_page(page);
>  				migrate->cpages--;
>  
> -				putback_lru_page(page);
> +				if (!is_zone_device_page(page))
> +					putback_lru_page(page);
> +				else
> +					put_page(page);
>  			}
>  		}
>  	}
> @@ -2405,7 +2456,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>  		unlock_page(page);
>  		restore--;
>  
> -		putback_lru_page(page);
> +		if (is_zone_device_page(page))
> +			put_page(page);
> +		else
> +			putback_lru_page(page);
>  	}
>  }
>  
> @@ -2436,6 +2490,26 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
>  
>  		mapping = page_mapping(page);
>  
> +		if (is_zone_device_page(newpage)) {
> +			if (is_device_private_page(newpage)) {
> +				/*
> +				 * For now only support private anonymous when
> +				 * migrating to un-addressable device memory.
> +				 */
> +				if (mapping) {
> +					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +					continue;
> +				}
> +			} else {
> +				/*
> +				 * Other types of ZONE_DEVICE page are not
> +				 * supported.
> +				 */
> +				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +				continue;
> +			}
> +		}
> +
>  		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
>  		if (r != MIGRATEPAGE_SUCCESS)
>  			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> @@ -2476,11 +2550,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
>  		unlock_page(page);
>  		migrate->cpages--;
>  
> -		putback_lru_page(page);
> +		if (is_zone_device_page(page))
> +			put_page(page);
> +		else
> +			putback_lru_page(page);
>  
>  		if (newpage != page) {
>  			unlock_page(newpage);
> -			putback_lru_page(newpage);
> +			if (is_zone_device_page(newpage))
> +				put_page(newpage);
> +			else
> +				putback_lru_page(newpage);
>  		}
>  	}
>  }
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index de9c40d..f95765c 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -48,6 +48,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		if (!is_swap_pte(*pvmw->pte))
>  			return false;
>  		entry = pte_to_swp_entry(*pvmw->pte);
> +
>  		if (!is_migration_entry(entry))
>  			return false;
>  		if (migration_entry_to_page(entry) - pvmw->page >=
> @@ -60,6 +61,15 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		WARN_ON_ONCE(1);
>  #endif
>  	} else {
> +		if (is_swap_pte(*pvmw->pte)) {
> +			swp_entry_t entry;
> +
> +			entry = pte_to_swp_entry(*pvmw->pte);
> +			if (is_device_private_entry(entry) &&
> +			    device_private_entry_to_page(entry) == pvmw->page)
> +				return true;
> +		}
> +
>  		if (!pte_present(*pvmw->pte))
>  			return false;
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d405f0e..515cea6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -63,6 +63,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/backing-dev.h>
>  #include <linux/page_idle.h>
> +#include <linux/memremap.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1308,6 +1309,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return true;
>  
> +	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> +	    is_zone_device_page(page) && !is_device_private_page(page))
> +		return true;
> +

I wonder how CDM would ever work with this?

>  	if (flags & TTU_SPLIT_HUGE_PMD) {
>  		split_huge_pmd_address(vma, address,
>  				flags & TTU_MIGRATION, page);
> @@ -1343,6 +1348,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (IS_ENABLED(CONFIG_MIGRATION) &&
> +		    (flags & TTU_MIGRATION) &&
> +		    is_zone_device_page(page)) {
> +			swp_entry_t entry;
> +			pte_t swp_pte;
> +
> +			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> +
> +			/*
> +			 * Store the pfn of the page in a special migration
> +			 * pte. do_swap_page() will wait until the migration
> +			 * pte is removed and then restart fault handling.
> +			 */
> +			entry = make_migration_entry(page, 0);
> +			swp_pte = swp_entry_to_pte(entry);
> +			if (pte_soft_dirty(pteval))
> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			goto discard;
> +		}
>  
>  		if (!(flags & TTU_IGNORE_ACCESS)) {
>  			if (ptep_clear_flush_young_notify(vma, address,


Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2
  2017-05-31  4:09   ` Balbir Singh
@ 2017-05-31  8:39     ` Balbir Singh
  0 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2017-05-31  8:39 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard

On Wed, May 31, 2017 at 2:09 PM, Balbir Singh <bsingharora@gmail.com> wrote:
> On Wed, 24 May 2017 13:20:23 -0400
> Jérôme Glisse <jglisse@redhat.com> wrote:
>
>> Allow to unmap and restore special swap entry of un-addressable
>> ZONE_DEVICE memory.
>>
>> Changed since v1:
>>   - s/device unaddressable/device private/
>>
>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---

Sorry! Please ignore my comments, this is only for un-addressable memory


Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-05-31  3:59   ` Balbir Singh
@ 2017-06-01 22:35     ` Jerome Glisse
  2017-06-07  9:02       ` Balbir Singh
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-06-01 22:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Wed, May 31, 2017 at 01:59:54PM +1000, Balbir Singh wrote:
> On Wed, 24 May 2017 13:20:21 -0400
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > This patch add a new memory migration helpers, which migrate memory
> > backing a range of virtual address of a process to different memory
> > (which can be allocated through special allocator). It differs from
> > numa migration by working on a range of virtual address and thus by
> > doing migration in chunk that can be large enough to use DMA engine
> > or special copy offloading engine.
> > 
> > Expected users are any one with heterogeneous memory where different
> > memory have different characteristics (latency, bandwidth, ...). As
> > an example IBM platform with CAPI bus can make use of this feature
> > to migrate between regular memory and CAPI device memory. New CPU
> > architecture with a pool of high performance memory not manage as
> > cache but presented as regular memory (while being faster and with
> > lower latency than DDR) will also be prime user of this patch.
> > 
> > Migration to private device memory will be useful for device that
> > have large pool of such like GPU, NVidia plans to use HMM for that.
> > 
> 
> It is helpful, for HMM-CDM however we would like to avoid the downsides
> of MIGRATE_SYNC_NOCOPY

What are the downside you are referring too ?

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4
  2017-05-31  2:10   ` Balbir Singh
@ 2017-06-01 22:35     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-06-01 22:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Wed, May 31, 2017 at 12:10:24PM +1000, Balbir Singh wrote:
> On Wed, 24 May 2017 13:20:11 -0400
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > HMM provides 3 separate types of functionality:
> >     - Mirroring: synchronize CPU page table and device page table
> >     - Device memory: allocating struct page for device memory
> >     - Migration: migrating regular memory to device memory
> > 
> > This patch introduces some common helpers and definitions to all of
> > those 3 functionality.
> > 
> > Changed since v3:
> >   - Unconditionaly build hmm.c for static keys
> > Changed since v2:
> >   - s/device unaddressable/device private
> > Changed since v1:
> >   - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > ---
> 
> It would be nice to explain a bit of how hmm_pfn_t bits work with pfn
> and find out what we need from an arch to support HMM.
> 

This is only needed for HMM_MIRROR feature so you do not care about it
for powerpc

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-06-01 22:35     ` Jerome Glisse
@ 2017-06-07  9:02       ` Balbir Singh
  2017-06-07 14:06         ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Balbir Singh @ 2017-06-07  9:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Fri, Jun 2, 2017 at 8:35 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Wed, May 31, 2017 at 01:59:54PM +1000, Balbir Singh wrote:
>> On Wed, 24 May 2017 13:20:21 -0400
>> Jérôme Glisse <jglisse@redhat.com> wrote:
>>
>> > This patch add a new memory migration helpers, which migrate memory
>> > backing a range of virtual address of a process to different memory
>> > (which can be allocated through special allocator). It differs from
>> > numa migration by working on a range of virtual address and thus by
>> > doing migration in chunk that can be large enough to use DMA engine
>> > or special copy offloading engine.
>> >
>> > Expected users are any one with heterogeneous memory where different
>> > memory have different characteristics (latency, bandwidth, ...). As
>> > an example IBM platform with CAPI bus can make use of this feature
>> > to migrate between regular memory and CAPI device memory. New CPU
>> > architecture with a pool of high performance memory not manage as
>> > cache but presented as regular memory (while being faster and with
>> > lower latency than DDR) will also be prime user of this patch.
>> >
>> > Migration to private device memory will be useful for device that
>> > have large pool of such like GPU, NVidia plans to use HMM for that.
>> >
>>
>> It is helpful, for HMM-CDM however we would like to avoid the downsides
>> of MIGRATE_SYNC_NOCOPY
>
> What are the downside you are referring too ?

IIUC, MIGRATE_SYNC_NO_COPY is for anonymous memory only.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-06-07  9:02       ` Balbir Singh
@ 2017-06-07 14:06         ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-06-07 14:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

> On Fri, Jun 2, 2017 at 8:35 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Wed, May 31, 2017 at 01:59:54PM +1000, Balbir Singh wrote:
> >> On Wed, 24 May 2017 13:20:21 -0400
> >> Jérôme Glisse <jglisse@redhat.com> wrote:
> >>
> >> > This patch add a new memory migration helpers, which migrate memory
> >> > backing a range of virtual address of a process to different memory
> >> > (which can be allocated through special allocator). It differs from
> >> > numa migration by working on a range of virtual address and thus by
> >> > doing migration in chunk that can be large enough to use DMA engine
> >> > or special copy offloading engine.
> >> >
> >> > Expected users are any one with heterogeneous memory where different
> >> > memory have different characteristics (latency, bandwidth, ...). As
> >> > an example IBM platform with CAPI bus can make use of this feature
> >> > to migrate between regular memory and CAPI device memory. New CPU
> >> > architecture with a pool of high performance memory not manage as
> >> > cache but presented as regular memory (while being faster and with
> >> > lower latency than DDR) will also be prime user of this patch.
> >> >
> >> > Migration to private device memory will be useful for device that
> >> > have large pool of such like GPU, NVidia plans to use HMM for that.
> >> >
> >>
> >> It is helpful, for HMM-CDM however we would like to avoid the downsides
> >> of MIGRATE_SYNC_NOCOPY
> >
> > What are the downside you are referring too ?
> 
> IIUC, MIGRATE_SYNC_NO_COPY is for anonymous memory only.

It can migrate anything, file back page too. It just forbid that latter
case if it is ZONE_DEVICE HMM. I should have time now to finish the CDM
patchset and i will post, previous patches already enabled file back
page migration for HMM-CDM.

The NOCOPY is for no CPUCOPY, i couldn't think of a better name.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
  2017-05-30 16:43   ` Ross Zwisler
  2017-05-31  1:23   ` Balbir Singh
@ 2017-06-09  3:55   ` John Hubbard
  2017-06-12 17:57     ` Jerome Glisse
  2017-06-15  3:41   ` zhong jiang
  3 siblings, 1 reply; 66+ messages in thread
From: John Hubbard @ 2017-06-09  3:55 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm, Evgeny Baskakov
  Cc: Dan Williams, Kirill A . Shutemov, Ross Zwisler

On 05/24/2017 10:20 AM, Jérôme Glisse wrote:
[...8<...]
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +int device_private_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned int flags,
> +		       pmd_t *pmdp)
> +{
> +	struct page *page = device_private_entry_to_page(entry);
> +
> +	/*
> +	 * The page_fault() callback must migrate page back to system memory
> +	 * so that CPU can access it. This might fail for various reasons
> +	 * (device issue, device was unsafely unplugged, ...). When such
> +	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
> +	 *
> +	 * Note that because memory cgroup charges are accounted to the device
> +	 * memory, this should never fail because of memory restrictions (but
> +	 * allocation of regular system page might still fail because we are
> +	 * out of memory).
> +	 *
> +	 * There is a more in-depth description of what that callback can and
> +	 * cannot do, in include/linux/memremap.h
> +	 */
> +	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
> +}
> +EXPORT_SYMBOL(device_private_entry_fault);
> +#endif /* CONFIG_DEVICE_PRIVATE */
> +
>   static void pgmap_radix_release(struct resource *res)
>   {
>   	resource_size_t key, align_start, align_size, align_end;
> @@ -321,6 +351,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>   	}
>   	pgmap->ref = ref;
>   	pgmap->res = &page_map->res;
> +	pgmap->type = MEMORY_DEVICE_PUBLIC;
> +	pgmap->page_fault = NULL;
> +	pgmap->page_free = NULL;
> +	pgmap->data = NULL;
>   
>   	mutex_lock(&pgmap_lock);
>   	error = 0;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d744cff..f5357ff 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,6 +736,19 @@ config ZONE_DEVICE
>   
>   	  If FS_DAX is enabled, then say Y.
>   
> +config DEVICE_PRIVATE
> +	bool "Unaddressable device memory (GPU memory, ...)"
> +	depends on X86_64
> +	depends on ZONE_DEVICE
> +	depends on MEMORY_HOTPLUG
> +	depends on MEMORY_HOTREMOVE
> +	depends on SPARSEMEM_VMEMMAP
> +
> +	help
> +	  Allows creation of struct pages to represent unaddressable device
> +	  memory; i.e., memory that is only accessible from the device (or
> +	  group of devices).
> +

Hi Jerome,

CONFIG_DEVICE_PRIVATE has caused me some problems, because it's not coupled to HMM_DEVMEM.

To fix this, my first choice would be to just s/DEVICE_PRIVATE/HMM_DEVMEM/g , because I don't see 
any value to DEVICE_PRIVATE as an independent Kconfig choice. It's complicating the Kconfig choices, 
and adding problems. However, if DEVICE_PRIVATE must be kept, then something like this also fixes my 
HMM tests:

From: John Hubbard <jhubbard@nvidia.com>
Date: Thu, 8 Jun 2017 20:13:13 -0700
Subject: [PATCH] hmm: select CONFIG_DEVICE_PRIVATE with HMM_DEVMEM

The HMM_DEVMEM feature is useless without the various
features that are guarded with CONFIG_DEVICE_PRIVATE.
Therefore, auto-select DEVICE_PRIVATE when selecting
HMM_DEVMEM.

Otherwise, you can easily end up with a partially
working HMM installation: if you select HMM_DEVMEM,
but do not select DEVICE_PRIVATE, then faulting and
migrating to a device (such as a GPU) works, but CPU
page faults are ignored, so the page never migrates
back to the CPU.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
  mm/Kconfig | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 46296d5d7570..23d2f5ec865e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -318,6 +318,8 @@ config HMM_DEVMEM
  	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
  	depends on ARCH_HAS_HMM
  	select HMM
+	select DEVICE_PRIVATE
+
  	help
  	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
  	  feature. This is just to avoid having device drivers to replicating a lot
-- 
2.13.1

This is a minor thing, and I don't think this needs to hold up merging HMM v23 into -mm, IMHO. But I 
would like it fixed at some point.

thanks,
--
John Hubbard
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-06-09  3:55   ` John Hubbard
@ 2017-06-12 17:57     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-06-12 17:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-kernel, linux-mm, Evgeny Baskakov, Dan Williams,
	Kirill A . Shutemov, Ross Zwisler

On Thu, Jun 08, 2017 at 08:55:05PM -0700, John Hubbard wrote:
> On 05/24/2017 10:20 AM, Jerome Glisse wrote:
> [...8<...]
> > +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> > +int device_private_entry_fault(struct vm_area_struct *vma,
> > +		       unsigned long addr,
> > +		       swp_entry_t entry,
> > +		       unsigned int flags,
> > +		       pmd_t *pmdp)
> > +{
> > +	struct page *page = device_private_entry_to_page(entry);
> > +
> > +	/*
> > +	 * The page_fault() callback must migrate page back to system memory
> > +	 * so that CPU can access it. This might fail for various reasons
> > +	 * (device issue, device was unsafely unplugged, ...). When such
> > +	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
> > +	 *
> > +	 * Note that because memory cgroup charges are accounted to the device
> > +	 * memory, this should never fail because of memory restrictions (but
> > +	 * allocation of regular system page might still fail because we are
> > +	 * out of memory).
> > +	 *
> > +	 * There is a more in-depth description of what that callback can and
> > +	 * cannot do, in include/linux/memremap.h
> > +	 */
> > +	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
> > +}
> > +EXPORT_SYMBOL(device_private_entry_fault);
> > +#endif /* CONFIG_DEVICE_PRIVATE */
> > +
> >   static void pgmap_radix_release(struct resource *res)
> >   {
> >   	resource_size_t key, align_start, align_size, align_end;
> > @@ -321,6 +351,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> >   	}
> >   	pgmap->ref = ref;
> >   	pgmap->res = &page_map->res;
> > +	pgmap->type = MEMORY_DEVICE_PUBLIC;
> > +	pgmap->page_fault = NULL;
> > +	pgmap->page_free = NULL;
> > +	pgmap->data = NULL;
> >   	mutex_lock(&pgmap_lock);
> >   	error = 0;
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index d744cff..f5357ff 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -736,6 +736,19 @@ config ZONE_DEVICE
> >   	  If FS_DAX is enabled, then say Y.
> > +config DEVICE_PRIVATE
> > +	bool "Unaddressable device memory (GPU memory, ...)"
> > +	depends on X86_64
> > +	depends on ZONE_DEVICE
> > +	depends on MEMORY_HOTPLUG
> > +	depends on MEMORY_HOTREMOVE
> > +	depends on SPARSEMEM_VMEMMAP
> > +
> > +	help
> > +	  Allows creation of struct pages to represent unaddressable device
> > +	  memory; i.e., memory that is only accessible from the device (or
> > +	  group of devices).
> > +
> 
> Hi Jerome,
> 
> CONFIG_DEVICE_PRIVATE has caused me some problems, because it's not coupled to HMM_DEVMEM.
> 
> To fix this, my first choice would be to just s/DEVICE_PRIVATE/HMM_DEVMEM/g
> , because I don't see any value to DEVICE_PRIVATE as an independent Kconfig
> choice. It's complicating the Kconfig choices, and adding problems. However,
> if DEVICE_PRIVATE must be kept, then something like this also fixes my HMM
> tests:


Better is depend on so that you can not select HMM_DEVMEM if you do not have
DEVICE_PRIVATE. But maybe this can be merge under one config option, i do not
have any strong preference personnaly. The HMM_DEVMEM just enable helper code
that make using CONFIG_DEVICE_PRIVATE easier for device driver but is not
strictly needed ie device driver can reimplement what HMM_DEVMEM provides.

I might just merge this kernel option as part of CDM patchset that i am about
to send.

Cheers,
Jerome

> 
> From: John Hubbard <jhubbard@nvidia.com>
> Date: Thu, 8 Jun 2017 20:13:13 -0700
> Subject: [PATCH] hmm: select CONFIG_DEVICE_PRIVATE with HMM_DEVMEM
> 
> The HMM_DEVMEM feature is useless without the various
> features that are guarded with CONFIG_DEVICE_PRIVATE.
> Therefore, auto-select DEVICE_PRIVATE when selecting
> HMM_DEVMEM.
> 
> Otherwise, you can easily end up with a partially
> working HMM installation: if you select HMM_DEVMEM,
> but do not select DEVICE_PRIVATE, then faulting and
> migrating to a device (such as a GPU) works, but CPU
> page faults are ignored, so the page never migrates
> back to the CPU.
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 46296d5d7570..23d2f5ec865e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -318,6 +318,8 @@ config HMM_DEVMEM
>  	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
>  	depends on ARCH_HAS_HMM
>  	select HMM
> +	select DEVICE_PRIVATE
> +
>  	help
>  	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
>  	  feature. This is just to avoid having device drivers to replicating a lot
> -- 
> 2.13.1
> 
> This is a minor thing, and I don't think this needs to hold up merging HMM
> v23 into -mm, IMHO. But I would like it fixed at some point.
> 
> thanks,
> --
> John Hubbard
> NVIDIA
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
                     ` (2 preceding siblings ...)
  2017-06-09  3:55   ` John Hubbard
@ 2017-06-15  3:41   ` zhong jiang
  2017-06-15 17:43     ` Jerome Glisse
  3 siblings, 1 reply; 66+ messages in thread
From: zhong jiang @ 2017-06-15  3:41 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Ross Zwisler

On 2017/5/25 1:20, JA(C)rA'me Glisse wrote:
> HMM (heterogeneous memory management) need struct page to support migration
> from system main memory to device memory.  Reasons for HMM and migration to
> device memory is explained with HMM core patch.
>
> This patch deals with device memory that is un-addressable memory (ie CPU
> can not access it). Hence we do not want those struct page to be manage
> like regular memory. That is why we extend ZONE_DEVICE to support different
> types of memory.
>
> A persistent memory type is define for existing user of ZONE_DEVICE and a
> new device un-addressable type is added for the un-addressable memory type.
> There is a clear separation between what is expected from each memory type
> and existing user of ZONE_DEVICE are un-affected by new requirement and new
> use of the un-addressable type. All specific code path are protect with
> test against the memory type.
>
> Because memory is un-addressable we use a new special swap type for when
> a page is migrated to device memory (this reduces the number of maximum
> swap file).
>
> The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> First one, page_free() is call whenever page refcount reach 1 (which means
> the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> allow device driver to manage its memory and associated struct page.
>
> The second callback page_fault() happens when there is a CPU access to
> an address that is back by a device page (which are un-addressable by the
> CPU). This callback is responsible to migrate the page back to system
> main memory. Device driver can not block migration back to system memory,
> HMM make sure that such page can not be pin into device memory.
>
> If device is in some error condition and can not migrate memory back then
> a CPU page fault to device memory should end with SIGBUS.
>
> Changed since v2:
>   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> Changed since v1:
>   - rename to device private memory (from device unaddressable)
>
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/proc/task_mmu.c       |  7 +++++
>  include/linux/ioport.h   |  1 +
>  include/linux/memremap.h | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h       | 12 ++++++++
>  include/linux/swap.h     | 24 ++++++++++++++--
>  include/linux/swapops.h  | 68 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/memremap.c        | 34 +++++++++++++++++++++++
>  mm/Kconfig               | 13 +++++++++
>  mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c      | 10 +++++--
>  mm/mprotect.c            | 14 ++++++++++
>  11 files changed, 311 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33..90b2fa4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  			}
>  		} else if (is_migration_entry(swpent))
>  			page = migration_entry_to_page(swpent);
> +		else if (is_device_private_entry(swpent))
> +			page = device_private_entry_to_page(swpent);
>  	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
>  							&& pte_none(*pte))) {
>  		page = find_get_entry(vma->vm_file->f_mapping,
> @@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>  
>  		if (is_migration_entry(swpent))
>  			page = migration_entry_to_page(swpent);
> +		else if (is_device_private_entry(swpent))
> +			page = device_private_entry_to_page(swpent);
>  	}
>  	if (page) {
>  		int mapcount = page_mapcount(page);
> @@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		flags |= PM_SWAP;
>  		if (is_migration_entry(entry))
>  			page = migration_entry_to_page(entry);
> +
> +		if (is_device_private_entry(entry))
> +			page = device_private_entry_to_page(entry);
>  	}
>  
>  	if (page && !PageAnon(page))
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 6230064..3a4f691 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -130,6 +130,7 @@ enum {
>  	IORES_DESC_ACPI_NV_STORAGE		= 3,
>  	IORES_DESC_PERSISTENT_MEMORY		= 4,
>  	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
> +	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
>  };
>  
>  /* helpers to define resources */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..0fcf840 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -4,6 +4,8 @@
>  #include <linux/ioport.h>
>  #include <linux/percpu-refcount.h>
>  
> +#include <asm/pgtable.h>
> +
>  struct resource;
>  struct device;
>  
> @@ -35,18 +37,88 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>  }
>  #endif
>  
> +/*
> + * Specialize ZONE_DEVICE memory into multiple types each having differents
> + * usage.
> + *
> + * MEMORY_DEVICE_PUBLIC:
> + * Persistent device memory (pmem): struct page might be allocated in different
> + * memory and architecture might want to perform special actions. It is similar
> + * to regular memory, in that the CPU can access it transparently. However,
> + * it is likely to have different bandwidth and latency than regular memory.
> + * See Documentation/nvdimm/nvdimm.txt for more information.
> + *
> + * MEMORY_DEVICE_PRIVATE:
> + * Device memory that is not directly addressable by the CPU: CPU can neither
> + * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
> + * pages backing the device memory. Doing so simplifies the implementation, but
> + * it is important to remember that there are certain points at which the struct
> + * page must be treated as an opaque object, rather than a "normal" struct page.
> + * A more complete discussion of unaddressable memory may be found in
> + * include/linux/hmm.h and Documentation/vm/hmm.txt.
> + */
> +enum memory_type {
> +	MEMORY_DEVICE_PUBLIC = 0,
> +	MEMORY_DEVICE_PRIVATE,
> +};
> +
> +/*
> + * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two
> + * callbacks:
> + *   page_fault()
> + *   page_free()
> + *
> + * Additional notes about MEMORY_DEVICE_PRIVATE may be found in
> + * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
> + * explanation in include/linux/memory_hotplug.h.
> + *
> + * The page_fault() callback must migrate page back, from device memory to
> + * system memory, so that the CPU can access it. This might fail for various
> + * reasons (device issues,  device have been unplugged, ...). When such error
> + * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
> + * set the CPU page table entry to "poisoned".
> + *
> + * Note that because memory cgroup charges are transferred to the device memory,
> + * this should never fail due to memory restrictions. However, allocation
> + * of a regular system page might still fail because we are out of memory. If
> + * that happens, the page_fault() callback must return VM_FAULT_OOM.
> + *
> + * The page_fault() callback can also try to migrate back multiple pages in one
> + * chunk, as an optimization. It must, however, prioritize the faulting address
> + * over all the others.
> + *
> + *
> + * The page_free() callback is called once the page refcount reaches 1
> + * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
> + * This allows the device driver to implement its own memory management.)
> + */
> +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
> +				unsigned long addr,
> +				struct page *page,
> +				unsigned int flags,
> +				pmd_t *pmdp);
> +typedef void (*dev_page_free_t)(struct page *page, void *data);
> +
>  /**
>   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> + * @page_fault: callback when CPU fault on an unaddressable device page
> + * @page_free: free page callback when page refcount reaches 1
>   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @data: private data pointer for page_free()
> + * @type: memory type: see MEMORY_* in memory_hotplug.h
>   */
>  struct dev_pagemap {
> +	dev_page_fault_t page_fault;
> +	dev_page_free_t page_free;
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
>  	struct device *dev;
> +	void *data;
> +	enum memory_type type;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cb17c6..a825dab 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -788,11 +788,23 @@ static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
> +
> +static inline bool is_device_private_page(const struct page *page)
> +{
> +	/* See MEMORY_DEVICE_PRIVATE in include/linux/memory_hotplug.h */
> +	return ((page_zonenum(page) == ZONE_DEVICE) &&
> +		(page->pgmap->type == MEMORY_DEVICE_PRIVATE));
> +}
>  #else
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return false;
>  }
> +
> +static inline bool is_device_private_page(const struct page *page)
> +{
> +	return false;
> +}
>  #endif
>  
>  static inline void get_page(struct page *page)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 5ab1c98..ab6c20b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -51,6 +51,23 @@ static inline int current_is_kswapd(void)
>   */
>  
>  /*
> + * Unaddressable device memory support. See include/linux/hmm.h and
> + * Documentation/vm/hmm.txt. Short description is we need struct pages for
> + * device memory that is unaddressable (inaccessible) by CPU, so that we can
> + * migrate part of a process memory to device memory.
> + *
> + * When a page is migrated from CPU to device, we set the CPU page table entry
> + * to a special SWP_DEVICE_* entry.
> + */
> +#ifdef CONFIG_DEVICE_PRIVATE
> +#define SWP_DEVICE_NUM 2
> +#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
> +#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
> +#else
> +#define SWP_DEVICE_NUM 0
> +#endif
> +
> +/*
>   * NUMA node memory migration support
>   */
>  #ifdef CONFIG_MIGRATION
> @@ -72,7 +89,8 @@ static inline int current_is_kswapd(void)
>  #endif
>  
>  #define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> @@ -432,8 +450,8 @@ static inline void show_swap_cache_info(void)
>  {
>  }
>  
> -#define free_swap_and_cache(swp)	is_migration_entry(swp)
> -#define swapcache_prepare(swp)		is_migration_entry(swp)
> +#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_private_entry(e))
> +#define swapcache_prepare(e) (is_migration_entry(e) || is_device_private_entry(e))
>  
>  static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
>  {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3..361090c 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -100,6 +100,74 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
> +{
> +	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
> +			 page_to_pfn(page));
> +}
> +
> +static inline bool is_device_private_entry(swp_entry_t entry)
> +{
> +	int type = swp_type(entry);
> +	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
> +}
> +
> +static inline void make_device_private_entry_read(swp_entry_t *entry)
> +{
> +	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
> +}
> +
> +static inline bool is_write_device_private_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
> +}
> +
> +static inline struct page *device_private_entry_to_page(swp_entry_t entry)
> +{
> +	return pfn_to_page(swp_offset(entry));
> +}
> +
> +int device_private_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned int flags,
> +		       pmd_t *pmdp);
> +#else /* CONFIG_DEVICE_PRIVATE */
> +static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline void make_device_private_entry_read(swp_entry_t *entry)
> +{
> +}
> +
> +static inline bool is_device_private_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline bool is_write_device_private_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline struct page *device_private_entry_to_page(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
> +static inline int device_private_entry_fault(struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     swp_entry_t entry,
> +				     unsigned int flags,
> +				     pmd_t *pmdp)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif /* CONFIG_DEVICE_PRIVATE */
> +
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 124bed7..cd596d4 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -18,6 +18,8 @@
>  #include <linux/io.h>
>  #include <linux/mm.h>
>  #include <linux/memory_hotplug.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  #ifndef ioremap_cache
>  /* temporary while we convert existing ioremap_cache users to memremap */
> @@ -182,6 +184,34 @@ struct page_map {
>  	struct vmem_altmap altmap;
>  };
>  
> +#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
> +int device_private_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned int flags,
> +		       pmd_t *pmdp)
> +{
> +	struct page *page = device_private_entry_to_page(entry);
> +
> +	/*
> +	 * The page_fault() callback must migrate page back to system memory
> +	 * so that CPU can access it. This might fail for various reasons
> +	 * (device issue, device was unsafely unplugged, ...). When such
> +	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
> +	 *
> +	 * Note that because memory cgroup charges are accounted to the device
> +	 * memory, this should never fail because of memory restrictions (but
> +	 * allocation of regular system page might still fail because we are
> +	 * out of memory).
> +	 *
> +	 * There is a more in-depth description of what that callback can and
> +	 * cannot do, in include/linux/memremap.h
> +	 */
> +	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
> +}
> +EXPORT_SYMBOL(device_private_entry_fault);
> +#endif /* CONFIG_DEVICE_PRIVATE */
> +
>  static void pgmap_radix_release(struct resource *res)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -321,6 +351,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	}
>  	pgmap->ref = ref;
>  	pgmap->res = &page_map->res;
> +	pgmap->type = MEMORY_DEVICE_PUBLIC;
> +	pgmap->page_fault = NULL;
> +	pgmap->page_free = NULL;
> +	pgmap->data = NULL;
>  
>  	mutex_lock(&pgmap_lock);
>  	error = 0;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d744cff..f5357ff 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,6 +736,19 @@ config ZONE_DEVICE
>  
>  	  If FS_DAX is enabled, then say Y.
>  
> +config DEVICE_PRIVATE
> +	bool "Unaddressable device memory (GPU memory, ...)"
> +	depends on X86_64
> +	depends on ZONE_DEVICE
> +	depends on MEMORY_HOTPLUG
> +	depends on MEMORY_HOTREMOVE
> +	depends on SPARSEMEM_VMEMMAP
> +
 maybe just depends on ARCH_HAS_HMM is enough.
> +	help
> +	  Allows creation of struct pages to represent unaddressable device
> +	  memory; i.e., memory that is only accessible from the device (or
> +	  group of devices).
> +
>  config FRAME_VECTOR
>  	bool
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index d320b4e..eba61dd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -49,6 +49,7 @@
>  #include <linux/swap.h>
>  #include <linux/highmem.h>
>  #include <linux/pagemap.h>
> +#include <linux/memremap.h>
>  #include <linux/ksm.h>
>  #include <linux/rmap.h>
>  #include <linux/export.h>
> @@ -927,6 +928,35 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  					pte = pte_swp_mksoft_dirty(pte);
>  				set_pte_at(src_mm, addr, src_pte, pte);
>  			}
> +		} else if (is_device_private_entry(entry)) {
> +			page = device_private_entry_to_page(entry);
> +
> +			/*
> +			 * Update rss count even for unaddressable pages, as
> +			 * they should treated just like normal pages in this
> +			 * respect.
> +			 *
> +			 * We will likely want to have some new rss counters
> +			 * for unaddressable pages, at some point. But for now
> +			 * keep things as they are.
> +			 */
> +			get_page(page);
> +			rss[mm_counter(page)]++;
> +			page_dup_rmap(page, false);
> +
> +			/*
> +			 * We do not preserve soft-dirty information, because so
> +			 * far, checkpoint/restore is the only feature that
> +			 * requires that. And checkpoint/restore does not work
> +			 * when a device driver is involved (you cannot easily
> +			 * save and restore device driver state).
> +			 */
> +			if (is_write_device_private_entry(entry) &&
> +			    is_cow_mapping(vm_flags)) {
> +				make_device_private_entry_read(&entry);
> +				pte = swp_entry_to_pte(entry);
> +				set_pte_at(src_mm, addr, src_pte, pte);
> +			}
>  		}
>  		goto out_set_pte;
>  	}
> @@ -1243,6 +1273,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			}
>  			continue;
>  		}
> +
> +		entry = pte_to_swp_entry(ptent);
> +		if (non_swap_entry(entry) && is_device_private_entry(entry)) {
> +			struct page *page = device_private_entry_to_page(entry);
> +
> +			if (unlikely(details && details->check_mapping)) {
> +				/*
> +				 * unmap_shared_mapping_pages() wants to
> +				 * invalidate cache without truncating:
> +				 * unmap shared but keep private pages.
> +				 */
> +				if (details->check_mapping !=
> +				    page_rmapping(page))
> +					continue;
> +			}
> +
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			rss[mm_counter(page)]--;
> +			page_remove_rmap(page, false);
> +			put_page(page);
> +			continue;
> +		}
> +
>  		/* If details->check_mapping, we leave swap entries. */
>  		if (unlikely(details))
>  			continue;
> @@ -2690,6 +2743,14 @@ int do_swap_page(struct vm_fault *vmf)
>  		if (is_migration_entry(entry)) {
>  			migration_entry_wait(vma->vm_mm, vmf->pmd,
>  					     vmf->address);
> +		} else if (is_device_private_entry(entry)) {
> +			/*
> +			 * For un-addressable device memory we call the pgmap
> +			 * fault handler callback. The callback must migrate
> +			 * the page back to some CPU accessible page.
> +			 */
> +			ret = device_private_entry_fault(vma, vmf->address, entry,
> +						 vmf->flags, vmf->pmd);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
>  		} else {
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 599c675..0a9f690 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -156,7 +156,7 @@ void mem_hotplug_done(void)
>  /* add this memory to iomem resource */
>  static struct resource *register_memory_resource(u64 start, u64 size)
>  {
> -	struct resource *res;
> +	struct resource *res, *conflict;
>  	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
>  	if (!res)
>  		return ERR_PTR(-ENOMEM);
> @@ -165,7 +165,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  	res->start = start;
>  	res->end = start + size - 1;
>  	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	if (request_resource(&iomem_resource, res) < 0) {
> +	conflict =  request_resource_conflict(&iomem_resource, res);
> +	if (conflict) {
> +		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> +			pr_debug("Device unaddressable memory block "
> +				 "memory hotplug at %#010llx !\n",
> +				 (unsigned long long)start);
> +		}
>  		pr_debug("System RAM resource %pR cannot be added\n", res);
>  		kfree(res);
>  		return ERR_PTR(-EEXIST);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1a8c9ca..868d0ed 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -124,6 +124,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  				pages++;
>  			}
> +
> +			if (is_write_device_private_entry(entry)) {
> +				pte_t newpte;
> +
> +				/*
> +				 * We do not preserve soft-dirtiness. See
> +				 * copy_one_pte() for explanation.
> +				 */
> +				make_device_private_entry_read(&entry);
> +				newpte = swp_entry_to_pte(entry);
> +				set_pte_at(mm, addr, pte, newpte);
> +
> +				pages++;
> +			}
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3
  2017-06-15  3:41   ` zhong jiang
@ 2017-06-15 17:43     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-06-15 17:43 UTC (permalink / raw)
  To: zhong jiang
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Ross Zwisler

> On 2017/5/25 1:20, Jérôme Glisse wrote:

[...]

> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index d744cff..f5357ff 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -736,6 +736,19 @@ config ZONE_DEVICE
> >  
> >  	  If FS_DAX is enabled, then say Y.
> >  
> > +config DEVICE_PRIVATE
> > +	bool "Unaddressable device memory (GPU memory, ...)"
> > +	depends on X86_64
> > +	depends on ZONE_DEVICE
> > +	depends on MEMORY_HOTPLUG
> > +	depends on MEMORY_HOTREMOVE
> > +	depends on SPARSEMEM_VMEMMAP
> > +
>  maybe just depends on ARCH_HAS_HMM is enough.

I have updated that as part of HMM CDM patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2017-05-24 17:20 ` [HMM 15/15] mm/migrate: allow migrate_vma() to alloc new page on empty entry v2 Jérôme Glisse
@ 2017-06-16  7:22 ` Bridgman, John
  2017-06-16 14:47   ` Jerome Glisse
  2017-06-23 15:00 ` Bob Liu
  16 siblings, 1 reply; 66+ messages in thread
From: Bridgman, John @ 2017-06-16  7:22 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, John Hubbard, Sander, Ben,
	Kuehling, Felix

Hi Jerome, 

I'm just getting back to this; sorry for the late responses. 

Your description of HMM talks about blocking CPU accesses when a page has been migrated to device memory, and you treat that as a "given" in the HMM design. Other than BAR limits, coherency between CPU and device caches and performance on read-intensive CPU accesses to device memory are there any other reasons for this ?

The reason I'm asking is that we make fairly heavy use of large BAR support which allows the CPU to directly access all of the device memory on each of the GPUs, albeit without cache coherency, and there are some cases where it appears that allowing CPU access to the page in device memory would be more efficient than constantly migrating back and forth.

Migrating the page back and forth between device system memory appears at first glance to provide three benefits (albeit at a cost):

1. BAR limit - this is kind of a no-brainer, in the sense that if the CPU can not access the VRAM then you have to migrate it

2. coherency - having the CPU fault when page is in device memory or vice versa gives you an event which can be used to allow cache flushing on one device before handing ownership (from a cache perspective) to the other device - but at first glance you don't actually have to move the page to get that benefit

3. performance - CPU writes to device memory can be pretty fast since the transfers can be "fire and forget" but reads are always going to be slow because of the round-trip nature... but the tradeoff between access performance and migration overhead is more of a heuristic thing than a black-and-white thing

Do you see any HMM-related problems in principle with optionally leaving a page in device memory while the CPU is accessing it assuming that only one CPU/device "owns" the page from a cache POV at any given time ? 

Thanks,
John

(btw apologies for what looks like top-posting - I tried inserting the questions a few different places in your patches but each time ended up messy)

>-----Original Message-----
>From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
>Behalf Of Jérôme Glisse
>Sent: Wednesday, May 24, 2017 1:20 PM
>To: akpm@linux-foundation.org; linux-kernel@vger.kernel.org; linux-
>mm@kvack.org
>Cc: Dan Williams; Kirill A . Shutemov; John Hubbard; Jérôme Glisse
>Subject: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
>
>Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i test same
>kernel as kbuild system, git branch:
>
>https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v23
>
>Change since v22 is use of static key for special ZONE_DEVICE case in
>put_page() and build fix for architecture with no mmu.
>
>Everything else is the same. Below is the long description of what HMM is
>about and why. At the end of this email i describe briefly each patch and
>suggest reviewers for each of them.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
  2017-06-16  7:22 ` [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Bridgman, John
@ 2017-06-16 14:47   ` Jerome Glisse
  2017-06-16 17:55     ` Bridgman, John
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-06-16 14:47 UTC (permalink / raw)
  To: Bridgman, John
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Sander, Ben, Kuehling, Felix

On Fri, Jun 16, 2017 at 07:22:05AM +0000, Bridgman, John wrote:
> Hi Jerome, 
> 
> I'm just getting back to this; sorry for the late responses. 
> 
> Your description of HMM talks about blocking CPU accesses when a page
> has been migrated to device memory, and you treat that as a "given" in
> the HMM design. Other than BAR limits, coherency between CPU and device
> caches and performance on read-intensive CPU accesses to device memory
> are there any other reasons for this ?

Correct this is the list of reasons for it. Note that HMM is more of
a toolboox that one monolithic thing. For instance you also have the
HMM-CDM patchset that does allow to have GPU memory map to the CPU
but this rely on CAPI or CCIX to keep same memory model garanty.


> The reason I'm asking is that we make fairly heavy use of large BAR
> support which allows the CPU to directly access all of the device
> memory on each of the GPUs, albeit without cache coherency, and there
> are some cases where it appears that allowing CPU access to the page
> in device memory would be more efficient than constantly migrating
> back and forth.

The thing is we are designing toward random program and we can not
make any assumption on what kind of instruction a program might run
on such memory. So if program try to do atomic on it iirc it is un-
define what is suppose to happen.

So if you want to keep such memory mapped to userspace i would
suggest doing it through device specific vma and thus through API
specific contract that is well understood by the developer.

> 
> Migrating the page back and forth between device system memory
> appears at first glance to provide three benefits (albeit at a
> cost):
> 
> 1. BAR limit - this is kind of a no-brainer, in the sense that if
>    the CPU can not access the VRAM then you have to migrate it
> 
> 2. coherency - having the CPU fault when page is in device memory
>    or vice versa gives you an event which can be used to allow cache
>    flushing on one device before handing ownership (from a cache
>    perspective) to the other device - but at first glance you don't
>    actually have to move the page to get that benefit
> 
> 3. performance - CPU writes to device memory can be pretty fast
>    since the transfers can be "fire and forget" but reads are always
>    going to be slow because of the round-trip nature... but the
>    tradeoff between access performance and migration overhead is
>    more of a heuristic thing than a black-and-white thing

You are missing CPU atomic operation AFAIK it is undefine how they
behave on BAR/IO memory.


> Do you see any HMM-related problems in principle with optionally
> leaving a page in device memory while the CPU is accessing it
> assuming that only one CPU/device "owns" the page from a cache POV
> at any given time ?

The problem i see is with breaking assumption in respect to the memory
model the programmer have. So let say you have program A that use a
library L and that library is clever enough to use the GPU and that
GPU driver use HMM. Now if L migrate some memory behind the back of
the program to perform some computation you do not want to break any
of the assumption made by the programmer of A.

So like i said above if you want to keep a live mapping of some memory
i would do it through device specific API. The whole point of HMM is
to make memory migration transparent without breaking any of the
expectation you have about how memory access works from CPU point of
view.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
  2017-06-16 14:47   ` Jerome Glisse
@ 2017-06-16 17:55     ` Bridgman, John
  2017-06-16 18:04       ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Bridgman, John @ 2017-06-16 17:55 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Sander, Ben, Kuehling, Felix

>-----Original Message-----
>From: Jerome Glisse [mailto:jglisse@redhat.com]
>Sent: Friday, June 16, 2017 10:48 AM
>To: Bridgman, John
>Cc: akpm@linux-foundation.org; linux-kernel@vger.kernel.org; linux-
>mm@kvack.org; Dan Williams; Kirill A . Shutemov; John Hubbard; Sander, Ben;
>Kuehling, Felix
>Subject: Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
>
>On Fri, Jun 16, 2017 at 07:22:05AM +0000, Bridgman, John wrote:
>> Hi Jerome,
>>
>> I'm just getting back to this; sorry for the late responses.
>>
>> Your description of HMM talks about blocking CPU accesses when a page
>> has been migrated to device memory, and you treat that as a "given" in
>> the HMM design. Other than BAR limits, coherency between CPU and
>> device caches and performance on read-intensive CPU accesses to device
>> memory are there any other reasons for this ?
>
>Correct this is the list of reasons for it. Note that HMM is more of a toolboox
>that one monolithic thing. For instance you also have the HMM-CDM patchset
>that does allow to have GPU memory map to the CPU but this rely on CAPI or
>CCIX to keep same memory model garanty.
>
>
>> The reason I'm asking is that we make fairly heavy use of large BAR
>> support which allows the CPU to directly access all of the device
>> memory on each of the GPUs, albeit without cache coherency, and there
>> are some cases where it appears that allowing CPU access to the page
>> in device memory would be more efficient than constantly migrating
>> back and forth.
>
>The thing is we are designing toward random program and we can not make
>any assumption on what kind of instruction a program might run on such
>memory. So if program try to do atomic on it iirc it is un- define what is
>suppose to happen.

Thanks... thought I was missing something from the list. Agree that we need to provide consistent behaviour, and we definitely care about atomics. If we could get consistent behaviour with the page still in device memory are you aware of any other problems related to HMM itself ? 

>
>So if you want to keep such memory mapped to userspace i would suggest
>doing it through device specific vma and thus through API specific contract
>that is well understood by the developer.
>
>>
>> Migrating the page back and forth between device system memory appears
>> at first glance to provide three benefits (albeit at a
>> cost):
>>
>> 1. BAR limit - this is kind of a no-brainer, in the sense that if
>>    the CPU can not access the VRAM then you have to migrate it
>>
>> 2. coherency - having the CPU fault when page is in device memory
>>    or vice versa gives you an event which can be used to allow cache
>>    flushing on one device before handing ownership (from a cache
>>    perspective) to the other device - but at first glance you don't
>>    actually have to move the page to get that benefit
>>
>> 3. performance - CPU writes to device memory can be pretty fast
>>    since the transfers can be "fire and forget" but reads are always
>>    going to be slow because of the round-trip nature... but the
>>    tradeoff between access performance and migration overhead is
>>    more of a heuristic thing than a black-and-white thing
>
>You are missing CPU atomic operation AFAIK it is undefine how they behave
>on BAR/IO memory.
>
>
>> Do you see any HMM-related problems in principle with optionally
>> leaving a page in device memory while the CPU is accessing it assuming
>> that only one CPU/device "owns" the page from a cache POV at any given
>> time ?
>
>The problem i see is with breaking assumption in respect to the memory
>model the programmer have. So let say you have program A that use a library
>L and that library is clever enough to use the GPU and that GPU driver use
>HMM. Now if L migrate some memory behind the back of the program to
>perform some computation you do not want to break any of the assumption
>made by the programmer of A.
>
>So like i said above if you want to keep a live mapping of some memory i
>would do it through device specific API. The whole point of HMM is to make
>memory migration transparent without breaking any of the expectation you
>have about how memory access works from CPU point of view.
>
>Cheers,
>Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
  2017-06-16 17:55     ` Bridgman, John
@ 2017-06-16 18:04       ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-06-16 18:04 UTC (permalink / raw)
  To: Bridgman, John
  Cc: akpm, linux-kernel, linux-mm, Dan Williams, Kirill A . Shutemov,
	John Hubbard, Sander, Ben, Kuehling, Felix

On Fri, Jun 16, 2017 at 05:55:52PM +0000, Bridgman, John wrote:
> >-----Original Message-----
> >From: Jerome Glisse [mailto:jglisse@redhat.com]
> >Sent: Friday, June 16, 2017 10:48 AM
> >To: Bridgman, John
> >Cc: akpm@linux-foundation.org; linux-kernel@vger.kernel.org; linux-
> >mm@kvack.org; Dan Williams; Kirill A . Shutemov; John Hubbard; Sander, Ben;
> >Kuehling, Felix
> >Subject: Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
> >
> >On Fri, Jun 16, 2017 at 07:22:05AM +0000, Bridgman, John wrote:
> >> Hi Jerome,
> >>
> >> I'm just getting back to this; sorry for the late responses.
> >>
> >> Your description of HMM talks about blocking CPU accesses when a page
> >> has been migrated to device memory, and you treat that as a "given" in
> >> the HMM design. Other than BAR limits, coherency between CPU and
> >> device caches and performance on read-intensive CPU accesses to device
> >> memory are there any other reasons for this ?
> >
> >Correct this is the list of reasons for it. Note that HMM is more of a toolboox
> >that one monolithic thing. For instance you also have the HMM-CDM patchset
> >that does allow to have GPU memory map to the CPU but this rely on CAPI or
> >CCIX to keep same memory model garanty.
> >
> >
> >> The reason I'm asking is that we make fairly heavy use of large BAR
> >> support which allows the CPU to directly access all of the device
> >> memory on each of the GPUs, albeit without cache coherency, and there
> >> are some cases where it appears that allowing CPU access to the page
> >> in device memory would be more efficient than constantly migrating
> >> back and forth.
> >
> >The thing is we are designing toward random program and we can not make
> >any assumption on what kind of instruction a program might run on such
> >memory. So if program try to do atomic on it iirc it is un- define what is
> >suppose to happen.
> 
> Thanks... thought I was missing something from the list. Agree that we
> need to provide consistent behaviour, and we definitely care about atomics.
> If we could get consistent behaviour with the page still in device memory
> are you aware of any other problems related to HMM itself ?

Well only way to get consistent is with CCIX or CAPI bus, i would need to
do an in depth reading of PCIE but from my memory this isn't doable with
any of the existing PCIE standard.

Note that i have HMM-CDM especially for the case you have cache coherent
device memory that behave just like regular memory. When you use HMM-CDM
and you migrate to GPU memory the page is still map into the CPU address
space. HMM-CDM is a separate patchset that i posted couple days ago.

So if you have cache coherent device memory that behave like regular memory
what you want is HMM-CDM and when migrated thing are still map into the
CPU page table.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
  2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
                   ` (15 preceding siblings ...)
  2017-06-16  7:22 ` [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Bridgman, John
@ 2017-06-23 15:00 ` Bob Liu
  2017-06-23 15:28   ` Jerome Glisse
  16 siblings, 1 reply; 66+ messages in thread
From: Bob Liu @ 2017-06-23 15:00 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, Linux-Kernel, Linux-MM, Dan Williams,
	Kirill A . Shutemov, John Hubbard

Hi,

On Thu, May 25, 2017 at 1:20 AM, Jérôme Glisse <jglisse@redhat.com> wrote:
> Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
> test same kernel as kbuild system, git branch:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v23
>
> Change since v22 is use of static key for special ZONE_DEVICE case in
> put_page() and build fix for architecture with no mmu.
>
> Everything else is the same. Below is the long description of what HMM
> is about and why. At the end of this email i describe briefly each patch
> and suggest reviewers for each of them.
>
>
> Heterogeneous Memory Management (HMM) (description and justification)
>
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, â) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
>
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, â) but this get extremely complex with advance data structure
> (list, tree, graph, â) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
>
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only). This kind of feature is also appearing in
> various other operating systems.
>
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism

It looks like the address space sharing and device memory management
are two different things. They don't depend on each other and HMM has
helpers for both.

Is it possible to separate these two things into two patchsets?
Which will make it's more easy to review and also follow the "Do one
thing, and do it well" philosophy.

Thanks,
Bob Liu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23
  2017-06-23 15:00 ` Bob Liu
@ 2017-06-23 15:28   ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-06-23 15:28 UTC (permalink / raw)
  To: Bob Liu
  Cc: Andrew Morton, Linux-Kernel, Linux-MM, Dan Williams,
	Kirill A . Shutemov, John Hubbard

On Fri, Jun 23, 2017 at 11:00:37PM +0800, Bob Liu wrote:
> Hi,
> 
> On Thu, May 25, 2017 at 1:20 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
> > test same kernel as kbuild system, git branch:
> >
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v23
> >
> > Change since v22 is use of static key for special ZONE_DEVICE case in
> > put_page() and build fix for architecture with no mmu.
> >
> > Everything else is the same. Below is the long description of what HMM
> > is about and why. At the end of this email i describe briefly each patch
> > and suggest reviewers for each of them.
> >
> >
> > Heterogeneous Memory Management (HMM) (description and justification)
> >
> > Today device driver expose dedicated memory allocation API through their
> > device file, often relying on a combination of IOCTL and mmap calls. The
> > device can only access and use memory allocated through this API. This
> > effectively split the program address space into object allocated for the
> > device and useable by the device and other regular memory (malloc, mmap
> > of a file, share memory, a) only accessible by CPU (or in a very limited
> > way by a device by pinning memory).
> >
> > Allowing different isolated component of a program to use a device thus
> > require duplication of the input data structure using device memory
> > allocator. This is reasonable for simple data structure (array, grid,
> > image, a) but this get extremely complex with advance data structure
> > (list, tree, graph, a) that rely on a web of memory pointers. This is
> > becoming a serious limitation on the kind of work load that can be
> > offloaded to device like GPU.
> >
> > New industry standard like C++, OpenCL or CUDA are pushing to remove this
> > barrier. This require a shared address space between GPU device and CPU so
> > that GPU can access any memory of a process (while still obeying memory
> > protection like read only). This kind of feature is also appearing in
> > various other operating systems.
> >
> > HMM is a set of helpers to facilitate several aspects of address space
> > sharing and device memory management. Unlike existing sharing mechanism
> 
> It looks like the address space sharing and device memory management
> are two different things. They don't depend on each other and HMM has
> helpers for both.
> 
> Is it possible to separate these two things into two patchsets?
> Which will make it's more easy to review and also follow the "Do one
> thing, and do it well" philosophy.
> 

They are already seperate. Patch 3-5 are for address space mirroring.
Patch 6-10 for device memory using struct page and ZONE_DEVICE. Finaly
patch 11-15 for adding new page migration helper capable of using
device DMA engine to perform memory copy operation.

Patch 1 is just common documentation and patch 2 is common helpers and
definitions.

Also they are separate at kernel configuration level. So for all intents
and purposes this is already 2 separate things, just in one posting
because first user will use both. You can use one without the other and
it will work properly.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 09/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5
  2017-05-24 17:20 ` [HMM 09/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5 Jérôme Glisse
@ 2017-06-24  3:54   ` John Hubbard
  0 siblings, 0 replies; 66+ messages in thread
From: John Hubbard @ 2017-06-24  3:54 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 05/24/2017 10:20 AM, Jérôme Glisse wrote:
[...]
> +/*
> + * hmm_devmem_fault_range() - migrate back a virtual range of memory
> + *
> + * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
> + * @vma: virtual memory area containing the range to be migrated
> + * @ops: migration callback for allocating destination memory and copying
> + * @src: array of unsigned long containing source pfns
> + * @dst: array of unsigned long containing destination pfns
> + * @start: start address of the range to migrate (inclusive)
> + * @addr: fault address (must be inside the range)
> + * @end: end address of the range to migrate (exclusive)
> + * @private: pointer passed back to each of the callback
> + * Returns: 0 on success, VM_FAULT_SIGBUS on error
> + *
> + * This is a wrapper around migrate_vma() which checks the migration status
> + * for a given fault address and returns the corresponding page fault handler
> + * status. That will be 0 on success, or VM_FAULT_SIGBUS if migration failed
> + * for the faulting address.
> + *
> + * This is a helper intendend to be used by the ZONE_DEVICE fault handler.
> + */
> +int hmm_devmem_fault_range(struct hmm_devmem *devmem,
> +			   struct vm_area_struct *vma,
> +			   const struct migrate_vma_ops *ops,
> +			   unsigned long *src,
> +			   unsigned long *dst,
> +			   unsigned long start,
> +			   unsigned long addr,
> +			   unsigned long end,
> +			   void *private)
> +{
> +	if (migrate_vma(ops, vma, start, end, src, dst, private))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (dst[(addr - start) >> PAGE_SHIFT] & MIGRATE_PFN_ERROR)
> +		return VM_FAULT_SIGBUS;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_devmem_fault_range);
> +#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
> 

Hi Jerome (+Evgeny),

After some time and testing, I'd like to recommend that we delete the above 
hmm_dev_fault_range() function from the patchset. Reasons:

1. Our driver code is actually easier to follow if we call migrate_vma() directly, for CPU 
faults. That's because there are a lot of hmm_* calls in both directions (driver <--> 
core), already, and it takes some time to remember which direction each one goes.

2. The helper is a little confusing to use, what with a start, end, *and* an addr argument.

3. ...and it doesn't add anything that the driver can't trivially do itself.

So, let's just remove it. Less is more this time. :)

thanks,
--
John Hubbard
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 01/15] hmm: heterogeneous memory management documentation
  2017-05-24 17:20 ` [HMM 01/15] hmm: heterogeneous memory management documentation Jérôme Glisse
@ 2017-06-24  6:15   ` John Hubbard
  0 siblings, 0 replies; 66+ messages in thread
From: John Hubbard @ 2017-06-24  6:15 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Dan Williams, Kirill A . Shutemov

On 05/24/2017 10:20 AM, Jérôme Glisse wrote:
> This add documentation for HMM (Heterogeneous Memory Management). It
> presents the motivation behind it, the features necessary for it to
> be useful and and gives an overview of how this is implemented.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>   Documentation/vm/hmm.txt | 362 +++++++++++++++++++++++++++++++++++++++++++++++
>   MAINTAINERS              |   7 +
>   2 files changed, 369 insertions(+)
>   create mode 100644 Documentation/vm/hmm.txt
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> new file mode 100644
> index 0000000..a18ffc0
> --- /dev/null
> +++ b/Documentation/vm/hmm.txt
> @@ -0,0 +1,362 @@
> +Heterogeneous Memory Management (HMM)
> +

Some months ago, I made a rash promise to give this document some editing love. I am still 
willing to do that if anyone sees the need, but I put it on the back burner because I 
suspect that the document is already good enough. This is based on not seeing any "I am 
having trouble understanding HMM" complaints.

If that's not the case, please speak up. Otherwise, I'm assuming that all is well in the 
HMM Documentation department.

thanks,
--
John Hubbard
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-25 22:45                             ` Evgeny Baskakov
@ 2017-07-26 19:14                               ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-07-26 19:14 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Tue, Jul 25, 2017 at 03:45:14PM -0700, Evgeny Baskakov wrote:
> On 7/20/17 6:33 PM, Jerome Glisse wrote:
> 
> > So i pushed an updated hmm-next branch it should have all fixes so far, including
> > something that should fix this issue. I still want to go over all emails again
> > to make sure i am not forgetting anything.
> > 
> > Cheers,
> > Jerome
> 
> Hi Jerome,
> 
> Thanks for updating the documentation for hmm_devmem_ops.
> 
> I have an inquiry about the "fault" callback, though. The documentation says
> "Returns: 0 on success", but can the driver set any VM_FAULT_* flags? For
> instance, the driver might want to set the VM_FAULT_MAJOR flag to indicate
> that a heavy-weight page migration has happened on the page fault.
> 
> If that is possible, can you please update the documentation and list the
> flags that are permitted in the callback's return value?

Yes you can.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-21  1:33                           ` Jerome Glisse
  2017-07-21 22:01                             ` Evgeny Baskakov
@ 2017-07-25 22:45                             ` Evgeny Baskakov
  2017-07-26 19:14                               ` Jerome Glisse
  1 sibling, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-25 22:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 7/20/17 6:33 PM, Jerome Glisse wrote:

> So i pushed an updated hmm-next branch it should have all fixes so far, including
> something that should fix this issue. I still want to go over all emails again
> to make sure i am not forgetting anything.
>
> Cheers,
> Jérôme

Hi Jerome,

Thanks for updating the documentation for hmm_devmem_ops.

I have an inquiry about the "fault" callback, though. The documentation 
says "Returns: 0 on success", but can the driver set any VM_FAULT_* 
flags? For instance, the driver might want to set the VM_FAULT_MAJOR 
flag to indicate that a heavy-weight page migration has happened on the 
page fault.

If that is possible, can you please update the documentation and list 
the flags that are permitted in the callback's return value?

Thanks!

-- 
Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-21  1:33                           ` Jerome Glisse
@ 2017-07-21 22:01                             ` Evgeny Baskakov
  2017-07-25 22:45                             ` Evgeny Baskakov
  1 sibling, 0 replies; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-21 22:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 7/20/17 6:33 PM, Jerome Glisse wrote:

> So i pushed an updated hmm-next branch it should have all fixes so far, including
> something that should fix this issue. I still want to go over all emails again
> to make sure i am not forgetting anything.
>
> Cheers,
> Jérôme

Hi Jerome,

The issues I observed seem to be gone!

I am still running my stress tests, though. I will let you know if there 
is anything else that needs to be addressed.

Thanks!

-- 
Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-21  1:00                         ` Evgeny Baskakov
@ 2017-07-21  1:33                           ` Jerome Glisse
  2017-07-21 22:01                             ` Evgeny Baskakov
  2017-07-25 22:45                             ` Evgeny Baskakov
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-07-21  1:33 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Thu, Jul 20, 2017 at 06:00:08PM -0700, Evgeny Baskakov wrote:
> On 7/14/17 5:55 PM, Jerome Glisse wrote:
> Hi Jerome,
> 
> I think I just found a couple of new issues, now related to fork/execve.
> 
> 1) With a fork() followed by execve(), the child process makes a copy of the
> parent mm_struct object, including the "hmm" pointer. Later on, an execve()
> syscall in the child process frees the old mm_struct, and destroys the "hmm"
> object - which apparently it shouldn't do, because the "hmm" object is
> shared between the parent and child processes:
> 
> (gdb) bt
> #0  hmm_mm_destroy (mm=0xffff88080757aa40) at mm/hmm.c:134
> #1  0xffffffff81058567 in __mmdrop (mm=0xffff88080757aa40) at
> kernel/fork.c:889
> #2  0xffffffff8105904f in mmdrop (mm=<optimized out>) at
> ./include/linux/sched/mm.h:42
> #3  __mmput (mm=<optimized out>) at kernel/fork.c:916
> #4  mmput (mm=0xffff88080757aa40) at kernel/fork.c:927
> #5  0xffffffff811c5a68 in exec_mmap (mm=<optimized out>) at fs/exec.c:1057
> #6  flush_old_exec (bprm=<optimized out>) at fs/exec.c:1284
> #7  0xffffffff81214460 in load_elf_binary (bprm=0xffff8808133b1978) at
> fs/binfmt_elf.c:855
> #8  0xffffffff811c4fce in search_binary_handler (bprm=0xffff88081b40cb78) at
> fs/exec.c:1625
> #9  0xffffffff811c6bbf in exec_binprm (bprm=<optimized out>) at
> fs/exec.c:1667
> #10 do_execveat_common (fd=<optimized out>, filename=0xffff88080a101200,
> flags=0x0, argv=..., envp=...) at fs/exec.c:1789
> #11 0xffffffff811c6fda in do_execve (__envp=<optimized out>,
> __argv=<optimized out>, filename=<optimized out>) at fs/exec.c:1833
> #12 SYSC_execve (envp=<optimized out>, argv=<optimized out>,
> filename=<optimized out>) at fs/exec.c:1914
> #13 SyS_execve (filename=<optimized out>, argv=0x7f4e5c2aced0,
> envp=0x7f4e5c2aceb0) at fs/exec.c:1909
> #14 0xffffffff810018dd in do_syscall_64 (regs=0xffff88081b40cb78) at
> arch/x86/entry/common.c:284
> #15 0xffffffff819e2c06 in entry_SYSCALL_64 () at
> arch/x86/entry/entry_64.S:245
> 
> This leads to a sporadic memory corruption in the parent process:
> 
> Thread 200 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 3685]
> 0xffffffff811a3efe in __mmu_notifier_invalidate_range_start
> (mm=0xffff880807579000, start=0x7f4e5c62f000, end=0x7f4e5c66f000) at
> mm/mmu_notifier.c:199
> 199            if (mn->ops->invalidate_range_start)
> (gdb) bt
> #0  0xffffffff811a3efe in __mmu_notifier_invalidate_range_start
> (mm=0xffff880807579000, start=0x7f4e5c62f000, end=0x7f4e5c66f000) at
> mm/mmu_notifier.c:199
> #1  0xffffffff811ae471 in mmu_notifier_invalidate_range_start
> (end=<optimized out>, start=<optimized out>, mm=<optimized out>) at
> ./include/linux/mmu_notifier.h:282
> #2  migrate_vma_collect (migrate=0xffffc90003ca3940) at mm/migrate.c:2280
> #3  0xffffffff811b04a7 in migrate_vma (ops=<optimized out>,
> vma=0x7f4e5c62f000, start=0x7f4e5c62f000, end=0x7f4e5c66f000,
> src=0xffffc90003ca39d0, dst=0xffffc90003ca39d0, private=0xffffc90003ca39c0)
> at mm/migrate.c:2819
> (gdb) p mn->ops
> $2 = (const struct mmu_notifier_ops *) 0x6b6b6b6b6b6b6b6b
> 
> Please see attached a reproducer (sanity_rmem004_fork.tgz). Use "./build.sh;
> sudo ./kload.sh; ./run.sh" to recreate the issue on your end.
> 
> 
> 2) A slight modification of the affected application does not use fork().
> Instead, an execve() call from a parallel thread replaces the original
> process. This is a particularly interesting case, because at that point the
> process is busy migrating pages to/from device.
> 
> Here's what happens:
> 
> 0xffffffff811b9879 in commit_charge (page=<optimized out>,
> lrucare=<optimized out>, memcg=<optimized out>) at mm/memcontrol.c:2060
> 2060        VM_BUG_ON_PAGE(page->mem_cgroup, page);
> (gdb) bt
> #0  0xffffffff811b9879 in commit_charge (page=<optimized out>,
> lrucare=<optimized out>, memcg=<optimized out>) at mm/memcontrol.c:2060
> #1  0xffffffff811b93d6 in commit_charge (lrucare=<optimized out>,
> memcg=<optimized out>, page=<optimized out>) at
> ./include/linux/page-flags.h:149
> #2  mem_cgroup_commit_charge (page=0xffff88081b68cb70,
> memcg=0xffff88081b051548, lrucare=<optimized out>, compound=<optimized out>)
> at mm/memcontrol.c:5468
> #3  0xffffffff811b10d4 in migrate_vma_insert_page (migrate=<optimized out>,
> dst=<optimized out>, src=<optimized out>, page=<optimized out>,
> addr=<optimized out>) at mm/migrate.c:2605
> #4  migrate_vma_pages (migrate=<optimized out>) at mm/migrate.c:2647
> #5  migrate_vma (ops=<optimized out>, vma=<optimized out>, start=<optimized
> out>, end=<optimized out>, src=<optimized out>, dst=<optimized out>,
> private=0xffffc900037439c0) at mm/migrate.c:2844
> 
> 
> Please find another reproducer attached (sanity_rmem004_execve.tgz) for this
> issue.
> 

So i pushed an updated hmm-next branch it should have all fixes so far, including
something that should fix this issue. I still want to go over all emails again
to make sure i am not forgetting anything.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-15  0:55                       ` Jerome Glisse
  2017-07-15  5:04                         ` Evgeny Baskakov
@ 2017-07-21  1:00                         ` Evgeny Baskakov
  2017-07-21  1:33                           ` Jerome Glisse
  1 sibling, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-21  1:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 4607 bytes --]

On 7/14/17 5:55 PM, Jerome Glisse wrote:

> ...
>
> Cheers,
> JA(C)rA'me

Hi Jerome,

I think I just found a couple of new issues, now related to fork/execve.

1) With a fork() followed by execve(), the child process makes a copy of 
the parent mm_struct object, including the "hmm" pointer. Later on, an 
execve() syscall in the child process frees the old mm_struct, and 
destroys the "hmm" object - which apparently it shouldn't do, because 
the "hmm" object is shared between the parent and child processes:

(gdb) bt
#0  hmm_mm_destroy (mm=0xffff88080757aa40) at mm/hmm.c:134
#1  0xffffffff81058567 in __mmdrop (mm=0xffff88080757aa40) at 
kernel/fork.c:889
#2  0xffffffff8105904f in mmdrop (mm=<optimized out>) at 
./include/linux/sched/mm.h:42
#3  __mmput (mm=<optimized out>) at kernel/fork.c:916
#4  mmput (mm=0xffff88080757aa40) at kernel/fork.c:927
#5  0xffffffff811c5a68 in exec_mmap (mm=<optimized out>) at fs/exec.c:1057
#6  flush_old_exec (bprm=<optimized out>) at fs/exec.c:1284
#7  0xffffffff81214460 in load_elf_binary (bprm=0xffff8808133b1978) at 
fs/binfmt_elf.c:855
#8  0xffffffff811c4fce in search_binary_handler 
(bprm=0xffff88081b40cb78) at fs/exec.c:1625
#9  0xffffffff811c6bbf in exec_binprm (bprm=<optimized out>) at 
fs/exec.c:1667
#10 do_execveat_common (fd=<optimized out>, filename=0xffff88080a101200, 
flags=0x0, argv=..., envp=...) at fs/exec.c:1789
#11 0xffffffff811c6fda in do_execve (__envp=<optimized out>, 
__argv=<optimized out>, filename=<optimized out>) at fs/exec.c:1833
#12 SYSC_execve (envp=<optimized out>, argv=<optimized out>, 
filename=<optimized out>) at fs/exec.c:1914
#13 SyS_execve (filename=<optimized out>, argv=0x7f4e5c2aced0, 
envp=0x7f4e5c2aceb0) at fs/exec.c:1909
#14 0xffffffff810018dd in do_syscall_64 (regs=0xffff88081b40cb78) at 
arch/x86/entry/common.c:284
#15 0xffffffff819e2c06 in entry_SYSCALL_64 () at 
arch/x86/entry/entry_64.S:245

This leads to a sporadic memory corruption in the parent process:

Thread 200 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 3685]
0xffffffff811a3efe in __mmu_notifier_invalidate_range_start 
(mm=0xffff880807579000, start=0x7f4e5c62f000, end=0x7f4e5c66f000) at 
mm/mmu_notifier.c:199
199            if (mn->ops->invalidate_range_start)
(gdb) bt
#0  0xffffffff811a3efe in __mmu_notifier_invalidate_range_start 
(mm=0xffff880807579000, start=0x7f4e5c62f000, end=0x7f4e5c66f000) at 
mm/mmu_notifier.c:199
#1  0xffffffff811ae471 in mmu_notifier_invalidate_range_start 
(end=<optimized out>, start=<optimized out>, mm=<optimized out>) at 
./include/linux/mmu_notifier.h:282
#2  migrate_vma_collect (migrate=0xffffc90003ca3940) at mm/migrate.c:2280
#3  0xffffffff811b04a7 in migrate_vma (ops=<optimized out>, 
vma=0x7f4e5c62f000, start=0x7f4e5c62f000, end=0x7f4e5c66f000, 
src=0xffffc90003ca39d0, dst=0xffffc90003ca39d0, 
private=0xffffc90003ca39c0) at mm/migrate.c:2819
(gdb) p mn->ops
$2 = (const struct mmu_notifier_ops *) 0x6b6b6b6b6b6b6b6b

Please see attached a reproducer (sanity_rmem004_fork.tgz). Use 
"./build.sh; sudo ./kload.sh; ./run.sh" to recreate the issue on your end.


2) A slight modification of the affected application does not use 
fork(). Instead, an execve() call from a parallel thread replaces the 
original process. This is a particularly interesting case, because at 
that point the process is busy migrating pages to/from device.

Here's what happens:

0xffffffff811b9879 in commit_charge (page=<optimized out>, 
lrucare=<optimized out>, memcg=<optimized out>) at mm/memcontrol.c:2060
2060        VM_BUG_ON_PAGE(page->mem_cgroup, page);
(gdb) bt
#0  0xffffffff811b9879 in commit_charge (page=<optimized out>, 
lrucare=<optimized out>, memcg=<optimized out>) at mm/memcontrol.c:2060
#1  0xffffffff811b93d6 in commit_charge (lrucare=<optimized out>, 
memcg=<optimized out>, page=<optimized out>) at 
./include/linux/page-flags.h:149
#2  mem_cgroup_commit_charge (page=0xffff88081b68cb70, 
memcg=0xffff88081b051548, lrucare=<optimized out>, compound=<optimized 
out>) at mm/memcontrol.c:5468
#3  0xffffffff811b10d4 in migrate_vma_insert_page (migrate=<optimized 
out>, dst=<optimized out>, src=<optimized out>, page=<optimized out>, 
addr=<optimized out>) at mm/migrate.c:2605
#4  migrate_vma_pages (migrate=<optimized out>) at mm/migrate.c:2647
#5  migrate_vma (ops=<optimized out>, vma=<optimized out>, 
start=<optimized out>, end=<optimized out>, src=<optimized out>, 
dst=<optimized out>, private=0xffffc900037439c0) at mm/migrate.c:2844


Please find another reproducer attached (sanity_rmem004_execve.tgz) for 
this issue.

Thanks!

-- 
Evgeny Baskakov
NVIDIA


[-- Attachment #2: sanity_rmem004_execve.tgz --]
[-- Type: application/x-gzip, Size: 5728 bytes --]

[-- Attachment #3: sanity_rmem004_fork.tgz --]
[-- Type: application/x-gzip, Size: 5736 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-11  0:54               ` Jerome Glisse
@ 2017-07-20 21:05                 ` Evgeny Baskakov
  0 siblings, 0 replies; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-20 21:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 7/10/17 5:54 PM, Jerome Glisse wrote:

> On Mon, Jul 10, 2017 at 05:17:23PM -0700, Evgeny Baskakov wrote:
>> On 7/10/17 4:43 PM, Jerome Glisse wrote:
>>
>>> On Mon, Jul 10, 2017 at 03:59:37PM -0700, Evgeny Baskakov wrote:
>>> ...
>>> Horrible stupid bug in the code, most likely from cut and paste. Attached
>>> patch should fix it. I don't know how long it took for you to trigger it.
>>>
>>> Jérôme
>> Thanks, this indeed fixes the problem! Yes, it took a nightly run before it
>> triggered.
>>
>> One a side note, should this "return NULL" be replaced with "return
>> ERR_PTR(-ENOMEM)"?
> Or -EBUSY but yes sure.
>
> Jérôme

Hi Jerome,

Are these fixes in already (for the alloc_chrdev_region and "return 
NULL" issues)? I don't see them in hmm-next nor in hmm-v24.

Can you please double check it?

Thanks!

-- 
Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-15  0:55                       ` Jerome Glisse
@ 2017-07-15  5:04                         ` Evgeny Baskakov
  2017-07-21  1:00                         ` Evgeny Baskakov
  1 sibling, 0 replies; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-15  5:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 7/14/17 5:55 PM, Jerome Glisse wrote:
> So pushed an updated hmm-next branch this should fix all issues you had.
> Thought i am not sure about the test in this mail, all i see is that it
> continously spit error messages but it does not hang (i let it run 20min
> or so). Dunno if that is what expected. Let me know if this is still an
> issue and if so what should be the expected output of this test program.
>
> Cheers,
> Jérôme

Thanks, Jerome. The kernel hang indeed seems to be fixed.

Regarding the last issue I reported. It still persists. The test program 
should eventually exit. Instead, it loops indefinitely (sorry, I was not 
clear when I called it 'app-side hang').

This is what's expected. The number of error messages can be random, but 
must be finite; the program must print "OK" at the end:

$ ./run.sh
&&& 1 migrate threads: STARTING
iteration 0
thread 0 is migrating 10000 pages starting from 0x7f6abce79000
migrate_thread_func:87: failed to migrate pages at 0x7f6abce79000 
(migrate.npages (tid 0): 9725 != npages: 10000)
thread 0 is migrating 10000 pages starting from 0x7f6abce79000
thread 0 is migrating 10000 pages starting from 0x7f6abce79000
&&& 1 migrate threads: PASSED
(OK)[./sanity_rmem004] anon migration read test

Thanks,

-- 
Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-14 19:43                     ` Evgeny Baskakov
@ 2017-07-15  0:55                       ` Jerome Glisse
  2017-07-15  5:04                         ` Evgeny Baskakov
  2017-07-21  1:00                         ` Evgeny Baskakov
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-07-15  0:55 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Fri, Jul 14, 2017 at 12:43:51PM -0700, Evgeny Baskakov wrote:
> On 7/13/17 1:16 PM, Jerome Glisse wrote:
> Hi Jerome,
> 
> I have hit another kind of hang. Briefly, if a not yet allocated page faults
> on CPU during migration to device memory, any subsequent migration will fail
> for such page. Such a situation can trigger if a CPU page fault happens just
> immediately after migrate_vma() starts unmapping pages to migrate.
> 
> Please find attached a reproducer based on the sample driver. In the
> hmm_test() function, an HMM_DMIRROR_MIGRATE request is triggered from a
> separate thread for not yet allocated pages (coming from malloc). In the
> same time, a HMM_DMIRROR_READ request is made for the same pages. This
> results in a sporadic app-side hang, because random number of pages never
> migrate to device memory.
> 
> Note that if the pages are touched (initialized with data) prior to that,
> everything works as expected: all HMM_DMIRROR_READ and HMM_DMIRROR_MIGRATE
> requests eventually succeed. See comments in the hmm_test() function.
> 

So pushed an updated hmm-next branch this should fix all issues you had.
Thought i am not sure about the test in this mail, all i see is that it
continously spit error messages but it does not hang (i let it run 20min
or so). Dunno if that is what expected. Let me know if this is still an
issue and if so what should be the expected output of this test program.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-13 20:16                   ` Jerome Glisse
  2017-07-14  5:32                     ` Evgeny Baskakov
@ 2017-07-14 19:43                     ` Evgeny Baskakov
  2017-07-15  0:55                       ` Jerome Glisse
  1 sibling, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-14 19:43 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]

On 7/13/17 1:16 PM, Jerome Glisse wrote:

> ...
>

Hi Jerome,

I have hit another kind of hang. Briefly, if a not yet allocated page 
faults on CPU during migration to device memory, any subsequent 
migration will fail for such page. Such a situation can trigger if a CPU 
page fault happens just immediately after migrate_vma() starts unmapping 
pages to migrate.

Please find attached a reproducer based on the sample driver. In the 
hmm_test() function, an HMM_DMIRROR_MIGRATE request is triggered from a 
separate thread for not yet allocated pages (coming from malloc). In the 
same time, a HMM_DMIRROR_READ request is made for the same pages. This 
results in a sporadic app-side hang, because random number of pages 
never migrate to device memory.

Note that if the pages are touched (initialized with data) prior to 
that, everything works as expected: all HMM_DMIRROR_READ and 
HMM_DMIRROR_MIGRATE requests eventually succeed. See comments in the 
hmm_test() function.

Thanks!

-- 
Evgeny Baskakov
NVIDIA


[-- Attachment #2: sanity_rmem004_repeated_faults_threaded_notallocated.tgz --]
[-- Type: application/x-gzip, Size: 5786 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-13 20:16                   ` Jerome Glisse
@ 2017-07-14  5:32                     ` Evgeny Baskakov
  2017-07-14 19:43                     ` Evgeny Baskakov
  1 sibling, 0 replies; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-14  5:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 1285 bytes --]

On 7/13/17 1:16 PM, Jerome Glisse wrote:

> I updated hmm-next with a patch that might fix some other issues but i am
> still trying to get this dead lock you are seing. Does it happens quickly
> with the test program ?
>
> I can't see how it dead lock on the page lock bit. Going over and over all
> code path we always unlock page once we are done or when we back off from
> migration. So far i haven't been able to reproduce thought i haven't had
> much time to test as other thing kept me busy. I should be back looking into
> that tomorrow.
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-next
>
> Cheers,
> JA(C)rA'me

Hi Jerome,

The issue persists in the updated hmm-next. The test program hangs on 
the first run on my 12-core SMT system:

$ sudo ./run.sh
&&& 2 migrate threads, 2 read threads: STARTING
&&& 2 migrate threads, 2 read threads: PASSED
&&& 2 migrate threads, 3 read threads: STARTING
&&& 2 migrate threads, 3 read threads: PASSED
&&& 2 migrate threads, 4 read threads: STARTING
&&& 2 migrate threads, 4 read threads: PASSED
&&& 3 migrate threads, 2 read threads: STARTING
....
[no progress being made]

Please find attached a new kernel log with blocked tasks shown and my 
kernel config. I hope that is helpful.

Thanks,

-- 
Evgeny Baskakov
NVIDIA



[-- Attachment #2: config --]
[-- Type: text/plain, Size: 117855 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.12.0-rc6-mm1 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# CONFIG_BUILD_BIN2C is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=23
CONFIG_LOG_CPU_MAX_BUF_SHIFT=13
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=14
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
# CONFIG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_PERF is not set
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_SOCK_CGROUP_DATA is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_POSIX_TIMERS=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
# CONFIG_BPF_SYSCALL is not set
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_ADVISE_SYSCALLS=y
# CONFIG_USERFAULTFD is not set
CONFIG_PCI_QUIRKS=y
CONFIG_MEMBARRIER=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SYSTEM_DATA_VERIFICATION is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_NMI=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_GCC_PLUGINS=y
# CONFIG_GCC_PLUGINS is not set
CONFIG_HAVE_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_CC_STACKPROTECTOR_NONE=y
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_THIN_ARCHIVES=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_COPY_THREAD_TLS=y
CONFIG_HAVE_STACK_VALIDATION=y
# CONFIG_HAVE_ARCH_HASH is not set
# CONFIG_ISA_BUS_API is not set
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
# CONFIG_CPU_NO_EFFICIENT_FFS is not set
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
# CONFIG_ARCH_OPTIONAL_KERNEL_RWX is not set
# CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
# CONFIG_REFCOUNT_FULL is not set

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
# CONFIG_MODULE_COMPRESS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLK_SCSI_REQUEST=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_BSGLIB is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
# CONFIG_BLK_DEV_ZONED is not set
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_BLK_WBT is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
CONFIG_X86_FAST_FEATURE_TESTS=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
# CONFIG_INTEL_RDT_A is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_VSMP is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_HYPERVISOR_GUEST is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
# CONFIG_PROCESSOR_SELECT is not set
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_GART_IOMMU is not set
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=64
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
# CONFIG_VM86 is not set
CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
# CONFIG_ARCH_MEMORY_PROBE is not set
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_HAVE_GENERIC_GUP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_HAS_HMM=y
CONFIG_HMM=y
CONFIG_HMM_MIRROR=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_ARCH_WANTS_THP_SWAP=y
# CONFIG_CLEANCACHE is not set
# CONFIG_FRONTSWAP is not set
# CONFIG_CMA is not set
# CONFIG_ZPOOL is not set
# CONFIG_ZBUD is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT=y
# CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ZONE_DEVICE=y
CONFIG_DEVICE_PRIVATE=y
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
# CONFIG_X86_INTEL_MPX is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_EFI=y
# CONFIG_EFI_STUB is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
# CONFIG_KEXEC_FILE is not set
CONFIG_CRASH_DUMP=y
# CONFIG_KEXEC_JUMP is not set
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
# CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_COMPAT_VDSO is not set
# CONFIG_LEGACY_VSYSCALL_NATIVE is not set
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_MODIFY_LDT_SYSCALL=y
CONFIG_HAVE_LIVEPATCH=y
# CONFIG_LIVEPATCH is not set
CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_SUSPEND_SKIP_SYNC is not set
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_HOTPLUG_MEMORY is not set
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
# CONFIG_ACPI_NFIT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
# CONFIG_ACPI_APEI is not set
# CONFIG_DPTF_POWER is not set
# CONFIG_ACPI_EXTLOG is not set
# CONFIG_PMIC_OPREGION is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
# CONFIG_INTEL_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
CONFIG_PCIEAER=y
# CONFIG_PCIE_ECRC is not set
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_BUS_ADDR_T_64BIT=y
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
# CONFIG_PCI_IOV is not set
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT is not set

#
# PCI host controller drivers
#
# CONFIG_VMD is not set

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# CONFIG_ISA_BUS is not set
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_PCCARD_NONSTATIC=y
# CONFIG_RAPIDIO is not set
# CONFIG_X86_SYSFB is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y
CONFIG_NET_INGRESS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_FIB_TRIE_STATS is not set
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
CONFIG_IP_PNP_RARP=y
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_NET_IP_TUNNEL=y
CONFIG_IP_MROUTE=y
# CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_NET_UDP_TUNNEL is not set
# CONFIG_NET_FOU is not set
# CONFIG_NET_FOU_IP_TUNNELS is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_TCP_CONG_DCTCP is not set
# CONFIG_TCP_CONG_CDG is not set
# CONFIG_TCP_CONG_BBR is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=y
CONFIG_INET6_ESP=y
# CONFIG_INET6_ESP_OFFLOAD is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_IPV6_ILA is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_VTI is not set
CONFIG_IPV6_SIT=y
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_FOU is not set
# CONFIG_IPV6_FOU_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_NETFILTER_ADVANCED is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_LOG=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_LOG_COMMON=m
# CONFIG_NF_LOG_NETDEV is not set
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_IRC=y
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
CONFIG_NF_CONNTRACK_SIP=y
CONFIG_NF_CT_NETLINK=y
# CONFIG_NETFILTER_NETLINK_GLUE_CT is not set
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
# CONFIG_NF_NAT_AMANDA is not set
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_SIP=m
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_REDIRECT is not set
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NETFILTER_XT_NAT=m
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
CONFIG_NETFILTER_XT_TARGET_NFLOG=y
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
CONFIG_NETFILTER_XT_TARGET_SECMARK=y
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_POLICY=y
CONFIG_NETFILTER_XT_MATCH_STATE=y
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_CONNTRACK_IPV4=y
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
CONFIG_NF_LOG_ARP=m
CONFIG_NF_LOG_IPV4=m
CONFIG_NF_REJECT_IPV4=y
CONFIG_NF_NAT_IPV4=m
CONFIG_NF_NAT_MASQUERADE_IPV4=m
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_MANGLE=y
# CONFIG_IP_NF_RAW is not set

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV6=y
CONFIG_NF_CONNTRACK_IPV6=y
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
CONFIG_NF_REJECT_IPV6=y
CONFIG_NF_LOG_IPV6=m
CONFIG_IP6_NF_IPTABLES=y
CONFIG_IP6_NF_MATCH_IPV6HEADER=y
CONFIG_IP6_NF_FILTER=y
CONFIG_IP6_NF_TARGET_REJECT=y
CONFIG_IP6_NF_MANGLE=y
# CONFIG_IP6_NF_RAW is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_HAVE_NET_DSA=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_FQ is not set
# CONFIG_NET_SCH_HHF is not set
# CONFIG_NET_SCH_PIE is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set
# CONFIG_NET_SCH_DEFAULT is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_CLS_CGROUP is not set
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_SAMPLE is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
# CONFIG_NET_ACT_VLAN is not set
# CONFIG_NET_ACT_BPF is not set
# CONFIG_NET_ACT_SKBMOD is not set
# CONFIG_NET_ACT_IFE is not set
# CONFIG_NET_ACT_TUNNEL_KEY is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_NET_NCSI is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
# CONFIG_BPF_JIT is not set
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_STREAM_PARSER is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
CONFIG_CFG80211=y
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_CERTIFICATION_ONUS is not set
CONFIG_CFG80211_DEFAULT_PS=y
# CONFIG_CFG80211_DEBUGFS is not set
# CONFIG_CFG80211_INTERNAL_REGDB is not set
CONFIG_CFG80211_CRDA_SUPPORT=y
# CONFIG_CFG80211_WEXT is not set
# CONFIG_LIB80211 is not set
CONFIG_MAC80211=y
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_MINSTREL_HT=y
# CONFIG_MAC80211_RC_MINSTREL_VHT is not set
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
# CONFIG_MAC80211_MESH is not set
CONFIG_MAC80211_LEDS=y
# CONFIG_MAC80211_DEBUGFS is not set
# CONFIG_MAC80211_MESSAGE_TRACING is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_WIMAX is not set
CONFIG_RFKILL=y
CONFIG_RFKILL_LEDS=y
# CONFIG_RFKILL_INPUT is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
# CONFIG_NET_DEVLINK is not set
CONFIG_MAY_USE_DEVLINK=y
CONFIG_HAVE_EBPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
# CONFIG_DEVTMPFS_MOUNT is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set
CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set

#
# Bus devices
#
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_SRAM is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC Bus Driver
#
# CONFIG_INTEL_MIC_BUS is not set

#
# SCIF Bus Driver
#
# CONFIG_SCIF_BUS is not set

#
# VOP Bus Driver
#
# CONFIG_VOP_BUS is not set

#
# Intel MIC Host Driver
#

#
# Intel MIC Card Driver
#

#
# SCIF Driver
#

#
# Intel MIC Coprocessor State Management (COSM) Drivers
#

#
# VOP Driver
#
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_CXL_BASE is not set
# CONFIG_CXL_AFU_DRIVER_OPS is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_MQ_DEFAULT is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_DWC is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
CONFIG_PATA_SCH=y
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_PLATFORM is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_MQ_DEFAULT is not set
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_THIN_PROVISIONING is not set
# CONFIG_DM_CACHE is not set
# CONFIG_DM_ERA is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_RAID is not set
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_DM_INTEGRITY is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_FC is not set
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_MACSEC is not set
CONFIG_NETCONSOLE=y
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_TUN_VNET_CROSS_LE is not set
# CONFIG_VETH is not set
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_3C589 is not set
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_PCMCIA_NMCLAN is not set
# CONFIG_AMD_XGBE is not set
# CONFIG_AMD_XGBE_HAVE_ECC is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_NET_VENDOR_AURORA is not set
CONFIG_NET_CADENCE=y
# CONFIG_MACB is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BCMGENET is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
CONFIG_TIGON3=y
CONFIG_TIGON3_HWMON=y
# CONFIG_BNX2X is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_CAVIUM=y
# CONFIG_THUNDER_NIC_PF is not set
# CONFIG_THUNDER_NIC_VF is not set
# CONFIG_THUNDER_NIC_BGX is not set
# CONFIG_THUNDER_NIC_RGX is not set
# CONFIG_LIQUIDIO is not set
# CONFIG_LIQUIDIO_VF is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
# CONFIG_CX_ECAT is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EZCHIP=y
CONFIG_NET_VENDOR_EXAR=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
CONFIG_NET_VENDOR_FUJITSU=y
# CONFIG_PCMCIA_FMVJ18X is not set
CONFIG_NET_VENDOR_HP=y
# CONFIG_HP100 is not set
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=y
CONFIG_E1000=y
CONFIG_E1000E=y
CONFIG_E1000E_HWTS=y
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_IXGB is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
# CONFIG_I40E is not set
# CONFIG_I40EVF is not set
# CONFIG_FM10K is not set
CONFIG_NET_VENDOR_I825XX=y
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
# CONFIG_SKGE is not set
CONFIG_SKY2=y
# CONFIG_SKY2_DEBUG is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_NETRONOME=y
# CONFIG_NFP is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_PCMCIA_PCNET is not set
CONFIG_NET_VENDOR_NVIDIA=y
CONFIG_FORCEDETH=y
CONFIG_NET_VENDOR_OKI=y
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_QED is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
# CONFIG_8139TOO_TUNE_TWISTER is not set
# CONFIG_8139TOO_8129 is not set
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
CONFIG_NET_VENDOR_SEEQ=y
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_PCMCIA_SMC91C92 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_ALE is not set
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XIRCOM=y
# CONFIG_PCMCIA_XIRC2PS is not set
CONFIG_NET_VENDOR_SYNOPSYS=y
# CONFIG_DWC_XLGMAC is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_MDIO_DEVICE=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_THUNDER is not set
CONFIG_PHYLIB=y
# CONFIG_LED_TRIGGER_PHY is not set

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
# CONFIG_AT803X_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_CORTINA_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_XILINX_GMII2RGMII is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_LAN78XX is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
# CONFIG_WIRELESS_WDS is not set
CONFIG_WLAN_VENDOR_ADMTEK=y
# CONFIG_ADM8211 is not set
CONFIG_WLAN_VENDOR_ATH=y
# CONFIG_ATH_DEBUG is not set
# CONFIG_ATH5K is not set
# CONFIG_ATH5K_PCI is not set
# CONFIG_ATH9K is not set
# CONFIG_ATH9K_HTC is not set
# CONFIG_CARL9170 is not set
# CONFIG_ATH6KL is not set
# CONFIG_AR5523 is not set
# CONFIG_WIL6210 is not set
# CONFIG_ATH10K is not set
# CONFIG_WCN36XX is not set
CONFIG_WLAN_VENDOR_ATMEL=y
# CONFIG_ATMEL is not set
# CONFIG_AT76C50X_USB is not set
CONFIG_WLAN_VENDOR_BROADCOM=y
# CONFIG_B43 is not set
# CONFIG_B43LEGACY is not set
# CONFIG_BRCMSMAC is not set
# CONFIG_BRCMFMAC is not set
CONFIG_WLAN_VENDOR_CISCO=y
# CONFIG_AIRO is not set
# CONFIG_AIRO_CS is not set
CONFIG_WLAN_VENDOR_INTEL=y
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_IWL4965 is not set
# CONFIG_IWL3945 is not set
# CONFIG_IWLWIFI is not set
CONFIG_WLAN_VENDOR_INTERSIL=y
# CONFIG_HOSTAP is not set
# CONFIG_HERMES is not set
# CONFIG_P54_COMMON is not set
# CONFIG_PRISM54 is not set
CONFIG_WLAN_VENDOR_MARVELL=y
# CONFIG_LIBERTAS is not set
# CONFIG_LIBERTAS_THINFIRM is not set
# CONFIG_MWIFIEX is not set
# CONFIG_MWL8K is not set
CONFIG_WLAN_VENDOR_MEDIATEK=y
# CONFIG_MT7601U is not set
CONFIG_WLAN_VENDOR_RALINK=y
# CONFIG_RT2X00 is not set
CONFIG_WLAN_VENDOR_REALTEK=y
# CONFIG_RTL8180 is not set
# CONFIG_RTL8187 is not set
CONFIG_RTL_CARDS=y
# CONFIG_RTL8192CE is not set
# CONFIG_RTL8192SE is not set
# CONFIG_RTL8192DE is not set
# CONFIG_RTL8723AE is not set
# CONFIG_RTL8723BE is not set
# CONFIG_RTL8188EE is not set
# CONFIG_RTL8192EE is not set
# CONFIG_RTL8821AE is not set
# CONFIG_RTL8192CU is not set
# CONFIG_RTL8XXXU is not set
CONFIG_WLAN_VENDOR_RSI=y
# CONFIG_RSI_91X is not set
CONFIG_WLAN_VENDOR_ST=y
# CONFIG_CW1200 is not set
CONFIG_WLAN_VENDOR_TI=y
# CONFIG_WL1251 is not set
# CONFIG_WL12XX is not set
# CONFIG_WL18XX is not set
# CONFIG_WLCORE is not set
CONFIG_WLAN_VENDOR_ZYDAS=y
# CONFIG_USB_ZD1201 is not set
# CONFIG_ZD1211RW is not set
CONFIG_WLAN_VENDOR_QUANTENNA=y
# CONFIG_QTNFMAC_PEARL_PCIE is not set
# CONFIG_PCMCIA_RAYCS is not set
# CONFIG_PCMCIA_WL3501 is not set
# CONFIG_MAC80211_HWSIM is not set
# CONFIG_USB_NET_RNDIS_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
# CONFIG_ISDN is not set
# CONFIG_NVM is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_LEDS=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y
CONFIG_INPUT_SPARSEKMAP=y
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_SENTELIC is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_AS5011 is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_PEGASUS is not set
# CONFIG_TABLET_SERIAL_WACOM4 is not set
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_PROPERTIES=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_EGALAX_SERIAL is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_EKTF2127 is not set
# CONFIG_TOUCHSCREEN_ELAN is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MELFAS_MIP4 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WDT87XX_I2C is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2004 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_SILEAD is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_STMFTS is not set
# CONFIG_TOUCHSCREEN_SX8654 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
# CONFIG_TOUCHSCREEN_ZET6223 is not set
# CONFIG_TOUCHSCREEN_ROHM_BU21023 is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVMEM=y
CONFIG_DEVKMEM=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_FSL is not set
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
# CONFIG_SERIAL_8250_MOXA is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_TTY_PRINTK is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
CONFIG_HW_RANDOM_VIA=y
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_SCR24X is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_HMM_DMIRROR=m
# CONFIG_XILLYBUS is not set

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=y
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_AVS is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ASPEED is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_FTSTEUTATES is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NCT7904 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_EMULATION is not set
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_X86_PKG_TEMP_THERMAL=m
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# CONFIG_INTEL_PCH_THERMAL is not set
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_CORE is not set
# CONFIG_WATCHDOG_NOWAYOUT is not set
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
# CONFIG_WATCHDOG_SYSFS is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WDAT_WDT is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_NI903X_WDT is not set
# CONFIG_NIC7018_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set

#
# Watchdog Pretimeout Governors
#
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_CROS_EC is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_INTEL_SOC_PMIC_CHTWC is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RTSX_PCI is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RTSX_USB is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_MFD_SMSC is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS65218 is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
CONFIG_INTEL_GTT=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=y
CONFIG_DRM_MIPI_DSI=y
# CONFIG_DRM_DP_AUX_CHARDEV is not set
# CONFIG_DRM_DEBUG_MM is not set
# CONFIG_DRM_DEBUG_MM_SELFTEST is not set
CONFIG_DRM_KMS_HELPER=y
CONFIG_DRM_KMS_FB_HELPER=y
CONFIG_DRM_FBDEV_EMULATION=y
CONFIG_DRM_FBDEV_OVERALLOC=100
# CONFIG_DRM_LOAD_EDID_FIRMWARE is not set

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I2C_NXP_TDA998X is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_AMDGPU is not set

#
# ACP (Audio CoProcessor) Configuration
#
# CONFIG_DRM_NOUVEAU is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_I915_ALPHA_SUPPORT is not set
CONFIG_DRM_I915_CAPTURE_ERROR=y
CONFIG_DRM_I915_COMPRESS_ERROR=y
CONFIG_DRM_I915_USERPTR=y
# CONFIG_DRM_I915_GVT is not set

#
# drm/i915 Debugging
#
# CONFIG_DRM_I915_WERROR is not set
# CONFIG_DRM_I915_DEBUG is not set
# CONFIG_DRM_I915_SW_FENCE_DEBUG_OBJECTS is not set
# CONFIG_DRM_I915_SW_FENCE_CHECK_DAG is not set
# CONFIG_DRM_I915_SELFTEST is not set
# CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS is not set
# CONFIG_DRM_I915_DEBUG_VBLANK_EVADE is not set
# CONFIG_DRM_VGEM is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_DRM_QXL is not set
# CONFIG_DRM_BOCHS is not set
CONFIG_DRM_PANEL=y

#
# Display Panels
#
CONFIG_DRM_BRIDGE=y
CONFIG_DRM_PANEL_BRIDGE=y

#
# Display Interface Bridges
#
# CONFIG_DRM_ANALOGIX_ANX78XX is not set
# CONFIG_DRM_HISI_HIBMC is not set
# CONFIG_DRM_TINYDRM is not set
# CONFIG_DRM_LEGACY is not set
# CONFIG_DRM_LIB_RANDOM is not set

#
# Frame buffer Devices
#
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_PROVIDE_GET_FB_UNMAPPED_AREA is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_PM8941_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_BACKLIGHT_ARCXCNN is not set
# CONFIG_VGASTATE is not set
CONFIG_HDMI=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_SEQ_DEVICE=y
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_PCM_TIMER=y
CONFIG_SND_HRTIMER=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_PROC_FS=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_SEQUENCER=y
CONFIG_SND_SEQ_DUMMY=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_SEQ_MIDI_EVENT=y
# CONFIG_SND_OPL3_LIB_SEQ is not set
# CONFIG_SND_OPL4_LIB_SEQ is not set
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_ALOOP is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1_SEQ is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SE6X is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# HD-Audio
#
# CONFIG_SND_HDA_INTEL is not set
CONFIG_SND_HDA_PREALLOC_SIZE=64
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_USB_HIFACE is not set
# CONFIG_SND_BCD2000 is not set
# CONFIG_SND_USB_POD is not set
# CONFIG_SND_USB_PODHD is not set
# CONFIG_SND_USB_TONEPORT is not set
# CONFIG_SND_USB_VARIAX is not set
CONFIG_SND_PCMCIA=y
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set
# CONFIG_SND_SOC is not set
CONFIG_SND_X86=y
# CONFIG_HDMI_LPE_AUDIO is not set

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_ASUS is not set
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
# CONFIG_HID_BETOP_FF is not set
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_PRODIKEYS is not set
# CONFIG_HID_CMEDIA is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
CONFIG_HID_GYRATION=y
# CONFIG_HID_ICADE is not set
# CONFIG_HID_ITE is not set
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_HID_LOGITECH_HIDPP is not set
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
CONFIG_LOGIWHEELS_FF=y
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MAYFLASH is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
CONFIG_HID_PANTHERLORD=y
CONFIG_PANTHERLORD_FF=y
# CONFIG_HID_PENMOUNT is not set
CONFIG_HID_PETALYNX=y
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
# CONFIG_SONY_FF is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEELSERIES is not set
CONFIG_HID_SUNPLUS=y
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
CONFIG_HID_TOPSEED=y
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# I2C HID support
#
# CONFIG_I2C_HID is not set

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
# CONFIG_USB_OTG_BLACKLIST_HUB is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_REALTEK is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_STORAGE_ENE_UB6250 is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=y
# CONFIG_USB_SERIAL_CONSOLE is not set
# CONFIG_USB_SERIAL_GENERIC is not set
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_F8153X is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_UPD78F0730 is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set
# CONFIG_UCSI is not set

#
# USB Physical Layer drivers
#
# CONFIG_USB_PHY is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set

#
# USB Power Delivery and Type-C drivers
#
# CONFIG_USB_LED_TRIG is not set
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_LP5562 is not set
# CONFIG_LEDS_LP8501 is not set
# CONFIG_LEDS_LP8860 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
# CONFIG_EDAC_AMD64 is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
# CONFIG_EDAC_SKX is not set
# CONFIG_EDAC_PND2 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
# CONFIG_RTC_HCTOSYS is not set
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV8803 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_HID_SENSOR_TIME is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_INTEL_IDMA64 is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
CONFIG_HSU_DMA=y

#
# DMA Clients
#
# CONFIG_ASYNC_TX_DMA is not set
# CONFIG_DMATEST is not set

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
# CONFIG_SW_SYNC is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
# CONFIG_VFIO is not set
# CONFIG_VIRT_DRIVERS is not set

#
# Virtio drivers
#
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV_TSCPAGE is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_LAPTOP is not set
# CONFIG_DELL_SMO8800 is not set
# CONFIG_DELL_RBTN is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_AMILO_RFKILL is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_HP_WIRELESS is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_IDEAPAD_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
CONFIG_EEEPC_LAPTOP=y
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_CHT_INT33FE is not set
# CONFIG_INTEL_HID_EVENT is not set
# CONFIG_INTEL_VBTN is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_PMC_CORE is not set
# CONFIG_IBM_RTL is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_INTEL_OAKTRAIL is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set
# CONFIG_PVPANIC is not set
# CONFIG_INTEL_PMC_IPC is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_MLX_PLATFORM is not set
# CONFIG_MLX_CPLD_PLATFORM is not set
# CONFIG_INTEL_TURBO_MAX_3 is not set
CONFIG_PMC_ATOM=y
# CONFIG_CHROME_PLATFORMS is not set
CONFIG_CLKDEV_LOOKUP=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y

#
# Common Clock Framework
#
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_COMMON_CLK_NXP is not set
# CONFIG_COMMON_CLK_PXA is not set
# CONFIG_COMMON_CLK_PIC32 is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# CONFIG_ATMEL_PIT is not set
# CONFIG_SH_TIMER_CMT is not set
# CONFIG_SH_TIMER_MTU2 is not set
# CONFIG_SH_TIMER_TMU is not set
# CONFIG_EM_TIMER_STI is not set
CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
CONFIG_IOMMU_IOVA=y
CONFIG_AMD_IOMMU=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_IRQ_REMAP is not set

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Broadcom SoC drivers
#

#
# i.MX SoC drivers
#
# CONFIG_SUNXI_SRAM is not set
# CONFIG_SOC_TI is not set
# CONFIG_SOC_ZTE is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set
CONFIG_ARM_GIC_MAX_NR=1
# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set
# CONFIG_FMC is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
CONFIG_RAS=y
# CONFIG_THUNDERBOLT is not set

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_LIBNVDIMM is not set
CONFIG_DAX=y
# CONFIG_NVMEM is not set
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set

#
# FPGA Configuration Support
#
# CONFIG_FPGA is not set

#
# FSI support
#
# CONFIG_FSI is not set
# CONFIG_MULTIPLEXER is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT_FIND is not set
# CONFIG_FW_CFG_SYSFS is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_VARS=y
CONFIG_EFI_ESRT=y
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_EFI_DEV_PATH_PARSER is not set

#
# Tegra firmware driver
#

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_FS_DAX is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_MANDATORY_FILE_LOCKING=y
# CONFIG_FS_ENCRYPTION is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_FAT_DEFAULT_UTF8 is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
# CONFIG_CONFIGFS_FS is not set
CONFIG_EFIVAR_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_PSTORE is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
# CONFIG_NFS_V4_1 is not set
CONFIG_ROOT_NFS=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
# CONFIG_NFSD is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
# CONFIG_SUNRPC_DEBUG is not set
# CONFIG_CEPH_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_UPCALL is not set
# CONFIG_CIFS_XATTR is not set
CONFIG_CIFS_DEBUG=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_DFS_UPCALL is not set
# CONFIG_CIFS_SMB2 is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_DEBUG_SYNCHRO_TEST is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_DYNAMIC_DEBUG is not set

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_GDB_SCRIPTS is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_PAGE_OWNER is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_STACK_VALIDATION is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
CONFIG_PAGE_EXTENSION=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_PAGE_POISONING=y
CONFIG_PAGE_POISONING_NO_SANITY=y
# CONFIG_PAGE_POISONING_ZERO is not set
# CONFIG_DEBUG_PAGE_REF is not set
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_OBJECTS=y
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
# CONFIG_DEBUG_OBJECTS_FREE is not set
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
# CONFIG_DEBUG_OBJECTS_WORK is not set
# CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
# CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_SLUB_DEBUG_ON=y
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_HAVE_ARCH_KASAN=y
# CONFIG_KASAN is not set
CONFIG_ARCH_HAS_KCOV=y
# CONFIG_KCOV is not set
# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Lockups and Hangs
#
# CONFIG_SOFTLOCKUP_DETECTOR is not set
# CONFIG_HARDLOCKUP_DETECTOR is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
# CONFIG_SCHED_DEBUG is not set
CONFIG_SCHED_INFO=y
CONFIG_SCHEDSTATS=y
# CONFIG_SCHED_STACK_END_CHECK is not set
# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_PI_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_PROVE_RCU is not set
# CONFIG_TORTURE_TEST is not set
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_HWLAT_TRACER is not set
# CONFIG_FTRACE_SYSCALLS is not set
# CONFIG_TRACER_SNAPSHOT is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_STACK_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_KPROBE_EVENTS is not set
CONFIG_UPROBE_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set

#
# Runtime Testing
#
# CONFIG_LKDTM is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_TEST_SORT is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_HASH is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_MEMTEST is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=m
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
# CONFIG_KGDB_KDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_ARCH_WANTS_UBSAN_NO_NULL is not set
# CONFIG_UBSAN is not set
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
# CONFIG_STRICT_DEVMEM is not set
CONFIG_EARLY_PRINTK_USB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_EFI is not set
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_X86_PTDUMP_CORE is not set
# CONFIG_X86_PTDUMP is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_WX is not set
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
CONFIG_X86_DEBUG_FPU=y
# CONFIG_PUNIT_ATOM_DEBUG is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_COMPAT=y
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_BIG_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITY_WRITABLE_HOOKS=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
# CONFIG_SECURITY_PATH is not set
# CONFIG_INTEL_TXT is not set
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=0
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_LOADPIN is not set
# CONFIG_SECURITY_YAMA is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_INTEGRITY_AUDIT=y
# CONFIG_IMA is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
# CONFIG_CRYPTO_RSA is not set
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_MCRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set
# CONFIG_CRYPTO_KEYWRAP is not set

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
# CONFIG_CRYPTO_CRCT10DIF is not set
CONFIG_CRYPTO_GHASH=y
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SHA1_MB is not set
# CONFIG_CRYPTO_SHA256_MB is not set
# CONFIG_CRYPTO_SHA512_MB is not set
CONFIG_CRYPTO_SHA256=y
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_DESC is not set
# CONFIG_CRYPTO_DEV_CCP is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
CONFIG_CRYPTO_DEV_NITROX=m
CONFIG_CRYPTO_DEV_NITROX_CNN55XX=m
# CONFIG_ASYMMETRIC_KEY_TYPE is not set

#
# Certificates for signature checking
#
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
# CONFIG_KVM is not set
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
# CONFIG_HAVE_ARCH_BITREVERSE is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
# CONFIG_CRC_T10DIF is not set
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
# CONFIG_CRC8 is not set
# CONFIG_AUDIT_ARCH_COMPAT_GENERIC is not set
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_INTERVAL_TREE=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
# CONFIG_DMA_NOOP_OPS is not set
# CONFIG_DMA_VIRT_OPS is not set
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
# CONFIG_IRQ_POLL is not set
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
# CONFIG_SG_SPLIT is not set
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_SG_CHAIN=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_MMIO_FLUSH=y
CONFIG_SBITMAP=y

[-- Attachment #3: kernel.log --]
[-- Type: text/plain, Size: 8602 bytes --]

[  189.489920] hmm_dmirror loaded THIS IS A DANGEROUS MODULE !!!
[  193.506260] DEVICE PAGE 27422 27422 (0)
[  194.132757] DEVICE PAGE 76509 76509 (0)
[  194.755164] DEVICE PAGE 104522 104522 (0)
[  220.658518] sysrq: SysRq : Show Blocked State
[  220.658598]   task                        PC stack   pid father
[  220.658601] rcu_sched       D15024     8      2 0x00000000
[  220.658609] Call Trace:
[  220.658616]  __schedule+0x20b/0x6c0
[  220.658619]  schedule+0x36/0x80
[  220.658625]  rcu_gp_kthread+0x74/0x770
[  220.658630]  kthread+0x109/0x140
[  220.658633]  ? force_qs_rnp+0x180/0x180
[  220.658636]  ? kthread_park+0x60/0x60
[  220.658640]  ret_from_fork+0x22/0x30
[  220.658642] rcu_bh          D15424     9      2 0x00000000
[  220.658648] Call Trace:
[  220.658651]  __schedule+0x20b/0x6c0
[  220.658653]  schedule+0x36/0x80
[  220.658657]  rcu_gp_kthread+0x74/0x770
[  220.658661]  kthread+0x109/0x140
[  220.658664]  ? force_qs_rnp+0x180/0x180
[  220.658667]  ? kthread_park+0x60/0x60
[  220.658670]  ret_from_fork+0x22/0x30
[  220.658698] sanity_rmem004  D13528  3948   3761 0x00000000
[  220.658705] Call Trace:
[  220.658707]  __schedule+0x20b/0x6c0
[  220.658709]  schedule+0x36/0x80
[  220.658714]  io_schedule+0x16/0x40
[  220.658719]  __lock_page+0xf2/0x130
[  220.658722]  ? page_cache_tree_insert+0x90/0x90
[  220.658726]  migrate_vma+0x48a/0xee0
[  220.658733]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  220.658741]  ? avc_has_extended_perms+0x8e/0x480
[  220.658745]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  220.658747]  ? _cond_resched+0x19/0x30
[  220.658752]  ? selinux_file_ioctl+0x114/0x1e0
[  220.658755]  do_vfs_ioctl+0x96/0x5a0
[  220.658758]  SyS_ioctl+0x79/0x90
[  220.658762]  entry_SYSCALL_64_fastpath+0x13/0x94
[  220.658764] RIP: 0033:0x7f098cf4f1e7
[  220.658766] RSP: 002b:00007f098ae58d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  220.658768] RAX: ffffffffffffffda RBX: 00007f098ae58df0 RCX: 00007f098cf4f1e7
[  220.658770] RDX: 00007f098ae58df0 RSI: 00000000c0104802 RDI: 0000000000000003
[  220.658771] RBP: 00007fffd23e52d0 R08: 0000000000000000 R09: 00007f098ae59700
[  220.658772] R10: 00007f098ae599d0 R11: 0000000000000246 R12: 00007f098ae58df8
[  220.658773] R13: 0000000000000000 R14: 0000000000000010 R15: 00007f098ae59700
[  220.658776] sanity_rmem004  D13528  3949   3761 0x00000000
[  220.658781] Call Trace:
[  220.658784]  __schedule+0x20b/0x6c0
[  220.658786]  schedule+0x36/0x80
[  220.658789]  io_schedule+0x16/0x40
[  220.658793]  __lock_page+0xf2/0x130
[  220.658796]  ? page_cache_tree_insert+0x90/0x90
[  220.658799]  migrate_vma+0x48a/0xee0
[  220.658803]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  220.658811]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  220.658813]  ? _cond_resched+0x19/0x30
[  220.658817]  ? selinux_file_ioctl+0x114/0x1e0
[  220.658819]  do_vfs_ioctl+0x96/0x5a0
[  220.658823]  SyS_ioctl+0x79/0x90
[  220.658826]  entry_SYSCALL_64_fastpath+0x13/0x94
[  220.658827] RIP: 0033:0x7f098cf4f1e7
[  220.658828] RSP: 002b:00007f098b659d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  220.658831] RAX: ffffffffffffffda RBX: 00007f098b659df0 RCX: 00007f098cf4f1e7
[  220.658832] RDX: 00007f098b659df0 RSI: 00000000c0104802 RDI: 0000000000000003
[  220.658833] RBP: 00007fffd23e52d0 R08: 0000000000000000 R09: 00007f098b65a700
[  220.658834] R10: 00007f098b65a9d0 R11: 0000000000000246 R12: 00007f098b659df8
[  220.658835] R13: 0000000000000000 R14: 0000000000000010 R15: 00007f098b65a700
[  220.658837] sanity_rmem004  D13528  3950   3761 0x00000000
[  220.658843] Call Trace:
[  220.658845]  __schedule+0x20b/0x6c0
[  220.658847]  schedule+0x36/0x80
[  220.658850]  io_schedule+0x16/0x40
[  220.658854]  __lock_page+0xf2/0x130
[  220.658857]  ? page_cache_tree_insert+0x90/0x90
[  220.658860]  migrate_vma+0x48a/0xee0
[  220.658863]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  220.658871]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  220.658873]  ? _cond_resched+0x19/0x30
[  220.658877]  ? selinux_file_ioctl+0x114/0x1e0
[  220.658880]  do_vfs_ioctl+0x96/0x5a0
[  220.658883]  SyS_ioctl+0x79/0x90
[  220.658886]  entry_SYSCALL_64_fastpath+0x13/0x94
[  220.658887] RIP: 0033:0x7f098cf4f1e7
[  220.658888] RSP: 002b:00007f098be5ad78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  220.658890] RAX: ffffffffffffffda RBX: 0000000000000039 RCX: 00007f098cf4f1e7
[  220.658891] RDX: 00007f098be5adf0 RSI: 00000000c0104802 RDI: 0000000000000003
[  220.658892] RBP: 00007fffd23e52d0 R08: 0000000000000000 R09: 00007f098be5b700
[  220.658893] R10: 00007f098be5b9d0 R11: 0000000000000246 R12: 0000000000000000
[  220.658895] R13: 0000000000000000 R14: 00007f098be5b9c0 R15: 00007f098be5b700
[  220.658897] sanity_rmem004  D13000  3951   3761 0x00000000
[  220.658903] Call Trace:
[  220.658905]  __schedule+0x20b/0x6c0
[  220.658910]  ? release_pages+0x279/0x380
[  220.658911]  schedule+0x36/0x80
[  220.658915]  io_schedule+0x16/0x40
[  220.658919]  __lock_page+0xf2/0x130
[  220.658922]  ? page_cache_tree_insert+0x90/0x90
[  220.658924]  migrate_vma+0x48a/0xee0
[  220.658929]  ? __kernel_map_pages+0x70/0xe0
[  220.658933]  hmm_devmem_fault_range+0x33/0x70 [hmm_dmirror]
[  220.658936]  dummy_devmem_fault+0x54/0x60 [hmm_dmirror]
[  220.658941]  hmm_devmem_fault+0x2e/0x30
[  220.658945]  device_private_entry_fault+0x2d/0x30
[  220.658947]  do_swap_page+0x490/0x4e0
[  220.658951]  ? __change_page_attr_set_clr+0xa80/0xc90
[  220.658953]  __handle_mm_fault+0x347/0x9d0
[  220.658957]  handle_mm_fault+0x88/0x150
[  220.658959]  hmm_vma_walk_pmd+0x2fe/0x3e0
[  220.658963]  __walk_page_range+0x1e8/0x420
[  220.658967]  walk_page_range+0x73/0xf0
[  220.658969]  hmm_vma_fault+0x180/0x260
[  220.658971]  ? hmm_vma_walk_hole+0xd0/0xd0
[  220.658973]  ? hmm_vma_get_pfns+0x1b0/0x1b0
[  220.658976]  dummy_fault+0xda/0x1f0 [hmm_dmirror]
[  220.658980]  ? __kernel_map_pages+0x70/0xe0
[  220.658985]  ? __alloc_pages_nodemask+0x11b/0x240
[  220.658988]  ? dummy_pt_walk+0x209/0x2f0 [hmm_dmirror]
[  220.658991]  ? dummy_update+0x60/0x60 [hmm_dmirror]
[  220.658994]  dummy_fops_unlocked_ioctl+0x12c/0x330 [hmm_dmirror]
[  220.658997]  do_vfs_ioctl+0x96/0x5a0
[  220.659001]  SyS_ioctl+0x79/0x90
[  220.659004]  entry_SYSCALL_64_fastpath+0x13/0x94
[  220.659005] RIP: 0033:0x7f098cf4f1e7
[  220.659006] RSP: 002b:00007f098ce5cc38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  220.659008] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f098cf4f1e7
[  220.659009] RDX: 00007f098ce5ccd0 RSI: 00000000c0284800 RDI: 0000000000000003
[  220.659010] RBP: 0000000000000000 R08: 00007f098ce5cef0 R09: 0000000000000000
[  220.659011] R10: 00007fffd23e52d0 R11: 0000000000000246 R12: 00000000000000c8
[  220.659012] R13: 0000000000000000 R14: 00007f098ce5ccd0 R15: 0000000000000000
[  220.659014] sanity_rmem004  D13048  3952   3761 0x00000000
[  220.659019] Call Trace:
[  220.659022]  __schedule+0x20b/0x6c0
[  220.659023]  schedule+0x36/0x80
[  220.659027]  io_schedule+0x16/0x40
[  220.659031]  wait_on_page_bit+0xee/0x120
[  220.659034]  ? page_cache_tree_insert+0x90/0x90
[  220.659037]  __migration_entry_wait+0xe8/0x190
[  220.659040]  migration_entry_wait+0x5f/0x70
[  220.659042]  do_swap_page+0x4c7/0x4e0
[  220.659044]  __handle_mm_fault+0x347/0x9d0
[  220.659048]  handle_mm_fault+0x88/0x150
[  220.659049]  hmm_vma_walk_pmd+0x2fe/0x3e0
[  220.659053]  __walk_page_range+0x1e8/0x420
[  220.659057]  walk_page_range+0x73/0xf0
[  220.659059]  hmm_vma_fault+0x180/0x260
[  220.659061]  ? hmm_vma_walk_hole+0xd0/0xd0
[  220.659063]  ? hmm_vma_get_pfns+0x1b0/0x1b0
[  220.659066]  dummy_fault+0xda/0x1f0 [hmm_dmirror]
[  220.659070]  ? __kernel_map_pages+0x70/0xe0
[  220.659074]  ? __alloc_pages_nodemask+0x11b/0x240
[  220.659078]  ? dummy_pt_walk+0x209/0x2f0 [hmm_dmirror]
[  220.659080]  ? dummy_update+0x60/0x60 [hmm_dmirror]
[  220.659083]  dummy_fops_unlocked_ioctl+0x12c/0x330 [hmm_dmirror]
[  220.659087]  do_vfs_ioctl+0x96/0x5a0
[  220.659090]  SyS_ioctl+0x79/0x90
[  220.659093]  entry_SYSCALL_64_fastpath+0x13/0x94
[  220.659094] RIP: 0033:0x7f098cf4f1e7
[  220.659095] RSP: 002b:00007f098c65bc38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  220.659097] RAX: ffffffffffffffda RBX: 00007f098c65c700 RCX: 00007f098cf4f1e7
[  220.659098] RDX: 00007f098c65bcd0 RSI: 00000000c0284800 RDI: 0000000000000003
[  220.659099] RBP: 00007fffd23e5210 R08: 00007f098c65bef0 R09: 00007f098c65c700
[  220.659100] R10: 00007fffd23e52d0 R11: 0000000000000246 R12: 0000000000000000
[  220.659102] R13: 0000000000000000 R14: 00007f098c65c9c0 R15: 00007f098c65c700

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-11 19:35                 ` Evgeny Baskakov
@ 2017-07-13 20:16                   ` Jerome Glisse
  2017-07-14  5:32                     ` Evgeny Baskakov
  2017-07-14 19:43                     ` Evgeny Baskakov
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-07-13 20:16 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Tue, Jul 11, 2017 at 12:35:03PM -0700, Evgeny Baskakov wrote:
> On 7/11/17 11:49 AM, Jerome Glisse wrote:
> 
> > 
> > What are the symptoms ? The program just stop making any progress and you
> > trigger a sysrequest to dump current states of each threads ? In this
> > log i don't see migration_entry_wait() anymore but it seems to be waiting
> > on page lock so there might be 2 issues here.
> > 
> > Jerome
> 
> That is correct, the program is not making any progress.
> 
> The stack traces in the kernel log are produced by a "sysrq w" (blocked
> tasks) command.
> 

I updated hmm-next with a patch that might fix some other issues but i am
still trying to get this dead lock you are seing. Does it happens quickly
with the test program ?

I can't see how it dead lock on the page lock bit. Going over and over all
code path we always unlock page once we are done or when we back off from
migration. So far i haven't been able to reproduce thought i haven't had
much time to test as other thing kept me busy. I should be back looking into
that tomorrow.

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-next

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-11 18:49               ` Jerome Glisse
@ 2017-07-11 19:35                 ` Evgeny Baskakov
  2017-07-13 20:16                   ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-11 19:35 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 7/11/17 11:49 AM, Jerome Glisse wrote:

>
> What are the symptoms ? The program just stop making any progress and you
> trigger a sysrequest to dump current states of each threads ? In this
> log i don't see migration_entry_wait() anymore but it seems to be waiting
> on page lock so there might be 2 issues here.
>
> Jérôme

That is correct, the program is not making any progress.

The stack traces in the kernel log are produced by a "sysrq w" (blocked 
tasks) command.

Thanks,

-- 
Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-11 18:42             ` Evgeny Baskakov
@ 2017-07-11 18:49               ` Jerome Glisse
  2017-07-11 19:35                 ` Evgeny Baskakov
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-07-11 18:49 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Tue, Jul 11, 2017 at 11:42:20AM -0700, Evgeny Baskakov wrote:
> On 7/11/17 11:29 AM, Jerome Glisse wrote:
> > Can you test if attached patch helps ? I am having trouble reproducing
> > this
> > from inside a vm.
> > 
> > My theory is that 2 concurrent CPU page fault happens. First one manage to
> > start the migration back to system memory but second one see the migration
> > special entry and call migration_entry_wait() which increase page refcount
> > and this happen before first one check page refcount are ok for migration.
> > 
> > For regular migration such scenario is ok as the migration bails out and
> > because page is CPU accessible there is no need to kick again the migration
> > for other thread that CPU fault to migrate.
> > 
> > I am looking into how i can change migration_entry_wait() not to refcount
> > pages. Let me know if the attached patch helps.
> > 
> > Thank you
> > Jerome
> 
> Hi Jerome,
> 
> Thanks for the update.
> 
> Unfortunately, the patch does not help. I just applied it and recompiled the
> kernel. Please find attached a new kernel log and an app log.
> 

What are the symptoms ? The program just stop making any progress and you
trigger a sysrequest to dump current states of each threads ? In this
log i don't see migration_entry_wait() anymore but it seems to be waiting
on page lock so there might be 2 issues here.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-11 18:29           ` Jerome Glisse
@ 2017-07-11 18:42             ` Evgeny Baskakov
  2017-07-11 18:49               ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-11 18:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 1011 bytes --]

On 7/11/17 11:29 AM, Jerome Glisse wrote:
> Can you test if attached patch helps ? I am having trouble reproducing 
> this
> from inside a vm.
>
> My theory is that 2 concurrent CPU page fault happens. First one manage to
> start the migration back to system memory but second one see the migration
> special entry and call migration_entry_wait() which increase page refcount
> and this happen before first one check page refcount are ok for migration.
>
> For regular migration such scenario is ok as the migration bails out and
> because page is CPU accessible there is no need to kick again the migration
> for other thread that CPU fault to migrate.
>
> I am looking into how i can change migration_entry_wait() not to refcount
> pages. Let me know if the attached patch helps.
>
> Thank you
> Jerome

Hi Jerome,

Thanks for the update.

Unfortunately, the patch does not help. I just applied it and recompiled 
the kernel. Please find attached a new kernel log and an app log.

-- 
Evgeny Baskakov
NVIDIA


[-- Attachment #2: test.log --]
[-- Type: text/plain, Size: 5339 bytes --]

sanity_rmem004_repeated_faults_threaded$ ./run.sh
&&& 2 migrate threads, 2 read threads: STARTING
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
(EE:84) hmm_buffer_mirror_read error -1
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
(EE:84) hmm_buffer_mirror_read error -1
iteration 18
iteration 19
iteration 20
iteration 21
(EE:84) hmm_buffer_mirror_read error -1
iteration 22
iteration 23
(EE:84) hmm_buffer_mirror_read error -1
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
(EE:84) hmm_buffer_mirror_read error -1
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration 77
(EE:84) hmm_buffer_mirror_read error -1
iteration 78
iteration 79
iteration 80
iteration 81
iteration 82
iteration 83
(EE:84) hmm_buffer_mirror_read error -1
iteration 84
iteration 85
iteration 86
iteration 87
iteration 88
iteration 89
(EE:84) hmm_buffer_mirror_read error -1
iteration 90
iteration 91
(EE:84) hmm_buffer_mirror_read error -1
iteration 92
iteration 93
iteration 94
iteration 95
iteration 96
iteration 97
iteration 98
iteration 99
&&& 2 migrate threads, 2 read threads: PASSED
&&& 2 migrate threads, 3 read threads: STARTING
iteration 0
iteration 1
iteration 2
(EE:84) hmm_buffer_mirror_read error -1
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
(EE:84) hmm_buffer_mirror_read error -1
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration 77
iteration 78
iteration 79
iteration 80
iteration 81
iteration 82
iteration 83
iteration 84
(EE:84) hmm_buffer_mirror_read error -1
iteration 85
iteration 86
(EE:84) hmm_buffer_mirror_read error -1
iteration 87
iteration 88
iteration 89
iteration 90
iteration 91
(EE:84) hmm_buffer_mirror_read error -1
iteration 92
iteration 93
iteration 94
iteration 95
(EE:84) hmm_buffer_mirror_read error -1
iteration 96
iteration 97
iteration 98
iteration 99
&&& 2 migrate threads, 3 read threads: PASSED
&&& 2 migrate threads, 4 read threads: STARTING
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
(EE:84) hmm_buffer_mirror_read error -1
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
(EE:84) hmm_buffer_mirror_read error -1
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration 77
iteration 78
iteration 79
iteration 80
iteration 81
iteration 82
iteration 83
iteration 84
iteration 85
iteration 86
iteration 87
iteration 88
iteration 89
iteration 90
iteration 91
iteration 92
iteration 93
iteration 94
iteration 95
iteration 96
iteration 97
iteration 98
iteration 99
&&& 2 migrate threads, 4 read threads: PASSED
&&& 3 migrate threads, 2 read threads: STARTING
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
(EE:84) hmm_buffer_mirror_read error -1
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
(EE:84) hmm_buffer_mirror_read error -1
iteration 19
(EE:84) hmm_buffer_mirror_read error -1
iteration 20
iteration 21
iteration 22

[-- Attachment #3: kernel.log --]
[-- Type: text/plain, Size: 7145 bytes --]

[   52.928642] hmm_dmirror loaded THIS IS A DANGEROUS MODULE !!!
[   64.367872] DEVICE PAGE 449938 449938 (0)
[   74.757473] DEVICE PAGE 923480 923480 (0)
[   85.617138] DEVICE PAGE 1460301 1460301 (0)
[  118.816301] sysrq: SysRq : Show Blocked State
[  118.816379]   task                        PC stack   pid father
[  118.816382] rcu_sched       D15024     8      2 0x00000000
[  118.816391] Call Trace:
[  118.816398]  __schedule+0x20b/0x6c0
[  118.816400]  schedule+0x36/0x80
[  118.816406]  rcu_gp_kthread+0x74/0x770
[  118.816411]  kthread+0x109/0x140
[  118.816415]  ? force_qs_rnp+0x180/0x180
[  118.816418]  ? kthread_park+0x60/0x60
[  118.816421]  ret_from_fork+0x22/0x30
[  118.816424] rcu_bh          D15424     9      2 0x00000000
[  118.816430] Call Trace:
[  118.816432]  __schedule+0x20b/0x6c0
[  118.816434]  schedule+0x36/0x80
[  118.816438]  rcu_gp_kthread+0x74/0x770
[  118.816441]  kthread+0x109/0x140
[  118.816445]  ? force_qs_rnp+0x180/0x180
[  118.816448]  ? kthread_park+0x60/0x60
[  118.816451]  ret_from_fork+0x22/0x30
[  118.816479] sanity_rmem004  D13904  3898   3897 0x00000000
[  118.816486] Call Trace:
[  118.816488]  __schedule+0x20b/0x6c0
[  118.816490]  schedule+0x36/0x80
[  118.816493]  rwsem_down_write_failed_killable+0x1f5/0x3f0
[  118.816497]  ? account_entity_enqueue+0x9d/0xc0
[  118.816501]  call_rwsem_down_write_failed_killable+0x17/0x30
[  118.816506]  ? selinux_file_mprotect+0x140/0x140
[  118.816509]  down_write_killable+0x2d/0x50
[  118.816513]  vm_mmap_pgoff+0x78/0xf0
[  118.816518]  SyS_mmap_pgoff+0x103/0x270
[  118.816522]  SyS_mmap+0x22/0x30
[  118.816525]  entry_SYSCALL_64_fastpath+0x13/0x94
[  118.816527] RIP: 0033:0x7f42b69a187a
[  118.816529] RSP: 002b:00007fff2b4cdb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
[  118.816532] RAX: ffffffffffffffda RBX: 00007f42b4cc1700 RCX: 00007f42b69a187a
[  118.816533] RDX: 0000000000000003 RSI: 0000000000801000 RDI: 0000000000000000
[  118.816534] RBP: 00007fff2b4cdba0 R08: 00000000ffffffff R09: 0000000000000000
[  118.816536] R10: 0000000000020022 R11: 0000000000000246 R12: 0000000000000000
[  118.816537] R13: 0000000000000000 R14: 00007f42b4cc19c0 R15: 00007f42b4cc1700
[  118.816539] sanity_rmem004  D13640  5509   3897 0x00000000
[  118.816545] Call Trace:
[  118.816547]  __schedule+0x20b/0x6c0
[  118.816549]  schedule+0x36/0x80
[  118.816552]  rwsem_down_read_failed+0x112/0x180
[  118.816555]  call_rwsem_down_read_failed+0x18/0x30
[  118.816558]  down_read+0x20/0x40
[  118.816563]  dummy_migrate.isra.10+0x43/0x110 [hmm_dmirror]
[  118.816571]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  118.816573]  ? _cond_resched+0x19/0x30
[  118.816577]  ? selinux_file_ioctl+0x114/0x1e0
[  118.816581]  do_vfs_ioctl+0x96/0x5a0
[  118.816584]  SyS_ioctl+0x79/0x90
[  118.816587]  entry_SYSCALL_64_fastpath+0x13/0x94
[  118.816588] RIP: 0033:0x7f42b699e1e7
[  118.816590] RSP: 002b:00007f42b5cc2d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  118.816592] RAX: ffffffffffffffda RBX: 00007f42b5cc3700 RCX: 00007f42b699e1e7
[  118.816593] RDX: 00007f42b5cc2e00 RSI: 00000000c0104802 RDI: 0000000000000003
[  118.816594] RBP: 00007fff2b4cdba0 R08: 00007f42b5cc3700 R09: 00007f42b5cc3700
[  118.816595] R10: 00007f42b5cc39d0 R11: 0000000000000246 R12: 0000000000000000
[  118.816597] R13: 0000000000000000 R14: 00007f42b5cc39c0 R15: 00007f42b5cc3700
[  118.816599] sanity_rmem004  D13696  5510   3897 0x00000000
[  118.816605] Call Trace:
[  118.816607]  __schedule+0x20b/0x6c0
[  118.816609]  schedule+0x36/0x80
[  118.816613]  io_schedule+0x16/0x40
[  118.816618]  __lock_page+0xf2/0x130
[  118.816622]  ? page_cache_tree_insert+0x90/0x90
[  118.816625]  migrate_vma+0x48a/0xee0
[  118.816629]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  118.816638]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  118.816640]  ? _cond_resched+0x19/0x30
[  118.816643]  ? selinux_file_ioctl+0x114/0x1e0
[  118.816646]  do_vfs_ioctl+0x96/0x5a0
[  118.816649]  SyS_ioctl+0x79/0x90
[  118.816652]  entry_SYSCALL_64_fastpath+0x13/0x94
[  118.816654] RIP: 0033:0x7f42b699e1e7
[  118.816655] RSP: 002b:00007f42b64c3d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  118.816657] RAX: ffffffffffffffda RBX: 00007f42b64c4700 RCX: 00007f42b699e1e7
[  118.816658] RDX: 00007f42b64c3df0 RSI: 00000000c0104802 RDI: 0000000000000003
[  118.816659] RBP: 00007fff2b4cdba0 R08: 0000000000000000 R09: 00007f42b64c4700
[  118.816660] R10: 00007f42b64c49d0 R11: 0000000000000246 R12: 0000000000000000
[  118.816661] R13: 0000000000000000 R14: 00007f42b64c49c0 R15: 00007f42b64c4700
[  118.816663] sanity_rmem004  D13696  5511   3897 0x00000000
[  118.816670] Call Trace:
[  118.816672]  __schedule+0x20b/0x6c0
[  118.816674]  schedule+0x36/0x80
[  118.816677]  io_schedule+0x16/0x40
[  118.816681]  __lock_page+0xf2/0x130
[  118.816684]  ? page_cache_tree_insert+0x90/0x90
[  118.816687]  migrate_vma+0x48a/0xee0
[  118.816691]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  118.816699]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  118.816701]  ? _cond_resched+0x19/0x30
[  118.816704]  ? selinux_file_ioctl+0x114/0x1e0
[  118.816707]  do_vfs_ioctl+0x96/0x5a0
[  118.816710]  SyS_ioctl+0x79/0x90
[  118.816713]  entry_SYSCALL_64_fastpath+0x13/0x94
[  118.816715] RIP: 0033:0x7f42b699e1e7
[  118.816716] RSP: 002b:00007f42b54c1d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  118.816718] RAX: ffffffffffffffda RBX: 00007f42b54c2700 RCX: 00007f42b699e1e7
[  118.816719] RDX: 00007f42b54c1df0 RSI: 00000000c0104802 RDI: 0000000000000003
[  118.816720] RBP: 00007fff2b4cdba0 R08: 0000000000000000 R09: 00007f42b54c2700
[  118.816721] R10: 00007f42b54c29d0 R11: 0000000000000246 R12: 0000000000000000
[  118.816722] R13: 0000000000000000 R14: 00007f42b54c29c0 R15: 00007f42b54c2700
[  118.816724] sanity_rmem004  D14624  5512   3897 0x00000000
[  118.816730] Call Trace:
[  118.816732]  __schedule+0x20b/0x6c0
[  118.816734]  schedule+0x36/0x80
[  118.816737]  rwsem_down_read_failed+0x112/0x180
[  118.816740]  call_rwsem_down_read_failed+0x18/0x30
[  118.816742]  down_read+0x20/0x40
[  118.816745]  dummy_fault+0x48/0x1f0 [hmm_dmirror]
[  118.816750]  ? __kernel_map_pages+0x70/0xe0
[  118.816754]  ? get_page_from_freelist+0x655/0xb40
[  118.816757]  ? __alloc_pages_nodemask+0x11b/0x240
[  118.816761]  ? dummy_pt_walk+0x209/0x2f0 [hmm_dmirror]
[  118.816764]  ? dummy_update+0x60/0x60 [hmm_dmirror]
[  118.816767]  dummy_fops_unlocked_ioctl+0x12c/0x330 [hmm_dmirror]
[  118.816770]  do_vfs_ioctl+0x96/0x5a0
[  118.816773]  SyS_ioctl+0x79/0x90
[  118.816776]  entry_SYSCALL_64_fastpath+0x13/0x94
[  118.816777] RIP: 0033:0x7f42b699e1e7
[  118.816778] RSP: 002b:00007f42b4cc0c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  118.816780] RAX: ffffffffffffffda RBX: 00007f42b4cc1700 RCX: 00007f42b699e1e7
[  118.816781] RDX: 00007f42b4cc0cd0 RSI: 00000000c0284800 RDI: 0000000000000003
[  118.816782] RBP: 00007fff2b4cdba0 R08: 00007f42b4cc0ef0 R09: 00007f42b4cc1700
[  118.816784] R10: 00007fff2b4cdc60 R11: 0000000000000246 R12: 0000000000000000
[  118.816785] R13: 0000000000000000 R14: 00007f42b4cc19c0 R15: 00007f42b4cc1700

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-10 23:44         ` Evgeny Baskakov
@ 2017-07-11 18:29           ` Jerome Glisse
  2017-07-11 18:42             ` Evgeny Baskakov
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-07-11 18:29 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 1252 bytes --]

On Mon, Jul 10, 2017 at 04:44:38PM -0700, Evgeny Baskakov wrote:
> On 6/30/17 5:57 PM, Jerome Glisse wrote:
> 
> ...
> 
> Hi Jerome,
> 
> I am working on a sporadic data corruption seen in highly contented use
> cases. So far, I've been able to re-create a sporadic hang that happens when
> multiple threads compete to migrate the same page to and from device memory.
> The reproducer uses only the dummy driver from hmm-next.
> 
> Please find attached. This is how it hangs on my 12-core Intel i7-5930K SMT
> system:
> 

Can you test if attached patch helps ? I am having trouble reproducing this
from inside a vm.

My theory is that 2 concurrent CPU page fault happens. First one manage to
start the migration back to system memory but second one see the migration
special entry and call migration_entry_wait() which increase page refcount
and this happen before first one check page refcount are ok for migration.

For regular migration such scenario is ok as the migration bails out and
because page is CPU accessible there is no need to kick again the migration
for other thread that CPU fault to migrate.

I am looking into how i can change migration_entry_wait() not to refcount
pages. Let me know if the attached patch helps.

Thank you
Jerome

[-- Attachment #2: 0001-TEST-THEORY-ABOUT-MIGRATION-AND-DEVICE.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-11  0:17             ` Evgeny Baskakov
@ 2017-07-11  0:54               ` Jerome Glisse
  2017-07-20 21:05                 ` Evgeny Baskakov
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-07-11  0:54 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, Jul 10, 2017 at 05:17:23PM -0700, Evgeny Baskakov wrote:
> On 7/10/17 4:43 PM, Jerome Glisse wrote:
> 
> > On Mon, Jul 10, 2017 at 03:59:37PM -0700, Evgeny Baskakov wrote:
> > ...
> > Horrible stupid bug in the code, most likely from cut and paste. Attached
> > patch should fix it. I don't know how long it took for you to trigger it.
> > 
> > Jerome
> Thanks, this indeed fixes the problem! Yes, it took a nightly run before it
> triggered.
> 
> One a side note, should this "return NULL" be replaced with "return
> ERR_PTR(-ENOMEM)"?

Or -EBUSY but yes sure.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-10 23:43           ` Jerome Glisse
@ 2017-07-11  0:17             ` Evgeny Baskakov
  2017-07-11  0:54               ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-11  0:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 7/10/17 4:43 PM, Jerome Glisse wrote:

> On Mon, Jul 10, 2017 at 03:59:37PM -0700, Evgeny Baskakov wrote:
> ...
> Horrible stupid bug in the code, most likely from cut and paste. Attached
> patch should fix it. I don't know how long it took for you to trigger it.
>
> Jérôme
Thanks, this indeed fixes the problem! Yes, it took a nightly run before 
it triggered.

One a side note, should this "return NULL" be replaced with "return 
ERR_PTR(-ENOMEM)"?

struct hmm_device *hmm_device_new(void *drvdata)
{
...
     if (hmm_device->minor >= HMM_DEVICE_MAX) {
         spin_unlock(&hmm_device_lock);
         kfree(hmm_device);
->      return NULL;
     }

Thanks!

Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-01  0:57       ` Jerome Glisse
  2017-07-01  2:06         ` Evgeny Baskakov
  2017-07-10 22:59         ` Evgeny Baskakov
@ 2017-07-10 23:44         ` Evgeny Baskakov
  2017-07-11 18:29           ` Jerome Glisse
  2 siblings, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-10 23:44 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 2526 bytes --]

On 6/30/17 5:57 PM, Jerome Glisse wrote:

...

Hi Jerome,

I am working on a sporadic data corruption seen in highly contented use 
cases. So far, I've been able to re-create a sporadic hang that happens 
when multiple threads compete to migrate the same page to and from 
device memory. The reproducer uses only the dummy driver from hmm-next.

Please find attached. This is how it hangs on my 12-core Intel i7-5930K 
SMT system:

&&& 2 migrate threads, 2 read threads: STARTING
(EE:84) hmm_buffer_mirror_read error -1
&&& 2 migrate threads, 2 read threads: PASSED
&&& 2 migrate threads, 3 read threads: STARTING
&&& 2 migrate threads, 3 read threads: PASSED
&&& 2 migrate threads, 4 read threads: STARTING
&&& 2 migrate threads, 4 read threads: PASSED
&&& 3 migrate threads, 2 read threads: STARTING

The kernel log (also attached) shows multiple threads blocked in 
hmm_vma_fault() and migrate_vma():

[  139.054907] sanity_rmem004  D13528  3997   3818 0x00000000
[  139.054912] Call Trace:
[  139.054914]  __schedule+0x20b/0x6c0
[  139.054916]  schedule+0x36/0x80
[  139.054920]  io_schedule+0x16/0x40
[  139.054923]  __lock_page+0xf2/0x130
[  139.054929]  migrate_vma+0x48a/0xee0
[  139.054933]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  139.054945]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  139.054954]  do_vfs_ioctl+0x96/0x5a0
[  139.054957]  SyS_ioctl+0x79/0x90
[  139.054960]  entry_SYSCALL_64_fastpath+0x13/0x94
...
[  139.055067] sanity_rmem004  D13136  3999   3818 0x00000000
[  139.055072] Call Trace:
[  139.055074]  __schedule+0x20b/0x6c0
[  139.055076]  schedule+0x36/0x80
[  139.055079]  io_schedule+0x16/0x40
[  139.055083]  wait_on_page_bit+0xee/0x120
[  139.055089]  __migration_entry_wait+0xe8/0x190
[  139.055091]  migration_entry_wait+0x5f/0x70
[  139.055094]  do_swap_page+0x4c7/0x4e0
[  139.055096]  __handle_mm_fault+0x347/0x9d0
[  139.055099]  handle_mm_fault+0x88/0x150
[  139.055103]  hmm_vma_walk_clear+0x8f/0xd0
[  139.055105]  hmm_vma_walk_pmd+0x1ba/0x250
[  139.055109]  __walk_page_range+0x1e8/0x420
[  139.055112]  walk_page_range+0x73/0xf0
[  139.055114]  hmm_vma_fault+0x180/0x260
[  139.055121]  dummy_fault+0xda/0x1f0 [hmm_dmirror]
[  139.055138]  dummy_fops_unlocked_ioctl+0x12c/0x330 [hmm_dmirror]
[  139.055142]  do_vfs_ioctl+0x96/0x5a0
[  139.055145]  SyS_ioctl+0x79/0x90
[  139.055148]  entry_SYSCALL_64_fastpath+0x13/0x94

Please compile and run the attached program this way:

$ ./build.sh
$ sudo ./kload.sh
$ sudo ./run.sh

Thanks!

Evgeny Baskakov
NVIDIA


[-- Attachment #2: sanity_rmem004_repeated_faults_threaded.tgz --]
[-- Type: application/x-gzip, Size: 5265 bytes --]

[-- Attachment #3: kernel.log --]
[-- Type: text/plain, Size: 8482 bytes --]

[  107.703099] hmm_dmirror loaded THIS IS A DANGEROUS MODULE !!!
[  114.236862] DEVICE PAGE 20400 20400 (0)
[  114.845967] DEVICE PAGE 53323 53323 (0)
[  115.536446] DEVICE PAGE 87401 87401 (0)
[  139.054579] sysrq: SysRq : Show Blocked State
[  139.054658]   task                        PC stack   pid father
[  139.054661] rcu_sched       D15024     8      2 0x00000000
[  139.054669] Call Trace:
[  139.054676]  __schedule+0x20b/0x6c0
[  139.054679]  schedule+0x36/0x80
[  139.054687]  rcu_gp_kthread+0x74/0x770
[  139.054693]  kthread+0x109/0x140
[  139.054697]  ? force_qs_rnp+0x180/0x180
[  139.054700]  ? kthread_park+0x60/0x60
[  139.054705]  ret_from_fork+0x22/0x30
[  139.054707] rcu_bh          D15424     9      2 0x00000000
[  139.054713] Call Trace:
[  139.054716]  __schedule+0x20b/0x6c0
[  139.054718]  schedule+0x36/0x80
[  139.054721]  rcu_gp_kthread+0x74/0x770
[  139.054725]  kthread+0x109/0x140
[  139.054728]  ? force_qs_rnp+0x180/0x180
[  139.054731]  ? kthread_park+0x60/0x60
[  139.054734]  ret_from_fork+0x22/0x30
[  139.054762] sanity_rmem004  D13528  3995   3818 0x00000000
[  139.054767] Call Trace:
[  139.054769]  __schedule+0x20b/0x6c0
[  139.054776]  ? wake_up_q+0x80/0x80
[  139.054778]  schedule+0x36/0x80
[  139.054782]  io_schedule+0x16/0x40
[  139.054789]  __lock_page+0xf2/0x130
[  139.054792]  ? page_cache_tree_insert+0x90/0x90
[  139.054798]  migrate_vma+0x48a/0xee0
[  139.054803]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  139.054812]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  139.054814]  ? _cond_resched+0x19/0x30
[  139.054819]  ? selinux_file_ioctl+0x114/0x1e0
[  139.054823]  do_vfs_ioctl+0x96/0x5a0
[  139.054826]  SyS_ioctl+0x79/0x90
[  139.054830]  entry_SYSCALL_64_fastpath+0x13/0x94
[  139.054832] RIP: 0033:0x7fc07add61e7
[  139.054834] RSP: 002b:00007fc078cdfd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  139.054836] RAX: ffffffffffffffda RBX: 00007fc078cdfdf0 RCX: 00007fc07add61e7
[  139.054837] RDX: 00007fc078cdfdf0 RSI: 00000000c0104802 RDI: 0000000000000003
[  139.054839] RBP: 00007ffde7ffd470 R08: 0000000000000000 R09: 00007fc078ce0700
[  139.054840] R10: 00007fc078ce09d0 R11: 0000000000000246 R12: 00007fc078cdfdf8
[  139.054841] R13: 0000000000000000 R14: 0000000000000010 R15: 00007fc078ce0700
[  139.054843] sanity_rmem004  D13304  3996   3818 0x00000000
[  139.054848] Call Trace:
[  139.054851]  __schedule+0x20b/0x6c0
[  139.054853]  schedule+0x36/0x80
[  139.054856]  io_schedule+0x16/0x40
[  139.054860]  __lock_page+0xf2/0x130
[  139.054863]  ? page_cache_tree_insert+0x90/0x90
[  139.054866]  migrate_vma+0x48a/0xee0
[  139.054870]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  139.054877]  ? avc_has_extended_perms+0xda/0x480
[  139.054881]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  139.054883]  ? _cond_resched+0x19/0x30
[  139.054887]  ? selinux_file_ioctl+0x114/0x1e0
[  139.054890]  do_vfs_ioctl+0x96/0x5a0
[  139.054893]  SyS_ioctl+0x79/0x90
[  139.054896]  entry_SYSCALL_64_fastpath+0x13/0x94
[  139.054897] RIP: 0033:0x7fc07add61e7
[  139.054898] RSP: 002b:00007fc0794e0d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  139.054901] RAX: ffffffffffffffda RBX: 00007fc0794e0df0 RCX: 00007fc07add61e7
[  139.054902] RDX: 00007fc0794e0df0 RSI: 00000000c0104802 RDI: 0000000000000003
[  139.054903] RBP: 00007ffde7ffd470 R08: 0000000000000000 R09: 00007fc0794e1700
[  139.054904] R10: 00007fc0794e19d0 R11: 0000000000000246 R12: 00007fc0794e0df8
[  139.054905] R13: 0000000000000000 R14: 0000000000000010 R15: 00007fc0794e1700
[  139.054907] sanity_rmem004  D13528  3997   3818 0x00000000
[  139.054912] Call Trace:
[  139.054914]  __schedule+0x20b/0x6c0
[  139.054916]  schedule+0x36/0x80
[  139.054920]  io_schedule+0x16/0x40
[  139.054923]  __lock_page+0xf2/0x130
[  139.054927]  ? page_cache_tree_insert+0x90/0x90
[  139.054929]  migrate_vma+0x48a/0xee0
[  139.054933]  dummy_migrate.isra.10+0xd9/0x110 [hmm_dmirror]
[  139.054942]  ? copy_user_enhanced_fast_string+0x7/0x10
[  139.054945]  dummy_fops_unlocked_ioctl+0x1e8/0x330 [hmm_dmirror]
[  139.054947]  ? _cond_resched+0x19/0x30
[  139.054951]  ? selinux_file_ioctl+0x114/0x1e0
[  139.054954]  do_vfs_ioctl+0x96/0x5a0
[  139.054957]  SyS_ioctl+0x79/0x90
[  139.054960]  entry_SYSCALL_64_fastpath+0x13/0x94
[  139.054961] RIP: 0033:0x7fc07add61e7
[  139.054962] RSP: 002b:00007fc079ce1d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  139.054964] RAX: ffffffffffffffda RBX: 00007fc079ce1df0 RCX: 00007fc07add61e7
[  139.054965] RDX: 00007fc079ce1df0 RSI: 00000000c0104802 RDI: 0000000000000003
[  139.054966] RBP: 00007ffde7ffd470 R08: 0000000000000000 R09: 00007fc079ce2700
[  139.054968] R10: 00007fc079ce29d0 R11: 0000000000000246 R12: 00007fc079ce1df8
[  139.054969] R13: 0000000000000000 R14: 0000000000000010 R15: 00007fc079ce2700
[  139.054971] sanity_rmem004  D13136  3998   3818 0x00000000
[  139.054975] Call Trace:
[  139.054977]  __schedule+0x20b/0x6c0
[  139.054979]  schedule+0x36/0x80
[  139.054983]  io_schedule+0x16/0x40
[  139.054986]  wait_on_page_bit+0xee/0x120
[  139.054990]  ? page_cache_tree_insert+0x90/0x90
[  139.054993]  __migration_entry_wait+0xe8/0x190
[  139.054995]  migration_entry_wait+0x5f/0x70
[  139.054998]  do_swap_page+0x4c7/0x4e0
[  139.055001]  __handle_mm_fault+0x347/0x9d0
[  139.055004]  handle_mm_fault+0x88/0x150
[  139.055008]  hmm_vma_walk_clear+0x8f/0xd0
[  139.055010]  hmm_vma_walk_pmd+0x1ba/0x250
[  139.055015]  __walk_page_range+0x1e8/0x420
[  139.055018]  walk_page_range+0x73/0xf0
[  139.055020]  hmm_vma_fault+0x180/0x260
[  139.055023]  ? hmm_vma_walk_hole+0xd0/0xd0
[  139.055024]  ? hmm_vma_get_pfns+0x1b0/0x1b0
[  139.055028]  dummy_fault+0xda/0x1f0 [hmm_dmirror]
[  139.055033]  ? __kernel_map_pages+0x70/0xe0
[  139.055038]  ? __alloc_pages_nodemask+0x11b/0x240
[  139.055041]  ? dummy_pt_walk+0x209/0x2f0 [hmm_dmirror]
[  139.055044]  ? dummy_update+0x60/0x60 [hmm_dmirror]
[  139.055047]  dummy_fops_unlocked_ioctl+0x12c/0x330 [hmm_dmirror]
[  139.055050]  do_vfs_ioctl+0x96/0x5a0
[  139.055054]  SyS_ioctl+0x79/0x90
[  139.055057]  entry_SYSCALL_64_fastpath+0x13/0x94
[  139.055058] RIP: 0033:0x7fc07add61e7
[  139.055059] RSP: 002b:00007fc07ace3c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  139.055061] RAX: ffffffffffffffda RBX: 00007fc07ace4700 RCX: 00007fc07add61e7
[  139.055062] RDX: 00007fc07ace3cd0 RSI: 00000000c0284800 RDI: 0000000000000003
[  139.055063] RBP: 00007ffde7ffd3b0 R08: 00007fc07ace3ef0 R09: 00007fc07ace4700
[  139.055064] R10: 00007ffde7ffd470 R11: 0000000000000246 R12: 0000000000000000
[  139.055066] R13: 0000000000000000 R14: 00007fc07ace49c0 R15: 00007fc07ace4700
[  139.055067] sanity_rmem004  D13136  3999   3818 0x00000000
[  139.055072] Call Trace:
[  139.055074]  __schedule+0x20b/0x6c0
[  139.055076]  schedule+0x36/0x80
[  139.055079]  io_schedule+0x16/0x40
[  139.055083]  wait_on_page_bit+0xee/0x120
[  139.055086]  ? page_cache_tree_insert+0x90/0x90
[  139.055089]  __migration_entry_wait+0xe8/0x190
[  139.055091]  migration_entry_wait+0x5f/0x70
[  139.055094]  do_swap_page+0x4c7/0x4e0
[  139.055096]  __handle_mm_fault+0x347/0x9d0
[  139.055099]  handle_mm_fault+0x88/0x150
[  139.055103]  hmm_vma_walk_clear+0x8f/0xd0
[  139.055105]  hmm_vma_walk_pmd+0x1ba/0x250
[  139.055109]  __walk_page_range+0x1e8/0x420
[  139.055112]  walk_page_range+0x73/0xf0
[  139.055114]  hmm_vma_fault+0x180/0x260
[  139.055116]  ? hmm_vma_walk_hole+0xd0/0xd0
[  139.055118]  ? hmm_vma_get_pfns+0x1b0/0x1b0
[  139.055121]  dummy_fault+0xda/0x1f0 [hmm_dmirror]
[  139.055125]  ? __kernel_map_pages+0x70/0xe0
[  139.055129]  ? __alloc_pages_nodemask+0x11b/0x240
[  139.055133]  ? dummy_pt_walk+0x209/0x2f0 [hmm_dmirror]
[  139.055135]  ? dummy_update+0x60/0x60 [hmm_dmirror]
[  139.055138]  dummy_fops_unlocked_ioctl+0x12c/0x330 [hmm_dmirror]
[  139.055142]  do_vfs_ioctl+0x96/0x5a0
[  139.055145]  SyS_ioctl+0x79/0x90
[  139.055148]  entry_SYSCALL_64_fastpath+0x13/0x94
[  139.055149] RIP: 0033:0x7fc07add61e7
[  139.055150] RSP: 002b:00007fc07a4e2c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  139.055152] RAX: ffffffffffffffda RBX: 00007fc07a4e3700 RCX: 00007fc07add61e7
[  139.055153] RDX: 00007fc07a4e2cd0 RSI: 00000000c0284800 RDI: 0000000000000003
[  139.055154] RBP: 00007ffde7ffd3b0 R08: 00007fc07a4e2ef0 R09: 00007fc07a4e3700
[  139.055155] R10: 00007ffde7ffd470 R11: 0000000000000246 R12: 0000000000000000
[  139.055156] R13: 0000000000000000 R14: 00007fc07a4e39c0 R15: 00007fc07a4e3700

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-10 22:59         ` Evgeny Baskakov
@ 2017-07-10 23:43           ` Jerome Glisse
  2017-07-11  0:17             ` Evgeny Baskakov
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-07-10 23:43 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 618 bytes --]

On Mon, Jul 10, 2017 at 03:59:37PM -0700, Evgeny Baskakov wrote:
> On 6/30/17 5:57 PM, Jerome Glisse wrote:
> ...
> 
> Hi Jerome,
> 
> I am seeing a strange crash in our code that uses the hmm_device_new()
> helper. After the driver is repeatedly loaded/unloaded, hmm_device_new()
> suddenly returns NULL.
> 
> I have reproduced this with the dummy driver from the hmm-next branch:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000208

Horrible stupid bug in the code, most likely from cut and paste. Attached
patch should fix it. I don't know how long it took for you to trigger it.

Jerome

[-- Attachment #2: 0001-mm-hmm-fix-major-device-driver-exhaustion-dumb-cut-a.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-01  0:57       ` Jerome Glisse
  2017-07-01  2:06         ` Evgeny Baskakov
@ 2017-07-10 22:59         ` Evgeny Baskakov
  2017-07-10 23:43           ` Jerome Glisse
  2017-07-10 23:44         ` Evgeny Baskakov
  2 siblings, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-10 22:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]

On 6/30/17 5:57 PM, Jerome Glisse wrote:
...

Hi Jerome,

I am seeing a strange crash in our code that uses the hmm_device_new() 
helper. After the driver is repeatedly loaded/unloaded, hmm_device_new() 
suddenly returns NULL.

I have reproduced this with the dummy driver from the hmm-next branch:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000208

(gdb) bt
#0  hmm_devmem_add (ops=0xffffffffa003a140, device=0x0 
<irq_stack_union>, size=0x4000000) at mm/hmm.c:997
#1  0xffffffffa0038236 in dmirror_probe (pdev=<optimized out>) at 
drivers/char/hmm_dmirror.c:1106
#2  0xffffffff815acfcb in platform_drv_probe (_dev=0xffff88081368ca78) 
at drivers/base/platform.c:578
#3  0xffffffff815ab0a4 in really_probe (drv=<optimized out>, 
dev=<optimized out>) at drivers/base/dd.c:385
#4  driver_probe_device (drv=0xffffffffa003b028, dev=0xffff88081368ca78) 
at drivers/base/dd.c:529
#5  0xffffffff815ab1d4 in __driver_attach (dev=0xffff88081368ca78, 
data=0xffffffffa003b028) at drivers/base/dd.c:763
#6  0xffffffff815a911d in bus_for_each_dev (bus=<optimized out>, 
start=<optimized out>, data=0x4000000, fn=0x18 <irq_stack_union+24>) at 
drivers/base/bus.c:313
#7  0xffffffff815aa98e in driver_attach (drv=<optimized out>) at 
drivers/base/dd.c:782
#8  0xffffffff815aa585 in bus_add_driver (drv=0xffffffffa003b028) at 
drivers/base/bus.c:669
#9  0xffffffff815abc10 in driver_register (drv=0xffffffffa003b028) at 
drivers/base/driver.c:168
#10 0xffffffff815acf46 in __platform_driver_register (drv=<optimized 
out>, owner=<optimized out>) at drivers/base/platform.c:636


Can you please look into this?

Here's a command to reproduce, using the kload.sh script (taken from a 
sanity suite you provided earlier, attached):

$ while true; do sudo ./kload.sh; done

Thanks!

Evgeny Baskakov
NVIDIA

[-- Attachment #2: kload.sh --]
[-- Type: application/x-sh, Size: 397 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-07-01  0:57       ` Jerome Glisse
@ 2017-07-01  2:06         ` Evgeny Baskakov
  2017-07-10 22:59         ` Evgeny Baskakov
  2017-07-10 23:44         ` Evgeny Baskakov
  2 siblings, 0 replies; 66+ messages in thread
From: Evgeny Baskakov @ 2017-07-01  2:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 6/30/17 5:57 PM, Jerome Glisse wrote:

> On Fri, Jun 30, 2017 at 04:19:25PM -0700, Evgeny Baskakov wrote:
>> Hi Jerome,
>>
>> It seems that the kernel can pass 0 in src_pfns for pages that it cannot
>> migrate (i.e. the kernel knows that they cannot migrate prior to calling
>> alloc_and_copy).
>>
>> So, a zero in src_pfns can mean either "the page is not allocated yet" or
>> "the page cannot migrate".
>>
>> Can migrate_vma set the MIGRATE_PFN_MIGRATE flag for not allocated pages? On
>> the driver side it is difficult to differentiate between the cases.
> So this is what is happening in v24. For thing that can not be migrated you
> get 0 and for things that are not allocated you get MIGRATE_PFN_MIGRATE like
> the updated comments in migrate.h explain.
>
> Cheers,
> Jérôme

Yes, I see the updated documentation in migrate.h. The issue seems to be gone now in v24.

Thanks!

Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-06-30 23:19     ` Evgeny Baskakov
@ 2017-07-01  0:57       ` Jerome Glisse
  2017-07-01  2:06         ` Evgeny Baskakov
                           ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-07-01  0:57 UTC (permalink / raw)
  To: Evgeny Baskakov
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Fri, Jun 30, 2017 at 04:19:25PM -0700, Evgeny Baskakov wrote:
> Hi Jerome,
> 
> It seems that the kernel can pass 0 in src_pfns for pages that it cannot
> migrate (i.e. the kernel knows that they cannot migrate prior to calling
> alloc_and_copy).
> 
> So, a zero in src_pfns can mean either "the page is not allocated yet" or
> "the page cannot migrate".
> 
> Can migrate_vma set the MIGRATE_PFN_MIGRATE flag for not allocated pages? On
> the driver side it is difficult to differentiate between the cases.

So this is what is happening in v24. For thing that can not be migrated you
get 0 and for things that are not allocated you get MIGRATE_PFN_MIGRATE like
the updated comments in migrate.h explain.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-06-27  0:07   ` Evgeny Baskakov
@ 2017-06-30 23:19     ` Evgeny Baskakov
  2017-07-01  0:57       ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-06-30 23:19 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On 6/26/17 5:07 PM, Evgeny Baskakov wrote:

 > Hi Jerome,
 >
 > The documentation shown above doesn't tell what the alloc_and_copy 
callback should do for source pages that have not been allocated yet. 
Instead, it unconditionally suggests checking if the MIGRATE_PFN_VALID 
and MIGRATE_PFN_MIGRATE flags are set.
 >
 > Based on my testing and looking in the source code, I see that for 
such pages the respective 'src' PFN entries are always set to 0 without 
any flags.
 >
 > The sample driver specifically handles that by checking if there's no 
page in the 'src' entry, and ignores any flags in such case:
 >
 >     struct page *spage = migrate_pfn_to_page(*src_pfns);
 >     ...
 >     if (spage && !(*src_pfns & MIGRATE_PFN_MIGRATE))
 >         continue;
 >
 >     if (spage && (*src_pfns & MIGRATE_PFN_DEVICE)) {
 >
 > I would like to suggest reflecting that in the documentation. Or, 
which would be more logical, migrate_vma could keep the zero in the PFN 
entries for not allocated pages, but set the MIGRATE_PFN_MIGRATE flag 
anyway.
 >
 > Thanks!
 >
 > Evgeny Baskakov
 > NVIDIA
 >

Hi Jerome,

It seems that the kernel can pass 0 in src_pfns for pages that it cannot 
migrate (i.e. the kernel knows that they cannot migrate prior to calling 
alloc_and_copy).

So, a zero in src_pfns can mean either "the page is not allocated yet" 
or "the page cannot migrate".

Can migrate_vma set the MIGRATE_PFN_MIGRATE flag for not allocated 
pages? On the driver side it is difficult to differentiate between the 
cases.

Thanks!

Evgeny Baskakov
NVIDIA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-05-22 16:52 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
  2017-05-23 18:07   ` Reza Arbab
@ 2017-06-27  0:07   ` Evgeny Baskakov
  2017-06-30 23:19     ` Evgeny Baskakov
  1 sibling, 1 reply; 66+ messages in thread
From: Evgeny Baskakov @ 2017-06-27  0:07 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Monday, May 22, 2017 9:52 AM, Jérôme Glisse wrote:
[...]

+ * The alloc_and_copy() callback happens once all source pages have 
+been locked,
+ * unmapped and checked (checked whether pinned or not). All pages that 
+can be
+ * migrated will have an entry in the src array set with the pfn value 
+of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set 
+(other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory 
+and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID 
+and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, 
+the
+ * callback must update each corresponding entry in the dst array with 
+the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their 
+struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that 
+are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a 
+migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag 
+is also
+ * set in the src array entry). If the device driver cannot migrate a 
+device
+ * page back to system memory, then it must set the corresponding dst 
+array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries 
+to
+ * access any of the virtual addresses originally backed by this page. 
+Because
+ * a SIGBUS is such a severe result for the userspace process, the 
+device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in 
+an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY 
+ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *

Hi Jerome,

The documentation shown above doesn't tell what the alloc_and_copy callback should do for source pages that have not been allocated yet. Instead, it unconditionally suggests checking if the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flags are set.

Based on my testing and looking in the source code, I see that for such pages the respective 'src' PFN entries are always set to 0 without any flags.

The sample driver specifically handles that by checking if there's no page in the 'src' entry, and ignores any flags in such case:

	struct page *spage = migrate_pfn_to_page(*src_pfns);
	...
	if (spage && !(*src_pfns & MIGRATE_PFN_MIGRATE))
		continue;

	if (spage && (*src_pfns & MIGRATE_PFN_DEVICE)) {

I would like to suggest reflecting that in the documentation. Or, which would be more logical, migrate_vma could keep the zero in the PFN entries for not allocated pages, but set the MIGRATE_PFN_MIGRATE flag anyway.

Thanks!

Evgeny Baskakov
NVIDIA


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-05-22 16:52 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
@ 2017-05-23 18:07   ` Reza Arbab
  2017-06-27  0:07   ` Evgeny Baskakov
  1 sibling, 0 replies; 66+ messages in thread
From: Reza Arbab @ 2017-05-23 18:07 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, David Nellans,
	Evgeny Baskakov, Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, May 22, 2017 at 12:52:03PM -0400, Jerome Glisse wrote:
>This patch add a new memory migration helpers, which migrate memory
>backing a range of virtual address of a process to different memory
>(which can be allocated through special allocator). It differs from
>numa migration by working on a range of virtual address and thus by
>doing migration in chunk that can be large enough to use DMA engine
>or special copy offloading engine.
>
>Expected users are any one with heterogeneous memory where different
>memory have different characteristics (latency, bandwidth, ...). As
>an example IBM platform with CAPI bus can make use of this feature
>to migrate between regular memory and CAPI device memory. New CPU
>architecture with a pool of high performance memory not manage as
>cache but presented as regular memory (while being faster and with
>lower latency than DDR) will also be prime user of this patch.
>
>Migration to private device memory will be useful for device that
>have large pool of such like GPU, NVidia plans to use HMM for that.

Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4
  2017-05-22 16:51 [HMM 00/15] HMM (Heterogeneous Memory Management) v22 Jérôme Glisse
@ 2017-05-22 16:52 ` Jérôme Glisse
  2017-05-23 18:07   ` Reza Arbab
  2017-06-27  0:07   ` Evgeny Baskakov
  0 siblings, 2 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-05-22 16:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, David Nellans, Jérôme Glisse,
	Evgeny Baskakov, Mark Hairgrove, Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changes since v3:
  - Rebase

Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/migrate.h | 104 ++++++++++++
 mm/migrate.c            | 444 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 548 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78a0fdc..576b3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID	(1UL << 0)
+#define MIGRATE_PFN_MIGRATE	(1UL << 1)
+#define MIGRATE_PFN_LOCKED	(1UL << 2)
+#define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_ERROR	(1UL << 4)
+#define MIGRATE_PFN_SHIFT	5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+	if (!(mpfn & MIGRATE_PFN_VALID))
+		return NULL;
+	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
+ *
+ * It is safe to update device page table from within the finalize_and_map()
+ * callback because both destination and source page are still locked, and the
+ * mmap_sem is held in read mode (hence no one can unmap the range being
+ * migrated).
+ *
+ * Once callback is done cleaning up things and updating its page table (if it
+ * chose to do so, this is not an obligation) then it returns. At this point,
+ * the HMM core will finish up the final steps, and the migration is complete.
+ *
+ * THE finalize_and_map() CALLBACK MUST NOT CHANGE ANY OF THE SRC OR DST ARRAY
+ * ENTRIES OR BAD THINGS WILL HAPPEN !
+ */
+struct migrate_vma_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const unsigned long *src,
+			       unsigned long *dst,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const unsigned long *src,
+				 const unsigned long *dst,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private);
+
+#endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 66410fc..12063f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -397,6 +397,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2077,3 +2085,439 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+	struct vm_area_struct	*vma;
+	unsigned long		*dst;
+	unsigned long		*src;
+	unsigned long		cpages;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int migrate_vma_collect_hole(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr, next;
+
+	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->dst[migrate->npages] = 0;
+		migrate->src[migrate->npages++] = 0;
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		/* FIXME support THP */
+		return migrate_vma_collect_hole(start, end, walk);
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		unsigned long mpfn, pfn;
+		struct page *page;
+		pte_t pte;
+
+		pte = *ptep;
+		pfn = pte_pfn(pte);
+
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/* FIXME support THP */
+		page = vm_normal_page(migrate->vma, addr, pte);
+		if (!page || !page->mapping || PageTransCompound(page)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/*
+		 * By getting a reference on the page we pin it and that blocks
+		 * any kind of migration. Side effect is that it "freezes" the
+		 * pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be dropped from it).
+		 */
+		get_page(page);
+		migrate->cpages++;
+		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+		migrate->src[migrate->npages++] = mpfn;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect pages over a range of virtual addresses
+ * @migrate: migrate struct containing all migration information
+ *
+ * This will walk the CPU page table. For each virtual address backed by a
+ * valid page, it updates the src array and takes a reference on the page, in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = migrate_vma_collect_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = migrate_vma_collect_hole;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+
+	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pinned or not
+ * @page: struct page to check
+ *
+ * Pinned pages cannot be migrated. This is the same test as in
+ * migrate_page_move_mapping(), except that here we allow migration of a
+ * ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+	/*
+	 * One extra ref because caller holds an extra reference, either from
+	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
+	 * a device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them than regular pages, because they can be mapped with a pmd
+	 * or with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration information
+ *
+ * This locks pages that have been collected by migrate_vma_collect(). Once each
+ * page is locked it is isolated from the lru (for non-device pages). Finally,
+ * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
+ * migrated by concurrent kernel threads.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec */
+			lru_add_drain_all();
+			allow_drain = false;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+
+		if (!migrate_vma_check_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+
+			putback_lru_page(page);
+		}
+	}
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Replace page mapping (CPU page table pte) with a special migration pte entry
+ * and check again if it has been pinned. Pinned pages are restored because we
+ * cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
+		}
+	}
+
+	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration information
+ *
+ * This migrates struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i;
+
+	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+	}
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
+			newpage = page;
+		}
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->cpages--;
+
+		putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This function tries to migrate a range of memory virtual address range, using
+ * callbacks to allocate and copy memory from source to destination. First it
+ * collects all the pages backing each virtual address in the range, saving this
+ * inside the src array. Then it locks those pages and unmaps them. Once the pages
+ * are locked and unmapped, it checks whether each page is pinned or not. Pages
+ * that aren't pinned have the MIGRATE_PFN_MIGRATE flag set (by this function)
+ * in the corresponding src array entry. It then restores any pages that are
+ * pinned, by remapping and unlocking those pages.
+ *
+ * At this point it calls the alloc_and_copy() callback. For documentation on
+ * what is expected from that callback, see struct migrate_vma_ops comments in
+ * include/linux/migrate.h
+ *
+ * After the alloc_and_copy() callback, this function goes over each entry in
+ * the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag
+ * set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,
+ * then the function tries to migrate struct page information from the source
+ * struct page to the destination struct page. If it fails to migrate the struct
+ * page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src
+ * array.
+ *
+ * At this point all successfully migrated pages have an entry in the src
+ * array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst
+ * array entry with MIGRATE_PFN_VALID flag set.
+ *
+ * It then calls the finalize_and_map() callback. See comments for "struct
+ * migrate_vma_ops", in include/linux/migrate.h for details about
+ * finalize_and_map() behavior.
+ *
+ * After the finalize_and_map() callback, for successfully migrated pages, this
+ * function updates the CPU page table to point to new pages, otherwise it
+ * restores the CPU page table to point to the original source pages.
+ *
+ * Function returns 0 after the above steps, even if no pages were migrated
+ * (The function only returns an error if any of the arguments are invalid.)
+ *
+ * Both src and dst array must be big enough for (end - start) >> PAGE_SHIFT
+ * unsigned long entries.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private)
+{
+	struct migrate_vma migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+	if (!ops || !src || !dst || start >= end)
+		return -EINVAL;
+
+	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+	migrate.src = src;
+	migrate.dst = dst;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.cpages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	migrate_vma_collect(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Lock and isolate page */
+	migrate_vma_prepare(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Unmap pages */
+	migrate_vma_unmap(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/*
+	 * At this point pages are locked and unmapped, and thus they have
+	 * stable content and can safely be copied to destination memory that
+	 * is allocated by the callback.
+	 *
+	 * Note that migration can fail in migrate_vma_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+	/* This does the real migration of struct page */
+	migrate_vma_pages(&migrate);
+
+	ops->finalize_and_map(vma, src, dst, start, end, private);
+
+	/* Unlock and remap pages */
+	migrate_vma_finalize(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2017-07-26 19:14 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-24 17:20 [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Jérôme Glisse
2017-05-24 17:20 ` [HMM 01/15] hmm: heterogeneous memory management documentation Jérôme Glisse
2017-06-24  6:15   ` John Hubbard
2017-05-24 17:20 ` [HMM 02/15] mm/hmm: heterogeneous memory management (HMM for short) v4 Jérôme Glisse
2017-05-31  2:10   ` Balbir Singh
2017-06-01 22:35     ` Jerome Glisse
2017-05-24 17:20 ` [HMM 03/15] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
2017-05-24 17:20 ` [HMM 04/15] mm/hmm/mirror: helper to snapshot CPU page table v3 Jérôme Glisse
2017-05-24 17:20 ` [HMM 05/15] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-05-24 17:20 ` [HMM 06/15] mm/memory_hotplug: introduce add_pages Jérôme Glisse
2017-05-31  1:31   ` Balbir Singh
2017-05-24 17:20 ` [HMM 07/15] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v3 Jérôme Glisse
2017-05-30 16:43   ` Ross Zwisler
2017-05-30 21:43     ` Jerome Glisse
2017-05-31  1:23   ` Balbir Singh
2017-06-09  3:55   ` John Hubbard
2017-06-12 17:57     ` Jerome Glisse
2017-06-15  3:41   ` zhong jiang
2017-06-15 17:43     ` Jerome Glisse
2017-05-24 17:20 ` [HMM 08/15] mm/ZONE_DEVICE: special case put_page() for device private pages v2 Jérôme Glisse
2017-05-24 17:20 ` [HMM 09/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v5 Jérôme Glisse
2017-06-24  3:54   ` John Hubbard
2017-05-24 17:20 ` [HMM 10/15] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
2017-05-24 17:20 ` [HMM 11/15] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
2017-05-24 17:20 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
2017-05-31  3:59   ` Balbir Singh
2017-06-01 22:35     ` Jerome Glisse
2017-06-07  9:02       ` Balbir Singh
2017-06-07 14:06         ` Jerome Glisse
2017-05-24 17:20 ` [HMM 13/15] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-05-24 17:20 ` [HMM 14/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration v2 Jérôme Glisse
2017-05-31  4:09   ` Balbir Singh
2017-05-31  8:39     ` Balbir Singh
2017-05-24 17:20 ` [HMM 15/15] mm/migrate: allow migrate_vma() to alloc new page on empty entry v2 Jérôme Glisse
2017-06-16  7:22 ` [HMM 00/15] HMM (Heterogeneous Memory Management) v23 Bridgman, John
2017-06-16 14:47   ` Jerome Glisse
2017-06-16 17:55     ` Bridgman, John
2017-06-16 18:04       ` Jerome Glisse
2017-06-23 15:00 ` Bob Liu
2017-06-23 15:28   ` Jerome Glisse
  -- strict thread matches above, loose matches on Subject: below --
2017-05-22 16:51 [HMM 00/15] HMM (Heterogeneous Memory Management) v22 Jérôme Glisse
2017-05-22 16:52 ` [HMM 12/15] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
2017-05-23 18:07   ` Reza Arbab
2017-06-27  0:07   ` Evgeny Baskakov
2017-06-30 23:19     ` Evgeny Baskakov
2017-07-01  0:57       ` Jerome Glisse
2017-07-01  2:06         ` Evgeny Baskakov
2017-07-10 22:59         ` Evgeny Baskakov
2017-07-10 23:43           ` Jerome Glisse
2017-07-11  0:17             ` Evgeny Baskakov
2017-07-11  0:54               ` Jerome Glisse
2017-07-20 21:05                 ` Evgeny Baskakov
2017-07-10 23:44         ` Evgeny Baskakov
2017-07-11 18:29           ` Jerome Glisse
2017-07-11 18:42             ` Evgeny Baskakov
2017-07-11 18:49               ` Jerome Glisse
2017-07-11 19:35                 ` Evgeny Baskakov
2017-07-13 20:16                   ` Jerome Glisse
2017-07-14  5:32                     ` Evgeny Baskakov
2017-07-14 19:43                     ` Evgeny Baskakov
2017-07-15  0:55                       ` Jerome Glisse
2017-07-15  5:04                         ` Evgeny Baskakov
2017-07-21  1:00                         ` Evgeny Baskakov
2017-07-21  1:33                           ` Jerome Glisse
2017-07-21 22:01                             ` Evgeny Baskakov
2017-07-25 22:45                             ` Evgeny Baskakov
2017-07-26 19:14                               ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).