All of lore.kernel.org
 help / color / mirror / Atom feed
* [HMM 00/16] HMM (Heterogeneous Memory Management) v19
@ 2017-04-05 20:40 ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

Patchset is on top of mmotm mmotm-2017-04-04-15-00 it would conflict
with Michal memory hotplug patchset (first patch in this serie would
be the conflicting one). There is also build issue against 4.11-rc*
where some definitions are now in include/linux/sched/mm.h to fix
this patchset this new header file need to be included in migrate.c
and hmm.c but patchset have been otherwise build tested on different
arch and there wasn't any issues. It was also tested with real hardware
on x86-64.


Changes since v18:
- Use an enum for memory type instead of set of flag, this make a
  more clear separation between different type of ZONE_DEVICE memory
  (ie persistent or HMM unaddressable memory)
-Don’t preserve soft-dirtyness as check and restore can not be use
 with an active device driver. This could be revisited if we are ever
 able to save device states
-Drop the extra flag to migratepage callback of address_space and use
 a new migrate mode instead of adding a new parameters.
-Improves comments in various code path
-Use rw_sem to protect mirrors list
-Improved Kconfig help description
-Drop over cautious BUG_ON()
-Added a documentation file
-Build fixes
-Typo fixes


Heterogeneous Memory Management (HMM) (description and justification)

Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls. The
device can only access and use memory allocated through this API. This
effectively split the program address space into object allocated for the
device and useable by the device and other regular memory (malloc, mmap
of a file, share memory, …) only accessible by CPU (or in a very limited
way by a device by pinning memory).

Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, …) but this get extremely complex with advance data structure
(list, tree, graph, …) that rely on a web of memory pointers. This is
becoming a serious limitation on the kind of work load that can be
offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this
barrier. This require a shared address space between GPU device and CPU so
that GPU can access any memory of a process (while still obeying memory
protection like read only). This kind of feature is also appearing in
various other operating systems.

HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.

Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the device
memory is not accessible by the CPU with the same properties as main
memory (cache coherency, atomic operations, …).

To allow migration from main memory to device memory HMM provides a set
of helper to hotplug device memory as a new type of ZONE_DEVICE memory
which is un-addressable by CPU but still has struct page representing it.
This allow most of the core kernel logic that deals with a process memory
to stay oblivious of the peculiarity of device memory.

When page backing an address of a process is migrated to device memory
the CPU page table entry is set to a new specific swap entry. CPU access
to such address triggers a migration back to system memory, just like if
the page was swap on disk. HMM also blocks any one from pinning a
ZONE_DEVICE page so that it can always be migrated back to system memory
if CPU access it. Conversely HMM does not migrate to device memory any
page that is pin in system memory.

To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.

This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect  in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.

The expected workload is a program builds a data set on the CPU (from disk,
from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...)
to give hint on memory placement for the input data and also for the output
buffer. Program call GPU API to schedule a GPU job, this happens using
device driver specific ioctl. All this is hidden from programmer point of
view in case of C++ compiler that transparently offload some part of a
program to GPU. Program can keep doing other stuff on the CPU while the
GPU is crunching numbers.

It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object will
stay in system memory and should not be bottlenecked by system bus
bandwidth (rare write and read access from both CPU and GPU).

As we are relying on device driver API, HMM does not introduce any new
syscall nor does it modify any existing ones. It does not change any POSIX
semantics or behaviors. For instance the child after a fork of a process
that is using HMM will not be impacted in anyway, nor is there any data
hazard between child COW or parent COW of memory that was migrated to
device prior to fork.

HMM assume a numbers of hardware features. Device must allow device page
table to be updated at any time (ie device job must be preemptable). Device
page table must provides memory protection such as read only. Device must
track write access (dirty bit). Device must have a minimum granularity that
match PAGE_SIZE (ie 4k).


Reviewer (just hint):
Patch 1    add the concept of memory type and pass this down to to arch
           memory hotplug (adding new arg) Dan Williams is the best person
           to review this change
Patch 2    move the page reference decrement from put_page() to
           put_zone_device_page() Dan Williams is the best person to review
           this change
Patch 3    add a new memory type for ZONE_DEVICE and also add all the logic
           in various core mm to support this new type. Dan Williams and
           any core mm contributor are best people to review each half of
           this patchset
Patch 4    add support for new un-addressable type added in patch 3 to
           x86-64. This can be review by x86 contributor but there is
           nothing x86 specific about it. So i think any one with mm
           experience is fine
Patch 5    add a new migrate mode. Any one familiar with page migration is
           welcome to review.
Patch 6    introduce a new migration helper (migrate_vma()) that allow to
           migrate a range of virtual address of a process using device DMA
           engine to perform the copy. It is not limited to do copy from and
           to device but can also do copy between any kind of source and
           destination memory. Again anyone familiar with migration code
           should be able to verify the logic.
Patch 7    optimize the new migrate_vma() by unmapping pages while we are
           collecting them. This can be review by any mm folks.
Patch 8    introduce core infrastructure and definition of HMM, pretty
           small patch and easy to review
Patch 9    introduce the mirror functionality of HMM, it relies on
           mmu_notifier and thus someone familiar with that part would be
           in better position to review
Patch 10   is an helper to snapshot CPU page table while synchronizing with
           concurrent page table update. Understanding mmu_notifier makes
           review easier.
Patch 11   is mostly a wrapper around handle_mm_fault()
Patch 12   add unaddressable memory migration to helper introduced in patch
           6, this can be review by anyone familiar with migration code
Patch 13   add a feature that allow device to allocate non-present page on
           the GPU when migrating a range of address to device memory. This
           is an helper for device driver to avoid having to first allocate
           system memory before migration to device memory
Patch 14   add helper to hotplug un-addressable device memory as new type
           of ZONE_DEVICE memory (new type introducted in patch 3 of this
           serie). This is boiler plate code around memory hotplug and it
           also pick a free range of physical address for the device memory.
           Note that the physical address do not point to anything (at least
           as far as the kernel knows).
Patch 15   introduce a new hmm_device class as an helper for device driver
           that want to expose multiple device memory under a common fake
           device driver. This is usefull for multi-gpu configuration.
           Anyone familiar with device driver infrastructure can review
           this. Boiler plate code really.
Patch 16   is the documentation for everything


Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/
    v14 https://lkml.org/lkml/2016/12/8/344
    v15 http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1304107.html
    v16 http://www.spinics.net/lists/linux-mm/msg119814.html
    v17 https://lkml.org/lkml/2017/1/27/847
    v18 https://lkml.org/lkml/2017/3/16/596


Jérôme Glisse (16):
  mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  mm/put_page: move ZONE_DEVICE page reference decrement v2
  mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable
    memory
  mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  mm/migrate: new memory migration helper for use with device memory v4
  mm/migrate: migrate_vma() unmap page from vma while collecting pages
  mm/hmm: heterogeneous memory management (HMM for short)
  mm/hmm/mirror: mirror process address space on device with HMM helpers
  mm/hmm/mirror: helper to snapshot CPU page table v2
  mm/hmm/mirror: device page fault handler
  mm/migrate: support un-addressable ZONE_DEVICE page in migration
  mm/migrate: allow migrate_vma() to alloc new page on empty entry
  mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
  hmm: heterogeneous memory management documentation

 Documentation/vm/hmm.txt       |  362 ++++++++++++
 MAINTAINERS                    |    7 +
 arch/ia64/mm/init.c            |   36 +-
 arch/powerpc/mm/mem.c          |   37 +-
 arch/s390/mm/init.c            |   16 +-
 arch/sh/mm/init.c              |   35 +-
 arch/x86/mm/init_32.c          |   41 +-
 arch/x86/mm/init_64.c          |   57 +-
 fs/aio.c                       |    8 +
 fs/f2fs/data.c                 |    5 +-
 fs/hugetlbfs/inode.c           |    5 +-
 fs/proc/task_mmu.c             |    7 +
 fs/ubifs/file.c                |    5 +-
 include/linux/hmm.h            |  468 ++++++++++++++++
 include/linux/ioport.h         |    1 +
 include/linux/memory_hotplug.h |   34 +-
 include/linux/memremap.h       |   57 ++
 include/linux/migrate.h        |  115 ++++
 include/linux/migrate_mode.h   |    5 +
 include/linux/mm.h             |   14 +-
 include/linux/mm_types.h       |    5 +
 include/linux/swap.h           |   24 +-
 include/linux/swapops.h        |   68 +++
 kernel/fork.c                  |    2 +
 kernel/memremap.c              |   51 +-
 mm/Kconfig                     |   44 ++
 mm/Makefile                    |    1 +
 mm/balloon_compaction.c        |    8 +
 mm/hmm.c                       | 1205 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                    |   61 ++
 mm/memory_hotplug.c            |   14 +-
 mm/migrate.c                   |  785 +++++++++++++++++++++++++-
 mm/mprotect.c                  |   14 +
 mm/page_vma_mapped.c           |   10 +
 mm/rmap.c                      |   25 +
 mm/zsmalloc.c                  |    8 +
 36 files changed, 3590 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/vm/hmm.txt
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [HMM 00/16] HMM (Heterogeneous Memory Management) v19
@ 2017-04-05 20:40 ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

Patchset is on top of mmotm mmotm-2017-04-04-15-00 it would conflict
with Michal memory hotplug patchset (first patch in this serie would
be the conflicting one). There is also build issue against 4.11-rc*
where some definitions are now in include/linux/sched/mm.h to fix
this patchset this new header file need to be included in migrate.c
and hmm.c but patchset have been otherwise build tested on different
arch and there wasn't any issues. It was also tested with real hardware
on x86-64.


Changes since v18:
- Use an enum for memory type instead of set of flag, this make a
  more clear separation between different type of ZONE_DEVICE memory
  (ie persistent or HMM unaddressable memory)
-Dona??t preserve soft-dirtyness as check and restore can not be use
 with an active device driver. This could be revisited if we are ever
 able to save device states
-Drop the extra flag to migratepage callback of address_space and use
 a new migrate mode instead of adding a new parameters.
-Improves comments in various code path
-Use rw_sem to protect mirrors list
-Improved Kconfig help description
-Drop over cautious BUG_ON()
-Added a documentation file
-Build fixes
-Typo fixes


Heterogeneous Memory Management (HMM) (description and justification)

Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls. The
device can only access and use memory allocated through this API. This
effectively split the program address space into object allocated for the
device and useable by the device and other regular memory (malloc, mmap
of a file, share memory, a?|) only accessible by CPU (or in a very limited
way by a device by pinning memory).

Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, a?|) but this get extremely complex with advance data structure
(list, tree, graph, a?|) that rely on a web of memory pointers. This is
becoming a serious limitation on the kind of work load that can be
offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this
barrier. This require a shared address space between GPU device and CPU so
that GPU can access any memory of a process (while still obeying memory
protection like read only). This kind of feature is also appearing in
various other operating systems.

HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.

Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the device
memory is not accessible by the CPU with the same properties as main
memory (cache coherency, atomic operations, a?|).

To allow migration from main memory to device memory HMM provides a set
of helper to hotplug device memory as a new type of ZONE_DEVICE memory
which is un-addressable by CPU but still has struct page representing it.
This allow most of the core kernel logic that deals with a process memory
to stay oblivious of the peculiarity of device memory.

When page backing an address of a process is migrated to device memory
the CPU page table entry is set to a new specific swap entry. CPU access
to such address triggers a migration back to system memory, just like if
the page was swap on disk. HMM also blocks any one from pinning a
ZONE_DEVICE page so that it can always be migrated back to system memory
if CPU access it. Conversely HMM does not migrate to device memory any
page that is pin in system memory.

To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.

This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect  in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.

The expected workload is a program builds a data set on the CPU (from disk,
from network, from sensors, a?|). Program uses GPU API (OpenCL, CUDA, ...)
to give hint on memory placement for the input data and also for the output
buffer. Program call GPU API to schedule a GPU job, this happens using
device driver specific ioctl. All this is hidden from programmer point of
view in case of C++ compiler that transparently offload some part of a
program to GPU. Program can keep doing other stuff on the CPU while the
GPU is crunching numbers.

It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object will
stay in system memory and should not be bottlenecked by system bus
bandwidth (rare write and read access from both CPU and GPU).

As we are relying on device driver API, HMM does not introduce any new
syscall nor does it modify any existing ones. It does not change any POSIX
semantics or behaviors. For instance the child after a fork of a process
that is using HMM will not be impacted in anyway, nor is there any data
hazard between child COW or parent COW of memory that was migrated to
device prior to fork.

HMM assume a numbers of hardware features. Device must allow device page
table to be updated at any time (ie device job must be preemptable). Device
page table must provides memory protection such as read only. Device must
track write access (dirty bit). Device must have a minimum granularity that
match PAGE_SIZE (ie 4k).


Reviewer (just hint):
Patch 1    add the concept of memory type and pass this down to to arch
           memory hotplug (adding new arg) Dan Williams is the best person
           to review this change
Patch 2    move the page reference decrement from put_page() to
           put_zone_device_page() Dan Williams is the best person to review
           this change
Patch 3    add a new memory type for ZONE_DEVICE and also add all the logic
           in various core mm to support this new type. Dan Williams and
           any core mm contributor are best people to review each half of
           this patchset
Patch 4    add support for new un-addressable type added in patch 3 to
           x86-64. This can be review by x86 contributor but there is
           nothing x86 specific about it. So i think any one with mm
           experience is fine
Patch 5    add a new migrate mode. Any one familiar with page migration is
           welcome to review.
Patch 6    introduce a new migration helper (migrate_vma()) that allow to
           migrate a range of virtual address of a process using device DMA
           engine to perform the copy. It is not limited to do copy from and
           to device but can also do copy between any kind of source and
           destination memory. Again anyone familiar with migration code
           should be able to verify the logic.
Patch 7    optimize the new migrate_vma() by unmapping pages while we are
           collecting them. This can be review by any mm folks.
Patch 8    introduce core infrastructure and definition of HMM, pretty
           small patch and easy to review
Patch 9    introduce the mirror functionality of HMM, it relies on
           mmu_notifier and thus someone familiar with that part would be
           in better position to review
Patch 10   is an helper to snapshot CPU page table while synchronizing with
           concurrent page table update. Understanding mmu_notifier makes
           review easier.
Patch 11   is mostly a wrapper around handle_mm_fault()
Patch 12   add unaddressable memory migration to helper introduced in patch
           6, this can be review by anyone familiar with migration code
Patch 13   add a feature that allow device to allocate non-present page on
           the GPU when migrating a range of address to device memory. This
           is an helper for device driver to avoid having to first allocate
           system memory before migration to device memory
Patch 14   add helper to hotplug un-addressable device memory as new type
           of ZONE_DEVICE memory (new type introducted in patch 3 of this
           serie). This is boiler plate code around memory hotplug and it
           also pick a free range of physical address for the device memory.
           Note that the physical address do not point to anything (at least
           as far as the kernel knows).
Patch 15   introduce a new hmm_device class as an helper for device driver
           that want to expose multiple device memory under a common fake
           device driver. This is usefull for multi-gpu configuration.
           Anyone familiar with device driver infrastructure can review
           this. Boiler plate code really.
Patch 16   is the documentation for everything


Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/
    v14 https://lkml.org/lkml/2016/12/8/344
    v15 http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1304107.html
    v16 http://www.spinics.net/lists/linux-mm/msg119814.html
    v17 https://lkml.org/lkml/2017/1/27/847
    v18 https://lkml.org/lkml/2017/3/16/596


JA(C)rA'me Glisse (16):
  mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  mm/put_page: move ZONE_DEVICE page reference decrement v2
  mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable
    memory
  mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  mm/migrate: new memory migration helper for use with device memory v4
  mm/migrate: migrate_vma() unmap page from vma while collecting pages
  mm/hmm: heterogeneous memory management (HMM for short)
  mm/hmm/mirror: mirror process address space on device with HMM helpers
  mm/hmm/mirror: helper to snapshot CPU page table v2
  mm/hmm/mirror: device page fault handler
  mm/migrate: support un-addressable ZONE_DEVICE page in migration
  mm/migrate: allow migrate_vma() to alloc new page on empty entry
  mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
  hmm: heterogeneous memory management documentation

 Documentation/vm/hmm.txt       |  362 ++++++++++++
 MAINTAINERS                    |    7 +
 arch/ia64/mm/init.c            |   36 +-
 arch/powerpc/mm/mem.c          |   37 +-
 arch/s390/mm/init.c            |   16 +-
 arch/sh/mm/init.c              |   35 +-
 arch/x86/mm/init_32.c          |   41 +-
 arch/x86/mm/init_64.c          |   57 +-
 fs/aio.c                       |    8 +
 fs/f2fs/data.c                 |    5 +-
 fs/hugetlbfs/inode.c           |    5 +-
 fs/proc/task_mmu.c             |    7 +
 fs/ubifs/file.c                |    5 +-
 include/linux/hmm.h            |  468 ++++++++++++++++
 include/linux/ioport.h         |    1 +
 include/linux/memory_hotplug.h |   34 +-
 include/linux/memremap.h       |   57 ++
 include/linux/migrate.h        |  115 ++++
 include/linux/migrate_mode.h   |    5 +
 include/linux/mm.h             |   14 +-
 include/linux/mm_types.h       |    5 +
 include/linux/swap.h           |   24 +-
 include/linux/swapops.h        |   68 +++
 kernel/fork.c                  |    2 +
 kernel/memremap.c              |   51 +-
 mm/Kconfig                     |   44 ++
 mm/Makefile                    |    1 +
 mm/balloon_compaction.c        |    8 +
 mm/hmm.c                       | 1205 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                    |   61 ++
 mm/memory_hotplug.c            |   14 +-
 mm/migrate.c                   |  785 +++++++++++++++++++++++++-
 mm/mprotect.c                  |   14 +
 mm/page_vma_mapped.c           |   10 +
 mm/rmap.c                      |   25 +
 mm/zsmalloc.c                  |    8 +
 36 files changed, 3590 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/vm/hmm.txt
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Russell King, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

When hotpluging memory we want more information on the type of memory.
This is to extend ZONE_DEVICE to support new type of memory other than
the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
will be left un-modified.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/ia64/mm/init.c            | 36 +++++++++++++++++++++++++++++++++---
 arch/powerpc/mm/mem.c          | 37 ++++++++++++++++++++++++++++++++++---
 arch/s390/mm/init.c            | 16 ++++++++++++++--
 arch/sh/mm/init.c              | 35 +++++++++++++++++++++++++++++++++--
 arch/x86/mm/init_32.c          | 41 +++++++++++++++++++++++++++++++++++++----
 arch/x86/mm/init_64.c          | 39 +++++++++++++++++++++++++++++++++++----
 include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
 include/linux/memremap.h       |  2 ++
 kernel/memremap.c              |  5 +++--
 mm/memory_hotplug.c            |  4 ++--
 10 files changed, 215 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 06cdaef..c910b3f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,20 +645,36 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	pg_data_t *pgdat;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
-
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
@@ -667,13 +683,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..0933261 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	struct pglist_data *pgdata;
-	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
+	struct zone *zone;
 	int rc;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
@@ -153,13 +170,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index bf5b8a0..20d7714 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -153,7 +153,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
 	unsigned long start_pfn = PFN_DOWN(start);
@@ -162,6 +162,18 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	struct zone *zone;
 	int rc, i;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
@@ -205,7 +217,7 @@ unsigned long memory_block_size_bytes(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..f37e7a6 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,13 +485,30 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	pg_data_t *pgdat;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
@@ -516,13 +533,27 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (unlikely(ret))
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index c68078f..811d631 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -826,24 +826,57 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	struct pglist_data *pgdata = NODE_DATA(nid);
-	struct zone *zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
+	struct zone *zone;
+
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
+	zone = pgdata->node_zones +
+		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	return __remove_pages(zone, start_pfn, nr_pages);
 }
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7eef172..6c0b24e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -641,15 +641,33 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
-	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
+	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
+	zone = pgdat->node_zones +
+		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+
 	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -946,7 +964,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -955,6 +973,19 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f6..c3999f2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -13,6 +13,26 @@ struct mem_section;
 struct memory_block;
 struct resource;
 
+/*
+ * When hotplugging memory with arch_add_memory(), we want more information on
+ * the type of memory we are hotplugging, because depending on the type of
+ * architecture, the code might want to take different paths.
+ *
+ * MEMORY_NORMAL:
+ * Your regular system memory. Default common case.
+ *
+ * MEMORY_DEVICE_PERSISTENT:
+ * Persistent device memory (pmem): struct page might be allocated in different
+ * memory and architecture might want to perform special actions. It is similar
+ * to regular memory, in that the CPU can access it transparently. However,
+ * it is likely to have different bandwidth and latency than regular memory.
+ * See Documentation/nvdimm/nvdimm.txt for more information.
+ */
+enum memory_type {
+	MEMORY_NORMAL = 0,
+	MEMORY_DEVICE_PERSISTENT,
+};
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /*
@@ -104,7 +124,7 @@ extern bool memhp_auto_online;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, enum memory_type type);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -276,7 +296,7 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+extern int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..1f720f7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,12 +41,14 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @type: memory type see MEMORY_* in memory_hotplug.h
  */
 struct dev_pagemap {
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	enum memory_type type;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5..6b4505d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -248,7 +248,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
 
 	mem_hotplug_begin();
-	arch_remove_memory(align_start, align_size);
+	arch_remove_memory(align_start, align_size, pgmap->type);
 	mem_hotplug_done();
 
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -326,6 +326,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->type = MEMORY_DEVICE_PERSISTENT;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
@@ -363,7 +364,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		goto err_pfn_remap;
 
 	mem_hotplug_begin();
-	error = arch_add_memory(nid, align_start, align_size, true);
+	error = arch_add_memory(nid, align_start, align_size, pgmap->type);
 	mem_hotplug_done();
 	if (error)
 		goto err_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a07a07c..d1a4326 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1384,7 +1384,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	}
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, false);
+	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
 
 	if (ret < 0)
 		goto error;
@@ -2188,7 +2188,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
+	arch_remove_memory(start, size, MEMORY_NORMAL);
 
 	try_offline_node(nid);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Russell King, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

When hotpluging memory we want more information on the type of memory.
This is to extend ZONE_DEVICE to support new type of memory other than
the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
will be left un-modified.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/ia64/mm/init.c            | 36 +++++++++++++++++++++++++++++++++---
 arch/powerpc/mm/mem.c          | 37 ++++++++++++++++++++++++++++++++++---
 arch/s390/mm/init.c            | 16 ++++++++++++++--
 arch/sh/mm/init.c              | 35 +++++++++++++++++++++++++++++++++--
 arch/x86/mm/init_32.c          | 41 +++++++++++++++++++++++++++++++++++++----
 arch/x86/mm/init_64.c          | 39 +++++++++++++++++++++++++++++++++++----
 include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
 include/linux/memremap.h       |  2 ++
 kernel/memremap.c              |  5 +++--
 mm/memory_hotplug.c            |  4 ++--
 10 files changed, 215 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 06cdaef..c910b3f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,20 +645,36 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	pg_data_t *pgdat;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
-
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
@@ -667,13 +683,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..0933261 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	struct pglist_data *pgdata;
-	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
+	struct zone *zone;
 	int rc;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
@@ -153,13 +170,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index bf5b8a0..20d7714 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -153,7 +153,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
 	unsigned long start_pfn = PFN_DOWN(start);
@@ -162,6 +162,18 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	struct zone *zone;
 	int rc, i;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
@@ -205,7 +217,7 @@ unsigned long memory_block_size_bytes(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..f37e7a6 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,13 +485,30 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	pg_data_t *pgdat;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
@@ -516,13 +533,27 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (unlikely(ret))
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index c68078f..811d631 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -826,24 +826,57 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	struct pglist_data *pgdata = NODE_DATA(nid);
-	struct zone *zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
+	struct zone *zone;
+
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
+	zone = pgdata->node_zones +
+		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
+	 * is not supported on this architecture.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	return __remove_pages(zone, start_pfn, nr_pages);
 }
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7eef172..6c0b24e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -641,15 +641,33 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
-	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	bool for_device = false;
+	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+		break;
+	case MEMORY_DEVICE_PERSISTENT:
+		for_device = true;
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
+	zone = pgdat->node_zones +
+		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+
 	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -946,7 +964,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -955,6 +973,19 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	struct zone *zone;
 	int ret;
 
+	/*
+	 * Each memory_type needs special handling, so error out on an
+	 * unsupported type.
+	 */
+	switch (type) {
+	case MEMORY_NORMAL:
+	case MEMORY_DEVICE_PERSISTENT:
+		break;
+	default:
+		pr_err("hotplug unsupported memory type %d\n", type);
+		return -EINVAL;
+	}
+
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f6..c3999f2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -13,6 +13,26 @@ struct mem_section;
 struct memory_block;
 struct resource;
 
+/*
+ * When hotplugging memory with arch_add_memory(), we want more information on
+ * the type of memory we are hotplugging, because depending on the type of
+ * architecture, the code might want to take different paths.
+ *
+ * MEMORY_NORMAL:
+ * Your regular system memory. Default common case.
+ *
+ * MEMORY_DEVICE_PERSISTENT:
+ * Persistent device memory (pmem): struct page might be allocated in different
+ * memory and architecture might want to perform special actions. It is similar
+ * to regular memory, in that the CPU can access it transparently. However,
+ * it is likely to have different bandwidth and latency than regular memory.
+ * See Documentation/nvdimm/nvdimm.txt for more information.
+ */
+enum memory_type {
+	MEMORY_NORMAL = 0,
+	MEMORY_DEVICE_PERSISTENT,
+};
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /*
@@ -104,7 +124,7 @@ extern bool memhp_auto_online;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, enum memory_type type);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -276,7 +296,7 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+extern int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..1f720f7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,12 +41,14 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @type: memory type see MEMORY_* in memory_hotplug.h
  */
 struct dev_pagemap {
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	enum memory_type type;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5..6b4505d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -248,7 +248,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
 
 	mem_hotplug_begin();
-	arch_remove_memory(align_start, align_size);
+	arch_remove_memory(align_start, align_size, pgmap->type);
 	mem_hotplug_done();
 
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -326,6 +326,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->type = MEMORY_DEVICE_PERSISTENT;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
@@ -363,7 +364,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		goto err_pfn_remap;
 
 	mem_hotplug_begin();
-	error = arch_add_memory(nid, align_start, align_size, true);
+	error = arch_add_memory(nid, align_start, align_size, pgmap->type);
 	mem_hotplug_done();
 	if (error)
 		goto err_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a07a07c..d1a4326 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1384,7 +1384,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	}
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, false);
+	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
 
 	if (ret < 0)
 		goto error;
@@ -2188,7 +2188,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
+	arch_remove_memory(start, size, MEMORY_NORMAL);
 
 	try_offline_node(nid);
 
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 02/16] mm/put_page: move ZONE_DEVICE page reference decrement v2
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Ross Zwisler

Move page reference decrement of ZONE_DEVICE from put_page()
to put_zone_device_page() this does not affect non ZONE_DEVICE
page.

Doing this allow to catch when a ZONE_DEVICE page refcount reach
1 which means the device is no longer reference by any one (unlike
page from other zone, ZONE_DEVICE page refcount never reach 0).

This patch is just a preparatory patch for HMM.

Changes since v1:
  - commit message

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/mm.h | 14 +++++++++++---
 kernel/memremap.c  |  6 ++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0860a2b..92db0fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -813,11 +813,19 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
+	/*
+	 * ZONE_DEVICE pages should never have their refcount reach 0 (this
+	 * would be a bug), so call page_ref_dec() in put_zone_device_page()
+	 * to decrement page refcount and skip __put_page() here, as this
+	 * would worsen things if a ZONE_DEVICE had a refcount bug.
+	 */
+	if (unlikely(is_zone_device_page(page))) {
+		put_zone_device_page(page);
+		return;
+	}
+
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 6b4505d..0228a01 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+	/*
+	 * ZONE_DEVICE page refcount should never reach 0 and never be freed
+	 * to kernel memory allocator.
+	 */
+	page_ref_dec(page);
+
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 02/16] mm/put_page: move ZONE_DEVICE page reference decrement v2
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Ross Zwisler

Move page reference decrement of ZONE_DEVICE from put_page()
to put_zone_device_page() this does not affect non ZONE_DEVICE
page.

Doing this allow to catch when a ZONE_DEVICE page refcount reach
1 which means the device is no longer reference by any one (unlike
page from other zone, ZONE_DEVICE page refcount never reach 0).

This patch is just a preparatory patch for HMM.

Changes since v1:
  - commit message

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/mm.h | 14 +++++++++++---
 kernel/memremap.c  |  6 ++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0860a2b..92db0fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -813,11 +813,19 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
+	/*
+	 * ZONE_DEVICE pages should never have their refcount reach 0 (this
+	 * would be a bug), so call page_ref_dec() in put_zone_device_page()
+	 * to decrement page refcount and skip __put_page() here, as this
+	 * would worsen things if a ZONE_DEVICE had a refcount bug.
+	 */
+	if (unlikely(is_zone_device_page(page))) {
+		put_zone_device_page(page);
+		return;
+	}
+
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 6b4505d..0228a01 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+	/*
+	 * ZONE_DEVICE page refcount should never reach 0 and never be freed
+	 * to kernel memory allocator.
+	 */
+	page_ref_dec(page);
+
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 03/16] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Ross Zwisler

HMM (heterogeneous memory management) need struct page to support migration
from system main memory to device memory.  Reasons for HMM and migration to
device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support different
types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory type.
There is a clear separation between what is expected from each memory type
and existing user of ZONE_DEVICE are un-affected by new requirement and new
use of the un-addressable type. All specific code path are protect with
test against the memory type.

Because memory is un-addressable we use a new special swap type for when
a page is migrated to device memory (this reduces the number of maximum
swap file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which means
the page is free as ZONE_DEVICE page never reach a refcount of 0). This
allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to
an address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c             |  7 +++++
 include/linux/ioport.h         |  1 +
 include/linux/memory_hotplug.h | 10 +++++++
 include/linux/memremap.h       | 57 ++++++++++++++++++++++++++++++++++-
 include/linux/swap.h           | 24 +++++++++++++--
 include/linux/swapops.h        | 68 ++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c              | 42 +++++++++++++++++++++++++-
 mm/Kconfig                     |  9 ++++++
 mm/memory.c                    | 61 +++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            | 10 +++++--
 mm/mprotect.c                  | 14 +++++++++
 11 files changed, 296 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5c83597..a7f1ae4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -541,6 +541,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -703,6 +705,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1195,6 +1199,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_entry(entry))
+			page = device_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..ec619dc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
 	IORES_DESC_ACPI_NV_STORAGE		= 3,
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
+	IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE	= 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c3999f2..e60f203 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -27,10 +27,20 @@ struct resource;
  * to regular memory, in that the CPU can access it transparently. However,
  * it is likely to have different bandwidth and latency than regular memory.
  * See Documentation/nvdimm/nvdimm.txt for more information.
+ *
+ * MEMORY_DEVICE_UNADDRESSABLE:
+ * Device memory that is not directly addressable by the CPU: CPU can neither
+ * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
+ * pages backing the device memory. Doing so simplifies the implementation, but
+ * it is important to remember that there are certain points at which the struct
+ * page must be treated as an opaque object, rather than a "normal" struct page.
+ * A more complete discussion of unaddressable memory may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt.
  */
 enum memory_type {
 	MEMORY_NORMAL = 0,
 	MEMORY_DEVICE_PERSISTENT,
+	MEMORY_DEVICE_UNADDRESSABLE,
 };
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1f720f7..3a9494e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,19 +35,62 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+/*
+ * For MEMORY_DEVICE_UNADDRESSABLE we use ZONE_DEVICE and extend it with two
+ * callbacks:
+ *   page_fault()
+ *   page_free()
+ *
+ * Additional notes about MEMORY_DEVICE_UNADDRESSABLE may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
+ * explanation in include/linux/memory_hotplug.h.
+ *
+ * The page_fault() callback must migrate page back, from device memory to
+ * system memory, so that the CPU can access it. This might fail for various
+ * reasons (device issues,  device have been unplugged, ...). When such error
+ * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
+ * set the CPU page table entry to "poisoned".
+ *
+ * Note that because memory cgroup charges are transferred to the device memory,
+ * this should never fail due to memory restrictions. However, allocation
+ * of a regular system page might still fail because we are out of memory. If
+ * that happens, the page_fault() callback must return VM_FAULT_OOM.
+ *
+ * The page_fault() callback can also try to migrate back multiple pages in one
+ * chunk, as an optimization. It must, however, prioritize the faulting address
+ * over all the others.
+ *
+ *
+ * The page_free() callback is called once the page refcount reaches 1
+ * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
+ * This allows the device driver to implement its own memory management.)
+ */
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				struct page *page,
+				unsigned int flags,
+				pmd_t *pmdp);
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an unaddressable device page
+ * @page_free: free page callback when page refcount reaches 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
- * @type: memory type see MEMORY_* in memory_hotplug.h
+ * @data: private data pointer for page_free()
+ * @type: memory type: see MEMORY_* in memory_hotplug.h
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
 	enum memory_type type;
 };
 
@@ -55,6 +98,13 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool is_device_unaddressable_page(const struct page *page)
+{
+	/* See MEMORY_DEVICE_UNADDRESSABLE in include/linux/memory_hotplug.h */
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->type == MEMORY_DEVICE_UNADDRESSABLE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -73,6 +123,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool is_device_unaddressable_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 486494e..9174e67 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -51,6 +51,23 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Unaddressable device memory support. See include/linux/hmm.h and
+ * Documentation/vm/hmm.txt. Short description is we need struct pages for
+ * device memory that is unaddressable (inaccessible) by CPU, so that we can
+ * migrate part of a process memory to device memory.
+ *
+ * When a page is migrated from CPU to device, we set the CPU page table entry
+ * to a special SWP_DEVICE_* entry.
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
+#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -72,7 +89,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -435,8 +453,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..960eb6b 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,74 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
+			 page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned int flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 0228a01..d98ba1f 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -194,12 +196,47 @@ void put_zone_device_page(struct page *page)
 	 * ZONE_DEVICE page refcount should never reach 0 and never be freed
 	 * to kernel memory allocator.
 	 */
-	page_ref_dec(page);
+	int count = page_ref_dec_return(page);
+
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (page->pgmap->page_free && count == 1)
+		page->pgmap->page_free(page, page->pgmap->data);
 
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_entry_to_page(entry);
+
+	/*
+	 * The page_fault() callback must migrate page back to system memory
+	 * so that CPU can access it. This might fail for various reasons
+	 * (device issue, device was unsafely unplugged, ...). When such
+	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
+	 *
+	 * Note that because memory cgroup charges are accounted to the device
+	 * memory, this should never fail because of memory restrictions (but
+	 * allocation of regular system page might still fail because we are
+	 * out of memory).
+	 *
+	 * There is a more in-depth description of what that callback can and
+	 * cannot do, in include/linux/memremap.h
+	 */
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -333,6 +370,9 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
 	pgmap->type = MEMORY_DEVICE_PERSISTENT;
+	pgmap->page_fault = NULL;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..6208963 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,15 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_UNADDRESSABLE
+	bool "Unaddressable device memory (GPU memory, ...)"
+	depends on ZONE_DEVICE
+
+	help
+	  Allows creation of struct pages to represent unaddressable device memory;
+	  i.e., memory that is only accessible from the device (or group of
+	  devices).
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index 9c82e25..d68c653 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -923,6 +924,35 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_entry(entry)) {
+			page = device_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for unaddressable pages, as
+			 * they should treated just like normal pages in this
+			 * respect.
+			 *
+			 * We will likely want to have some new rss counters
+			 * for unaddressable pages, at some point. But for now
+			 * keep things as they are.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			/*
+			 * We do not preserve soft-dirty information, because so
+			 * far, checkpoint/restore is the only feature that
+			 * requires that. And checkpoint/restore does not work
+			 * when a device driver is involved (you cannot easily
+			 * save and restore device driver state).
+			 */
+			if (is_write_device_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1239,6 +1269,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
+
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_entry(entry)) {
+			struct page *page = device_entry_to_page(entry);
+
+			if (unlikely(details && details->check_mapping)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping !=
+				    page_rmapping(page))
+					continue;
+			}
+
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+			continue;
+		}
+
 		/* If details->check_mapping, we leave swap entries. */
 		if (unlikely(details))
 			continue;
@@ -2686,6 +2739,14 @@ int do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_entry_fault(vma, vmf->address, entry,
+						 vmf->flags, vmf->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d1a4326..62856c7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -155,7 +155,7 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res;
+	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
@@ -164,7 +164,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	if (request_resource(&iomem_resource, res) < 0) {
+	conflict =  request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (conflict->desc == IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE) {
+			pr_debug("Device unaddressable memory block "
+				 "memory hotplug at %#010llx !\n",
+				 (unsigned long long)start);
+		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3e1a901..ccb45cd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -126,6 +126,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_entry(entry)) {
+				pte_t newpte;
+
+				/*
+				 * We do not preserve soft-dirtiness. See
+				 * copy_one_pte() for explanation.
+				 */
+				make_device_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 03/16] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Ross Zwisler

HMM (heterogeneous memory management) need struct page to support migration
from system main memory to device memory.  Reasons for HMM and migration to
device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support different
types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory type.
There is a clear separation between what is expected from each memory type
and existing user of ZONE_DEVICE are un-affected by new requirement and new
use of the un-addressable type. All specific code path are protect with
test against the memory type.

Because memory is un-addressable we use a new special swap type for when
a page is migrated to device memory (this reduces the number of maximum
swap file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which means
the page is free as ZONE_DEVICE page never reach a refcount of 0). This
allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to
an address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c             |  7 +++++
 include/linux/ioport.h         |  1 +
 include/linux/memory_hotplug.h | 10 +++++++
 include/linux/memremap.h       | 57 ++++++++++++++++++++++++++++++++++-
 include/linux/swap.h           | 24 +++++++++++++--
 include/linux/swapops.h        | 68 ++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c              | 42 +++++++++++++++++++++++++-
 mm/Kconfig                     |  9 ++++++
 mm/memory.c                    | 61 +++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            | 10 +++++--
 mm/mprotect.c                  | 14 +++++++++
 11 files changed, 296 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5c83597..a7f1ae4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -541,6 +541,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -703,6 +705,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1195,6 +1199,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_entry(entry))
+			page = device_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..ec619dc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
 	IORES_DESC_ACPI_NV_STORAGE		= 3,
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
+	IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE	= 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c3999f2..e60f203 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -27,10 +27,20 @@ struct resource;
  * to regular memory, in that the CPU can access it transparently. However,
  * it is likely to have different bandwidth and latency than regular memory.
  * See Documentation/nvdimm/nvdimm.txt for more information.
+ *
+ * MEMORY_DEVICE_UNADDRESSABLE:
+ * Device memory that is not directly addressable by the CPU: CPU can neither
+ * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
+ * pages backing the device memory. Doing so simplifies the implementation, but
+ * it is important to remember that there are certain points at which the struct
+ * page must be treated as an opaque object, rather than a "normal" struct page.
+ * A more complete discussion of unaddressable memory may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt.
  */
 enum memory_type {
 	MEMORY_NORMAL = 0,
 	MEMORY_DEVICE_PERSISTENT,
+	MEMORY_DEVICE_UNADDRESSABLE,
 };
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1f720f7..3a9494e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,19 +35,62 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+/*
+ * For MEMORY_DEVICE_UNADDRESSABLE we use ZONE_DEVICE and extend it with two
+ * callbacks:
+ *   page_fault()
+ *   page_free()
+ *
+ * Additional notes about MEMORY_DEVICE_UNADDRESSABLE may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
+ * explanation in include/linux/memory_hotplug.h.
+ *
+ * The page_fault() callback must migrate page back, from device memory to
+ * system memory, so that the CPU can access it. This might fail for various
+ * reasons (device issues,  device have been unplugged, ...). When such error
+ * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
+ * set the CPU page table entry to "poisoned".
+ *
+ * Note that because memory cgroup charges are transferred to the device memory,
+ * this should never fail due to memory restrictions. However, allocation
+ * of a regular system page might still fail because we are out of memory. If
+ * that happens, the page_fault() callback must return VM_FAULT_OOM.
+ *
+ * The page_fault() callback can also try to migrate back multiple pages in one
+ * chunk, as an optimization. It must, however, prioritize the faulting address
+ * over all the others.
+ *
+ *
+ * The page_free() callback is called once the page refcount reaches 1
+ * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
+ * This allows the device driver to implement its own memory management.)
+ */
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				struct page *page,
+				unsigned int flags,
+				pmd_t *pmdp);
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an unaddressable device page
+ * @page_free: free page callback when page refcount reaches 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
- * @type: memory type see MEMORY_* in memory_hotplug.h
+ * @data: private data pointer for page_free()
+ * @type: memory type: see MEMORY_* in memory_hotplug.h
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
 	enum memory_type type;
 };
 
@@ -55,6 +98,13 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool is_device_unaddressable_page(const struct page *page)
+{
+	/* See MEMORY_DEVICE_UNADDRESSABLE in include/linux/memory_hotplug.h */
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->type == MEMORY_DEVICE_UNADDRESSABLE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -73,6 +123,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool is_device_unaddressable_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 486494e..9174e67 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -51,6 +51,23 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Unaddressable device memory support. See include/linux/hmm.h and
+ * Documentation/vm/hmm.txt. Short description is we need struct pages for
+ * device memory that is unaddressable (inaccessible) by CPU, so that we can
+ * migrate part of a process memory to device memory.
+ *
+ * When a page is migrated from CPU to device, we set the CPU page table entry
+ * to a special SWP_DEVICE_* entry.
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
+#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -72,7 +89,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -435,8 +453,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..960eb6b 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,74 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
+			 page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned int flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 0228a01..d98ba1f 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -194,12 +196,47 @@ void put_zone_device_page(struct page *page)
 	 * ZONE_DEVICE page refcount should never reach 0 and never be freed
 	 * to kernel memory allocator.
 	 */
-	page_ref_dec(page);
+	int count = page_ref_dec_return(page);
+
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (page->pgmap->page_free && count == 1)
+		page->pgmap->page_free(page, page->pgmap->data);
 
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_entry_to_page(entry);
+
+	/*
+	 * The page_fault() callback must migrate page back to system memory
+	 * so that CPU can access it. This might fail for various reasons
+	 * (device issue, device was unsafely unplugged, ...). When such
+	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
+	 *
+	 * Note that because memory cgroup charges are accounted to the device
+	 * memory, this should never fail because of memory restrictions (but
+	 * allocation of regular system page might still fail because we are
+	 * out of memory).
+	 *
+	 * There is a more in-depth description of what that callback can and
+	 * cannot do, in include/linux/memremap.h
+	 */
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -333,6 +370,9 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
 	pgmap->type = MEMORY_DEVICE_PERSISTENT;
+	pgmap->page_fault = NULL;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..6208963 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,15 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_UNADDRESSABLE
+	bool "Unaddressable device memory (GPU memory, ...)"
+	depends on ZONE_DEVICE
+
+	help
+	  Allows creation of struct pages to represent unaddressable device memory;
+	  i.e., memory that is only accessible from the device (or group of
+	  devices).
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index 9c82e25..d68c653 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -923,6 +924,35 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_entry(entry)) {
+			page = device_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for unaddressable pages, as
+			 * they should treated just like normal pages in this
+			 * respect.
+			 *
+			 * We will likely want to have some new rss counters
+			 * for unaddressable pages, at some point. But for now
+			 * keep things as they are.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			/*
+			 * We do not preserve soft-dirty information, because so
+			 * far, checkpoint/restore is the only feature that
+			 * requires that. And checkpoint/restore does not work
+			 * when a device driver is involved (you cannot easily
+			 * save and restore device driver state).
+			 */
+			if (is_write_device_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1239,6 +1269,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
+
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_entry(entry)) {
+			struct page *page = device_entry_to_page(entry);
+
+			if (unlikely(details && details->check_mapping)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping !=
+				    page_rmapping(page))
+					continue;
+			}
+
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+			continue;
+		}
+
 		/* If details->check_mapping, we leave swap entries. */
 		if (unlikely(details))
 			continue;
@@ -2686,6 +2739,14 @@ int do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_entry_fault(vma, vmf->address, entry,
+						 vmf->flags, vmf->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d1a4326..62856c7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -155,7 +155,7 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res;
+	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
@@ -164,7 +164,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	if (request_resource(&iomem_resource, res) < 0) {
+	conflict =  request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (conflict->desc == IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE) {
+			pr_debug("Device unaddressable memory block "
+				 "memory hotplug at %#010llx !\n",
+				 (unsigned long long)start);
+		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3e1a901..ccb45cd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -126,6 +126,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_entry(entry)) {
+				pte_t newpte;
+
+				/*
+				 * We do not preserve soft-dirtiness. See
+				 * copy_one_pte() for explanation.
+				 */
+				make_device_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 04/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.

Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/mm/init_64.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 6c0b24e..b635636 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -658,6 +658,7 @@ int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 	case MEMORY_NORMAL:
 		break;
 	case MEMORY_DEVICE_PERSISTENT:
+	case MEMORY_DEVICE_UNADDRESSABLE:
 		for_device = true;
 		break;
 	default:
@@ -668,7 +669,17 @@ int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 	zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 
-	init_memory_mapping(start, start + size);
+	/*
+	 * We get un-addressable memory when some one is adding a ZONE_DEVICE
+	 * to have struct page for a device memory which is not accessible by
+	 * the CPU so it is pointless to have a linear kernel mapping of such
+	 * memory.
+	 *
+	 * Core mm should make sure it never set a pte pointing to such fake
+	 * physical range.
+	 */
+	if (type != MEMORY_DEVICE_UNADDRESSABLE)
+		init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
@@ -980,6 +991,7 @@ int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
 	switch (type) {
 	case MEMORY_NORMAL:
 	case MEMORY_DEVICE_PERSISTENT:
+	case MEMORY_DEVICE_UNADDRESSABLE:
 		break;
 	default:
 		pr_err("hotplug unsupported memory type %d\n", type);
@@ -993,7 +1005,9 @@ int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
 	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-	kernel_physical_mapping_remove(start, start + size);
+
+	if (type != MEMORY_DEVICE_UNADDRESSABLE)
+		kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 04/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.

Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/mm/init_64.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 6c0b24e..b635636 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -658,6 +658,7 @@ int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 	case MEMORY_NORMAL:
 		break;
 	case MEMORY_DEVICE_PERSISTENT:
+	case MEMORY_DEVICE_UNADDRESSABLE:
 		for_device = true;
 		break;
 	default:
@@ -668,7 +669,17 @@ int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
 	zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 
-	init_memory_mapping(start, start + size);
+	/*
+	 * We get un-addressable memory when some one is adding a ZONE_DEVICE
+	 * to have struct page for a device memory which is not accessible by
+	 * the CPU so it is pointless to have a linear kernel mapping of such
+	 * memory.
+	 *
+	 * Core mm should make sure it never set a pte pointing to such fake
+	 * physical range.
+	 */
+	if (type != MEMORY_DEVICE_UNADDRESSABLE)
+		init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
@@ -980,6 +991,7 @@ int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
 	switch (type) {
 	case MEMORY_NORMAL:
 	case MEMORY_DEVICE_PERSISTENT:
+	case MEMORY_DEVICE_UNADDRESSABLE:
 		break;
 	default:
 		pr_err("hotplug unsupported memory type %d\n", type);
@@ -993,7 +1005,9 @@ int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
 	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-	kernel_physical_mapping_remove(start, start + size);
+
+	if (type != MEMORY_DEVICE_UNADDRESSABLE)
+		kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 05/16] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

Introduce a new migration mode that allow to offload the copy to
a device DMA engine. This changes the workflow of migration and
not all address_space migratepage callback can support this. So
it needs to be tested in those cases.

This is intended to be use by migrate_vma() which itself is use
for thing like HMM (see include/linux/hmm.h).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 fs/aio.c                     |  8 +++++++
 fs/f2fs/data.c               |  5 ++++-
 fs/hugetlbfs/inode.c         |  5 ++++-
 fs/ubifs/file.c              |  5 ++++-
 include/linux/migrate.h      |  5 +++++
 include/linux/migrate_mode.h |  5 +++++
 mm/balloon_compaction.c      |  8 +++++++
 mm/migrate.c                 | 52 ++++++++++++++++++++++++++++++++++----------
 mm/zsmalloc.c                |  8 +++++++
 9 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 7e2ab9c..be21c49 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -373,6 +373,14 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	pgoff_t idx;
 	int rc;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the ctx->completion_lock. That does not work with the
+	 * migration workflow of MIGRATE_SYNC_NO_COPY.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 9ac2625..7fc08a5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1997,7 +1997,10 @@ int f2fs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	set_page_private(newpage, page_private(page));
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cf3669d..b2e0fdb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -837,7 +837,10 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index d9ae86f..c08cbcc 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1482,7 +1482,10 @@ static int ubifs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	}
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e2484..78a0fdc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,7 @@ extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
+extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
@@ -63,6 +64,10 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
+static inline void migrate_page_states(struct page *newpage, struct page *page)
+{
+}
+
 static inline void migrate_page_copy(struct page *newpage,
 				     struct page *page) {}
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..bdf66af 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,16 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
+ *	with the CPU. Instead, page copy happens outside the migratepage()
+ *	callback and is likely using a DMA engine. See migrate_vma() and HMM
+ *	(mm/hmm.c) for users of this mode.
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_SYNC_NO_COPY,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..145b903 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -139,6 +139,14 @@ int balloon_page_migrate(struct address_space *mapping,
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
+	/*
+	 * We can not easily support the no copy case here so ignore it as it
+	 * is unlikely to be use with ballon pages. See include/linux/hmm.h for
+	 * user of the MIGRATE_SYNC_NO_COPY mode.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 5cfe3c2..5176772 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -601,15 +601,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 /*
  * Copy the page to its new location
  */
-void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -663,6 +658,17 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+EXPORT_SYMBOL(migrate_page_states);
+
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -688,7 +694,10 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -738,12 +747,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -802,8 +814,13 @@ static int fallback_migrate_page(struct address_space *mapping,
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
-		if (mode != MIGRATE_SYNC)
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			return -EBUSY;
+		}
 		return writeout(mapping, page);
 	}
 
@@ -940,7 +957,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the retry loop is too short and in the sync-light case,
 		 * the overhead of stalling is too much
 		 */
-		if (mode != MIGRATE_SYNC) {
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			rc = -EBUSY;
 			goto out_unlock;
 		}
@@ -1210,8 +1231,15 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (!trylock_page(hpage)) {
-		if (!force || mode != MIGRATE_SYNC)
+		if (!force)
 			goto out;
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			goto out;
+		}
 		lock_page(hpage);
 	}
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b7b1fb6..37afd65 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1982,6 +1982,14 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	unsigned int obj_idx;
 	int ret = -EAGAIN;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the zs lock, which does not work with
+	 * MIGRATE_SYNC_NO_COPY workflow.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 05/16] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

Introduce a new migration mode that allow to offload the copy to
a device DMA engine. This changes the workflow of migration and
not all address_space migratepage callback can support this. So
it needs to be tested in those cases.

This is intended to be use by migrate_vma() which itself is use
for thing like HMM (see include/linux/hmm.h).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 fs/aio.c                     |  8 +++++++
 fs/f2fs/data.c               |  5 ++++-
 fs/hugetlbfs/inode.c         |  5 ++++-
 fs/ubifs/file.c              |  5 ++++-
 include/linux/migrate.h      |  5 +++++
 include/linux/migrate_mode.h |  5 +++++
 mm/balloon_compaction.c      |  8 +++++++
 mm/migrate.c                 | 52 ++++++++++++++++++++++++++++++++++----------
 mm/zsmalloc.c                |  8 +++++++
 9 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 7e2ab9c..be21c49 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -373,6 +373,14 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	pgoff_t idx;
 	int rc;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the ctx->completion_lock. That does not work with the
+	 * migration workflow of MIGRATE_SYNC_NO_COPY.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 9ac2625..7fc08a5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1997,7 +1997,10 @@ int f2fs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	set_page_private(newpage, page_private(page));
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cf3669d..b2e0fdb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -837,7 +837,10 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index d9ae86f..c08cbcc 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1482,7 +1482,10 @@ static int ubifs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	}
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e2484..78a0fdc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,7 @@ extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
+extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
@@ -63,6 +64,10 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
+static inline void migrate_page_states(struct page *newpage, struct page *page)
+{
+}
+
 static inline void migrate_page_copy(struct page *newpage,
 				     struct page *page) {}
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..bdf66af 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,16 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
+ *	with the CPU. Instead, page copy happens outside the migratepage()
+ *	callback and is likely using a DMA engine. See migrate_vma() and HMM
+ *	(mm/hmm.c) for users of this mode.
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_SYNC_NO_COPY,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..145b903 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -139,6 +139,14 @@ int balloon_page_migrate(struct address_space *mapping,
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
+	/*
+	 * We can not easily support the no copy case here so ignore it as it
+	 * is unlikely to be use with ballon pages. See include/linux/hmm.h for
+	 * user of the MIGRATE_SYNC_NO_COPY mode.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 5cfe3c2..5176772 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -601,15 +601,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 /*
  * Copy the page to its new location
  */
-void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -663,6 +658,17 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+EXPORT_SYMBOL(migrate_page_states);
+
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -688,7 +694,10 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -738,12 +747,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -802,8 +814,13 @@ static int fallback_migrate_page(struct address_space *mapping,
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
-		if (mode != MIGRATE_SYNC)
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			return -EBUSY;
+		}
 		return writeout(mapping, page);
 	}
 
@@ -940,7 +957,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the retry loop is too short and in the sync-light case,
 		 * the overhead of stalling is too much
 		 */
-		if (mode != MIGRATE_SYNC) {
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			rc = -EBUSY;
 			goto out_unlock;
 		}
@@ -1210,8 +1231,15 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (!trylock_page(hpage)) {
-		if (!force || mode != MIGRATE_SYNC)
+		if (!force)
 			goto out;
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			goto out;
+		}
 		lock_page(hpage);
 	}
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b7b1fb6..37afd65 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1982,6 +1982,14 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	unsigned int obj_idx;
 	int ret = -EAGAIN;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the zs lock, which does not work with
+	 * MIGRATE_SYNC_NO_COPY workflow.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 06/16] mm/migrate: new memory migration helper for use with device memory v4
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changes since v3:
  - Rebase

Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/migrate.h | 104 ++++++++++++
 mm/migrate.c            | 444 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 548 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78a0fdc..576b3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID	(1UL << 0)
+#define MIGRATE_PFN_MIGRATE	(1UL << 1)
+#define MIGRATE_PFN_LOCKED	(1UL << 2)
+#define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_ERROR	(1UL << 4)
+#define MIGRATE_PFN_SHIFT	5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+	if (!(mpfn & MIGRATE_PFN_VALID))
+		return NULL;
+	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
+ *
+ * It is safe to update device page table from within the finalize_and_map()
+ * callback because both destination and source page are still locked, and the
+ * mmap_sem is held in read mode (hence no one can unmap the range being
+ * migrated).
+ *
+ * Once callback is done cleaning up things and updating its page table (if it
+ * chose to do so, this is not an obligation) then it returns. At this point,
+ * the HMM core will finish up the final steps, and the migration is complete.
+ *
+ * THE finalize_and_map() CALLBACK MUST NOT CHANGE ANY OF THE SRC OR DST ARRAY
+ * ENTRIES OR BAD THINGS WILL HAPPEN !
+ */
+struct migrate_vma_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const unsigned long *src,
+			       unsigned long *dst,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const unsigned long *src,
+				 const unsigned long *dst,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private);
+
+#endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 5176772..b2ce541 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -395,6 +395,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2075,3 +2083,439 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+	struct vm_area_struct	*vma;
+	unsigned long		*dst;
+	unsigned long		*src;
+	unsigned long		cpages;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int migrate_vma_collect_hole(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr, next;
+
+	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->dst[migrate->npages] = 0;
+		migrate->src[migrate->npages++] = 0;
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		/* FIXME support THP */
+		return migrate_vma_collect_hole(start, end, walk);
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		unsigned long mpfn, pfn;
+		struct page *page;
+		pte_t pte;
+
+		pte = *ptep;
+		pfn = pte_pfn(pte);
+
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/* FIXME support THP */
+		page = vm_normal_page(migrate->vma, addr, pte);
+		if (!page || !page->mapping || PageTransCompound(page)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/*
+		 * By getting a reference on the page we pin it and that blocks
+		 * any kind of migration. Side effect is that it "freezes" the
+		 * pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be dropped from it).
+		 */
+		get_page(page);
+		migrate->cpages++;
+		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+		migrate->src[migrate->npages++] = mpfn;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect pages over a range of virtual addresses
+ * @migrate: migrate struct containing all migration information
+ *
+ * This will walk the CPU page table. For each virtual address backed by a
+ * valid page, it updates the src array and takes a reference on the page, in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = migrate_vma_collect_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = migrate_vma_collect_hole;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+
+	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pinned or not
+ * @page: struct page to check
+ *
+ * Pinned pages cannot be migrated. This is the same test as in
+ * migrate_page_move_mapping(), except that here we allow migration of a
+ * ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+	/*
+	 * One extra ref because caller holds an extra reference, either from
+	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
+	 * a device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them than regular pages, because they can be mapped with a pmd
+	 * or with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration information
+ *
+ * This locks pages that have been collected by migrate_vma_collect(). Once each
+ * page is locked it is isolated from the lru (for non-device pages). Finally,
+ * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
+ * migrated by concurrent kernel threads.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec */
+			lru_add_drain_all();
+			allow_drain = false;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+
+		if (!migrate_vma_check_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+
+			putback_lru_page(page);
+		}
+	}
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Replace page mapping (CPU page table pte) with a special migration pte entry
+ * and check again if it has been pinned. Pinned pages are restored because we
+ * cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
+		}
+	}
+
+	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration information
+ *
+ * This migrates struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i;
+
+	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+	}
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
+			newpage = page;
+		}
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->cpages--;
+
+		putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This function tries to migrate a range of memory virtual address range, using
+ * callbacks to allocate and copy memory from source to destination. First it
+ * collects all the pages backing each virtual address in the range, saving this
+ * inside the src array. Then it locks those pages and unmaps them. Once the pages
+ * are locked and unmapped, it checks whether each page is pinned or not. Pages
+ * that aren't pinned have the MIGRATE_PFN_MIGRATE flag set (by this function)
+ * in the corresponding src array entry. It then restores any pages that are
+ * pinned, by remapping and unlocking those pages.
+ *
+ * At this point it calls the alloc_and_copy() callback. For documentation on
+ * what is expected from that callback, see struct migrate_vma_ops comments in
+ * include/linux/migrate.h
+ *
+ * After the alloc_and_copy() callback, this function goes over each entry in
+ * the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag
+ * set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,
+ * then the function tries to migrate struct page information from the source
+ * struct page to the destination struct page. If it fails to migrate the struct
+ * page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src
+ * array.
+ *
+ * At this point all successfully migrated pages have an entry in the src
+ * array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst
+ * array entry with MIGRATE_PFN_VALID flag set.
+ *
+ * It then calls the finalize_and_map() callback. See comments for "struct
+ * migrate_vma_ops", in include/linux/migrate.h for details about
+ * finalize_and_map() behavior.
+ *
+ * After the finalize_and_map() callback, for successfully migrated pages, this
+ * function updates the CPU page table to point to new pages, otherwise it
+ * restores the CPU page table to point to the original source pages.
+ *
+ * Function returns 0 after the above steps, even if no pages were migrated
+ * (The function only returns an error if any of the arguments are invalid.)
+ *
+ * Both src and dst array must be big enough for (end - start) >> PAGE_SHIFT
+ * unsigned long entries.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private)
+{
+	struct migrate_vma migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (!vma || !ops || !src || !dst || start >= end)
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+	migrate.src = src;
+	migrate.dst = dst;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.cpages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	migrate_vma_collect(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Lock and isolate page */
+	migrate_vma_prepare(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Unmap pages */
+	migrate_vma_unmap(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/*
+	 * At this point pages are locked and unmapped, and thus they have
+	 * stable content and can safely be copied to destination memory that
+	 * is allocated by the callback.
+	 *
+	 * Note that migration can fail in migrate_vma_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+	/* This does the real migration of struct page */
+	migrate_vma_pages(&migrate);
+
+	ops->finalize_and_map(vma, src, dst, start, end, private);
+
+	/* Unlock and remap pages */
+	migrate_vma_finalize(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 06/16] mm/migrate: new memory migration helper for use with device memory v4
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changes since v3:
  - Rebase

Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/migrate.h | 104 ++++++++++++
 mm/migrate.c            | 444 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 548 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78a0fdc..576b3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID	(1UL << 0)
+#define MIGRATE_PFN_MIGRATE	(1UL << 1)
+#define MIGRATE_PFN_LOCKED	(1UL << 2)
+#define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_ERROR	(1UL << 4)
+#define MIGRATE_PFN_SHIFT	5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+	if (!(mpfn & MIGRATE_PFN_VALID))
+		return NULL;
+	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
+ *
+ * It is safe to update device page table from within the finalize_and_map()
+ * callback because both destination and source page are still locked, and the
+ * mmap_sem is held in read mode (hence no one can unmap the range being
+ * migrated).
+ *
+ * Once callback is done cleaning up things and updating its page table (if it
+ * chose to do so, this is not an obligation) then it returns. At this point,
+ * the HMM core will finish up the final steps, and the migration is complete.
+ *
+ * THE finalize_and_map() CALLBACK MUST NOT CHANGE ANY OF THE SRC OR DST ARRAY
+ * ENTRIES OR BAD THINGS WILL HAPPEN !
+ */
+struct migrate_vma_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const unsigned long *src,
+			       unsigned long *dst,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const unsigned long *src,
+				 const unsigned long *dst,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private);
+
+#endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 5176772..b2ce541 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -395,6 +395,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2075,3 +2083,439 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+	struct vm_area_struct	*vma;
+	unsigned long		*dst;
+	unsigned long		*src;
+	unsigned long		cpages;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int migrate_vma_collect_hole(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr, next;
+
+	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->dst[migrate->npages] = 0;
+		migrate->src[migrate->npages++] = 0;
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		/* FIXME support THP */
+		return migrate_vma_collect_hole(start, end, walk);
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		unsigned long mpfn, pfn;
+		struct page *page;
+		pte_t pte;
+
+		pte = *ptep;
+		pfn = pte_pfn(pte);
+
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/* FIXME support THP */
+		page = vm_normal_page(migrate->vma, addr, pte);
+		if (!page || !page->mapping || PageTransCompound(page)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/*
+		 * By getting a reference on the page we pin it and that blocks
+		 * any kind of migration. Side effect is that it "freezes" the
+		 * pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be dropped from it).
+		 */
+		get_page(page);
+		migrate->cpages++;
+		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+		migrate->src[migrate->npages++] = mpfn;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect pages over a range of virtual addresses
+ * @migrate: migrate struct containing all migration information
+ *
+ * This will walk the CPU page table. For each virtual address backed by a
+ * valid page, it updates the src array and takes a reference on the page, in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = migrate_vma_collect_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = migrate_vma_collect_hole;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+
+	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pinned or not
+ * @page: struct page to check
+ *
+ * Pinned pages cannot be migrated. This is the same test as in
+ * migrate_page_move_mapping(), except that here we allow migration of a
+ * ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+	/*
+	 * One extra ref because caller holds an extra reference, either from
+	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
+	 * a device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them than regular pages, because they can be mapped with a pmd
+	 * or with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration information
+ *
+ * This locks pages that have been collected by migrate_vma_collect(). Once each
+ * page is locked it is isolated from the lru (for non-device pages). Finally,
+ * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
+ * migrated by concurrent kernel threads.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec */
+			lru_add_drain_all();
+			allow_drain = false;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+
+		if (!migrate_vma_check_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+
+			putback_lru_page(page);
+		}
+	}
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Replace page mapping (CPU page table pte) with a special migration pte entry
+ * and check again if it has been pinned. Pinned pages are restored because we
+ * cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
+		}
+	}
+
+	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration information
+ *
+ * This migrates struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i;
+
+	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+	}
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
+			newpage = page;
+		}
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->cpages--;
+
+		putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This function tries to migrate a range of memory virtual address range, using
+ * callbacks to allocate and copy memory from source to destination. First it
+ * collects all the pages backing each virtual address in the range, saving this
+ * inside the src array. Then it locks those pages and unmaps them. Once the pages
+ * are locked and unmapped, it checks whether each page is pinned or not. Pages
+ * that aren't pinned have the MIGRATE_PFN_MIGRATE flag set (by this function)
+ * in the corresponding src array entry. It then restores any pages that are
+ * pinned, by remapping and unlocking those pages.
+ *
+ * At this point it calls the alloc_and_copy() callback. For documentation on
+ * what is expected from that callback, see struct migrate_vma_ops comments in
+ * include/linux/migrate.h
+ *
+ * After the alloc_and_copy() callback, this function goes over each entry in
+ * the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag
+ * set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,
+ * then the function tries to migrate struct page information from the source
+ * struct page to the destination struct page. If it fails to migrate the struct
+ * page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src
+ * array.
+ *
+ * At this point all successfully migrated pages have an entry in the src
+ * array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst
+ * array entry with MIGRATE_PFN_VALID flag set.
+ *
+ * It then calls the finalize_and_map() callback. See comments for "struct
+ * migrate_vma_ops", in include/linux/migrate.h for details about
+ * finalize_and_map() behavior.
+ *
+ * After the finalize_and_map() callback, for successfully migrated pages, this
+ * function updates the CPU page table to point to new pages, otherwise it
+ * restores the CPU page table to point to the original source pages.
+ *
+ * Function returns 0 after the above steps, even if no pages were migrated
+ * (The function only returns an error if any of the arguments are invalid.)
+ *
+ * Both src and dst array must be big enough for (end - start) >> PAGE_SHIFT
+ * unsigned long entries.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private)
+{
+	struct migrate_vma migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (!vma || !ops || !src || !dst || start >= end)
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+	migrate.src = src;
+	migrate.dst = dst;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.cpages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	migrate_vma_collect(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Lock and isolate page */
+	migrate_vma_prepare(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Unmap pages */
+	migrate_vma_unmap(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/*
+	 * At this point pages are locked and unmapped, and thus they have
+	 * stable content and can safely be copied to destination memory that
+	 * is allocated by the callback.
+	 *
+	 * Note that migration can fail in migrate_vma_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+	/* This does the real migration of struct page */
+	migrate_vma_pages(&migrate);
+
+	ops->finalize_and_map(vma, src, dst, start, end, private);
+
+	/* Unlock and remap pages */
+	migrate_vma_finalize(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 07/16] mm/migrate: migrate_vma() unmap page from vma while collecting pages
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 98 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b2ce541..4486e30 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2117,7 +2117,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 {
 	struct migrate_vma *migrate = walk->private;
 	struct mm_struct *mm = walk->vma->vm_mm;
-	unsigned long addr = start;
+	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
 
@@ -2127,9 +2127,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		unsigned long mpfn, pfn;
 		struct page *page;
+		swp_entry_t entry;
 		pte_t pte;
 
 		pte = *ptep;
@@ -2161,11 +2164,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+		/*
+		 * Optimize for the common case where page is only mapped once
+		 * in one process. If we can lock the page, then we can safely
+		 * set up a special migration page table entry now.
+		 */
+		if (trylock_page(page)) {
+			pte_t swp_pte;
+
+			mpfn |= MIGRATE_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, pte_write(pte));
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regular unmap: we remove the rmap and
+			 * drop page refcount. Page won't be freed, as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmapped++;
+		}
+
 next:
 		migrate->src[migrate->npages++] = mpfn;
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmapped)
+		flush_tlb_range(walk->vma, start, end);
+
 	return 0;
 }
 
@@ -2190,7 +2226,13 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	mm_walk.mm = migrate->vma->vm_mm;
 	mm_walk.private = migrate;
 
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
 	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
 
 	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
 }
@@ -2246,12 +2288,16 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 
 	for (i = 0; i < npages; i++) {
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		bool remap = true;
 
 		if (!page)
 			continue;
 
-		lock_page(page);
-		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+			remap = false;
+			lock_page(page);
+			migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		}
 
 		if (!PageLRU(page) && allow_drain) {
 			/* Drain CPU's pagevec */
@@ -2260,21 +2306,50 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 		}
 
 		if (isolate_lru_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
-			put_page(page);
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+				put_page(page);
+			}
 			continue;
 		}
 
 		if (!migrate_vma_check_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
 
-			putback_lru_page(page);
+				get_page(page);
+				putback_lru_page(page);
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+
+				putback_lru_page(page);
+			}
 		}
 	}
+
+	for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_pte(page, migrate->vma, addr, page);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		put_page(page);
+		restore--;
+	}
 }
 
 /*
@@ -2301,12 +2376,19 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
 			continue;
 
-		try_to_unmap(page, flags);
-		if (page_mapped(page) || !migrate_vma_check_page(page)) {
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-			migrate->cpages--;
-			restore++;
+		if (page_mapped(page)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page))
+				goto restore;
 		}
+
+		if (migrate_vma_check_page(page))
+			continue;
+
+restore:
+		migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+		migrate->cpages--;
+		restore++;
 	}
 
 	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 07/16] mm/migrate: migrate_vma() unmap page from vma while collecting pages
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 98 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b2ce541..4486e30 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2117,7 +2117,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 {
 	struct migrate_vma *migrate = walk->private;
 	struct mm_struct *mm = walk->vma->vm_mm;
-	unsigned long addr = start;
+	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
 
@@ -2127,9 +2127,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		unsigned long mpfn, pfn;
 		struct page *page;
+		swp_entry_t entry;
 		pte_t pte;
 
 		pte = *ptep;
@@ -2161,11 +2164,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+		/*
+		 * Optimize for the common case where page is only mapped once
+		 * in one process. If we can lock the page, then we can safely
+		 * set up a special migration page table entry now.
+		 */
+		if (trylock_page(page)) {
+			pte_t swp_pte;
+
+			mpfn |= MIGRATE_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, pte_write(pte));
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regular unmap: we remove the rmap and
+			 * drop page refcount. Page won't be freed, as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmapped++;
+		}
+
 next:
 		migrate->src[migrate->npages++] = mpfn;
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmapped)
+		flush_tlb_range(walk->vma, start, end);
+
 	return 0;
 }
 
@@ -2190,7 +2226,13 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	mm_walk.mm = migrate->vma->vm_mm;
 	mm_walk.private = migrate;
 
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
 	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
 
 	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
 }
@@ -2246,12 +2288,16 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 
 	for (i = 0; i < npages; i++) {
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		bool remap = true;
 
 		if (!page)
 			continue;
 
-		lock_page(page);
-		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+			remap = false;
+			lock_page(page);
+			migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		}
 
 		if (!PageLRU(page) && allow_drain) {
 			/* Drain CPU's pagevec */
@@ -2260,21 +2306,50 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 		}
 
 		if (isolate_lru_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
-			put_page(page);
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+				put_page(page);
+			}
 			continue;
 		}
 
 		if (!migrate_vma_check_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
 
-			putback_lru_page(page);
+				get_page(page);
+				putback_lru_page(page);
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+
+				putback_lru_page(page);
+			}
 		}
 	}
+
+	for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_pte(page, migrate->vma, addr, page);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		put_page(page);
+		restore--;
+	}
 }
 
 /*
@@ -2301,12 +2376,19 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
 			continue;
 
-		try_to_unmap(page, flags);
-		if (page_mapped(page) || !migrate_vma_check_page(page)) {
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-			migrate->cpages--;
-			restore++;
+		if (page_mapped(page)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page))
+				goto restore;
 		}
+
+		if (migrate_vma_check_page(page))
+			continue;
+
+restore:
+		migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+		migrate->cpages--;
+		restore++;
 	}
 
 	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 08/16] mm/hmm: heterogeneous memory management (HMM for short)
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 MAINTAINERS              |   7 +++
 include/linux/hmm.h      | 146 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  14 +++++
 mm/Makefile              |   1 +
 mm/hmm.c                 |  71 +++++++++++++++++++++++
 7 files changed, 246 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 01394b0..4d2bddc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5891,6 +5891,13 @@ S:	Supported
 F:	drivers/scsi/hisi_sas/
 F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	linux-wireless@vger.kernel.org
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..93b363d
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,146 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * Heterogeneous Memory Management (HMM)
+ *
+ * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it
+ * is for. Here we focus on the HMM API description, with some explanation of
+ * the underlying implementation.
+ *
+ * Short description: HMM provides a set of helpers to share a virtual address
+ * space between CPU and a device, so that the device can access any valid
+ * address of the process (while still obeying memory protection). HMM also
+ * provides helpers to migrate process memory to device memory, and back. Each
+ * set of functionality (address space mirroring, and migration to and from
+ * device memory) can be used independently of the other.
+ *
+ *
+ * HMM address space mirroring API:
+ *
+ * Use HMM address space mirroring if you want to mirror range of the CPU page
+ * table of a process into a device page table. Here, "mirror" means "keep
+ * synchronized". Prerequisites: the device must provide the ability to write-
+ * protect its page tables (at PAGE_SIZE granularity), and must be able to
+ * recover from the resulting potential page faults.
+ *
+ * HMM guarantees that at any point in time, a given virtual address points to
+ * either the same memory in both CPU and device page tables (that is: CPU and
+ * device page tables each point to the same pages), or that one page table (CPU
+ * or device) points to no entry, while the other still points to the old page
+ * for the address. The latter case happens when the CPU page table update
+ * happens first, and then the update is mirrored over to the device page table.
+ * This does not cause any issue, because the CPU page table cannot start
+ * pointing to a new page until the device page table is invalidated.
+ *
+ * HMM uses mmu_notifiers to monitor the CPU page tables, and forwards any
+ * updates to each device driver that has registered a mirror. It also provides
+ * some API calls to help with taking a snapshot of the CPU page table, and to
+ * synchronize with any updates that might happen concurrently.
+ *
+ *
+ * HMM migration to and from device memory:
+ *
+ * HMM provides a set of helpers to hotplug device memory as ZONE_DEVICE, with
+ * a new MEMORY_DEVICE_UNADDRESSABLE type. This provides a struct page for
+ * each page of the device memory, and allows the device driver to manage its
+ * memory using those struct pages. Having struct pages for device memory makes
+ * migration easier. Because that memory is not addressable by the CPU it must
+ * never be pinned to the device; in other words, any CPU page fault can always
+ * cause the device memory to be migrated (copied/moved) back to regular memory.
+ *
+ * A new migrate helper (migrate_vma()) has been added (see mm/migrate.c) that
+ * allows use of a device DMA engine to perform the copy operation between
+ * regular system memory and device memory.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table has write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f6d440..31da81f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -532,6 +533,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM needs to track a few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index 14017b1..37a0961 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -851,6 +852,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
diff --git a/mm/Kconfig b/mm/Kconfig
index 6208963..6bae95f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,20 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config HMM
+	bool
+	depends on MMU && 64BIT
+	help
+	  HMM provides a set of helpers to share a virtual address
+	  space between CPU and a device, so that the device can access any valid
+	  address of the process (while still obeying memory protection). HMM also
+	  provides helpers to migrate process memory to device memory, and back.
+	  Each set of functionality (address space mirroring, and migration to and
+	  from device memory) can be used independently of the other.
+
+	  This is primarily useful for devices like GPU, for GPGPU compute workload,
+	  with APIs such as OpenCL or CUDA. See Documentation/vm/hmm.txt.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..9eb4121 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..acadb49
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be used directly by device drivers. It allocates an
+ * HMM struct if mm does not have one, and initializes it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be freed once the mm_struct goes away,
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	kfree(mm->hmm);
+}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 08/16] mm/hmm: heterogeneous memory management (HMM for short)
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 MAINTAINERS              |   7 +++
 include/linux/hmm.h      | 146 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  14 +++++
 mm/Makefile              |   1 +
 mm/hmm.c                 |  71 +++++++++++++++++++++++
 7 files changed, 246 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 01394b0..4d2bddc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5891,6 +5891,13 @@ S:	Supported
 F:	drivers/scsi/hisi_sas/
 F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	linux-wireless@vger.kernel.org
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..93b363d
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,146 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * Heterogeneous Memory Management (HMM)
+ *
+ * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it
+ * is for. Here we focus on the HMM API description, with some explanation of
+ * the underlying implementation.
+ *
+ * Short description: HMM provides a set of helpers to share a virtual address
+ * space between CPU and a device, so that the device can access any valid
+ * address of the process (while still obeying memory protection). HMM also
+ * provides helpers to migrate process memory to device memory, and back. Each
+ * set of functionality (address space mirroring, and migration to and from
+ * device memory) can be used independently of the other.
+ *
+ *
+ * HMM address space mirroring API:
+ *
+ * Use HMM address space mirroring if you want to mirror range of the CPU page
+ * table of a process into a device page table. Here, "mirror" means "keep
+ * synchronized". Prerequisites: the device must provide the ability to write-
+ * protect its page tables (at PAGE_SIZE granularity), and must be able to
+ * recover from the resulting potential page faults.
+ *
+ * HMM guarantees that at any point in time, a given virtual address points to
+ * either the same memory in both CPU and device page tables (that is: CPU and
+ * device page tables each point to the same pages), or that one page table (CPU
+ * or device) points to no entry, while the other still points to the old page
+ * for the address. The latter case happens when the CPU page table update
+ * happens first, and then the update is mirrored over to the device page table.
+ * This does not cause any issue, because the CPU page table cannot start
+ * pointing to a new page until the device page table is invalidated.
+ *
+ * HMM uses mmu_notifiers to monitor the CPU page tables, and forwards any
+ * updates to each device driver that has registered a mirror. It also provides
+ * some API calls to help with taking a snapshot of the CPU page table, and to
+ * synchronize with any updates that might happen concurrently.
+ *
+ *
+ * HMM migration to and from device memory:
+ *
+ * HMM provides a set of helpers to hotplug device memory as ZONE_DEVICE, with
+ * a new MEMORY_DEVICE_UNADDRESSABLE type. This provides a struct page for
+ * each page of the device memory, and allows the device driver to manage its
+ * memory using those struct pages. Having struct pages for device memory makes
+ * migration easier. Because that memory is not addressable by the CPU it must
+ * never be pinned to the device; in other words, any CPU page fault can always
+ * cause the device memory to be migrated (copied/moved) back to regular memory.
+ *
+ * A new migrate helper (migrate_vma()) has been added (see mm/migrate.c) that
+ * allows use of a device DMA engine to perform the copy operation between
+ * regular system memory and device memory.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table has write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f6d440..31da81f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -532,6 +533,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM needs to track a few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index 14017b1..37a0961 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -851,6 +852,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
diff --git a/mm/Kconfig b/mm/Kconfig
index 6208963..6bae95f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,20 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config HMM
+	bool
+	depends on MMU && 64BIT
+	help
+	  HMM provides a set of helpers to share a virtual address
+	  space between CPU and a device, so that the device can access any valid
+	  address of the process (while still obeying memory protection). HMM also
+	  provides helpers to migrate process memory to device memory, and back.
+	  Each set of functionality (address space mirroring, and migration to and
+	  from device memory) can be used independently of the other.
+
+	  This is primarily useful for devices like GPU, for GPGPU compute workload,
+	  with APIs such as OpenCL or CUDA. See Documentation/vm/hmm.txt.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..9eb4121 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..acadb49
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be used directly by device drivers. It allocates an
+ * HMM struct if mm does not have one, and initializes it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be freed once the mm_struct goes away,
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	kfree(mm->hmm);
+}
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 09/16] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 110 ++++++++++++++++++++++++++++++++++
 mm/Kconfig          |  12 ++++
 mm/hmm.c            | 170 +++++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 277 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 93b363d..6668a1b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
@@ -134,6 +135,115 @@ static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to synchronize device page table with CPU page table.
+ *
+ * A device driver that is participating in HMM mirroring must always
+ * synchronize with CPU page table updates. For this, device drivers can either
+ * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
+ * drivers can decide to register one mirror per device per process, or just
+ * one mirror per process for a group of devices. The pattern is:
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contains an hmm_mirror struct as one of its fields.
+ *          ...
+ *
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *
+ *          // Other device driver specific initialization
+ *          ...
+ *      }
+ *
+ * Once an hmm_mirror is registered for an address space, the device driver
+ * will get callbacks through sync_cpu_device_pagetables() operation (see
+ * hmm_mirror_ops struct).
+ *
+ * Device driver must not free the struct containing the hmm_mirror struct
+ * before calling hmm_mirror_unregister(). The expected usage is to do that when
+ * the device driver is unbinding from an address space.
+ *
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          ...
+ *
+ *          hmm_mirror_unregister(&das->mirror);
+ *
+ *          // Other device driver specific cleanup, and now das can be freed
+ *          ...
+ *      }
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update_type - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update_type {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* sync_cpu_device_pagetables() - synchronize page tables
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update_type: type of update that occurred to the CPU page table
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback ultimately originates from mmu_notifiers when the CPU
+	 * page table is updated. The device driver must update its page table
+	 * in response to this callback. The update argument tells what action
+	 * to perform.
+	 *
+	 * The device driver must not return from this callback until the device
+	 * page tables are completely updated (TLBs flushed, etc); this is a
+	 * synchronous call.
+	 */
+	void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
+					   enum hmm_update_type update_type,
+					   unsigned long start,
+					   unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * instance of an hmm_mirror struct with HMM. HMM will track the list of all
+ * mirrors for each mm_struct.
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 6bae95f..134e300 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -303,6 +303,18 @@ config HMM
 	  This is primarily useful for devices like GPU, for GPGPU compute workload,
 	  with APIs such as OpenCL or CUDA. See Documentation/vm/hmm.txt.
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	depends on MMU && 64BIT
+	select HMM
+	select MMU_NOTIFIER
+	help
+	  Select HMM_MIRROR if you want to mirror range of the CPU page table of a
+	  process into a device page table. Here, mirror means "keep synchronized".
+	  Prerequisites: the device must provide the ability to write-protect its
+	  page tables (at PAGE_SIZE granularity), and must be able to recover from
+	  the resulting potential page faults.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index acadb49..7ed4b4c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,26 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
+
 
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @sequence: we track updates to the CPU page table with a sequence number
+ * @mirrors: list of mirrors for this mm
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	atomic_t		sequence;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	struct rw_semaphore	mirrors_sem;
 };
 
 /*
@@ -41,27 +53,48 @@ struct hmm {
  */
 static struct hmm *hmm_register(struct mm_struct *mm)
 {
-	if (!mm->hmm) {
-		struct hmm *hmm = NULL;
-
-		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
-		if (!hmm)
-			return NULL;
-		hmm->mm = mm;
-
-		spin_lock(&mm->page_table_lock);
-		if (!mm->hmm)
-			mm->hmm = hmm;
-		else
-			kfree(hmm);
-		spin_unlock(&mm->page_table_lock);
-	}
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+	bool cleanup = false;
 
 	/*
 	 * The hmm struct can only be freed once the mm_struct goes away,
 	 * hence we should always have pre-allocated an new hmm struct
 	 * above.
 	 */
+	if (hmm)
+		return hmm;
+
+	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+	if (!hmm)
+		return NULL;
+	INIT_LIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->mirrors_sem);
+	atomic_set(&hmm->sequence, 0);
+	hmm->mmu_notifier.ops = NULL;
+	hmm->mm = mm;
+
+	/*
+	 * We should only get here if hold the mmap_sem in write mode ie on
+	 * registration of first mirror through hmm_mirror_register()
+	 */
+	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+		kfree(hmm);
+		return NULL;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (!mm->hmm)
+		mm->hmm = hmm;
+	else
+		cleanup = true;
+	spin_unlock(&mm->page_table_lock);
+
+	if (cleanup) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+
 	return mm->hmm;
 }
 
@@ -69,3 +102,110 @@ void hmm_mm_destroy(struct mm_struct *mm)
 {
 	kfree(mm->hmm);
 }
+
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update_type action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list)
+		mirror->ops->sync_cpu_device_pagetables(mirror, action,
+							start, end);
+	up_read(&hmm->mirrors_sem);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space, the device driver must register
+ * an HMM mirror struct.
+ *
+ * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	down_write(&mirror->hmm->mirrors_sem);
+	list_add(&mirror->list, &mirror->hmm->mirrors);
+	up_write(&mirror->hmm->mirrors_sem);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space, and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	list_del(&mirror->list);
+	up_write(&hmm->mirrors_sem);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 09/16] mm/hmm/mirror: mirror process address space on device with HMM helpers
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 110 ++++++++++++++++++++++++++++++++++
 mm/Kconfig          |  12 ++++
 mm/hmm.c            | 170 +++++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 277 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 93b363d..6668a1b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
@@ -134,6 +135,115 @@ static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to synchronize device page table with CPU page table.
+ *
+ * A device driver that is participating in HMM mirroring must always
+ * synchronize with CPU page table updates. For this, device drivers can either
+ * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
+ * drivers can decide to register one mirror per device per process, or just
+ * one mirror per process for a group of devices. The pattern is:
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contains an hmm_mirror struct as one of its fields.
+ *          ...
+ *
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *
+ *          // Other device driver specific initialization
+ *          ...
+ *      }
+ *
+ * Once an hmm_mirror is registered for an address space, the device driver
+ * will get callbacks through sync_cpu_device_pagetables() operation (see
+ * hmm_mirror_ops struct).
+ *
+ * Device driver must not free the struct containing the hmm_mirror struct
+ * before calling hmm_mirror_unregister(). The expected usage is to do that when
+ * the device driver is unbinding from an address space.
+ *
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          ...
+ *
+ *          hmm_mirror_unregister(&das->mirror);
+ *
+ *          // Other device driver specific cleanup, and now das can be freed
+ *          ...
+ *      }
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update_type - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update_type {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* sync_cpu_device_pagetables() - synchronize page tables
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update_type: type of update that occurred to the CPU page table
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback ultimately originates from mmu_notifiers when the CPU
+	 * page table is updated. The device driver must update its page table
+	 * in response to this callback. The update argument tells what action
+	 * to perform.
+	 *
+	 * The device driver must not return from this callback until the device
+	 * page tables are completely updated (TLBs flushed, etc); this is a
+	 * synchronous call.
+	 */
+	void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
+					   enum hmm_update_type update_type,
+					   unsigned long start,
+					   unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * instance of an hmm_mirror struct with HMM. HMM will track the list of all
+ * mirrors for each mm_struct.
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 6bae95f..134e300 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -303,6 +303,18 @@ config HMM
 	  This is primarily useful for devices like GPU, for GPGPU compute workload,
 	  with APIs such as OpenCL or CUDA. See Documentation/vm/hmm.txt.
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	depends on MMU && 64BIT
+	select HMM
+	select MMU_NOTIFIER
+	help
+	  Select HMM_MIRROR if you want to mirror range of the CPU page table of a
+	  process into a device page table. Here, mirror means "keep synchronized".
+	  Prerequisites: the device must provide the ability to write-protect its
+	  page tables (at PAGE_SIZE granularity), and must be able to recover from
+	  the resulting potential page faults.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index acadb49..7ed4b4c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,26 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
+
 
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @sequence: we track updates to the CPU page table with a sequence number
+ * @mirrors: list of mirrors for this mm
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	atomic_t		sequence;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	struct rw_semaphore	mirrors_sem;
 };
 
 /*
@@ -41,27 +53,48 @@ struct hmm {
  */
 static struct hmm *hmm_register(struct mm_struct *mm)
 {
-	if (!mm->hmm) {
-		struct hmm *hmm = NULL;
-
-		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
-		if (!hmm)
-			return NULL;
-		hmm->mm = mm;
-
-		spin_lock(&mm->page_table_lock);
-		if (!mm->hmm)
-			mm->hmm = hmm;
-		else
-			kfree(hmm);
-		spin_unlock(&mm->page_table_lock);
-	}
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+	bool cleanup = false;
 
 	/*
 	 * The hmm struct can only be freed once the mm_struct goes away,
 	 * hence we should always have pre-allocated an new hmm struct
 	 * above.
 	 */
+	if (hmm)
+		return hmm;
+
+	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+	if (!hmm)
+		return NULL;
+	INIT_LIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->mirrors_sem);
+	atomic_set(&hmm->sequence, 0);
+	hmm->mmu_notifier.ops = NULL;
+	hmm->mm = mm;
+
+	/*
+	 * We should only get here if hold the mmap_sem in write mode ie on
+	 * registration of first mirror through hmm_mirror_register()
+	 */
+	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+		kfree(hmm);
+		return NULL;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (!mm->hmm)
+		mm->hmm = hmm;
+	else
+		cleanup = true;
+	spin_unlock(&mm->page_table_lock);
+
+	if (cleanup) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+
 	return mm->hmm;
 }
 
@@ -69,3 +102,110 @@ void hmm_mm_destroy(struct mm_struct *mm)
 {
 	kfree(mm->hmm);
 }
+
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update_type action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list)
+		mirror->ops->sync_cpu_device_pagetables(mirror, action,
+							start, end);
+	up_read(&hmm->mirrors_sem);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space, the device driver must register
+ * an HMM mirror struct.
+ *
+ * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	down_write(&mirror->hmm->mirrors_sem);
+	list_add(&mirror->list, &mirror->hmm->mirrors);
+	up_write(&mirror->hmm->mirrors_sem);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space, and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	list_del(&mirror->list);
+	up_write(&hmm->mirrors_sem);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Changes since v1:
  - Use spinlock instead of rcu synchronized list traversal

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  55 +++++++++-
 mm/hmm.c            | 285 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 338 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 6668a1b..defa7cd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,13 +79,26 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
+ *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
+ *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
+ *      set and the pfn value is undefined.
+ * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_SPECIAL (1 << 5)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
+#define HMM_PFN_SHIFT 7
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -241,6 +254,44 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
+ * driver lock that serializes device page table updates, then call
+ * hmm_vma_range_done(), to check if the snapshot is still valid. The same
+ * device driver page table update lock must also be used in the
+ * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
+ * table invalidation serializes on it.
+ *
+ * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 7ed4b4c..4828b97 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,8 +19,12 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
@@ -30,14 +34,18 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
  * @sequence: we track updates to the CPU page table with a sequence number
+ * @ranges: list of range being snapshotted
  * @mirrors: list of mirrors for this mm
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
 	atomic_t		sequence;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
@@ -71,6 +79,8 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	init_rwsem(&hmm->mirrors_sem);
 	atomic_set(&hmm->sequence, 0);
 	hmm->mmu_notifier.ops = NULL;
+	INIT_LIST_HEAD(&hmm->ranges);
+	spin_lock_init(&hmm->lock);
 	hmm->mm = mm;
 
 	/*
@@ -111,6 +121,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	spin_unlock(&hmm->lock);
 
 	down_read(&hmm->mirrors_sem);
 	list_for_each_entry(mirror, &hmm->mirrors, list)
@@ -208,4 +234,263 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static int hmm_vma_walk_hole(unsigned long addr,
+			     unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = HMM_PFN_EMPTY;
+
+	return 0;
+}
+
+static int hmm_vma_walk_clear(unsigned long addr,
+			      unsigned long end,
+			      struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = 0;
+
+	return 0;
+}
+
+static int hmm_vma_walk_pmd(pmd_t *pmdp,
+			    unsigned long start,
+			    unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long addr = start, i;
+	hmm_pfn_t flag;
+	pte_t *ptep;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		pmd_t pmd;
+
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd))
+			return hmm_vma_walk_hole(start, end, walk);
+
+		if (pmd_bad(pmd) || pmd_protnone(pmd))
+			return hmm_vma_walk_clear(start, end, walk);
+
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+
+			flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+			for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			return 0;
+		} else {
+			/*
+			 * Something unusual is going on. Better to have the
+			 * driver assume there is nothing for this range and
+			 * let the fault code path sort out proper pages for the
+			 * range.
+			 */
+			return hmm_vma_walk_clear(start, end, walk);
+		}
+	}
+
+	ptep = pte_offset_map(pmdp, addr);
+	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
+		pte_t pte = *ptep;
+
+		pfns[i] = 0;
+
+		if (pte_none(pte)) {
+			pfns[i] = HMM_PFN_EMPTY;
+			continue;
+		}
+
+		if (!pte_present(pte)) {
+			swp_entry_t entry;
+
+			if (!non_swap_entry(entry))
+				continue;
+			entry = pte_to_swp_entry(pte);
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			 */
+			if (is_device_entry(entry)) {
+				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				if (is_write_device_entry(entry))
+					pfns[i] |= HMM_PFN_WRITE;
+				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (!is_migration_entry(entry)) {
+				pfns[i] = HMM_PFN_ERROR;
+			}
+			continue;
+		}
+
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	}
+	pte_unmap(ptep - 1);
+
+	return 0;
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
+ * @vma: virtual memory area containing the virtual address range
+ * @range: used to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshots the CPU page table for a range of virtual addresses. Snapshot
+ * validity is tracked by range struct. See hmm_vma_range_done() for further
+ * information.
+ *
+ * The range struct is initialized here. It tracks the CPU page table, but only
+ * if the function returns success (0), in which case the caller must then call
+ * hmm_vma_range_done() to stop CPU page table update tracking on this range.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.private = range;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	walk_page_range(start, end, &mm_walk);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being tracked
+ * Returns: false if range data has been invalidated, true otherwise
+ *
+ * Range struct is used to track updates to the CPU page table after a call to
+ * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
+ * using the data,  or wants to lock updates to the data it got from those
+ * functions, it must call the hmm_vma_range_done() function, which will then
+ * stop tracking CPU page table updates.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * the mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be used while trying to duplicate CPU page table contents for a range of
+ * virtual addresses.
+ *
+ * There are two ways to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Changes since v1:
  - Use spinlock instead of rcu synchronized list traversal

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  55 +++++++++-
 mm/hmm.c            | 285 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 338 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 6668a1b..defa7cd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,13 +79,26 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
+ *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
+ *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
+ *      set and the pfn value is undefined.
+ * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_SPECIAL (1 << 5)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
+#define HMM_PFN_SHIFT 7
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -241,6 +254,44 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
+ * driver lock that serializes device page table updates, then call
+ * hmm_vma_range_done(), to check if the snapshot is still valid. The same
+ * device driver page table update lock must also be used in the
+ * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
+ * table invalidation serializes on it.
+ *
+ * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 7ed4b4c..4828b97 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,8 +19,12 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
@@ -30,14 +34,18 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
  * @sequence: we track updates to the CPU page table with a sequence number
+ * @ranges: list of range being snapshotted
  * @mirrors: list of mirrors for this mm
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
 	atomic_t		sequence;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
@@ -71,6 +79,8 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	init_rwsem(&hmm->mirrors_sem);
 	atomic_set(&hmm->sequence, 0);
 	hmm->mmu_notifier.ops = NULL;
+	INIT_LIST_HEAD(&hmm->ranges);
+	spin_lock_init(&hmm->lock);
 	hmm->mm = mm;
 
 	/*
@@ -111,6 +121,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	spin_unlock(&hmm->lock);
 
 	down_read(&hmm->mirrors_sem);
 	list_for_each_entry(mirror, &hmm->mirrors, list)
@@ -208,4 +234,263 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static int hmm_vma_walk_hole(unsigned long addr,
+			     unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = HMM_PFN_EMPTY;
+
+	return 0;
+}
+
+static int hmm_vma_walk_clear(unsigned long addr,
+			      unsigned long end,
+			      struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = 0;
+
+	return 0;
+}
+
+static int hmm_vma_walk_pmd(pmd_t *pmdp,
+			    unsigned long start,
+			    unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long addr = start, i;
+	hmm_pfn_t flag;
+	pte_t *ptep;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		pmd_t pmd;
+
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd))
+			return hmm_vma_walk_hole(start, end, walk);
+
+		if (pmd_bad(pmd) || pmd_protnone(pmd))
+			return hmm_vma_walk_clear(start, end, walk);
+
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+
+			flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+			for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			return 0;
+		} else {
+			/*
+			 * Something unusual is going on. Better to have the
+			 * driver assume there is nothing for this range and
+			 * let the fault code path sort out proper pages for the
+			 * range.
+			 */
+			return hmm_vma_walk_clear(start, end, walk);
+		}
+	}
+
+	ptep = pte_offset_map(pmdp, addr);
+	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
+		pte_t pte = *ptep;
+
+		pfns[i] = 0;
+
+		if (pte_none(pte)) {
+			pfns[i] = HMM_PFN_EMPTY;
+			continue;
+		}
+
+		if (!pte_present(pte)) {
+			swp_entry_t entry;
+
+			if (!non_swap_entry(entry))
+				continue;
+			entry = pte_to_swp_entry(pte);
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			 */
+			if (is_device_entry(entry)) {
+				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				if (is_write_device_entry(entry))
+					pfns[i] |= HMM_PFN_WRITE;
+				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (!is_migration_entry(entry)) {
+				pfns[i] = HMM_PFN_ERROR;
+			}
+			continue;
+		}
+
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	}
+	pte_unmap(ptep - 1);
+
+	return 0;
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
+ * @vma: virtual memory area containing the virtual address range
+ * @range: used to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshots the CPU page table for a range of virtual addresses. Snapshot
+ * validity is tracked by range struct. See hmm_vma_range_done() for further
+ * information.
+ *
+ * The range struct is initialized here. It tracks the CPU page table, but only
+ * if the function returns success (0), in which case the caller must then call
+ * hmm_vma_range_done() to stop CPU page table update tracking on this range.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.private = range;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	walk_page_range(start, end, &mm_walk);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being tracked
+ * Returns: false if range data has been invalidated, true otherwise
+ *
+ * Range struct is used to track updates to the CPU page table after a call to
+ * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
+ * using the data,  or wants to lock updates to the data it got from those
+ * functions, it must call the hmm_vma_range_done() function, which will then
+ * stop tracking CPU page table updates.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * the mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be used while trying to duplicate CPU page table contents for a range of
+ * virtual addresses.
+ *
+ * There are two ways to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 11/16] mm/hmm/mirror: device page fault handler
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  27 ++++++
 mm/hmm.c            | 243 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 256 insertions(+), 14 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index defa7cd..d267989 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -292,6 +292,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if the block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See the function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4828b97..be88807 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -235,6 +235,36 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+struct hmm_vma_walk {
+	struct hmm_range	*range;
+	unsigned long		last;
+	bool			fault;
+	bool			block;
+	bool			write;
+};
+
+static int hmm_vma_do_fault(struct mm_walk *walk,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn)
+{
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int r;
+
+	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EBUSY;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return -EAGAIN;
+}
+
 static void hmm_pfns_special(hmm_pfn_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
@@ -243,34 +273,62 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
 		*pfns = HMM_PFN_SPECIAL;
 }
 
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = HMM_PFN_EMPTY;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_clear(unsigned long addr,
 			      unsigned long end,
 			      struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -278,15 +336,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long end,
 			    struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
+	bool write_fault;
 	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
 		pmd_t pmd;
@@ -302,6 +363,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
 
+			if (write_fault && !pmd_write(pmd))
+				return hmm_vma_walk_clear(start, end, walk);
+
 			flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 			for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
 				pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
@@ -325,14 +389,19 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 		if (pte_none(pte)) {
 			pfns[i] = HMM_PFN_EMPTY;
+			if (hmm_vma_walk->fault)
+				goto fault;
 			continue;
 		}
 
 		if (!pte_present(pte)) {
 			swp_entry_t entry;
 
-			if (!non_swap_entry(entry))
+			if (!non_swap_entry(entry)) {
+				if (hmm_vma_walk->fault)
+					goto fault;
 				continue;
+			}
 			entry = pte_to_swp_entry(pte);
 
 			/*
@@ -341,18 +410,38 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 */
 			if (is_device_entry(entry)) {
 				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
-				if (is_write_device_entry(entry))
+				if (is_write_device_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
+				} else if (write_fault)
+					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
 				pfns[i] |= flag;
-			} else if (!is_migration_entry(entry)) {
+			} else if (is_migration_entry(entry)) {
+				if (hmm_vma_walk->fault) {
+					pte_unmap(ptep - 1);
+					hmm_vma_walk->last = addr;
+					migration_entry_wait(vma->vm_mm,
+							     pmdp, addr);
+					return -EAGAIN;
+				}
+				continue;
+			} else {
+				/* Report error for everything else */
 				pfns[i] = HMM_PFN_ERROR;
 			}
 			continue;
 		}
 
+		if (write_fault && !pte_write(pte))
+			goto fault;
 		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+		continue;
+
+fault:
+		pte_unmap(ptep - 1);
+		/* Fault all pages in range */
+		return hmm_vma_walk_clear(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
@@ -385,6 +474,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns)
 {
+	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
@@ -416,9 +506,12 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	hmm_vma_walk.fault = false;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+
 	mm_walk.vma = vma;
 	mm_walk.mm = vma->vm_mm;
-	mm_walk.private = range;
 	mm_walk.pte_entry = NULL;
 	mm_walk.test_walk = NULL;
 	mm_walk.hugetlb_entry = NULL;
@@ -426,7 +519,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	walk_page_range(start, end, &mm_walk);
-
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -453,7 +545,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -464,7 +556,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -493,4 +585,127 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	struct hmm_vma_walk hmm_vma_walk;
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have registered a mirror using hmm_mirror_register() */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	hmm_vma_walk.fault = true;
+	hmm_vma_walk.write = write;
+	hmm_vma_walk.block = block;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+	hmm_vma_walk.last = range->start;
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	do {
+		ret = walk_page_range(start, end, &mm_walk);
+		start = hmm_vma_walk.last;
+	} while (ret == -EAGAIN);
+
+	if (ret) {
+		unsigned long i;
+
+		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
+		hmm_vma_range_done(vma, range);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 11/16] mm/hmm/mirror: device page fault handler
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  27 ++++++
 mm/hmm.c            | 243 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 256 insertions(+), 14 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index defa7cd..d267989 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -292,6 +292,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if the block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See the function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4828b97..be88807 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -235,6 +235,36 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+struct hmm_vma_walk {
+	struct hmm_range	*range;
+	unsigned long		last;
+	bool			fault;
+	bool			block;
+	bool			write;
+};
+
+static int hmm_vma_do_fault(struct mm_walk *walk,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn)
+{
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int r;
+
+	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EBUSY;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return -EAGAIN;
+}
+
 static void hmm_pfns_special(hmm_pfn_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
@@ -243,34 +273,62 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
 		*pfns = HMM_PFN_SPECIAL;
 }
 
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = HMM_PFN_EMPTY;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_clear(unsigned long addr,
 			      unsigned long end,
 			      struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -278,15 +336,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long end,
 			    struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
+	bool write_fault;
 	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
 		pmd_t pmd;
@@ -302,6 +363,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
 
+			if (write_fault && !pmd_write(pmd))
+				return hmm_vma_walk_clear(start, end, walk);
+
 			flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 			for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
 				pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
@@ -325,14 +389,19 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 		if (pte_none(pte)) {
 			pfns[i] = HMM_PFN_EMPTY;
+			if (hmm_vma_walk->fault)
+				goto fault;
 			continue;
 		}
 
 		if (!pte_present(pte)) {
 			swp_entry_t entry;
 
-			if (!non_swap_entry(entry))
+			if (!non_swap_entry(entry)) {
+				if (hmm_vma_walk->fault)
+					goto fault;
 				continue;
+			}
 			entry = pte_to_swp_entry(pte);
 
 			/*
@@ -341,18 +410,38 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 */
 			if (is_device_entry(entry)) {
 				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
-				if (is_write_device_entry(entry))
+				if (is_write_device_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
+				} else if (write_fault)
+					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
 				pfns[i] |= flag;
-			} else if (!is_migration_entry(entry)) {
+			} else if (is_migration_entry(entry)) {
+				if (hmm_vma_walk->fault) {
+					pte_unmap(ptep - 1);
+					hmm_vma_walk->last = addr;
+					migration_entry_wait(vma->vm_mm,
+							     pmdp, addr);
+					return -EAGAIN;
+				}
+				continue;
+			} else {
+				/* Report error for everything else */
 				pfns[i] = HMM_PFN_ERROR;
 			}
 			continue;
 		}
 
+		if (write_fault && !pte_write(pte))
+			goto fault;
 		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+		continue;
+
+fault:
+		pte_unmap(ptep - 1);
+		/* Fault all pages in range */
+		return hmm_vma_walk_clear(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
@@ -385,6 +474,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns)
 {
+	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
@@ -416,9 +506,12 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	hmm_vma_walk.fault = false;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+
 	mm_walk.vma = vma;
 	mm_walk.mm = vma->vm_mm;
-	mm_walk.private = range;
 	mm_walk.pte_entry = NULL;
 	mm_walk.test_walk = NULL;
 	mm_walk.hugetlb_entry = NULL;
@@ -426,7 +519,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	walk_page_range(start, end, &mm_walk);
-
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -453,7 +545,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -464,7 +556,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -493,4 +585,127 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	struct hmm_vma_walk hmm_vma_walk;
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have registered a mirror using hmm_mirror_register() */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	hmm_vma_walk.fault = true;
+	hmm_vma_walk.write = write;
+	hmm_vma_walk.block = block;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+	hmm_vma_walk.last = range->start;
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	do {
+		ret = walk_page_range(start, end, &mm_walk);
+		start = hmm_vma_walk.last;
+	} while (ret == -EAGAIN);
+
+	if (ret) {
+		unsigned long i;
+
+		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
+		hmm_vma_range_done(vma, range);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 12/16] mm/migrate: support un-addressable ZONE_DEVICE page in migration
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Kirill A . Shutemov

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/migrate.h |  10 +++-
 mm/migrate.c            | 136 ++++++++++++++++++++++++++++++++++++++----------
 mm/page_vma_mapped.c    |  10 ++++
 mm/rmap.c               |  25 +++++++++
 4 files changed, 152 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 576b3f5..7dd875a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,12 +130,18 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #ifdef CONFIG_MIGRATION
 
+/*
+ * Watch out for PAE architecture, which has an unsigned long, and might not
+ * have enough bits to store all physical address and flags. So far we have
+ * enough room for all our flags.
+ */
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_LOCKED	(1UL << 2)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
-#define MIGRATE_PFN_ERROR	(1UL << 4)
-#define MIGRATE_PFN_SHIFT	5
+#define MIGRATE_PFN_DEVICE	(1UL << 4)
+#define MIGRATE_PFN_ERROR	(1UL << 5)
+#define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 4486e30..f7a7661 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -40,6 +40,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -232,7 +233,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			pte = arch_make_huge_pte(pte, vma, new, 0);
 		}
 #endif
-		flush_dcache_page(new);
+
+		if (unlikely(is_zone_device_page(new)) &&
+		    is_device_unaddressable_page(new)) {
+			entry = make_device_entry(new, pte_write(pte));
+			pte = swp_entry_to_pte(entry);
+			if (pte_swp_soft_dirty(*pvmw.pte))
+				pte = pte_mksoft_dirty(pte);
+		} else
+			flush_dcache_page(new);
 		set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
 		if (PageHuge(new)) {
@@ -304,6 +313,8 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 	 */
 	if (!get_page_unless_zero(page))
 		goto out;
+	if (is_zone_device_page(page))
+		get_zone_device_page(page);
 	pte_unmap_unlock(ptep, ptl);
 	wait_on_page_locked(page);
 	put_page(page);
@@ -2138,17 +2149,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pte = *ptep;
 		pfn = pte_pfn(pte);
 
-		if (!pte_present(pte)) {
+		if (pte_none(pte)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
 
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+
+			/*
+			 * Only care about unaddressable device page special
+			 * page table entry. Other special swap entries are not
+			 * migratable, and we ignore regular swapped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_entry(entry))
+				goto next;
+
+			page = device_entry_to_page(entry);
+			mpfn = migrate_pfn(page_to_pfn(page))|
+				MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
+			if (is_write_device_entry(entry))
+				mpfn |= MIGRATE_PFN_WRITE;
+		} else {
+			page = vm_normal_page(migrate->vma, addr, pte);
+			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+		}
+
 		/* FIXME support THP */
-		page = vm_normal_page(migrate->vma, addr, pte);
 		if (!page || !page->mapping || PageTransCompound(page)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
+		pfn = page_to_pfn(page);
 
 		/*
 		 * By getting a reference on the page we pin it and that blocks
@@ -2161,8 +2195,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 */
 		get_page(page);
 		migrate->cpages++;
-		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
 		/*
 		 * Optimize for the common case where page is only mapped once
@@ -2193,6 +2225,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		}
 
 next:
+		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = mpfn;
 	}
 	arch_leave_lazy_mmu_mode();
@@ -2262,6 +2295,15 @@ static bool migrate_vma_check_page(struct page *page)
 	if (PageCompound(page))
 		return false;
 
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		if (is_device_unaddressable_page(page)) {
+			extra++;
+		} else
+			/* Other ZONE_DEVICE memory type are not supported */
+			return false;
+	}
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2299,24 +2341,30 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			migrate->src[i] |= MIGRATE_PFN_LOCKED;
 		}
 
-		if (!PageLRU(page) && allow_drain) {
-			/* Drain CPU's pagevec */
-			lru_add_drain_all();
-			allow_drain = false;
-		}
+		/* ZONE_DEVICE pages are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
 
-		if (isolate_lru_page(page)) {
-			if (remap) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
-				restore++;
-			} else {
-				migrate->src[i] = 0;
-				unlock_page(page);
-				migrate->cpages--;
-				put_page(page);
+			if (isolate_lru_page(page)) {
+				if (remap) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					migrate->cpages--;
+					restore++;
+				} else {
+					migrate->src[i] = 0;
+					unlock_page(page);
+					migrate->cpages--;
+					put_page(page);
+				}
+				continue;
 			}
-			continue;
+
+			/* Drop the reference we took in collect */
+			put_page(page);
 		}
 
 		if (!migrate_vma_check_page(page)) {
@@ -2325,14 +2373,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 				migrate->cpages--;
 				restore++;
 
-				get_page(page);
-				putback_lru_page(page);
+				if (!is_zone_device_page(page)) {
+					get_page(page);
+					putback_lru_page(page);
+				}
 			} else {
 				migrate->src[i] = 0;
 				unlock_page(page);
 				migrate->cpages--;
 
-				putback_lru_page(page);
+				if (!is_zone_device_page(page))
+					putback_lru_page(page);
+				else
+					put_page(page);
 			}
 		}
 	}
@@ -2403,7 +2456,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		unlock_page(page);
 		restore--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -2434,6 +2490,26 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 
 		mapping = page_mapping(page);
 
+		if (is_zone_device_page(newpage)) {
+			if (is_device_unaddressable_page(newpage)) {
+				/*
+				 * For now only support private anonymous when
+				 * migrating to un-addressable device memory.
+				 */
+				if (mapping) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					continue;
+				}
+			} else {
+				/*
+				 * Other types of ZONE_DEVICE page are not
+				 * supported.
+				 */
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				continue;
+			}
+		}
+
 		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
@@ -2474,11 +2550,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		unlock_page(page);
 		migrate->cpages--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 
 		if (newpage != page) {
 			unlock_page(newpage);
-			putback_lru_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
 		}
 	}
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index de9c40d..12b8c1a 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -48,6 +48,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		if (!is_swap_pte(*pvmw->pte))
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
+
 		if (!is_migration_entry(entry))
 			return false;
 		if (migration_entry_to_page(entry) - pvmw->page >=
@@ -60,6 +61,15 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		WARN_ON_ONCE(1);
 #endif
 	} else {
+		if (is_swap_pte(*pvmw->pte)) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*pvmw->pte);
+			if (is_device_entry(entry) &&
+			    device_entry_to_page(entry) == pvmw->page)
+				return true;
+		}
+
 		if (!pte_present(*pvmw->pte))
 			return false;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 81065b2..1583517 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -61,6 +61,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1306,6 +1307,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return true;
 
+	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+	    is_zone_device_page(page) && !is_device_unaddressable_page(page))
+		return true;
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_MIGRATION, page);
@@ -1341,6 +1346,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (IS_ENABLED(CONFIG_MIGRATION) &&
+		    (flags & TTU_MIGRATION) &&
+		    is_zone_device_page(page)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_migration_entry(page, 0);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			goto discard;
+		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
 			if (ptep_clear_flush_young_notify(vma, address,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 12/16] mm/migrate: support un-addressable ZONE_DEVICE page in migration
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Kirill A . Shutemov

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/migrate.h |  10 +++-
 mm/migrate.c            | 136 ++++++++++++++++++++++++++++++++++++++----------
 mm/page_vma_mapped.c    |  10 ++++
 mm/rmap.c               |  25 +++++++++
 4 files changed, 152 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 576b3f5..7dd875a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,12 +130,18 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #ifdef CONFIG_MIGRATION
 
+/*
+ * Watch out for PAE architecture, which has an unsigned long, and might not
+ * have enough bits to store all physical address and flags. So far we have
+ * enough room for all our flags.
+ */
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_LOCKED	(1UL << 2)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
-#define MIGRATE_PFN_ERROR	(1UL << 4)
-#define MIGRATE_PFN_SHIFT	5
+#define MIGRATE_PFN_DEVICE	(1UL << 4)
+#define MIGRATE_PFN_ERROR	(1UL << 5)
+#define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 4486e30..f7a7661 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -40,6 +40,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -232,7 +233,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			pte = arch_make_huge_pte(pte, vma, new, 0);
 		}
 #endif
-		flush_dcache_page(new);
+
+		if (unlikely(is_zone_device_page(new)) &&
+		    is_device_unaddressable_page(new)) {
+			entry = make_device_entry(new, pte_write(pte));
+			pte = swp_entry_to_pte(entry);
+			if (pte_swp_soft_dirty(*pvmw.pte))
+				pte = pte_mksoft_dirty(pte);
+		} else
+			flush_dcache_page(new);
 		set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
 		if (PageHuge(new)) {
@@ -304,6 +313,8 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 	 */
 	if (!get_page_unless_zero(page))
 		goto out;
+	if (is_zone_device_page(page))
+		get_zone_device_page(page);
 	pte_unmap_unlock(ptep, ptl);
 	wait_on_page_locked(page);
 	put_page(page);
@@ -2138,17 +2149,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pte = *ptep;
 		pfn = pte_pfn(pte);
 
-		if (!pte_present(pte)) {
+		if (pte_none(pte)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
 
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+
+			/*
+			 * Only care about unaddressable device page special
+			 * page table entry. Other special swap entries are not
+			 * migratable, and we ignore regular swapped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_entry(entry))
+				goto next;
+
+			page = device_entry_to_page(entry);
+			mpfn = migrate_pfn(page_to_pfn(page))|
+				MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
+			if (is_write_device_entry(entry))
+				mpfn |= MIGRATE_PFN_WRITE;
+		} else {
+			page = vm_normal_page(migrate->vma, addr, pte);
+			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+		}
+
 		/* FIXME support THP */
-		page = vm_normal_page(migrate->vma, addr, pte);
 		if (!page || !page->mapping || PageTransCompound(page)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
+		pfn = page_to_pfn(page);
 
 		/*
 		 * By getting a reference on the page we pin it and that blocks
@@ -2161,8 +2195,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 */
 		get_page(page);
 		migrate->cpages++;
-		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
 		/*
 		 * Optimize for the common case where page is only mapped once
@@ -2193,6 +2225,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		}
 
 next:
+		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = mpfn;
 	}
 	arch_leave_lazy_mmu_mode();
@@ -2262,6 +2295,15 @@ static bool migrate_vma_check_page(struct page *page)
 	if (PageCompound(page))
 		return false;
 
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		if (is_device_unaddressable_page(page)) {
+			extra++;
+		} else
+			/* Other ZONE_DEVICE memory type are not supported */
+			return false;
+	}
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2299,24 +2341,30 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			migrate->src[i] |= MIGRATE_PFN_LOCKED;
 		}
 
-		if (!PageLRU(page) && allow_drain) {
-			/* Drain CPU's pagevec */
-			lru_add_drain_all();
-			allow_drain = false;
-		}
+		/* ZONE_DEVICE pages are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
 
-		if (isolate_lru_page(page)) {
-			if (remap) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
-				restore++;
-			} else {
-				migrate->src[i] = 0;
-				unlock_page(page);
-				migrate->cpages--;
-				put_page(page);
+			if (isolate_lru_page(page)) {
+				if (remap) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					migrate->cpages--;
+					restore++;
+				} else {
+					migrate->src[i] = 0;
+					unlock_page(page);
+					migrate->cpages--;
+					put_page(page);
+				}
+				continue;
 			}
-			continue;
+
+			/* Drop the reference we took in collect */
+			put_page(page);
 		}
 
 		if (!migrate_vma_check_page(page)) {
@@ -2325,14 +2373,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 				migrate->cpages--;
 				restore++;
 
-				get_page(page);
-				putback_lru_page(page);
+				if (!is_zone_device_page(page)) {
+					get_page(page);
+					putback_lru_page(page);
+				}
 			} else {
 				migrate->src[i] = 0;
 				unlock_page(page);
 				migrate->cpages--;
 
-				putback_lru_page(page);
+				if (!is_zone_device_page(page))
+					putback_lru_page(page);
+				else
+					put_page(page);
 			}
 		}
 	}
@@ -2403,7 +2456,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		unlock_page(page);
 		restore--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -2434,6 +2490,26 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 
 		mapping = page_mapping(page);
 
+		if (is_zone_device_page(newpage)) {
+			if (is_device_unaddressable_page(newpage)) {
+				/*
+				 * For now only support private anonymous when
+				 * migrating to un-addressable device memory.
+				 */
+				if (mapping) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					continue;
+				}
+			} else {
+				/*
+				 * Other types of ZONE_DEVICE page are not
+				 * supported.
+				 */
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				continue;
+			}
+		}
+
 		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
@@ -2474,11 +2550,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		unlock_page(page);
 		migrate->cpages--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 
 		if (newpage != page) {
 			unlock_page(newpage);
-			putback_lru_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
 		}
 	}
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index de9c40d..12b8c1a 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -48,6 +48,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		if (!is_swap_pte(*pvmw->pte))
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
+
 		if (!is_migration_entry(entry))
 			return false;
 		if (migration_entry_to_page(entry) - pvmw->page >=
@@ -60,6 +61,15 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		WARN_ON_ONCE(1);
 #endif
 	} else {
+		if (is_swap_pte(*pvmw->pte)) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*pvmw->pte);
+			if (is_device_entry(entry) &&
+			    device_entry_to_page(entry) == pvmw->page)
+				return true;
+		}
+
 		if (!pte_present(*pvmw->pte))
 			return false;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 81065b2..1583517 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -61,6 +61,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1306,6 +1307,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return true;
 
+	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+	    is_zone_device_page(page) && !is_device_unaddressable_page(page))
+		return true;
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_MIGRATION, page);
@@ -1341,6 +1346,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (IS_ENABLED(CONFIG_MIGRATION) &&
+		    (flags & TTU_MIGRATION) &&
+		    is_zone_device_page(page)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_migration_entry(page, 0);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			goto discard;
+		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
 			if (ptep_clear_flush_young_notify(vma, address,
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 13/16] mm/migrate: allow migrate_vma() to alloc new page on empty entry
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry. It only support anoymous memory and it won't allow
new page to be instance if userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/migrate.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 127 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f7a7661..cbaa4f2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
 #include <linux/memremap.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/tlbflush.h>
 
@@ -2111,9 +2112,10 @@ static int migrate_vma_collect_hole(unsigned long start,
 				    struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
-	unsigned long addr, next;
+	unsigned long addr;
 
 	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->cpages++;
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = 0;
 	}
@@ -2150,6 +2152,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pfn = pte_pfn(pte);
 
 		if (pte_none(pte)) {
+			migrate->cpages++;
 			mpfn = pfn = 0;
 			goto next;
 		}
@@ -2463,6 +2466,114 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 	}
 }
 
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+				    unsigned long addr,
+				    struct page *page,
+				    unsigned long *src,
+				    unsigned long *dst)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mem_cgroup *memcg;
+	spinlock_t *ptl;
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+	pte_t entry;
+
+	/* Only allow populating anonymous memory */
+	if (!vma_is_anonymous(vma))
+		goto abort;
+
+	pgdp = pgd_offset(mm, addr);
+	pudp = pud_alloc(mm, pgdp, addr);
+	if (!pudp)
+		goto abort;
+	pmdp = pmd_alloc(mm, pudp, addr);
+	if (!pmdp)
+		goto abort;
+
+	if (pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))
+		goto abort;
+
+	/*
+	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
+	 * pte_offset_map() on pmds where a huge pmd might be created
+	 * from a different thread.
+	 *
+	 * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+	 * parallel threads are excluded by other means.
+	 *
+	 * Here we only have down_read(mmap_sem).
+	 */
+	if (pte_alloc(mm, pmdp, addr))
+		goto abort;
+
+	/* See the comment in pte_alloc_one_map() */
+	if (unlikely(pmd_trans_unstable(pmdp)))
+		goto abort;
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto abort;
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+		goto abort;
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+
+	if (is_zone_device_page(page) && is_device_unaddressable_page(page)) {
+		swp_entry_t swp_entry;
+
+		swp_entry = make_device_entry(page, vma->vm_flags & VM_WRITE);
+		entry = swp_entry_to_pte(swp_entry);
+	} else {
+		entry = mk_pte(page, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	if (!pte_none(*ptep)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	/*
+	 * Check for usefaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	inc_mm_counter(mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
+	if (!is_zone_device_page(page))
+		lru_cache_add_active_or_unevictable(page, vma);
+	set_pte_at(mm, addr, ptep, entry);
+
+	/* Take a reference on the page */
+	get_page(page);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, addr, ptep);
+	pte_unmap_unlock(ptep, ptl);
+	*src = MIGRATE_PFN_MIGRATE;
+	return;
+
+abort:
+	*src &= ~MIGRATE_PFN_MIGRATE;
+}
+
 /*
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
@@ -2483,10 +2594,16 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 		struct address_space *mapping;
 		int r;
 
-		if (!page || !newpage)
+		if (!newpage) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
-		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+		} else if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) {
+			if (!page)
+				migrate_vma_insert_page(migrate, addr, newpage,
+							&migrate->src[i],
+							&migrate->dst[i]);
 			continue;
+		}
 
 		mapping = page_mapping(page);
 
@@ -2536,8 +2653,14 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
 
-		if (!page)
+		if (!page) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
 			continue;
+		}
+
 		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
 			if (newpage) {
 				unlock_page(newpage);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 13/16] mm/migrate: allow migrate_vma() to alloc new page on empty entry
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry. It only support anoymous memory and it won't allow
new page to be instance if userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/migrate.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 127 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f7a7661..cbaa4f2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
 #include <linux/memremap.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/tlbflush.h>
 
@@ -2111,9 +2112,10 @@ static int migrate_vma_collect_hole(unsigned long start,
 				    struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
-	unsigned long addr, next;
+	unsigned long addr;
 
 	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->cpages++;
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = 0;
 	}
@@ -2150,6 +2152,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pfn = pte_pfn(pte);
 
 		if (pte_none(pte)) {
+			migrate->cpages++;
 			mpfn = pfn = 0;
 			goto next;
 		}
@@ -2463,6 +2466,114 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 	}
 }
 
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+				    unsigned long addr,
+				    struct page *page,
+				    unsigned long *src,
+				    unsigned long *dst)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mem_cgroup *memcg;
+	spinlock_t *ptl;
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+	pte_t entry;
+
+	/* Only allow populating anonymous memory */
+	if (!vma_is_anonymous(vma))
+		goto abort;
+
+	pgdp = pgd_offset(mm, addr);
+	pudp = pud_alloc(mm, pgdp, addr);
+	if (!pudp)
+		goto abort;
+	pmdp = pmd_alloc(mm, pudp, addr);
+	if (!pmdp)
+		goto abort;
+
+	if (pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))
+		goto abort;
+
+	/*
+	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
+	 * pte_offset_map() on pmds where a huge pmd might be created
+	 * from a different thread.
+	 *
+	 * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+	 * parallel threads are excluded by other means.
+	 *
+	 * Here we only have down_read(mmap_sem).
+	 */
+	if (pte_alloc(mm, pmdp, addr))
+		goto abort;
+
+	/* See the comment in pte_alloc_one_map() */
+	if (unlikely(pmd_trans_unstable(pmdp)))
+		goto abort;
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto abort;
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+		goto abort;
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+
+	if (is_zone_device_page(page) && is_device_unaddressable_page(page)) {
+		swp_entry_t swp_entry;
+
+		swp_entry = make_device_entry(page, vma->vm_flags & VM_WRITE);
+		entry = swp_entry_to_pte(swp_entry);
+	} else {
+		entry = mk_pte(page, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	if (!pte_none(*ptep)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	/*
+	 * Check for usefaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	inc_mm_counter(mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
+	if (!is_zone_device_page(page))
+		lru_cache_add_active_or_unevictable(page, vma);
+	set_pte_at(mm, addr, ptep, entry);
+
+	/* Take a reference on the page */
+	get_page(page);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, addr, ptep);
+	pte_unmap_unlock(ptep, ptl);
+	*src = MIGRATE_PFN_MIGRATE;
+	return;
+
+abort:
+	*src &= ~MIGRATE_PFN_MIGRATE;
+}
+
 /*
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
@@ -2483,10 +2594,16 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 		struct address_space *mapping;
 		int r;
 
-		if (!page || !newpage)
+		if (!newpage) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
-		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+		} else if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) {
+			if (!page)
+				migrate_vma_insert_page(migrate, addr, newpage,
+							&migrate->src[i],
+							&migrate->dst[i]);
 			continue;
+		}
 
 		mapping = page_mapping(page);
 
@@ -2536,8 +2653,14 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
 
-		if (!page)
+		if (!page) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
 			continue;
+		}
+
 		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
 			if (newpage) {
 				unlock_page(newpage);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 114 +++++++++++++++
 mm/Kconfig          |   9 ++
 mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 521 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d267989..50a1115 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/migrate.h>
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -322,6 +327,115 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned int flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct device			*device;
+	struct dev_pagemap		pagemap;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * The device driver can wrap the hmm_devmem struct inside a private device
+ * driver struct. The device driver must call hmm_devmem_remove() before the
+ * device goes away and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 134e300..96dcf61 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -315,6 +315,15 @@ config HMM_MIRROR
 	  page tables (at PAGE_SIZE granularity), and must be able to recover from
 	  the resulting potential page faults.
 
+config HMM_DEVMEM
+	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+	depends on MMU && 64BIT
+	select HMM
+	help
+	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
+	  feature. This is just to avoid having device drivers to replicating a lot
+	  of boiler plate code.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index be88807..ec3eb49 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/mmu_notifier.h>
 
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
 
@@ -709,3 +714,396 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_ref_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+}
+
+static void hmm_devmem_ref_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
+}
+
+static void hmm_devmem_ref_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_kill(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned int flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+static DEFINE_MUTEX(hmm_devmem_lock);
+static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
+static void hmm_devmem_radix_release(struct resource *resource)
+{
+	resource_size_t key, align_start, align_size, align_end;
+
+	align_start = resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+	align_end = align_start + align_size - 1;
+
+	mutex_lock(&hmm_devmem_lock);
+	for (key = resource->start; key <= resource->end; key += SECTION_SIZE)
+		radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
+	mutex_unlock(&hmm_devmem_lock);
+}
+
+static void hmm_devmem_release(struct device *dev, void *data)
+{
+	struct hmm_devmem *devmem = data;
+	resource_size_t align_start, align_size;
+	struct resource *resource = devmem->resource;
+
+	if (percpu_ref_tryget_live(&devmem->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(&devmem->ref);
+	}
+
+	/* pages are dead and unused, undo the arch mapping */
+	align_start = resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+
+	mem_hotplug_begin();
+	arch_remove_memory(align_start, align_size, devmem->pagemap.type);
+	mem_hotplug_done();
+
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+	hmm_devmem_radix_release(resource);
+}
+
+static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
+}
+
+static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
+{
+	resource_size_t key, align_start, align_size, align_end;
+	struct device *device = devmem->device;
+	pgprot_t pgprot = PAGE_KERNEL;
+	int ret, nid, is_ram;
+	unsigned long pfn;
+
+	align_start = devmem->resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(devmem->resource->start +
+			   resource_size(devmem->resource),
+			   SECTION_SIZE) - align_start;
+
+	is_ram = region_intersects(align_start, align_size,
+				   IORESOURCE_SYSTEM_RAM,
+				   IORES_DESC_NONE);
+	if (is_ram == REGION_MIXED) {
+		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+				__func__, devmem->resource);
+		return -ENXIO;
+	}
+	if (is_ram == REGION_INTERSECTS)
+		return -ENXIO;
+
+	devmem->pagemap.type = MEMORY_DEVICE_UNADDRESSABLE;
+	devmem->pagemap.res = devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.dev = devmem->device;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+
+	mutex_lock(&hmm_devmem_lock);
+	align_end = align_start + align_size - 1;
+	for (key = align_start; key <= align_end; key += SECTION_SIZE) {
+		struct hmm_devmem *dup;
+
+		rcu_read_lock();
+		dup = hmm_devmem_find(key);
+		rcu_read_unlock();
+		if (dup) {
+			dev_err(device, "%s: collides with mapping for %s\n",
+				__func__, dev_name(dup->device));
+			mutex_unlock(&hmm_devmem_lock);
+			ret = -EBUSY;
+			goto error;
+		}
+		ret = radix_tree_insert(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT,
+					devmem);
+		if (ret) {
+			dev_err(device, "%s: failed: %d\n", __func__, ret);
+			mutex_unlock(&hmm_devmem_lock);
+			goto error_radix;
+		}
+	}
+	mutex_unlock(&hmm_devmem_lock);
+
+	nid = dev_to_node(device);
+	if (nid < 0)
+		nid = numa_mem_id();
+
+	ret = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start),
+			      0, align_size);
+	if (ret)
+		goto error_radix;
+
+	mem_hotplug_begin();
+	ret = arch_add_memory(nid, align_start, align_size,
+			      devmem->pagemap.type);
+	mem_hotplug_done();
+	if (!ret)
+		goto error_add_memory;
+
+	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
+		 * freed or placed on a driver-private list. Therefore,
+		 * seed the storage with LIST_POISON* values.
+		 */
+		list_del(&page->lru);
+		page->pgmap = &devmem->pagemap;
+	}
+	return 0;
+
+error_add_memory:
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+error_radix:
+	hmm_devmem_radix_release(devmem->resource);
+error:
+	return ret;
+}
+
+static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
+{
+	struct hmm_devmem *devmem = data;
+
+	return devmem->resource == match_data;
+}
+
+static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
+{
+	devres_release(devmem->device, &hmm_devmem_release,
+		       &hmm_devmem_match, devmem->resource);
+}
+
+/*
+ * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
+ *
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
+ *
+ * This first finds an empty range of physical address big enough to contain the
+ * new resource, and then hotplugs it as ZONE_DEVICE memory, which in turn
+ * allocates struct pages. It does not do anything beyond that; all events
+ * affecting the memory will go through the various callbacks provided by
+ * hmm_devmem_ops struct.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size)
+{
+	struct hmm_devmem *devmem;
+	resource_size_t addr;
+	int ret;
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+	size = ALIGN(size, SECTION_SIZE);
+	addr = (iomem_resource.end + 1ULL) - size;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size && addr >= iomem_resource.start; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error_no_resource;
+		}
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error_no_resource;
+	}
+
+	devmem->resource->desc = IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE;
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_pages;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_pages:
+	devm_release_mem_region(device, devmem->resource->start,
+				resource_size(devmem->resource));
+error_no_resource:
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ *
+ * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf
+ * of the device driver. It will free struct page and remove the resource that
+ * reserved the physical address range for this device memory.
+ */
+void hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	resource_size_t start, size;
+	struct device *device;
+
+	if (!devmem)
+		return;
+
+	device = devmem->device;
+	start = devmem->resource->start;
+	size = resource_size(devmem->resource);
+
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+	hmm_devmem_pages_remove(devmem);
+
+	devm_release_mem_region(device, start, size);
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @src: array of unsigned long containing source pfns
+ * @dst: array of unsigned long containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around migrate_vma() which checks the migration status
+ * for a given fault address and returns the corresponding page fault handler
+ * status. That will be 0 on success, or VM_FAULT_SIGBUS if migration failed
+ * for the faulting address.
+ *
+ * This is a helper intendend to be used by the ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private)
+{
+	if (migrate_vma(ops, vma, start, end, src, dst, private))
+		return VM_FAULT_SIGBUS;
+
+	if (dst[(addr - start) >> PAGE_SHIFT] & MIGRATE_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 114 +++++++++++++++
 mm/Kconfig          |   9 ++
 mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 521 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d267989..50a1115 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/migrate.h>
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -322,6 +327,115 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned int flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct device			*device;
+	struct dev_pagemap		pagemap;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * The device driver can wrap the hmm_devmem struct inside a private device
+ * driver struct. The device driver must call hmm_devmem_remove() before the
+ * device goes away and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 134e300..96dcf61 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -315,6 +315,15 @@ config HMM_MIRROR
 	  page tables (at PAGE_SIZE granularity), and must be able to recover from
 	  the resulting potential page faults.
 
+config HMM_DEVMEM
+	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+	depends on MMU && 64BIT
+	select HMM
+	help
+	  HMM devmem is a set of helper routines to leverage the ZONE_DEVICE
+	  feature. This is just to avoid having device drivers to replicating a lot
+	  of boiler plate code.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index be88807..ec3eb49 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/mmu_notifier.h>
 
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
 
@@ -709,3 +714,396 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_ref_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+}
+
+static void hmm_devmem_ref_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
+}
+
+static void hmm_devmem_ref_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_kill(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned int flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+static DEFINE_MUTEX(hmm_devmem_lock);
+static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
+static void hmm_devmem_radix_release(struct resource *resource)
+{
+	resource_size_t key, align_start, align_size, align_end;
+
+	align_start = resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+	align_end = align_start + align_size - 1;
+
+	mutex_lock(&hmm_devmem_lock);
+	for (key = resource->start; key <= resource->end; key += SECTION_SIZE)
+		radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
+	mutex_unlock(&hmm_devmem_lock);
+}
+
+static void hmm_devmem_release(struct device *dev, void *data)
+{
+	struct hmm_devmem *devmem = data;
+	resource_size_t align_start, align_size;
+	struct resource *resource = devmem->resource;
+
+	if (percpu_ref_tryget_live(&devmem->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(&devmem->ref);
+	}
+
+	/* pages are dead and unused, undo the arch mapping */
+	align_start = resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+
+	mem_hotplug_begin();
+	arch_remove_memory(align_start, align_size, devmem->pagemap.type);
+	mem_hotplug_done();
+
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+	hmm_devmem_radix_release(resource);
+}
+
+static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
+}
+
+static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
+{
+	resource_size_t key, align_start, align_size, align_end;
+	struct device *device = devmem->device;
+	pgprot_t pgprot = PAGE_KERNEL;
+	int ret, nid, is_ram;
+	unsigned long pfn;
+
+	align_start = devmem->resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(devmem->resource->start +
+			   resource_size(devmem->resource),
+			   SECTION_SIZE) - align_start;
+
+	is_ram = region_intersects(align_start, align_size,
+				   IORESOURCE_SYSTEM_RAM,
+				   IORES_DESC_NONE);
+	if (is_ram == REGION_MIXED) {
+		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+				__func__, devmem->resource);
+		return -ENXIO;
+	}
+	if (is_ram == REGION_INTERSECTS)
+		return -ENXIO;
+
+	devmem->pagemap.type = MEMORY_DEVICE_UNADDRESSABLE;
+	devmem->pagemap.res = devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.dev = devmem->device;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+
+	mutex_lock(&hmm_devmem_lock);
+	align_end = align_start + align_size - 1;
+	for (key = align_start; key <= align_end; key += SECTION_SIZE) {
+		struct hmm_devmem *dup;
+
+		rcu_read_lock();
+		dup = hmm_devmem_find(key);
+		rcu_read_unlock();
+		if (dup) {
+			dev_err(device, "%s: collides with mapping for %s\n",
+				__func__, dev_name(dup->device));
+			mutex_unlock(&hmm_devmem_lock);
+			ret = -EBUSY;
+			goto error;
+		}
+		ret = radix_tree_insert(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT,
+					devmem);
+		if (ret) {
+			dev_err(device, "%s: failed: %d\n", __func__, ret);
+			mutex_unlock(&hmm_devmem_lock);
+			goto error_radix;
+		}
+	}
+	mutex_unlock(&hmm_devmem_lock);
+
+	nid = dev_to_node(device);
+	if (nid < 0)
+		nid = numa_mem_id();
+
+	ret = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start),
+			      0, align_size);
+	if (ret)
+		goto error_radix;
+
+	mem_hotplug_begin();
+	ret = arch_add_memory(nid, align_start, align_size,
+			      devmem->pagemap.type);
+	mem_hotplug_done();
+	if (!ret)
+		goto error_add_memory;
+
+	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
+		 * freed or placed on a driver-private list. Therefore,
+		 * seed the storage with LIST_POISON* values.
+		 */
+		list_del(&page->lru);
+		page->pgmap = &devmem->pagemap;
+	}
+	return 0;
+
+error_add_memory:
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+error_radix:
+	hmm_devmem_radix_release(devmem->resource);
+error:
+	return ret;
+}
+
+static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
+{
+	struct hmm_devmem *devmem = data;
+
+	return devmem->resource == match_data;
+}
+
+static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
+{
+	devres_release(devmem->device, &hmm_devmem_release,
+		       &hmm_devmem_match, devmem->resource);
+}
+
+/*
+ * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
+ *
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
+ *
+ * This first finds an empty range of physical address big enough to contain the
+ * new resource, and then hotplugs it as ZONE_DEVICE memory, which in turn
+ * allocates struct pages. It does not do anything beyond that; all events
+ * affecting the memory will go through the various callbacks provided by
+ * hmm_devmem_ops struct.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size)
+{
+	struct hmm_devmem *devmem;
+	resource_size_t addr;
+	int ret;
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+	size = ALIGN(size, SECTION_SIZE);
+	addr = (iomem_resource.end + 1ULL) - size;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size && addr >= iomem_resource.start; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error_no_resource;
+		}
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error_no_resource;
+	}
+
+	devmem->resource->desc = IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE;
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_pages;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_pages:
+	devm_release_mem_region(device, devmem->resource->start,
+				resource_size(devmem->resource));
+error_no_resource:
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ *
+ * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf
+ * of the device driver. It will free struct page and remove the resource that
+ * reserved the physical address range for this device memory.
+ */
+void hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	resource_size_t start, size;
+	struct device *device;
+
+	if (!devmem)
+		return;
+
+	device = devmem->device;
+	start = devmem->resource->start;
+	size = resource_size(devmem->resource);
+
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+	hmm_devmem_pages_remove(devmem);
+
+	devm_release_mem_region(device, start, size);
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @src: array of unsigned long containing source pfns
+ * @dst: array of unsigned long containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around migrate_vma() which checks the migration status
+ * for a given fault address and returns the corresponding page fault handler
+ * status. That will be 0 on success, or VM_FAULT_SIGBUS if migration failed
+ * for the faulting address.
+ *
+ * This is a helper intendend to be used by the ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private)
+{
+	if (migrate_vma(ops, vma, start, end, src, dst, private))
+		return VM_FAULT_SIGBUS;
+
+	if (dst[(addr - start) >> PAGE_SHIFT] & MIGRATE_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 15/16] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is useful to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 +++++++++++-
 mm/hmm.c            | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 50a1115..374e5fd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,11 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -433,6 +433,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned int		minor;
+};
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper and
+ * it is not strictly needed, in order to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index ec3eb49..ff8ec59 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mmzone.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
@@ -1106,4 +1107,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt, 0, 1, "hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor = find_first_zero_bit(hmm_device_mask, HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	dev_set_drvdata(&hmm_device->device, drvdata);
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+	unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+	class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 15/16] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is useful to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 +++++++++++-
 mm/hmm.c            | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 50a1115..374e5fd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,11 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -433,6 +433,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned int		minor;
+};
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper and
+ * it is not strictly needed, in order to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index ec3eb49..ff8ec59 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mmzone.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
@@ -1106,4 +1107,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt, 0, 1, "hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor = find_first_zero_bit(hmm_device_mask, HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	dev_set_drvdata(&hmm_device->device, drvdata);
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+	unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+	class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 16/16] hmm: heterogeneous memory management documentation
  2017-04-05 20:40 ` Jérôme Glisse
@ 2017-04-05 20:40   ` Jérôme Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

This add documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to
be useful and and gives an overview of how this is implemented.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 362 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 362 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..a18ffc0
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,362 @@
+Heterogeneous Memory Management (HMM)
+
+Transparently allow any component of a program to use any memory region of said
+program with a device without using device specific memory allocator. This is
+becoming a requirement to simplify the use of advance heterogeneous computing
+where GPU, DSP or FPGA are use to perform various computations.
+
+This document is divided as follow, in the first section i expose the problems
+related to the use of a device specific allocator. The second section i expose
+the hardware limitations that are inherent to many platforms. The third section
+gives an overview of HMM designs. The fourth section explains how CPU page-
+table mirroring works and what is HMM purpose in this context. Fifth section
+deals with how device memory is represented inside the kernel. Finaly the last
+section present the new migration helper that allow to leverage the device DMA
+engine.
+
+
+-------------------------------------------------------------------------------
+
+1) Problems of using device specific memory allocator:
+
+Device with large amount of on board memory (several giga bytes) like GPU have
+historically manage their memory through dedicated driver specific API. This
+creates a disconnect between memory allocated and managed by device driver and
+regular application memory (private anonymous, share memory or regular file
+back memory). From here on i will refer to this aspect as split address space.
+I use share address space to refer to the opposite situation ie one in which
+any memory region can be use by device transparently.
+
+Split address space because device can only access memory allocated through the
+device specific API. This imply that all memory object in a program are not
+equal from device point of view which complicate large program that rely on a
+wide set of libraries.
+
+Concretly this means that code that wants to leverage device like GPU need to
+copy object between genericly allocated memory (malloc, mmap private/share/)
+and memory allocated through the device driver API (this still end up with an
+mmap but of the device file).
+
+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
+data-set need to re-map all the pointer relations between each of its elements.
+This is error prone and program gets harder to debug because of the duplicate
+data-set.
+
+Split address space also means that library can not transparently use data they
+are getting from core program or other library and thus each library might have
+to duplicate its input data-set using specific memory allocator. Large project
+suffer from this and waste resources because of the various memory copy.
+
+Duplicating each library API to accept as input or output memory allocted by
+each device specific allocator is not a viable option. It would lead to a
+combinatorial explosions in the library entry points.
+
+Finaly with the advance of high level language constructs (in C++ but in other
+language too) it is now possible for compiler to leverage GPU or other devices
+without even the programmer knowledge. Some of compiler identified patterns are
+only do-able with a share address. It is as well more reasonable to use a share
+address space for all the other patterns.
+
+
+-------------------------------------------------------------------------------
+
+2) System bus, device memory characteristics
+
+System bus cripple share address due to few limitations. Most system bus only
+allow basic memory access from device to main memory, even cache coherency is
+often optional. Access to device memory from CPU is even more limited, most
+often than not it is not cache coherent.
+
+If we only consider the PCIE bus than device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However it only allows
+a limited set of atomic operation from device on main memory. This is worse
+in the other direction the CPUs can only access a limited range of the device
+memory and can not perform atomic operations on it. Thus device memory can not
+be consider like regular memory from kernel point of view.
+
+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
+The final limitation is latency, access to main memory from the device has an
+order of magnitude higher latency than when the device access its own memory.
+
+Some platform are developing new system bus or additions/modifications to PCIE
+to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+way cache coherency between CPU and device and allow all atomic operations the
+architecture supports. Saddly not all platform are following this trends and
+some major architecture are left without hardware solutions to those problems.
+
+So for share address space to make sense not only we must allow device to
+access any memory memory but we must also permit any memory to be migrated to
+device memory while device is using it (blocking CPU access while it happens).
+
+
+-------------------------------------------------------------------------------
+
+3) Share address space and migration
+
+HMM intends to provide two main features. First one is to share the address
+space by duplication the CPU page table into the device page table so same
+address point to same memory and this for any valid main memory address in
+the process address space.
+
+To achieve this, HMM offer a set of helpers to populate the device page table
+while keeping track of CPU page table updates. Device page table updates are
+not as easy as CPU page table updates. To update the device page table you must
+allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
+commands in it to perform the update (unmap, cache invalidations and flush,
+...). This can not be done through common code for all device. Hence why HMM
+provides helpers to factor out everything that can be while leaving the gory
+details to the device driver.
+
+The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
+allow to allocate a struct page for each page of the device memory. Those page
+are special because the CPU can not map them. They however allow to migrate
+main memory to device memory using exhisting migration mechanism and everything
+looks like if page was swap out to disk from CPU point of view. Using a struct
+page gives the easiest and cleanest integration with existing mm mechanisms.
+Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
+for the device memory and second to perform migration. Policy decision of what
+and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page trigger a page fault and a migration
+back to main memory ie when a page backing an given address A is migrated from
+a main memory page to a device page then any CPU access to address A trigger a
+page fault and initiate a migration back to main memory.
+
+
+With this two features, HMM not only allow a device to mirror a process address
+space and keeps both CPU and device page table synchronize, but also allow to
+leverage device memory by migrating part of data-set that is actively use by a
+device.
+
+
+-------------------------------------------------------------------------------
+
+4) Address space mirroring implementation and API
+
+Address space mirroring main objective is to allow to duplicate range of CPU
+page table into a device page table and HMM helps keeping both synchronize. A
+device driver that want to mirror a process address space must start with the
+registration of an hmm_mirror struct:
+
+ int hmm_mirror_register(struct hmm_mirror *mirror,
+                         struct mm_struct *mm);
+ int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+                                struct mm_struct *mm);
+
+The locked variant is to be use when the driver is already holding the mmap_sem
+of the mm in write mode. The mirror struct has a set of callback that are use
+to propagate CPU page table:
+
+ struct hmm_mirror_ops {
+     /* sync_cpu_device_pagetables() - synchronize page tables
+      *
+      * @mirror: pointer to struct hmm_mirror
+      * @update_type: type of update that occurred to the CPU page table
+      * @start: virtual start address of the range to update
+      * @end: virtual end address of the range to update
+      *
+      * This callback ultimately originates from mmu_notifiers when the CPU
+      * page table is updated. The device driver must update its page table
+      * in response to this callback. The update argument tells what action
+      * to perform.
+      *
+      * The device driver must not return from this callback until the device
+      * page tables are completely updated (TLBs flushed, etc); this is a
+      * synchronous call.
+      */
+      void (*update)(struct hmm_mirror *mirror,
+                     enum hmm_update action,
+                     unsigned long start,
+                     unsigned long end);
+ };
+
+Device driver must perform update to the range following action (turn range
+read only, or fully unmap, ...). Once driver callback returns the device must
+be done with the update.
+
+
+When device driver wants to populate a range of virtual address it can use
+either:
+ int hmm_vma_get_pfns(struct vm_area_struct *vma,
+                      struct hmm_range *range,
+                      unsigned long start,
+                      unsigned long end,
+                      hmm_pfn_t *pfns);
+ int hmm_vma_fault(struct vm_area_struct *vma,
+                   struct hmm_range *range,
+                   unsigned long start,
+                   unsigned long end,
+                   hmm_pfn_t *pfns,
+                   bool write,
+                   bool block);
+
+First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
+will not trigger a page fault on missing or non present entry. The second one
+do trigger page fault on missing or read only entry if write parameter is true.
+Page fault use the generic mm page fault code path just like a CPU page fault.
+
+Both function copy CPU page table into their pfns array argument. Each entry in
+that array correspond to an address in the virtual range. HMM provide a set of
+flags to help driver identify special CPU page table entries.
+
+Locking with the update() callback is the most important aspect the driver must
+respect in order to keep things properly synchronize. The usage pattern is :
+
+ int driver_populate_range(...)
+ {
+      struct hmm_range range;
+      ...
+ again:
+      ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
+      if (ret)
+          return ret;
+      take_lock(driver->update);
+      if (!hmm_vma_range_done(vma, &range)) {
+          release_lock(driver->update);
+          goto again;
+      }
+
+      // Use pfns array content to update device page table
+
+      release_lock(driver->update);
+      return 0;
+ }
+
+The driver->update lock is the same lock that driver takes inside its update()
+callback. That lock must be call before hmm_vma_range_done() to avoid any race
+with a concurrent CPU page table update.
+
+HMM implements all this on top of the mmu_notifier API because we wanted to a
+simpler API and also to be able to perform optimization latter own like doing
+concurrent device update in multi-devices scenario.
+
+HMM also serve as an impedence missmatch between how CPU page table update are
+done (by CPU write to the page table and TLB flushes) from how device update
+their own page table. Device update is a multi-step process, first appropriate
+commands are write to a buffer, then this buffer is schedule for execution on
+the device. It is only once the device has executed commands in the buffer that
+the update is done. Creating and scheduling update command buffer can happen
+concurrently for multiple devices. Waiting for each device to report commands
+as executed is serialize (there is no point in doing this concurrently).
+
+
+-------------------------------------------------------------------------------
+
+5) Represent and manage device memory from core kernel point of view
+
+Several differents design were try to support device memory. First one use
+device specific data structure to keep information about migrated memory and
+HMM hooked itself in various place of mm code to handle any access to address
+that were back by device memory. It turns out that this ended up replicating
+most of the fields of struct page and also needed many kernel code path to be
+updated to understand this new kind of memory.
+
+Thing is most kernel code path never try to access the memory behind a page
+but only care about struct page contents. Because of this HMM switchted to
+directly using struct page for device memory which left most kernel code path
+un-aware of the difference. We only need to make sure that no one ever try to
+map those page from the CPU side.
+
+HMM provide a set of helpers to register and hotplug device memory as a new
+region needing struct page. This is offer through a very simple API:
+
+ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+                                   struct device *device,
+                                   unsigned long size);
+ void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+The hmm_devmem_ops is where most of the important things are:
+
+ struct hmm_devmem_ops {
+     void (*free)(struct hmm_devmem *devmem, struct page *page);
+     int (*fault)(struct hmm_devmem *devmem,
+                  struct vm_area_struct *vma,
+                  unsigned long addr,
+                  struct page *page,
+                  unsigned flags,
+                  pmd_t *pmdp);
+ };
+
+The first callback (free()) happens when the last reference on a device page is
+drop. This means the device page is now free and no longer use by anyone. The
+second callback happens whenever CPU try to access a device page which it can
+not do. This second callback must trigger a migration back to system memory,
+HMM provides an helper to do just that:
+
+ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+                            struct vm_area_struct *vma,
+                            const struct migrate_vma_ops *ops,
+                            unsigned long mentry,
+                            unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long addr,
+                            unsigned long end,
+                            void *private);
+
+It relies on new migrate_vma() helper which is a generic page migration helper
+that work on range of virtual address instead of working on individual pages,
+it also allow to leverage device DMA engine to perform the copy from device to
+main memory (or in the other direction). The next section goes over this new
+helper.
+
+
+-------------------------------------------------------------------------------
+
+6) Migrate to and from device memory
+
+Because CPU can not access device memory, migration must use device DMA engine
+to perform copy from and to device memory. For this we need a new migration
+helper:
+
+ int migrate_vma(const struct migrate_vma_ops *ops,
+                 struct vm_area_struct *vma,
+                 unsigned long mentries,
+                 unsigned long start,
+                 unsigned long end,
+                 unsigned long *src,
+                 unsigned long *dst,
+                 void *private);
+
+Unlike other migration function it works on a range of virtual address, there
+is two reasons for that. First device DMA copy has a high setup overhead cost
+and thus batching multiple pages is needed as otherwise the migration overhead
+make the whole excersie pointless. The second reason is because driver trigger
+such migration base on range of address the device is actively accessing.
+
+The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
+control destination memory allocation and copy operation. Second one is there
+to allow device driver to perform cleanup operation after migration.
+
+ struct migrate_vma_ops {
+     void (*alloc_and_copy)(struct vm_area_struct *vma,
+                            const unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long end,
+                            void *private);
+     void (*finalize_and_map)(struct vm_area_struct *vma,
+                              const unsigned long *src,
+                              const unsigned long *dst,
+                              unsigned long start,
+                              unsigned long end,
+                              void *private);
+ };
+
+It is important to stress that this migration helpers allow for hole in the
+virtual address range. Some pages in the range might not be migrated for all
+the usual reasons (page is pin, page is lock, ...). This helper does not fail
+but just skip over those pages.
+
+The alloc_and_copy() might as well decide to not migrate all pages in the
+range (for reasons under the callback control). For those the callback just
+have to leave the corresponding dst entry empty.
+
+Finaly the migration of the struct page might fails (for file back page) for
+various reasons (failure to freeze reference, or update page cache, ...). If
+that happens then the finalize_and_map() can catch any pages that was not
+migrated. Note those page were still copied to new page and thus we wasted
+bandwidth but this is considered as a rare event and a price that we are
+willing to pay to keep all the code simpler.
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [HMM 16/16] hmm: heterogeneous memory management documentation
@ 2017-04-05 20:40   ` Jérôme Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jérôme Glisse @ 2017-04-05 20:40 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Jérôme Glisse

This add documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to
be useful and and gives an overview of how this is implemented.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 362 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 362 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..a18ffc0
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,362 @@
+Heterogeneous Memory Management (HMM)
+
+Transparently allow any component of a program to use any memory region of said
+program with a device without using device specific memory allocator. This is
+becoming a requirement to simplify the use of advance heterogeneous computing
+where GPU, DSP or FPGA are use to perform various computations.
+
+This document is divided as follow, in the first section i expose the problems
+related to the use of a device specific allocator. The second section i expose
+the hardware limitations that are inherent to many platforms. The third section
+gives an overview of HMM designs. The fourth section explains how CPU page-
+table mirroring works and what is HMM purpose in this context. Fifth section
+deals with how device memory is represented inside the kernel. Finaly the last
+section present the new migration helper that allow to leverage the device DMA
+engine.
+
+
+-------------------------------------------------------------------------------
+
+1) Problems of using device specific memory allocator:
+
+Device with large amount of on board memory (several giga bytes) like GPU have
+historically manage their memory through dedicated driver specific API. This
+creates a disconnect between memory allocated and managed by device driver and
+regular application memory (private anonymous, share memory or regular file
+back memory). From here on i will refer to this aspect as split address space.
+I use share address space to refer to the opposite situation ie one in which
+any memory region can be use by device transparently.
+
+Split address space because device can only access memory allocated through the
+device specific API. This imply that all memory object in a program are not
+equal from device point of view which complicate large program that rely on a
+wide set of libraries.
+
+Concretly this means that code that wants to leverage device like GPU need to
+copy object between genericly allocated memory (malloc, mmap private/share/)
+and memory allocated through the device driver API (this still end up with an
+mmap but of the device file).
+
+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
+data-set need to re-map all the pointer relations between each of its elements.
+This is error prone and program gets harder to debug because of the duplicate
+data-set.
+
+Split address space also means that library can not transparently use data they
+are getting from core program or other library and thus each library might have
+to duplicate its input data-set using specific memory allocator. Large project
+suffer from this and waste resources because of the various memory copy.
+
+Duplicating each library API to accept as input or output memory allocted by
+each device specific allocator is not a viable option. It would lead to a
+combinatorial explosions in the library entry points.
+
+Finaly with the advance of high level language constructs (in C++ but in other
+language too) it is now possible for compiler to leverage GPU or other devices
+without even the programmer knowledge. Some of compiler identified patterns are
+only do-able with a share address. It is as well more reasonable to use a share
+address space for all the other patterns.
+
+
+-------------------------------------------------------------------------------
+
+2) System bus, device memory characteristics
+
+System bus cripple share address due to few limitations. Most system bus only
+allow basic memory access from device to main memory, even cache coherency is
+often optional. Access to device memory from CPU is even more limited, most
+often than not it is not cache coherent.
+
+If we only consider the PCIE bus than device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However it only allows
+a limited set of atomic operation from device on main memory. This is worse
+in the other direction the CPUs can only access a limited range of the device
+memory and can not perform atomic operations on it. Thus device memory can not
+be consider like regular memory from kernel point of view.
+
+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
+The final limitation is latency, access to main memory from the device has an
+order of magnitude higher latency than when the device access its own memory.
+
+Some platform are developing new system bus or additions/modifications to PCIE
+to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+way cache coherency between CPU and device and allow all atomic operations the
+architecture supports. Saddly not all platform are following this trends and
+some major architecture are left without hardware solutions to those problems.
+
+So for share address space to make sense not only we must allow device to
+access any memory memory but we must also permit any memory to be migrated to
+device memory while device is using it (blocking CPU access while it happens).
+
+
+-------------------------------------------------------------------------------
+
+3) Share address space and migration
+
+HMM intends to provide two main features. First one is to share the address
+space by duplication the CPU page table into the device page table so same
+address point to same memory and this for any valid main memory address in
+the process address space.
+
+To achieve this, HMM offer a set of helpers to populate the device page table
+while keeping track of CPU page table updates. Device page table updates are
+not as easy as CPU page table updates. To update the device page table you must
+allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
+commands in it to perform the update (unmap, cache invalidations and flush,
+...). This can not be done through common code for all device. Hence why HMM
+provides helpers to factor out everything that can be while leaving the gory
+details to the device driver.
+
+The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
+allow to allocate a struct page for each page of the device memory. Those page
+are special because the CPU can not map them. They however allow to migrate
+main memory to device memory using exhisting migration mechanism and everything
+looks like if page was swap out to disk from CPU point of view. Using a struct
+page gives the easiest and cleanest integration with existing mm mechanisms.
+Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
+for the device memory and second to perform migration. Policy decision of what
+and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page trigger a page fault and a migration
+back to main memory ie when a page backing an given address A is migrated from
+a main memory page to a device page then any CPU access to address A trigger a
+page fault and initiate a migration back to main memory.
+
+
+With this two features, HMM not only allow a device to mirror a process address
+space and keeps both CPU and device page table synchronize, but also allow to
+leverage device memory by migrating part of data-set that is actively use by a
+device.
+
+
+-------------------------------------------------------------------------------
+
+4) Address space mirroring implementation and API
+
+Address space mirroring main objective is to allow to duplicate range of CPU
+page table into a device page table and HMM helps keeping both synchronize. A
+device driver that want to mirror a process address space must start with the
+registration of an hmm_mirror struct:
+
+ int hmm_mirror_register(struct hmm_mirror *mirror,
+                         struct mm_struct *mm);
+ int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+                                struct mm_struct *mm);
+
+The locked variant is to be use when the driver is already holding the mmap_sem
+of the mm in write mode. The mirror struct has a set of callback that are use
+to propagate CPU page table:
+
+ struct hmm_mirror_ops {
+     /* sync_cpu_device_pagetables() - synchronize page tables
+      *
+      * @mirror: pointer to struct hmm_mirror
+      * @update_type: type of update that occurred to the CPU page table
+      * @start: virtual start address of the range to update
+      * @end: virtual end address of the range to update
+      *
+      * This callback ultimately originates from mmu_notifiers when the CPU
+      * page table is updated. The device driver must update its page table
+      * in response to this callback. The update argument tells what action
+      * to perform.
+      *
+      * The device driver must not return from this callback until the device
+      * page tables are completely updated (TLBs flushed, etc); this is a
+      * synchronous call.
+      */
+      void (*update)(struct hmm_mirror *mirror,
+                     enum hmm_update action,
+                     unsigned long start,
+                     unsigned long end);
+ };
+
+Device driver must perform update to the range following action (turn range
+read only, or fully unmap, ...). Once driver callback returns the device must
+be done with the update.
+
+
+When device driver wants to populate a range of virtual address it can use
+either:
+ int hmm_vma_get_pfns(struct vm_area_struct *vma,
+                      struct hmm_range *range,
+                      unsigned long start,
+                      unsigned long end,
+                      hmm_pfn_t *pfns);
+ int hmm_vma_fault(struct vm_area_struct *vma,
+                   struct hmm_range *range,
+                   unsigned long start,
+                   unsigned long end,
+                   hmm_pfn_t *pfns,
+                   bool write,
+                   bool block);
+
+First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
+will not trigger a page fault on missing or non present entry. The second one
+do trigger page fault on missing or read only entry if write parameter is true.
+Page fault use the generic mm page fault code path just like a CPU page fault.
+
+Both function copy CPU page table into their pfns array argument. Each entry in
+that array correspond to an address in the virtual range. HMM provide a set of
+flags to help driver identify special CPU page table entries.
+
+Locking with the update() callback is the most important aspect the driver must
+respect in order to keep things properly synchronize. The usage pattern is :
+
+ int driver_populate_range(...)
+ {
+      struct hmm_range range;
+      ...
+ again:
+      ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
+      if (ret)
+          return ret;
+      take_lock(driver->update);
+      if (!hmm_vma_range_done(vma, &range)) {
+          release_lock(driver->update);
+          goto again;
+      }
+
+      // Use pfns array content to update device page table
+
+      release_lock(driver->update);
+      return 0;
+ }
+
+The driver->update lock is the same lock that driver takes inside its update()
+callback. That lock must be call before hmm_vma_range_done() to avoid any race
+with a concurrent CPU page table update.
+
+HMM implements all this on top of the mmu_notifier API because we wanted to a
+simpler API and also to be able to perform optimization latter own like doing
+concurrent device update in multi-devices scenario.
+
+HMM also serve as an impedence missmatch between how CPU page table update are
+done (by CPU write to the page table and TLB flushes) from how device update
+their own page table. Device update is a multi-step process, first appropriate
+commands are write to a buffer, then this buffer is schedule for execution on
+the device. It is only once the device has executed commands in the buffer that
+the update is done. Creating and scheduling update command buffer can happen
+concurrently for multiple devices. Waiting for each device to report commands
+as executed is serialize (there is no point in doing this concurrently).
+
+
+-------------------------------------------------------------------------------
+
+5) Represent and manage device memory from core kernel point of view
+
+Several differents design were try to support device memory. First one use
+device specific data structure to keep information about migrated memory and
+HMM hooked itself in various place of mm code to handle any access to address
+that were back by device memory. It turns out that this ended up replicating
+most of the fields of struct page and also needed many kernel code path to be
+updated to understand this new kind of memory.
+
+Thing is most kernel code path never try to access the memory behind a page
+but only care about struct page contents. Because of this HMM switchted to
+directly using struct page for device memory which left most kernel code path
+un-aware of the difference. We only need to make sure that no one ever try to
+map those page from the CPU side.
+
+HMM provide a set of helpers to register and hotplug device memory as a new
+region needing struct page. This is offer through a very simple API:
+
+ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+                                   struct device *device,
+                                   unsigned long size);
+ void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+The hmm_devmem_ops is where most of the important things are:
+
+ struct hmm_devmem_ops {
+     void (*free)(struct hmm_devmem *devmem, struct page *page);
+     int (*fault)(struct hmm_devmem *devmem,
+                  struct vm_area_struct *vma,
+                  unsigned long addr,
+                  struct page *page,
+                  unsigned flags,
+                  pmd_t *pmdp);
+ };
+
+The first callback (free()) happens when the last reference on a device page is
+drop. This means the device page is now free and no longer use by anyone. The
+second callback happens whenever CPU try to access a device page which it can
+not do. This second callback must trigger a migration back to system memory,
+HMM provides an helper to do just that:
+
+ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+                            struct vm_area_struct *vma,
+                            const struct migrate_vma_ops *ops,
+                            unsigned long mentry,
+                            unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long addr,
+                            unsigned long end,
+                            void *private);
+
+It relies on new migrate_vma() helper which is a generic page migration helper
+that work on range of virtual address instead of working on individual pages,
+it also allow to leverage device DMA engine to perform the copy from device to
+main memory (or in the other direction). The next section goes over this new
+helper.
+
+
+-------------------------------------------------------------------------------
+
+6) Migrate to and from device memory
+
+Because CPU can not access device memory, migration must use device DMA engine
+to perform copy from and to device memory. For this we need a new migration
+helper:
+
+ int migrate_vma(const struct migrate_vma_ops *ops,
+                 struct vm_area_struct *vma,
+                 unsigned long mentries,
+                 unsigned long start,
+                 unsigned long end,
+                 unsigned long *src,
+                 unsigned long *dst,
+                 void *private);
+
+Unlike other migration function it works on a range of virtual address, there
+is two reasons for that. First device DMA copy has a high setup overhead cost
+and thus batching multiple pages is needed as otherwise the migration overhead
+make the whole excersie pointless. The second reason is because driver trigger
+such migration base on range of address the device is actively accessing.
+
+The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
+control destination memory allocation and copy operation. Second one is there
+to allow device driver to perform cleanup operation after migration.
+
+ struct migrate_vma_ops {
+     void (*alloc_and_copy)(struct vm_area_struct *vma,
+                            const unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long end,
+                            void *private);
+     void (*finalize_and_map)(struct vm_area_struct *vma,
+                              const unsigned long *src,
+                              const unsigned long *dst,
+                              unsigned long start,
+                              unsigned long end,
+                              void *private);
+ };
+
+It is important to stress that this migration helpers allow for hole in the
+virtual address range. Some pages in the range might not be migrated for all
+the usual reasons (page is pin, page is lock, ...). This helper does not fail
+but just skip over those pages.
+
+The alloc_and_copy() might as well decide to not migrate all pages in the
+range (for reasons under the callback control). For those the callback just
+have to leave the corresponding dst entry empty.
+
+Finaly the migration of the struct page might fails (for file back page) for
+various reasons (failure to freeze reference, or update page cache, ...). If
+that happens then the finalize_and_map() can catch any pages that was not
+migrated. Note those page were still copied to new page and thus we wasted
+bandwidth but this is considered as a rare event and a price that we are
+willing to pay to keep all the code simpler.
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19
  2017-04-05 20:40 ` Jérôme Glisse
                   ` (16 preceding siblings ...)
  (?)
@ 2017-04-06  3:22 ` Figo.zhang
  2017-04-06  4:59     ` Jerome Glisse
  -1 siblings, 1 reply; 81+ messages in thread
From: Figo.zhang @ 2017-04-06  3:22 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, LKML, Linux MM, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans

[-- Attachment #1: Type: text/plain, Size: 6191 bytes --]

>
>
>
>
> Heterogeneous Memory Management (HMM) (description and justification)
>
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, …) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
>
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, …) but this get extremely complex with advance data structure
> (list, tree, graph, …) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
>

how handle it by current  GPU software stack? maintain a complex middle
firmwork/HAL?


>
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only).


GPU can access the whole process VMAs or any VMAs which backing system
memory has migrate to GPU page table?



> This kind of feature is also appearing in
> various other operating systems.
>
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism
> that rely on pining pages use by a device, HMM relies on mmu_notifier to
> propagate CPU page table update to device page table.
>
> Duplicating CPU page table is only one aspect necessary for efficiently
> using device like GPU. GPU local memory have bandwidth in the TeraBytes/
> second range but they are connected to main memory through a system bus
> like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
> is necessary to allow migration of process memory from main system memory
> to device memory. Issue is that on platform that only have PCIE the device
> memory is not accessible by the CPU with the same properties as main
> memory (cache coherency, atomic operations, …).
>
> To allow migration from main memory to device memory HMM provides a set
> of helper to hotplug device memory as a new type of ZONE_DEVICE memory
> which is un-addressable by CPU but still has struct page representing it.
> This allow most of the core kernel logic that deals with a process memory
> to stay oblivious of the peculiarity of device memory.
>
> When page backing an address of a process is migrated to device memory
> the CPU page table entry is set to a new specific swap entry. CPU access
> to such address triggers a migration back to system memory, just like if
> the page was swap on disk. HMM also blocks any one from pinning a
> ZONE_DEVICE page so that it can always be migrated back to system memory
> if CPU access it. Conversely HMM does not migrate to device memory any
> page that is pin in system memory.
>

the purpose of  migrate the system pages to device is that device can read
the system memory?
if the CPU/programs want read the device data, it need pin/mapping the
device memory to the process address space?
if multiple applications want to read the same device memory region
concurrently, how to do it?

it is better a graph to show how CPU and GPU share the address space.


>
> To allow efficient migration between device memory and main memory a new
> migrate_vma() helpers is added with this patchset. It allows to leverage
> device DMA engine to perform the copy operation.
>
> This feature will be use by upstream driver like nouveau mlx5 and probably
> other in the future (amdgpu is next suspect  in line). We are actively
> working on nouveau and mlx5 support. To test this patchset we also worked
> with NVidia close source driver team, they have more resources than us to
> test this kind of infrastructure and also a bigger and better userspace
> eco-system with various real industry workload they can be use to test and
> profile HMM.
>
> The expected workload is a program builds a data set on the CPU (from disk,
> from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...)
> to give hint on memory placement for the input data and also for the output
> buffer. Program call GPU API to schedule a GPU job, this happens using
> device driver specific ioctl. All this is hidden from programmer point of
> view in case of C++ compiler that transparently offload some part of a
> program to GPU. Program can keep doing other stuff on the CPU while the
> GPU is crunching numbers.
>
> It is expected that CPU will not access the same data set as the GPU while
> GPU is working on it, but this is not mandatory. In fact we expect some
> small memory object to be actively access by both GPU and CPU concurrently
> as synchronization channel and/or for monitoring purposes. Such object will
> stay in system memory and should not be bottlenecked by system bus
> bandwidth (rare write and read access from both CPU and GPU).
>
> As we are relying on device driver API, HMM does not introduce any new
> syscall nor does it modify any existing ones. It does not change any POSIX
> semantics or behaviors. For instance the child after a fork of a process
> that is using HMM will not be impacted in anyway, nor is there any data
> hazard between child COW or parent COW of memory that was migrated to
> device prior to fork.
>
> HMM assume a numbers of hardware features. Device must allow device page
> table to be updated at any time (ie device job must be preemptable). Device
> page table must provides memory protection such as read only. Device must
> track write access (dirty bit). Device must have a minimum granularity that
> match PAGE_SIZE (ie 4k).
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 7138 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19
  2017-04-06  3:22 ` [HMM 00/16] HMM (Heterogeneous Memory Management) v19 Figo.zhang
@ 2017-04-06  4:59     ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-06  4:59 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Andrew Morton, LKML, Linux MM, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans

On Thu, Apr 06, 2017 at 11:22:12AM +0800, Figo.zhang wrote:

[...]

> > Heterogeneous Memory Management (HMM) (description and justification)
> >
> > Today device driver expose dedicated memory allocation API through their
> > device file, often relying on a combination of IOCTL and mmap calls. The
> > device can only access and use memory allocated through this API. This
> > effectively split the program address space into object allocated for the
> > device and useable by the device and other regular memory (malloc, mmap
> > of a file, share memory, …) only accessible by CPU (or in a very limited
> > way by a device by pinning memory).
> >
> > Allowing different isolated component of a program to use a device thus
> > require duplication of the input data structure using device memory
> > allocator. This is reasonable for simple data structure (array, grid,
> > image, …) but this get extremely complex with advance data structure
> > (list, tree, graph, …) that rely on a web of memory pointers. This is
> > becoming a serious limitation on the kind of work load that can be
> > offloaded to device like GPU.
> >
> 
> how handle it by current  GPU software stack? maintain a complex middle
> firmwork/HAL?

Yes you still need a framework like OpenCL or CUDA. They are work under
way to leverage GPU directly from language like C++, so i expect that
the HAL will be hidden more and more for a larger group of programmer.
Note i still expect some programmer will want to program closer to the
hardware to extract every bit of performances they can.

For OpenCL you need HMM to implement what is described as fine-grained
system SVM memory model (see OpenCL 2.0 or latter specification).

> > New industry standard like C++, OpenCL or CUDA are pushing to remove this
> > barrier. This require a shared address space between GPU device and CPU so
> > that GPU can access any memory of a process (while still obeying memory
> > protection like read only).
> 
> GPU can access the whole process VMAs or any VMAs which backing system
> memory has migrate to GPU page table?

Whole process VMAs, it does not need to be migrated to device memory. The
migration is an optional features that is necessary for performances but
GPU can access system memory just fine.

[...]

> > When page backing an address of a process is migrated to device memory
> > the CPU page table entry is set to a new specific swap entry. CPU access
> > to such address triggers a migration back to system memory, just like if
> > the page was swap on disk. HMM also blocks any one from pinning a
> > ZONE_DEVICE page so that it can always be migrated back to system memory
> > if CPU access it. Conversely HMM does not migrate to device memory any
> > page that is pin in system memory.
> >
> 
> the purpose of  migrate the system pages to device is that device can read
> the system memory?
> if the CPU/programs want read the device data, it need pin/mapping the
> device memory to the process address space?
> if multiple applications want to read the same device memory region
> concurrently, how to do it?

Purpose of migrating to device memory is to leverage device memory bandwidth.
PCIE bandwidth 32GB/s, device memory bandwidth between 256GB/s to 1TB/s also
device bandwidth has smaller latency.

CPU can not access device memory. It can but in limited way on PCIE and it
would violate memory model programmer get for regular system memory hence
for all intents and purposes it is better to say that CPU can not access
any of the device memory.

Share VMA will just work, so if a VMA is share between 2 process than both
process can access the same memory. All the semantics that are valid on the
CPU are also valid on the GPU. Nothing change there.


> it is better a graph to show how CPU and GPU share the address space.

I am not good at making ASCII graph, nor would i know how to graph this.
Any valid address on the CPU is valid on the GPU, that's it really. The
migration to device memory is orthogonal to the share address space.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19
@ 2017-04-06  4:59     ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-06  4:59 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Andrew Morton, LKML, Linux MM, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans

On Thu, Apr 06, 2017 at 11:22:12AM +0800, Figo.zhang wrote:

[...]

> > Heterogeneous Memory Management (HMM) (description and justification)
> >
> > Today device driver expose dedicated memory allocation API through their
> > device file, often relying on a combination of IOCTL and mmap calls. The
> > device can only access and use memory allocated through this API. This
> > effectively split the program address space into object allocated for the
> > device and useable by the device and other regular memory (malloc, mmap
> > of a file, share memory, a?|) only accessible by CPU (or in a very limited
> > way by a device by pinning memory).
> >
> > Allowing different isolated component of a program to use a device thus
> > require duplication of the input data structure using device memory
> > allocator. This is reasonable for simple data structure (array, grid,
> > image, a?|) but this get extremely complex with advance data structure
> > (list, tree, graph, a?|) that rely on a web of memory pointers. This is
> > becoming a serious limitation on the kind of work load that can be
> > offloaded to device like GPU.
> >
> 
> how handle it by current  GPU software stack? maintain a complex middle
> firmwork/HAL?

Yes you still need a framework like OpenCL or CUDA. They are work under
way to leverage GPU directly from language like C++, so i expect that
the HAL will be hidden more and more for a larger group of programmer.
Note i still expect some programmer will want to program closer to the
hardware to extract every bit of performances they can.

For OpenCL you need HMM to implement what is described as fine-grained
system SVM memory model (see OpenCL 2.0 or latter specification).

> > New industry standard like C++, OpenCL or CUDA are pushing to remove this
> > barrier. This require a shared address space between GPU device and CPU so
> > that GPU can access any memory of a process (while still obeying memory
> > protection like read only).
> 
> GPU can access the whole process VMAs or any VMAs which backing system
> memory has migrate to GPU page table?

Whole process VMAs, it does not need to be migrated to device memory. The
migration is an optional features that is necessary for performances but
GPU can access system memory just fine.

[...]

> > When page backing an address of a process is migrated to device memory
> > the CPU page table entry is set to a new specific swap entry. CPU access
> > to such address triggers a migration back to system memory, just like if
> > the page was swap on disk. HMM also blocks any one from pinning a
> > ZONE_DEVICE page so that it can always be migrated back to system memory
> > if CPU access it. Conversely HMM does not migrate to device memory any
> > page that is pin in system memory.
> >
> 
> the purpose of  migrate the system pages to device is that device can read
> the system memory?
> if the CPU/programs want read the device data, it need pin/mapping the
> device memory to the process address space?
> if multiple applications want to read the same device memory region
> concurrently, how to do it?

Purpose of migrating to device memory is to leverage device memory bandwidth.
PCIE bandwidth 32GB/s, device memory bandwidth between 256GB/s to 1TB/s also
device bandwidth has smaller latency.

CPU can not access device memory. It can but in limited way on PCIE and it
would violate memory model programmer get for regular system memory hence
for all intents and purposes it is better to say that CPU can not access
any of the device memory.

Share VMA will just work, so if a VMA is share between 2 process than both
process can access the same memory. All the semantics that are valid on the
CPU are also valid on the GPU. Nothing change there.


> it is better a graph to show how CPU and GPU share the address space.

I am not good at making ASCII graph, nor would i know how to graph this.
Any valid address on the CPU is valid on the GPU, that's it really. The
migration to device memory is orthogonal to the share address space.

Cheers,
JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-05 20:40   ` Jérôme Glisse
@ 2017-04-06  9:45     ` Anshuman Khandual
  -1 siblings, 0 replies; 81+ messages in thread
From: Anshuman Khandual @ 2017-04-06  9:45 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Russell King, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, Heiko Carstens,
	Yoshinori Sato, Rich Felker, Chris Metcalf, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 5f84433..0933261 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
>  	return -ENODEV;
>  }
>  
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdata;
> -	struct zone *zone;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
>  	int rc;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.

The concept of MEMORY_DEVICE_UNADDRESSABLE has not been
introduced yet in this patch if I read correctly.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-06  9:45     ` Anshuman Khandual
  0 siblings, 0 replies; 81+ messages in thread
From: Anshuman Khandual @ 2017-04-06  9:45 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Russell King, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, Heiko Carstens,
	Yoshinori Sato, Rich Felker, Chris Metcalf, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 5f84433..0933261 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
>  	return -ENODEV;
>  }
>  
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdata;
> -	struct zone *zone;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
>  	int rc;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.

The concept of MEMORY_DEVICE_UNADDRESSABLE has not been
introduced yet in this patch if I read correctly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-06  9:45     ` Anshuman Khandual
@ 2017-04-06 13:58       ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-06 13:58 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

> > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> > index 5f84433..0933261 100644
> > --- a/arch/powerpc/mm/mem.c
> > +++ b/arch/powerpc/mm/mem.c
> > @@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long
> > start, unsigned long end)
> >  	return -ENODEV;
> >  }
> >  
> > -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> > +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
> >  {
> >  	struct pglist_data *pgdata;
> > -	struct zone *zone;
> >  	unsigned long start_pfn = start >> PAGE_SHIFT;
> >  	unsigned long nr_pages = size >> PAGE_SHIFT;
> > +	bool for_device = false;
> > +	struct zone *zone;
> >  	int rc;
> >  
> > +	/*
> > +	 * Each memory_type needs special handling, so error out on an
> > +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> > +	 * is not supported on this architecture.
> 
> The concept of MEMORY_DEVICE_UNADDRESSABLE has not been
> introduced yet in this patch if I read correctly.

Correct, i did not want to add comment to all the arch file in the patch
that add it because this is one of the most painful patch to rebase so
instead of having more patch that are problematic for rebase i just added
the proper comment ahead of time to make my constant rebasing easier.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-06 13:58       ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-06 13:58 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

> > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> > index 5f84433..0933261 100644
> > --- a/arch/powerpc/mm/mem.c
> > +++ b/arch/powerpc/mm/mem.c
> > @@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long
> > start, unsigned long end)
> >  	return -ENODEV;
> >  }
> >  
> > -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> > +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
> >  {
> >  	struct pglist_data *pgdata;
> > -	struct zone *zone;
> >  	unsigned long start_pfn = start >> PAGE_SHIFT;
> >  	unsigned long nr_pages = size >> PAGE_SHIFT;
> > +	bool for_device = false;
> > +	struct zone *zone;
> >  	int rc;
> >  
> > +	/*
> > +	 * Each memory_type needs special handling, so error out on an
> > +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> > +	 * is not supported on this architecture.
> 
> The concept of MEMORY_DEVICE_UNADDRESSABLE has not been
> introduced yet in this patch if I read correctly.

Correct, i did not want to add comment to all the arch file in the patch
that add it because this is one of the most painful patch to rebase so
instead of having more patch that are problematic for rebase i just added
the proper comment ahead of time to make my constant rebasing easier.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-04-05 20:40   ` Jérôme Glisse
@ 2017-04-06 21:22     ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-06 21:22 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Evgeny Baskakov, Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 112 bytes --]


So during rebase on lastest mmotm one if branch logic got inversed.
Attached is a fixup patch.

Cheers,
Jérôme

[-- Attachment #2: 0001-fixup-mm-hmm-devmem-device-memory-hotplug-using-ZONE.patch --]
[-- Type: text/plain, Size: 734 bytes --]

>From 374bca39b19a88da1d1c6d38c0a4c49c1af31c18 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Thu, 6 Apr 2017 17:16:56 -0400
Subject: [PATCH] fixup! mm/hmm/devmem: device memory hotplug using ZONE_DEVICE

---
 mm/hmm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index ff8ec59..f567a8b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -898,7 +898,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	ret = arch_add_memory(nid, align_start, align_size,
 			      devmem->pagemap.type);
 	mem_hotplug_done();
-	if (!ret)
+	if (ret)
 		goto error_add_memory;
 
 	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
@ 2017-04-06 21:22     ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-06 21:22 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Evgeny Baskakov, Mark Hairgrove, Sherry Cheung, Subhash Gutti

[-- Attachment #1: Type: text/plain, Size: 112 bytes --]


So during rebase on lastest mmotm one if branch logic got inversed.
Attached is a fixup patch.

Cheers,
Jerome

[-- Attachment #2: 0001-fixup-mm-hmm-devmem-device-memory-hotplug-using-ZONE.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-04-05 20:40   ` Jérôme Glisse
@ 2017-04-07  1:37     ` Balbir Singh
  -1 siblings, 0 replies; 81+ messages in thread
From: Balbir Singh @ 2017-04-07  1:37 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Evgeny Baskakov, Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> This introduce a simple struct and associated helpers for device driver
> to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> will find a unuse physical address range and trigger memory hotplug for
> it which allocates and initialize struct page for the device memory.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  include/linux/hmm.h | 114 +++++++++++++++
>  mm/Kconfig          |   9 ++
>  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 521 insertions(+)
> 
> +/*
> + * To add (hotplug) device memory, HMM assumes that there is no real resource
> + * that reserves a range in the physical address space (this is intended to be
> + * use by unaddressable device memory). It will reserve a physical range big
> + * enough and allocate struct page for it.

I've found that the implementation of this is quite non-portable, in that
starting from iomem_resource.end+1-size (which is effectively -size) on
my platform (powerpc) does not give expected results. It could be that
additional changes are needed to arch_add_memory() to support this
use case.

> +
> +	size = ALIGN(size, SECTION_SIZE);
> +	addr = (iomem_resource.end + 1ULL) - size;


Why don't we allocate_resource() with the right constraints and get a new
unused region?

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
@ 2017-04-07  1:37     ` Balbir Singh
  0 siblings, 0 replies; 81+ messages in thread
From: Balbir Singh @ 2017-04-07  1:37 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Naoya Horiguchi, David Nellans,
	Evgeny Baskakov, Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Wed, 2017-04-05 at 16:40 -0400, JA(C)rA'me Glisse wrote:
> This introduce a simple struct and associated helpers for device driver
> to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> will find a unuse physical address range and trigger memory hotplug for
> it which allocates and initialize struct page for the device memory.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  include/linux/hmm.h | 114 +++++++++++++++
>  mm/Kconfig          |   9 ++
>  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 521 insertions(+)
> 
> +/*
> + * To add (hotplug) device memory, HMM assumes that there is no real resource
> + * that reserves a range in the physical address space (this is intended to be
> + * use by unaddressable device memory). It will reserve a physical range big
> + * enough and allocate struct page for it.

I've found that the implementation of this is quite non-portable, in that
starting from iomem_resource.end+1-size (which is effectively -size) on
my platform (powerpc) does not give expected results. It could be that
additional changes are needed to arch_add_memory() to support this
use case.

> +
> +	size = ALIGN(size, SECTION_SIZE);
> +	addr = (iomem_resource.end + 1ULL) - size;


Why don't we allocate_resource() with the right constraints and get a new
unused region?

Thanks,
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-04-07  1:37     ` Balbir Singh
@ 2017-04-07  2:02       ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07  2:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> > This introduce a simple struct and associated helpers for device driver
> > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > will find a unuse physical address range and trigger memory hotplug for
> > it which allocates and initialize struct page for the device memory.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > ---
> >  include/linux/hmm.h | 114 +++++++++++++++
> >  mm/Kconfig          |   9 ++
> >  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 521 insertions(+)
> > 
> > +/*
> > + * To add (hotplug) device memory, HMM assumes that there is no real resource
> > + * that reserves a range in the physical address space (this is intended to be
> > + * use by unaddressable device memory). It will reserve a physical range big
> > + * enough and allocate struct page for it.
> 
> I've found that the implementation of this is quite non-portable, in that
> starting from iomem_resource.end+1-size (which is effectively -size) on
> my platform (powerpc) does not give expected results. It could be that
> additional changes are needed to arch_add_memory() to support this
> use case.

The CDM version does not use that part, that being said isn't -size a valid
value we care only about unsigned here ? What is the end value on powerpc ?
In any case this sounds more like a unsigned/signed arithmetic issue, i will
look into it.

> 
> > +
> > +	size = ALIGN(size, SECTION_SIZE);
> > +	addr = (iomem_resource.end + 1ULL) - size;
> 
> 
> Why don't we allocate_resource() with the right constraints and get a new
> unused region?

The issue with allocate_resource() is that it does scan the resource tree
from lower address to higher ones. I was told that it was less likely to
have hotplug issue conflict if i pick highest physicall address for the
device memory hence why i do my own scan from the end toward the start.

Again all this function does not apply to PPC, it can be hidden behind
x86 config if you prefer it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
@ 2017-04-07  2:02       ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07  2:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> On Wed, 2017-04-05 at 16:40 -0400, Jerome Glisse wrote:
> > This introduce a simple struct and associated helpers for device driver
> > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > will find a unuse physical address range and trigger memory hotplug for
> > it which allocates and initialize struct page for the device memory.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > ---
> >  include/linux/hmm.h | 114 +++++++++++++++
> >  mm/Kconfig          |   9 ++
> >  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 521 insertions(+)
> > 
> > +/*
> > + * To add (hotplug) device memory, HMM assumes that there is no real resource
> > + * that reserves a range in the physical address space (this is intended to be
> > + * use by unaddressable device memory). It will reserve a physical range big
> > + * enough and allocate struct page for it.
> 
> I've found that the implementation of this is quite non-portable, in that
> starting from iomem_resource.end+1-size (which is effectively -size) on
> my platform (powerpc) does not give expected results. It could be that
> additional changes are needed to arch_add_memory() to support this
> use case.

The CDM version does not use that part, that being said isn't -size a valid
value we care only about unsigned here ? What is the end value on powerpc ?
In any case this sounds more like a unsigned/signed arithmetic issue, i will
look into it.

> 
> > +
> > +	size = ALIGN(size, SECTION_SIZE);
> > +	addr = (iomem_resource.end + 1ULL) - size;
> 
> 
> Why don't we allocate_resource() with the right constraints and get a new
> unused region?

The issue with allocate_resource() is that it does scan the resource tree
from lower address to higher ones. I was told that it was less likely to
have hotplug issue conflict if i pick highest physicall address for the
device memory hence why i do my own scan from the end toward the start.

Again all this function does not apply to PPC, it can be hidden behind
x86 config if you prefer it.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-05 20:40   ` Jérôme Glisse
@ 2017-04-07 12:13     ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 12:13 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> When hotpluging memory we want more information on the type of memory.
> This is to extend ZONE_DEVICE to support new type of memory other than
> the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> will be left un-modified.

My current hotplug rework [1] is touching this path as well. It is not
really clear from the chage why you are changing this and what are the
further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
for_device with want__memblock [2]. I plan to repost shortly but I would
like to understand your modifications more to reduce potential conflicts
in the code. Why do you need to distinguish different types of memory
anyway.

[1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
[2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
    branch attempts/rewrite-mem_hotplug-WIP
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Chris Metcalf <cmetcalf@mellanox.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> ---
>  arch/ia64/mm/init.c            | 36 +++++++++++++++++++++++++++++++++---
>  arch/powerpc/mm/mem.c          | 37 ++++++++++++++++++++++++++++++++++---
>  arch/s390/mm/init.c            | 16 ++++++++++++++--
>  arch/sh/mm/init.c              | 35 +++++++++++++++++++++++++++++++++--
>  arch/x86/mm/init_32.c          | 41 +++++++++++++++++++++++++++++++++++++----
>  arch/x86/mm/init_64.c          | 39 +++++++++++++++++++++++++++++++++++----
>  include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
>  include/linux/memremap.h       |  2 ++
>  kernel/memremap.c              |  5 +++--
>  mm/memory_hotplug.c            |  4 ++--
>  10 files changed, 215 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 06cdaef..c910b3f 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -645,20 +645,36 @@ mem_init (void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	pg_data_t *pgdat;
>  	struct zone *zone;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	zone = pgdat->node_zones +
>  		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
> -
>  	if (ret)
>  		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
>  		       __func__,  ret);
> @@ -667,13 +683,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (ret)
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 5f84433..0933261 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
>  	return -ENODEV;
>  }
>  
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdata;
> -	struct zone *zone;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
>  	int rc;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	pgdata = NODE_DATA(nid);
>  
>  	start = (unsigned long)__va(start);
> @@ -153,13 +170,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (ret)
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index bf5b8a0..20d7714 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -153,7 +153,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
>  	unsigned long start_pfn = PFN_DOWN(start);
> @@ -162,6 +162,18 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	struct zone *zone;
>  	int rc, i;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	rc = vmem_add_mapping(start, size);
>  	if (rc)
>  		return rc;
> @@ -205,7 +217,7 @@ unsigned long memory_block_size_bytes(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	/*
>  	 * There is no hardware or firmware interface which could trigger a
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index 7549186..f37e7a6 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -485,13 +485,30 @@ void free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	pg_data_t *pgdat;
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	/* We only have ZONE_NORMAL, so this is easy.. */
> @@ -516,13 +533,27 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (unlikely(ret))
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index c68078f..811d631 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -826,24 +826,57 @@ void __init mem_init(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdata = NODE_DATA(nid);
> -	struct zone *zone = pgdata->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
> +
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
> +	zone = pgdata->node_zones +
> +		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
>  
>  	return __add_pages(nid, zone, start_pfn, nr_pages);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	return __remove_pages(zone, start_pfn, nr_pages);
>  }
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 7eef172..6c0b24e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -641,15 +641,33 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
>   * Memory is added always to NORMAL zone. This means you will never get
>   * additional DMA/DMA32 memory.
>   */
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdat = NODE_DATA(nid);
> -	struct zone *zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
> +	zone = pgdat->node_zones +
> +		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> +
>  	init_memory_mapping(start, start + size);
>  
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
> @@ -946,7 +964,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>  	remove_pagetable(start, end, true);
>  }
>  
> -int __ref arch_remove_memory(u64 start, u64 size)
> +int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> @@ -955,6 +973,19 @@ int __ref arch_remove_memory(u64 start, u64 size)
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	/* With altmap the first mapped page is offset from @start */
>  	altmap = to_vmem_altmap((unsigned long) page);
>  	if (altmap)
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 134a2f6..c3999f2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -13,6 +13,26 @@ struct mem_section;
>  struct memory_block;
>  struct resource;
>  
> +/*
> + * When hotplugging memory with arch_add_memory(), we want more information on
> + * the type of memory we are hotplugging, because depending on the type of
> + * architecture, the code might want to take different paths.
> + *
> + * MEMORY_NORMAL:
> + * Your regular system memory. Default common case.
> + *
> + * MEMORY_DEVICE_PERSISTENT:
> + * Persistent device memory (pmem): struct page might be allocated in different
> + * memory and architecture might want to perform special actions. It is similar
> + * to regular memory, in that the CPU can access it transparently. However,
> + * it is likely to have different bandwidth and latency than regular memory.
> + * See Documentation/nvdimm/nvdimm.txt for more information.
> + */
> +enum memory_type {
> +	MEMORY_NORMAL = 0,
> +	MEMORY_DEVICE_PERSISTENT,
> +};
> +
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  
>  /*
> @@ -104,7 +124,7 @@ extern bool memhp_auto_online;
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern bool is_pageblock_removable_nolock(struct page *page);
> -extern int arch_remove_memory(u64 start, u64 size);
> +extern int arch_remove_memory(u64 start, u64 size, enum memory_type type);
>  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
>  	unsigned long nr_pages);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -276,7 +296,7 @@ extern int add_memory(int nid, u64 start, u64 size);
>  extern int add_memory_resource(int nid, struct resource *resource, bool online);
>  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>  		bool for_device);
> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> +extern int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern void remove_memory(int nid, u64 start, u64 size);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..1f720f7 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,12 +41,14 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @type: memory type see MEMORY_* in memory_hotplug.h
>   */
>  struct dev_pagemap {
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
>  	struct device *dev;
> +	enum memory_type type;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07e85e5..6b4505d 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -248,7 +248,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
>  
>  	mem_hotplug_begin();
> -	arch_remove_memory(align_start, align_size);
> +	arch_remove_memory(align_start, align_size, pgmap->type);
>  	mem_hotplug_done();
>  
>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
> @@ -326,6 +326,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	}
>  	pgmap->ref = ref;
>  	pgmap->res = &page_map->res;
> +	pgmap->type = MEMORY_DEVICE_PERSISTENT;
>  
>  	mutex_lock(&pgmap_lock);
>  	error = 0;
> @@ -363,7 +364,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  		goto err_pfn_remap;
>  
>  	mem_hotplug_begin();
> -	error = arch_add_memory(nid, align_start, align_size, true);
> +	error = arch_add_memory(nid, align_start, align_size, pgmap->type);
>  	mem_hotplug_done();
>  	if (error)
>  		goto err_add_memory;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a07a07c..d1a4326 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1384,7 +1384,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
>  	}
>  
>  	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, false);
> +	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
>  
>  	if (ret < 0)
>  		goto error;
> @@ -2188,7 +2188,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> +	arch_remove_memory(start, size, MEMORY_NORMAL);
>  
>  	try_offline_node(nid);
>  
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 12:13     ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 12:13 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> When hotpluging memory we want more information on the type of memory.
> This is to extend ZONE_DEVICE to support new type of memory other than
> the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> will be left un-modified.

My current hotplug rework [1] is touching this path as well. It is not
really clear from the chage why you are changing this and what are the
further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
for_device with want__memblock [2]. I plan to repost shortly but I would
like to understand your modifications more to reduce potential conflicts
in the code. Why do you need to distinguish different types of memory
anyway.

[1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
[2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
    branch attempts/rewrite-mem_hotplug-WIP
> 
> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Chris Metcalf <cmetcalf@mellanox.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> ---
>  arch/ia64/mm/init.c            | 36 +++++++++++++++++++++++++++++++++---
>  arch/powerpc/mm/mem.c          | 37 ++++++++++++++++++++++++++++++++++---
>  arch/s390/mm/init.c            | 16 ++++++++++++++--
>  arch/sh/mm/init.c              | 35 +++++++++++++++++++++++++++++++++--
>  arch/x86/mm/init_32.c          | 41 +++++++++++++++++++++++++++++++++++++----
>  arch/x86/mm/init_64.c          | 39 +++++++++++++++++++++++++++++++++++----
>  include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
>  include/linux/memremap.h       |  2 ++
>  kernel/memremap.c              |  5 +++--
>  mm/memory_hotplug.c            |  4 ++--
>  10 files changed, 215 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 06cdaef..c910b3f 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -645,20 +645,36 @@ mem_init (void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	pg_data_t *pgdat;
>  	struct zone *zone;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	zone = pgdat->node_zones +
>  		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
> -
>  	if (ret)
>  		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
>  		       __func__,  ret);
> @@ -667,13 +683,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (ret)
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 5f84433..0933261 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -126,14 +126,31 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
>  	return -ENODEV;
>  }
>  
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdata;
> -	struct zone *zone;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
>  	int rc;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	pgdata = NODE_DATA(nid);
>  
>  	start = (unsigned long)__va(start);
> @@ -153,13 +170,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (ret)
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index bf5b8a0..20d7714 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -153,7 +153,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
>  	unsigned long start_pfn = PFN_DOWN(start);
> @@ -162,6 +162,18 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	struct zone *zone;
>  	int rc, i;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	rc = vmem_add_mapping(start, size);
>  	if (rc)
>  		return rc;
> @@ -205,7 +217,7 @@ unsigned long memory_block_size_bytes(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	/*
>  	 * There is no hardware or firmware interface which could trigger a
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index 7549186..f37e7a6 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -485,13 +485,30 @@ void free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	pg_data_t *pgdat;
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	/* We only have ZONE_NORMAL, so this is easy.. */
> @@ -516,13 +533,27 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (unlikely(ret))
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index c68078f..811d631 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -826,24 +826,57 @@ void __init mem_init(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdata = NODE_DATA(nid);
> -	struct zone *zone = pgdata->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
> +
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
> +	zone = pgdata->node_zones +
> +		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
>  
>  	return __add_pages(nid, zone, start_pfn, nr_pages);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type. In particular, MEMORY_DEVICE_UNADDRESSABLE
> +	 * is not supported on this architecture.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	return __remove_pages(zone, start_pfn, nr_pages);
>  }
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 7eef172..6c0b24e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -641,15 +641,33 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
>   * Memory is added always to NORMAL zone. This means you will never get
>   * additional DMA/DMA32 memory.
>   */
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type)
>  {
>  	struct pglist_data *pgdat = NODE_DATA(nid);
> -	struct zone *zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	bool for_device = false;
> +	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +		break;
> +	case MEMORY_DEVICE_PERSISTENT:
> +		for_device = true;
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
> +	zone = pgdat->node_zones +
> +		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> +
>  	init_memory_mapping(start, start + size);
>  
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
> @@ -946,7 +964,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>  	remove_pagetable(start, end, true);
>  }
>  
> -int __ref arch_remove_memory(u64 start, u64 size)
> +int __ref arch_remove_memory(u64 start, u64 size, enum memory_type type)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> @@ -955,6 +973,19 @@ int __ref arch_remove_memory(u64 start, u64 size)
>  	struct zone *zone;
>  	int ret;
>  
> +	/*
> +	 * Each memory_type needs special handling, so error out on an
> +	 * unsupported type.
> +	 */
> +	switch (type) {
> +	case MEMORY_NORMAL:
> +	case MEMORY_DEVICE_PERSISTENT:
> +		break;
> +	default:
> +		pr_err("hotplug unsupported memory type %d\n", type);
> +		return -EINVAL;
> +	}
> +
>  	/* With altmap the first mapped page is offset from @start */
>  	altmap = to_vmem_altmap((unsigned long) page);
>  	if (altmap)
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 134a2f6..c3999f2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -13,6 +13,26 @@ struct mem_section;
>  struct memory_block;
>  struct resource;
>  
> +/*
> + * When hotplugging memory with arch_add_memory(), we want more information on
> + * the type of memory we are hotplugging, because depending on the type of
> + * architecture, the code might want to take different paths.
> + *
> + * MEMORY_NORMAL:
> + * Your regular system memory. Default common case.
> + *
> + * MEMORY_DEVICE_PERSISTENT:
> + * Persistent device memory (pmem): struct page might be allocated in different
> + * memory and architecture might want to perform special actions. It is similar
> + * to regular memory, in that the CPU can access it transparently. However,
> + * it is likely to have different bandwidth and latency than regular memory.
> + * See Documentation/nvdimm/nvdimm.txt for more information.
> + */
> +enum memory_type {
> +	MEMORY_NORMAL = 0,
> +	MEMORY_DEVICE_PERSISTENT,
> +};
> +
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  
>  /*
> @@ -104,7 +124,7 @@ extern bool memhp_auto_online;
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern bool is_pageblock_removable_nolock(struct page *page);
> -extern int arch_remove_memory(u64 start, u64 size);
> +extern int arch_remove_memory(u64 start, u64 size, enum memory_type type);
>  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
>  	unsigned long nr_pages);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -276,7 +296,7 @@ extern int add_memory(int nid, u64 start, u64 size);
>  extern int add_memory_resource(int nid, struct resource *resource, bool online);
>  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>  		bool for_device);
> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> +extern int arch_add_memory(int nid, u64 start, u64 size, enum memory_type type);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern void remove_memory(int nid, u64 start, u64 size);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..1f720f7 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,12 +41,14 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @type: memory type see MEMORY_* in memory_hotplug.h
>   */
>  struct dev_pagemap {
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
>  	struct device *dev;
> +	enum memory_type type;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07e85e5..6b4505d 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -248,7 +248,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
>  
>  	mem_hotplug_begin();
> -	arch_remove_memory(align_start, align_size);
> +	arch_remove_memory(align_start, align_size, pgmap->type);
>  	mem_hotplug_done();
>  
>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
> @@ -326,6 +326,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	}
>  	pgmap->ref = ref;
>  	pgmap->res = &page_map->res;
> +	pgmap->type = MEMORY_DEVICE_PERSISTENT;
>  
>  	mutex_lock(&pgmap_lock);
>  	error = 0;
> @@ -363,7 +364,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  		goto err_pfn_remap;
>  
>  	mem_hotplug_begin();
> -	error = arch_add_memory(nid, align_start, align_size, true);
> +	error = arch_add_memory(nid, align_start, align_size, pgmap->type);
>  	mem_hotplug_done();
>  	if (error)
>  		goto err_add_memory;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a07a07c..d1a4326 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1384,7 +1384,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
>  	}
>  
>  	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, false);
> +	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
>  
>  	if (ret < 0)
>  		goto error;
> @@ -2188,7 +2188,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> +	arch_remove_memory(start, size, MEMORY_NORMAL);
>  
>  	try_offline_node(nid);
>  
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 12:13     ` Michal Hocko
@ 2017-04-07 14:32       ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 14:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > When hotpluging memory we want more information on the type of memory.
> > This is to extend ZONE_DEVICE to support new type of memory other than
> > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > will be left un-modified.
> 
> My current hotplug rework [1] is touching this path as well. It is not
> really clear from the chage why you are changing this and what are the
> further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> for_device with want__memblock [2]. I plan to repost shortly but I would
> like to understand your modifications more to reduce potential conflicts
> in the code. Why do you need to distinguish different types of memory
> anyway.
> 
> [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
>     branch attempts/rewrite-mem_hotplug-WIP

This is needed for UNADDRESSABLE memory type introduced in patch 3 and
the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
i do not want the arch code to create a linear mapping for the range
being hotpluged. Adding memory_type in this patch allow to distinguish
between different type of ZONE_DEVICE.

After your patchset, we do not need the for_device but i still need to
know if it is UNADDRESSABLE. You can check my branch on top of your
previous patchset (again patch 1, 3 and 4):

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20

1:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20&id=a85a895615e4812d3c68869cfeef92a4924b4946
3:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20&id=539b6d12429a7166f3690944d6bf164930a59def
4:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20&id=d5338b868e801acabb96c7166c1e802d730511e3

I will check your new branch and see what want_memblock is for.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 14:32       ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 14:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > When hotpluging memory we want more information on the type of memory.
> > This is to extend ZONE_DEVICE to support new type of memory other than
> > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > will be left un-modified.
> 
> My current hotplug rework [1] is touching this path as well. It is not
> really clear from the chage why you are changing this and what are the
> further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> for_device with want__memblock [2]. I plan to repost shortly but I would
> like to understand your modifications more to reduce potential conflicts
> in the code. Why do you need to distinguish different types of memory
> anyway.
> 
> [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
>     branch attempts/rewrite-mem_hotplug-WIP

This is needed for UNADDRESSABLE memory type introduced in patch 3 and
the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
i do not want the arch code to create a linear mapping for the range
being hotpluged. Adding memory_type in this patch allow to distinguish
between different type of ZONE_DEVICE.

After your patchset, we do not need the for_device but i still need to
know if it is UNADDRESSABLE. You can check my branch on top of your
previous patchset (again patch 1, 3 and 4):

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20

1:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20&id=a85a895615e4812d3c68869cfeef92a4924b4946
3:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20&id=539b6d12429a7166f3690944d6bf164930a59def
4:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v20&id=d5338b868e801acabb96c7166c1e802d730511e3

I will check your new branch and see what want_memblock is for.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 14:32       ` Jerome Glisse
@ 2017-04-07 14:45         ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 14:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > > When hotpluging memory we want more information on the type of memory.
> > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > will be left un-modified.
> > 
> > My current hotplug rework [1] is touching this path as well. It is not
> > really clear from the chage why you are changing this and what are the
> > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > for_device with want__memblock [2]. I plan to repost shortly but I would
> > like to understand your modifications more to reduce potential conflicts
> > in the code. Why do you need to distinguish different types of memory
> > anyway.
> > 
> > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> >     branch attempts/rewrite-mem_hotplug-WIP
> 
> This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> i do not want the arch code to create a linear mapping for the range
> being hotpluged. Adding memory_type in this patch allow to distinguish
> between different type of ZONE_DEVICE.

Why don't you use __add_pages directly then?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 14:45         ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 14:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > > When hotpluging memory we want more information on the type of memory.
> > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > will be left un-modified.
> > 
> > My current hotplug rework [1] is touching this path as well. It is not
> > really clear from the chage why you are changing this and what are the
> > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > for_device with want__memblock [2]. I plan to repost shortly but I would
> > like to understand your modifications more to reduce potential conflicts
> > in the code. Why do you need to distinguish different types of memory
> > anyway.
> > 
> > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> >     branch attempts/rewrite-mem_hotplug-WIP
> 
> This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> i do not want the arch code to create a linear mapping for the range
> being hotpluged. Adding memory_type in this patch allow to distinguish
> between different type of ZONE_DEVICE.

Why don't you use __add_pages directly then?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 14:45         ` Michal Hocko
@ 2017-04-07 14:57           ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 14:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > > > When hotpluging memory we want more information on the type of memory.
> > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > will be left un-modified.
> > > 
> > > My current hotplug rework [1] is touching this path as well. It is not
> > > really clear from the chage why you are changing this and what are the
> > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > like to understand your modifications more to reduce potential conflicts
> > > in the code. Why do you need to distinguish different types of memory
> > > anyway.
> > > 
> > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > >     branch attempts/rewrite-mem_hotplug-WIP
> > 
> > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > i do not want the arch code to create a linear mapping for the range
> > being hotpluged. Adding memory_type in this patch allow to distinguish
> > between different type of ZONE_DEVICE.
> 
> Why don't you use __add_pages directly then?

That's a possibility, i wanted to keep the arch code in the loop in case
some arch wanted to do something specific. But it is unlikely to ever be
use outside x86 and i don't think we will want to do anything more than
skipping linear mapping.

Note that however for CDM when doing it with ZONE_DEVICE (i am gonna
post an RFC for that) maybe the powerpc folks will want to know the
memory type ie what kind of ZONE_DEVICE this is. I don't think they need
it but i am not sure if there is anything specific needed for their
next gen power 9 in respect of device memory.


Andrew if Michal think it is better to not use arch_add_memory directly
for my case than i can respin HMM patchset. Let me know.

(Patch 1 and 4 would be drop, patch 3 and 14 would need updates from
top of my head).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 14:57           ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 14:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > > > When hotpluging memory we want more information on the type of memory.
> > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > will be left un-modified.
> > > 
> > > My current hotplug rework [1] is touching this path as well. It is not
> > > really clear from the chage why you are changing this and what are the
> > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > like to understand your modifications more to reduce potential conflicts
> > > in the code. Why do you need to distinguish different types of memory
> > > anyway.
> > > 
> > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > >     branch attempts/rewrite-mem_hotplug-WIP
> > 
> > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > i do not want the arch code to create a linear mapping for the range
> > being hotpluged. Adding memory_type in this patch allow to distinguish
> > between different type of ZONE_DEVICE.
> 
> Why don't you use __add_pages directly then?

That's a possibility, i wanted to keep the arch code in the loop in case
some arch wanted to do something specific. But it is unlikely to ever be
use outside x86 and i don't think we will want to do anything more than
skipping linear mapping.

Note that however for CDM when doing it with ZONE_DEVICE (i am gonna
post an RFC for that) maybe the powerpc folks will want to know the
memory type ie what kind of ZONE_DEVICE this is. I don't think they need
it but i am not sure if there is anything specific needed for their
next gen power 9 in respect of device memory.


Andrew if Michal think it is better to not use arch_add_memory directly
for my case than i can respin HMM patchset. Let me know.

(Patch 1 and 4 would be drop, patch 3 and 14 would need updates from
top of my head).

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 14:57           ` Jerome Glisse
@ 2017-04-07 15:11             ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 15:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > > > > When hotpluging memory we want more information on the type of memory.
> > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > will be left un-modified.
> > > > 
> > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > really clear from the chage why you are changing this and what are the
> > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > like to understand your modifications more to reduce potential conflicts
> > > > in the code. Why do you need to distinguish different types of memory
> > > > anyway.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > 
> > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > i do not want the arch code to create a linear mapping for the range
> > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > between different type of ZONE_DEVICE.
> > 
> > Why don't you use __add_pages directly then?
> 
> That's a possibility, i wanted to keep the arch code in the loop in case
> some arch wanted to do something specific. But it is unlikely to ever be
> use outside x86 and i don't think we will want to do anything more than
> skipping linear mapping.

Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
or you are guaranteed to not cross the max_pfn?
 
> Note that however for CDM when doing it with ZONE_DEVICE (i am gonna
> post an RFC for that) maybe the powerpc folks will want to know the
> memory type ie what kind of ZONE_DEVICE this is. I don't think they need
> it but i am not sure if there is anything specific needed for their
> next gen power 9 in respect of device memory.

Well, I really want to get rid of anything zone specific down the
arch_add_memory. So whatever they want to do with zone they will have to
do it after arch_add_memory.

> Andrew if Michal think it is better to not use arch_add_memory directly
> for my case than i can respin HMM patchset. Let me know.

Well, I can drop
https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git/commit/?h=attempts/rewrite-mem_hotplug-WIP&id=bb1657d823a85bca045712467f517980650652ca
and arch_add_memory can have memory type argument as you suggest. I am
just trying to understand what are the expectations here. If you only care
about x86 then it sounds a bit too much to tweak all arches.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 15:11             ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 15:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > > > > When hotpluging memory we want more information on the type of memory.
> > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > will be left un-modified.
> > > > 
> > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > really clear from the chage why you are changing this and what are the
> > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > like to understand your modifications more to reduce potential conflicts
> > > > in the code. Why do you need to distinguish different types of memory
> > > > anyway.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > 
> > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > i do not want the arch code to create a linear mapping for the range
> > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > between different type of ZONE_DEVICE.
> > 
> > Why don't you use __add_pages directly then?
> 
> That's a possibility, i wanted to keep the arch code in the loop in case
> some arch wanted to do something specific. But it is unlikely to ever be
> use outside x86 and i don't think we will want to do anything more than
> skipping linear mapping.

Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
or you are guaranteed to not cross the max_pfn?
 
> Note that however for CDM when doing it with ZONE_DEVICE (i am gonna
> post an RFC for that) maybe the powerpc folks will want to know the
> memory type ie what kind of ZONE_DEVICE this is. I don't think they need
> it but i am not sure if there is anything specific needed for their
> next gen power 9 in respect of device memory.

Well, I really want to get rid of anything zone specific down the
arch_add_memory. So whatever they want to do with zone they will have to
do it after arch_add_memory.

> Andrew if Michal think it is better to not use arch_add_memory directly
> for my case than i can respin HMM patchset. Let me know.

Well, I can drop
https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git/commit/?h=attempts/rewrite-mem_hotplug-WIP&id=bb1657d823a85bca045712467f517980650652ca
and arch_add_memory can have memory type argument as you suggest. I am
just trying to understand what are the expectations here. If you only care
about x86 then it sounds a bit too much to tweak all arches.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 15:11             ` Michal Hocko
@ 2017-04-07 16:10               ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 05:11:05PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > > On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > > > > > When hotpluging memory we want more information on the type of memory.
> > > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > > will be left un-modified.
> > > > > 
> > > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > > really clear from the chage why you are changing this and what are the
> > > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > > like to understand your modifications more to reduce potential conflicts
> > > > > in the code. Why do you need to distinguish different types of memory
> > > > > anyway.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > > 
> > > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > > i do not want the arch code to create a linear mapping for the range
> > > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > > between different type of ZONE_DEVICE.
> > > 
> > > Why don't you use __add_pages directly then?
> > 
> > That's a possibility, i wanted to keep the arch code in the loop in case
> > some arch wanted to do something specific. But it is unlikely to ever be
> > use outside x86 and i don't think we will want to do anything more than
> > skipping linear mapping.
> 
> Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
> or you are guaranteed to not cross the max_pfn?

No guaranteed so yes i somewhat care about max_pfn, i do not care about
any of its existing user last time i check but it might matter for some
new user.

>  
> > Note that however for CDM when doing it with ZONE_DEVICE (i am gonna
> > post an RFC for that) maybe the powerpc folks will want to know the
> > memory type ie what kind of ZONE_DEVICE this is. I don't think they need
> > it but i am not sure if there is anything specific needed for their
> > next gen power 9 in respect of device memory.
> 
> Well, I really want to get rid of anything zone specific down the
> arch_add_memory. So whatever they want to do with zone they will have to
> do it after arch_add_memory.

I am ok with that except that the linear mapping thing is in arch_add_memory
and this is the thing i want to skip.

> 
> > Andrew if Michal think it is better to not use arch_add_memory directly
> > for my case than i can respin HMM patchset. Let me know.
> 
> Well, I can drop
> https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git/commit/?h=attempts/rewrite-mem_hotplug-WIP&id=bb1657d823a85bca045712467f517980650652ca
> and arch_add_memory can have memory type argument as you suggest. I am
> just trying to understand what are the expectations here. If you only care
> about x86 then it sounds a bit too much to tweak all arches.

Well i care about powerpc and i know some arm folks are interested in HMM
but it is hard to know when this will materialize. So long run i expect
more arch but it might not happen soon.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 16:10               ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 05:11:05PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > > On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > > > > > When hotpluging memory we want more information on the type of memory.
> > > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > > will be left un-modified.
> > > > > 
> > > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > > really clear from the chage why you are changing this and what are the
> > > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > > like to understand your modifications more to reduce potential conflicts
> > > > > in the code. Why do you need to distinguish different types of memory
> > > > > anyway.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > > 
> > > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > > i do not want the arch code to create a linear mapping for the range
> > > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > > between different type of ZONE_DEVICE.
> > > 
> > > Why don't you use __add_pages directly then?
> > 
> > That's a possibility, i wanted to keep the arch code in the loop in case
> > some arch wanted to do something specific. But it is unlikely to ever be
> > use outside x86 and i don't think we will want to do anything more than
> > skipping linear mapping.
> 
> Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
> or you are guaranteed to not cross the max_pfn?

No guaranteed so yes i somewhat care about max_pfn, i do not care about
any of its existing user last time i check but it might matter for some
new user.

>  
> > Note that however for CDM when doing it with ZONE_DEVICE (i am gonna
> > post an RFC for that) maybe the powerpc folks will want to know the
> > memory type ie what kind of ZONE_DEVICE this is. I don't think they need
> > it but i am not sure if there is anything specific needed for their
> > next gen power 9 in respect of device memory.
> 
> Well, I really want to get rid of anything zone specific down the
> arch_add_memory. So whatever they want to do with zone they will have to
> do it after arch_add_memory.

I am ok with that except that the linear mapping thing is in arch_add_memory
and this is the thing i want to skip.

> 
> > Andrew if Michal think it is better to not use arch_add_memory directly
> > for my case than i can respin HMM patchset. Let me know.
> 
> Well, I can drop
> https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git/commit/?h=attempts/rewrite-mem_hotplug-WIP&id=bb1657d823a85bca045712467f517980650652ca
> and arch_add_memory can have memory type argument as you suggest. I am
> just trying to understand what are the expectations here. If you only care
> about x86 then it sounds a bit too much to tweak all arches.

Well i care about powerpc and i know some arm folks are interested in HMM
but it is hard to know when this will materialize. So long run i expect
more arch but it might not happen soon.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-04-07  2:02       ` Jerome Glisse
@ 2017-04-07 16:26         ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 16:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Thu, Apr 06, 2017 at 10:02:55PM -0400, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> > On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> > > This introduce a simple struct and associated helpers for device driver
> > > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > > will find a unuse physical address range and trigger memory hotplug for
> > > it which allocates and initialize struct page for the device memory.
> > > 
> > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > > ---
> > >  include/linux/hmm.h | 114 +++++++++++++++
> > >  mm/Kconfig          |   9 ++
> > >  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 521 insertions(+)
> > > 
> > > +/*
> > > + * To add (hotplug) device memory, HMM assumes that there is no real resource
> > > + * that reserves a range in the physical address space (this is intended to be
> > > + * use by unaddressable device memory). It will reserve a physical range big
> > > + * enough and allocate struct page for it.
> > 
> > I've found that the implementation of this is quite non-portable, in that
> > starting from iomem_resource.end+1-size (which is effectively -size) on
> > my platform (powerpc) does not give expected results. It could be that
> > additional changes are needed to arch_add_memory() to support this
> > use case.
> 
> The CDM version does not use that part, that being said isn't -size a valid
> value we care only about unsigned here ? What is the end value on powerpc ?
> In any case this sounds more like a unsigned/signed arithmetic issue, i will
> look into it.
> 
> > 
> > > +
> > > +	size = ALIGN(size, SECTION_SIZE);
> > > +	addr = (iomem_resource.end + 1ULL) - size;
> > 
> > 
> > Why don't we allocate_resource() with the right constraints and get a new
> > unused region?
> 
> The issue with allocate_resource() is that it does scan the resource tree
> from lower address to higher ones. I was told that it was less likely to
> have hotplug issue conflict if i pick highest physicall address for the
> device memory hence why i do my own scan from the end toward the start.
> 
> Again all this function does not apply to PPC, it can be hidden behind
> x86 config if you prefer it.

Ok so i have look into it and there is no arithmetic bug in my code the
issue is simpler than that. It seems only x86 clamp iomem_resource.end to
MAX_PHYSMEM_BITS so using allocate_resource() would just hide the issue.

It is fine not to clamp if you know that you won't get resource with
funky physical address but in case of UNADDRESSABLE i do not get any
physical address so i have to pick one and i want to pick one that is
unlikely to cause trouble latter on with someone hotpluging memory.

If we care about the UNADDRESSABLE case on powerpc i see 2 way to fix
this. Clamp iomem_resource.end to MAX_PHYSMEM_BITS or restrict my scan
in hmm to MIN(iomem_resource.end, 1UL << MAX_PHYSMEM_BITS) the latter
is probably safer and more bullet proof in respect to other arch getting
interested in this.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
@ 2017-04-07 16:26         ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 16:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Thu, Apr 06, 2017 at 10:02:55PM -0400, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> > On Wed, 2017-04-05 at 16:40 -0400, Jerome Glisse wrote:
> > > This introduce a simple struct and associated helpers for device driver
> > > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > > will find a unuse physical address range and trigger memory hotplug for
> > > it which allocates and initialize struct page for the device memory.
> > > 
> > > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > > ---
> > >  include/linux/hmm.h | 114 +++++++++++++++
> > >  mm/Kconfig          |   9 ++
> > >  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 521 insertions(+)
> > > 
> > > +/*
> > > + * To add (hotplug) device memory, HMM assumes that there is no real resource
> > > + * that reserves a range in the physical address space (this is intended to be
> > > + * use by unaddressable device memory). It will reserve a physical range big
> > > + * enough and allocate struct page for it.
> > 
> > I've found that the implementation of this is quite non-portable, in that
> > starting from iomem_resource.end+1-size (which is effectively -size) on
> > my platform (powerpc) does not give expected results. It could be that
> > additional changes are needed to arch_add_memory() to support this
> > use case.
> 
> The CDM version does not use that part, that being said isn't -size a valid
> value we care only about unsigned here ? What is the end value on powerpc ?
> In any case this sounds more like a unsigned/signed arithmetic issue, i will
> look into it.
> 
> > 
> > > +
> > > +	size = ALIGN(size, SECTION_SIZE);
> > > +	addr = (iomem_resource.end + 1ULL) - size;
> > 
> > 
> > Why don't we allocate_resource() with the right constraints and get a new
> > unused region?
> 
> The issue with allocate_resource() is that it does scan the resource tree
> from lower address to higher ones. I was told that it was less likely to
> have hotplug issue conflict if i pick highest physicall address for the
> device memory hence why i do my own scan from the end toward the start.
> 
> Again all this function does not apply to PPC, it can be hidden behind
> x86 config if you prefer it.

Ok so i have look into it and there is no arithmetic bug in my code the
issue is simpler than that. It seems only x86 clamp iomem_resource.end to
MAX_PHYSMEM_BITS so using allocate_resource() would just hide the issue.

It is fine not to clamp if you know that you won't get resource with
funky physical address but in case of UNADDRESSABLE i do not get any
physical address so i have to pick one and i want to pick one that is
unlikely to cause trouble latter on with someone hotpluging memory.

If we care about the UNADDRESSABLE case on powerpc i see 2 way to fix
this. Clamp iomem_resource.end to MAX_PHYSMEM_BITS or restrict my scan
in hmm to MIN(iomem_resource.end, 1UL << MAX_PHYSMEM_BITS) the latter
is probably safer and more bullet proof in respect to other arch getting
interested in this.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 16:10               ` Jerome Glisse
@ 2017-04-07 16:37                 ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 16:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 05:11:05PM +0200, Michal Hocko wrote:
> > On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> > > On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > > > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > > > On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > > > > > > When hotpluging memory we want more information on the type of memory.
> > > > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > > > will be left un-modified.
> > > > > > 
> > > > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > > > really clear from the chage why you are changing this and what are the
> > > > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > > > like to understand your modifications more to reduce potential conflicts
> > > > > > in the code. Why do you need to distinguish different types of memory
> > > > > > anyway.
> > > > > > 
> > > > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > > > 
> > > > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > > > i do not want the arch code to create a linear mapping for the range
> > > > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > > > between different type of ZONE_DEVICE.
> > > > 
> > > > Why don't you use __add_pages directly then?
> > > 
> > > That's a possibility, i wanted to keep the arch code in the loop in case
> > > some arch wanted to do something specific. But it is unlikely to ever be
> > > use outside x86 and i don't think we will want to do anything more than
> > > skipping linear mapping.
> > 
> > Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
> > or you are guaranteed to not cross the max_pfn?
> 
> No guaranteed so yes i somewhat care about max_pfn, i do not care about
> any of its existing user last time i check but it might matter for some
> new user.

OK, then we can add add_pages() which would do __add_pages by default
(#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
update_end_of_memory_vars. This sounds easier to me than updating all
the archs and add something that most of them do not really care about.

But I will not insist. If you think that your approach is better I will
not object.

Btw. is your series reviewed and ready to be applied to the mm tree? I
planed to post mine on Monday so I would like to know how do we
coordinate. I rebase on topo of yours or vice versa.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 16:37                 ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 16:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 05:11:05PM +0200, Michal Hocko wrote:
> > On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> > > On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > > > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > > > On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > > > > > > When hotpluging memory we want more information on the type of memory.
> > > > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > > > will be left un-modified.
> > > > > > 
> > > > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > > > really clear from the chage why you are changing this and what are the
> > > > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > > > like to understand your modifications more to reduce potential conflicts
> > > > > > in the code. Why do you need to distinguish different types of memory
> > > > > > anyway.
> > > > > > 
> > > > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > > > 
> > > > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > > > i do not want the arch code to create a linear mapping for the range
> > > > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > > > between different type of ZONE_DEVICE.
> > > > 
> > > > Why don't you use __add_pages directly then?
> > > 
> > > That's a possibility, i wanted to keep the arch code in the loop in case
> > > some arch wanted to do something specific. But it is unlikely to ever be
> > > use outside x86 and i don't think we will want to do anything more than
> > > skipping linear mapping.
> > 
> > Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
> > or you are guaranteed to not cross the max_pfn?
> 
> No guaranteed so yes i somewhat care about max_pfn, i do not care about
> any of its existing user last time i check but it might matter for some
> new user.

OK, then we can add add_pages() which would do __add_pages by default
(#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
update_end_of_memory_vars. This sounds easier to me than updating all
the archs and add something that most of them do not really care about.

But I will not insist. If you think that your approach is better I will
not object.

Btw. is your series reviewed and ready to be applied to the mm tree? I
planed to post mine on Monday so I would like to know how do we
coordinate. I rebase on topo of yours or vice versa.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 16:37                 ` Michal Hocko
@ 2017-04-07 17:10                   ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 17:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 3789 bytes --]

On Fri, Apr 07, 2017 at 06:37:37PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 05:11:05PM +0200, Michal Hocko wrote:
> > > On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> > > > On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > > > > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > > > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > > > > On Wed 05-04-17 16:40:11, Jérôme Glisse wrote:
> > > > > > > > When hotpluging memory we want more information on the type of memory.
> > > > > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > > > > will be left un-modified.
> > > > > > > 
> > > > > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > > > > really clear from the chage why you are changing this and what are the
> > > > > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > > > > like to understand your modifications more to reduce potential conflicts
> > > > > > > in the code. Why do you need to distinguish different types of memory
> > > > > > > anyway.
> > > > > > > 
> > > > > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > > > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > > > > 
> > > > > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > > > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > > > > i do not want the arch code to create a linear mapping for the range
> > > > > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > > > > between different type of ZONE_DEVICE.
> > > > > 
> > > > > Why don't you use __add_pages directly then?
> > > > 
> > > > That's a possibility, i wanted to keep the arch code in the loop in case
> > > > some arch wanted to do something specific. But it is unlikely to ever be
> > > > use outside x86 and i don't think we will want to do anything more than
> > > > skipping linear mapping.
> > > 
> > > Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
> > > or you are guaranteed to not cross the max_pfn?
> > 
> > No guaranteed so yes i somewhat care about max_pfn, i do not care about
> > any of its existing user last time i check but it might matter for some
> > new user.
> 
> OK, then we can add add_pages() which would do __add_pages by default
> (#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
> update_end_of_memory_vars. This sounds easier to me than updating all
> the archs and add something that most of them do not really care about.
> 
> But I will not insist. If you think that your approach is better I will
> not object.

Something like attached patch ?

> 
> Btw. is your series reviewed and ready to be applied to the mm tree? I
> planed to post mine on Monday so I would like to know how do we
> coordinate. I rebase on topo of yours or vice versa.

Well v18 core patches were review by Mel, i did include all of his comment
in v19 (i don't think i did miss any). I think Dan still want to look at
patch 1 and 3 for ZONE_DEVICE.

But i always welcome more review. I know Anshuman replied to this patch
to improve a comments. Balbir had issue on powerpc because iomem_resource.end
isn't clamped to MAX_PHYSMEM_BITS But that is all review i got so far on v19.

I don't mind rebasing on top of your patchset. What ever is easier for
Andrew i guess.

Cheers,
Jérôme

[-- Attachment #2: 0001-mm-memory_hotplug-add-add_pages-hotplug-without-line.patch --]
[-- Type: text/plain, Size: 3278 bytes --]

>From 7f414aef1e84c8ff65102e571f808f6362212350 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Fri, 7 Apr 2017 12:51:20 -0400
Subject: [PATCH] mm/memory_hotplug: add add_pages() hotplug without linear
 mapping
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

For some memory hotplug we do not want the linear mapping to the
hotpluged physical range. Add a new helper that just do __add_pages()
and other arch specific bits if necessary.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 arch/x86/Kconfig      |  1 +
 arch/x86/mm/init_64.c | 17 ++++++++++++++++-
 mm/Kconfig            |  2 ++
 mm/memory_hotplug.c   | 18 ++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a..4024fee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -25,6 +25,7 @@ config X86_64
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_SUPPORTS_INT128
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_HAS_ADD_PAGES
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
 	select X86_DEV_DMA_OPS
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 15173d3..933032c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -626,7 +626,7 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void update_end_of_memory_vars(u64 start, u64 size)
 {
 	unsigned long end_pfn = PFN_UP(start + size);
 
@@ -662,6 +662,21 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+int add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
+		unsigned long nr_pages)
+{
+	int ret;
+
+	ret = __add_pages(nid, zone, phys_start_pfn, nr_pages);
+
+	if (!ret)
+		update_end_of_memory_vars(phys_start_pfn << PAGE_SHIFT,
+					  nr_pages << PAGE_SHIFT);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(add_pages);
+
 #define PAGE_INUSE 0xFD
 
 static void __meminit free_pagetable(struct page *page, int order)
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..d052ec1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -707,3 +707,5 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_ADD_PAGES
+	bool
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 295479b..bef772c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -576,6 +576,24 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
 }
 EXPORT_SYMBOL_GPL(__add_pages);
 
+#ifndef ARCH_HAS_ADD_PAGES
+int add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
+		unsigned long nr_pages)
+{
+	int ret;
+
+	ret = __add_pages(nid, zone, phys_start_pfn, nr_pages);
+
+#ifdef CONFIG_X86_64
+	if (!ret)
+		update_end_of_memory_vars(phys_start_pfn << PAGE_SHIFT,
+					  nr_pages << PAGE_SHIFT);
+#endif
+	return ret;
+}
+EXPORT_SYMBOL_GPL(add_pages);
+#endif /* ARCH_HAS_ADD_PAGES */
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static int find_smallest_section_pfn(int nid, struct zone *zone,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 17:10                   ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 17:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 3789 bytes --]

On Fri, Apr 07, 2017 at 06:37:37PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 05:11:05PM +0200, Michal Hocko wrote:
> > > On Fri 07-04-17 10:57:43, Jerome Glisse wrote:
> > > > On Fri, Apr 07, 2017 at 04:45:04PM +0200, Michal Hocko wrote:
> > > > > On Fri 07-04-17 10:32:49, Jerome Glisse wrote:
> > > > > > On Fri, Apr 07, 2017 at 02:13:49PM +0200, Michal Hocko wrote:
> > > > > > > On Wed 05-04-17 16:40:11, Jerome Glisse wrote:
> > > > > > > > When hotpluging memory we want more information on the type of memory.
> > > > > > > > This is to extend ZONE_DEVICE to support new type of memory other than
> > > > > > > > the persistent memory. Existing user of ZONE_DEVICE (persistent memory)
> > > > > > > > will be left un-modified.
> > > > > > > 
> > > > > > > My current hotplug rework [1] is touching this path as well. It is not
> > > > > > > really clear from the chage why you are changing this and what are the
> > > > > > > further expectations of MEMORY_DEVICE_PERSISTENT. Infact I have replaced
> > > > > > > for_device with want__memblock [2]. I plan to repost shortly but I would
> > > > > > > like to understand your modifications more to reduce potential conflicts
> > > > > > > in the code. Why do you need to distinguish different types of memory
> > > > > > > anyway.
> > > > > > > 
> > > > > > > [1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
> > > > > > > [2] the current patchset is in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> > > > > > >     branch attempts/rewrite-mem_hotplug-WIP
> > > > > > 
> > > > > > This is needed for UNADDRESSABLE memory type introduced in patch 3 and
> > > > > > the arch specific bits are in patch 4. Basicly for UNADDRESSABLE memory
> > > > > > i do not want the arch code to create a linear mapping for the range
> > > > > > being hotpluged. Adding memory_type in this patch allow to distinguish
> > > > > > between different type of ZONE_DEVICE.
> > > > > 
> > > > > Why don't you use __add_pages directly then?
> > > > 
> > > > That's a possibility, i wanted to keep the arch code in the loop in case
> > > > some arch wanted to do something specific. But it is unlikely to ever be
> > > > use outside x86 and i don't think we will want to do anything more than
> > > > skipping linear mapping.
> > > 
> > > Hmm, I am looking closer and x86 stil updates max_pfn. Is this needed
> > > or you are guaranteed to not cross the max_pfn?
> > 
> > No guaranteed so yes i somewhat care about max_pfn, i do not care about
> > any of its existing user last time i check but it might matter for some
> > new user.
> 
> OK, then we can add add_pages() which would do __add_pages by default
> (#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
> update_end_of_memory_vars. This sounds easier to me than updating all
> the archs and add something that most of them do not really care about.
> 
> But I will not insist. If you think that your approach is better I will
> not object.

Something like attached patch ?

> 
> Btw. is your series reviewed and ready to be applied to the mm tree? I
> planed to post mine on Monday so I would like to know how do we
> coordinate. I rebase on topo of yours or vice versa.

Well v18 core patches were review by Mel, i did include all of his comment
in v19 (i don't think i did miss any). I think Dan still want to look at
patch 1 and 3 for ZONE_DEVICE.

But i always welcome more review. I know Anshuman replied to this patch
to improve a comments. Balbir had issue on powerpc because iomem_resource.end
isn't clamped to MAX_PHYSMEM_BITS But that is all review i got so far on v19.

I don't mind rebasing on top of your patchset. What ever is easier for
Andrew i guess.

Cheers,
Jerome

[-- Attachment #2: 0001-mm-memory_hotplug-add-add_pages-hotplug-without-line.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 17:10                   ` Jerome Glisse
@ 2017-04-07 17:59                     ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 17:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 13:10:59, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 06:37:37PM +0200, Michal Hocko wrote:
> > On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
[...]
> > > No guaranteed so yes i somewhat care about max_pfn, i do not care about
> > > any of its existing user last time i check but it might matter for some
> > > new user.
> > 
> > OK, then we can add add_pages() which would do __add_pages by default
> > (#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
> > update_end_of_memory_vars. This sounds easier to me than updating all
> > the archs and add something that most of them do not really care about.
> > 
> > But I will not insist. If you think that your approach is better I will
> > not object.
> 
> Something like attached patch ?

No I meant something like the diff below but maybe even that is too
excessive.
 
> > 
> > Btw. is your series reviewed and ready to be applied to the mm tree? I
> > planed to post mine on Monday so I would like to know how do we
> > coordinate. I rebase on topo of yours or vice versa.
> 
> Well v18 core patches were review by Mel, i did include all of his comment
> in v19 (i don't think i did miss any). I think Dan still want to look at
> patch 1 and 3 for ZONE_DEVICE.
> 
> But i always welcome more review. I know Anshuman replied to this patch
> to improve a comments. Balbir had issue on powerpc because iomem_resource.end
> isn't clamped to MAX_PHYSMEM_BITS But that is all review i got so far on v19.
> 
> I don't mind rebasing on top of your patchset. What ever is easier for
> Andrew i guess.

Well, considering that my patchset is changing the behavior of the core
of the memory hotplug I would prefer if it could go first and add new
user on top. But I realize that you are maintaining your series for a
_long_ time so I would completely understand if you wouldn't be
impressed by another rebase...

If you are OK with rebasing and I will help you with that as much as I
can I would be really grateful.

---
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 69188841717a..66e74928c2f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2260,6 +2260,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+
 config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 	depends on MEMORY_HOTPLUG
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 754d47cb2847..ed1bb63d8f90 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -626,9 +626,9 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void  update_end_of_memory_vars(u64 start_pfn, u64 nr_pages)
 {
-	unsigned long end_pfn = PFN_UP(start + size);
+	unsigned long end_pfn = start_pfn + nr_pages;
 
 	if (end_pfn > max_pfn) {
 		max_pfn = end_pfn;
@@ -637,22 +637,29 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 	}
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int add_pages(int nid, unsigned long start_pfn,
+	unsigned long nr_pages, bool want_memblock)
 {
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	init_memory_mapping(start, start + size);
-
 	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
-	update_end_of_memory_vars(start, size);
+	update_end_of_memory_vars(start_pfn, nr_pages);
 
 	return ret;
 }
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+
+	init_memory_mapping(start, start + size);
+
+	return add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a9985f6c460a..a0973fc80e60 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -113,6 +113,14 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 extern int __add_pages(int nid, unsigned long start_pfn,
 	unsigned long nr_pages, bool want_memblock);
 
+#ifndef CONFIG_ARCH_HAS_ADD_PAGES
+static inline int add_pages(int nid, unsigned long start_pfn,
+	unsigned long nr_pages, bool want_memblock)
+{
+	return __add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
+#endif
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 17:59                     ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-07 17:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 07-04-17 13:10:59, Jerome Glisse wrote:
> On Fri, Apr 07, 2017 at 06:37:37PM +0200, Michal Hocko wrote:
> > On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
[...]
> > > No guaranteed so yes i somewhat care about max_pfn, i do not care about
> > > any of its existing user last time i check but it might matter for some
> > > new user.
> > 
> > OK, then we can add add_pages() which would do __add_pages by default
> > (#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
> > update_end_of_memory_vars. This sounds easier to me than updating all
> > the archs and add something that most of them do not really care about.
> > 
> > But I will not insist. If you think that your approach is better I will
> > not object.
> 
> Something like attached patch ?

No I meant something like the diff below but maybe even that is too
excessive.
 
> > 
> > Btw. is your series reviewed and ready to be applied to the mm tree? I
> > planed to post mine on Monday so I would like to know how do we
> > coordinate. I rebase on topo of yours or vice versa.
> 
> Well v18 core patches were review by Mel, i did include all of his comment
> in v19 (i don't think i did miss any). I think Dan still want to look at
> patch 1 and 3 for ZONE_DEVICE.
> 
> But i always welcome more review. I know Anshuman replied to this patch
> to improve a comments. Balbir had issue on powerpc because iomem_resource.end
> isn't clamped to MAX_PHYSMEM_BITS But that is all review i got so far on v19.
> 
> I don't mind rebasing on top of your patchset. What ever is easier for
> Andrew i guess.

Well, considering that my patchset is changing the behavior of the core
of the memory hotplug I would prefer if it could go first and add new
user on top. But I realize that you are maintaining your series for a
_long_ time so I would completely understand if you wouldn't be
impressed by another rebase...

If you are OK with rebasing and I will help you with that as much as I
can I would be really grateful.

---
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 69188841717a..66e74928c2f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2260,6 +2260,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+
 config ARCH_ENABLE_MEMORY_HOTREMOVE
 	def_bool y
 	depends on MEMORY_HOTPLUG
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 754d47cb2847..ed1bb63d8f90 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -626,9 +626,9 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void  update_end_of_memory_vars(u64 start_pfn, u64 nr_pages)
 {
-	unsigned long end_pfn = PFN_UP(start + size);
+	unsigned long end_pfn = start_pfn + nr_pages;
 
 	if (end_pfn > max_pfn) {
 		max_pfn = end_pfn;
@@ -637,22 +637,29 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 	}
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int add_pages(int nid, unsigned long start_pfn,
+	unsigned long nr_pages, bool want_memblock)
 {
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	init_memory_mapping(start, start + size);
-
 	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
-	update_end_of_memory_vars(start, size);
+	update_end_of_memory_vars(start_pfn, nr_pages);
 
 	return ret;
 }
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+
+	init_memory_mapping(start, start + size);
+
+	return add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a9985f6c460a..a0973fc80e60 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -113,6 +113,14 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 extern int __add_pages(int nid, unsigned long start_pfn,
 	unsigned long nr_pages, bool want_memblock);
 
+#ifndef CONFIG_ARCH_HAS_ADD_PAGES
+static inline int add_pages(int nid, unsigned long start_pfn,
+	unsigned long nr_pages, bool want_memblock)
+{
+	return __add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
+#endif
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
  2017-04-07 17:59                     ` Michal Hocko
@ 2017-04-07 18:27                       ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 18:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 07:59:12PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 13:10:59, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 06:37:37PM +0200, Michal Hocko wrote:
> > > On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
> [...]
> > > > No guaranteed so yes i somewhat care about max_pfn, i do not care about
> > > > any of its existing user last time i check but it might matter for some
> > > > new user.
> > > 
> > > OK, then we can add add_pages() which would do __add_pages by default
> > > (#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
> > > update_end_of_memory_vars. This sounds easier to me than updating all
> > > the archs and add something that most of them do not really care about.
> > > 
> > > But I will not insist. If you think that your approach is better I will
> > > not object.
> > 
> > Something like attached patch ?
> 
> No I meant something like the diff below but maybe even that is too
> excessive.

No looks good to me at least. But i am no authority there.


> > > Btw. is your series reviewed and ready to be applied to the mm tree? I
> > > planed to post mine on Monday so I would like to know how do we
> > > coordinate. I rebase on topo of yours or vice versa.
> > 
> > Well v18 core patches were review by Mel, i did include all of his comment
> > in v19 (i don't think i did miss any). I think Dan still want to look at
> > patch 1 and 3 for ZONE_DEVICE.
> > 
> > But i always welcome more review. I know Anshuman replied to this patch
> > to improve a comments. Balbir had issue on powerpc because iomem_resource.end
> > isn't clamped to MAX_PHYSMEM_BITS But that is all review i got so far on v19.
> > 
> > I don't mind rebasing on top of your patchset. What ever is easier for
> > Andrew i guess.
> 
> Well, considering that my patchset is changing the behavior of the core
> of the memory hotplug I would prefer if it could go first and add new
> user on top. But I realize that you are maintaining your series for a
> _long_ time so I would completely understand if you wouldn't be
> impressed by another rebase...
> 
> If you are OK with rebasing and I will help you with that as much as I
> can I would be really grateful.


I don't mind rebasing on top of your patchset after you post. This is minor
change for me.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory
@ 2017-04-07 18:27                       ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-07 18:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri, Apr 07, 2017 at 07:59:12PM +0200, Michal Hocko wrote:
> On Fri 07-04-17 13:10:59, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 06:37:37PM +0200, Michal Hocko wrote:
> > > On Fri 07-04-17 12:10:00, Jerome Glisse wrote:
> [...]
> > > > No guaranteed so yes i somewhat care about max_pfn, i do not care about
> > > > any of its existing user last time i check but it might matter for some
> > > > new user.
> > > 
> > > OK, then we can add add_pages() which would do __add_pages by default
> > > (#ifndef ARCH_HAS_ADD_PAGES) and x86 would override it do also call
> > > update_end_of_memory_vars. This sounds easier to me than updating all
> > > the archs and add something that most of them do not really care about.
> > > 
> > > But I will not insist. If you think that your approach is better I will
> > > not object.
> > 
> > Something like attached patch ?
> 
> No I meant something like the diff below but maybe even that is too
> excessive.

No looks good to me at least. But i am no authority there.


> > > Btw. is your series reviewed and ready to be applied to the mm tree? I
> > > planed to post mine on Monday so I would like to know how do we
> > > coordinate. I rebase on topo of yours or vice versa.
> > 
> > Well v18 core patches were review by Mel, i did include all of his comment
> > in v19 (i don't think i did miss any). I think Dan still want to look at
> > patch 1 and 3 for ZONE_DEVICE.
> > 
> > But i always welcome more review. I know Anshuman replied to this patch
> > to improve a comments. Balbir had issue on powerpc because iomem_resource.end
> > isn't clamped to MAX_PHYSMEM_BITS But that is all review i got so far on v19.
> > 
> > I don't mind rebasing on top of your patchset. What ever is easier for
> > Andrew i guess.
> 
> Well, considering that my patchset is changing the behavior of the core
> of the memory hotplug I would prefer if it could go first and add new
> user on top. But I realize that you are maintaining your series for a
> _long_ time so I would completely understand if you wouldn't be
> impressed by another rebase...
> 
> If you are OK with rebasing and I will help you with that as much as I
> can I would be really grateful.


I don't mind rebasing on top of your patchset after you post. This is minor
change for me.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-04-07 16:26         ` Jerome Glisse
@ 2017-04-10  4:31           ` Balbir Singh
  -1 siblings, 0 replies; 81+ messages in thread
From: Balbir Singh @ 2017-04-10  4:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Fri, 2017-04-07 at 12:26 -0400, Jerome Glisse wrote:
> On Thu, Apr 06, 2017 at 10:02:55PM -0400, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> > > On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> > > > This introduce a simple struct and associated helpers for device driver
> > > > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > > > will find a unuse physical address range and trigger memory hotplug for
> > > > it which allocates and initialize struct page for the device memory.
> > > > 
> > > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > > > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > > > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > > > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > > > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > > > ---
> > > >  include/linux/hmm.h | 114 +++++++++++++++
> > > >  mm/Kconfig          |   9 ++
> > > >  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 521 insertions(+)
> > > > 
> > > > +/*
> > > > + * To add (hotplug) device memory, HMM assumes that there is no real resource
> > > > + * that reserves a range in the physical address space (this is intended to be
> > > > + * use by unaddressable device memory). It will reserve a physical range big
> > > > + * enough and allocate struct page for it.
> > > 
> > > I've found that the implementation of this is quite non-portable, in that
> > > starting from iomem_resource.end+1-size (which is effectively -size) on
> > > my platform (powerpc) does not give expected results. It could be that
> > > additional changes are needed to arch_add_memory() to support this
> > > use case.
> > 
> > The CDM version does not use that part, that being said isn't -size a valid
> > value we care only about unsigned here ? What is the end value on powerpc ?
> > In any case this sounds more like a unsigned/signed arithmetic issue, i will
> > look into it.
> > 

Thanks!

> > > 
> > > > +
> > > > +	size = ALIGN(size, SECTION_SIZE);
> > > > +	addr = (iomem_resource.end + 1ULL) - size;
> > > 
> > > 
> > > Why don't we allocate_resource() with the right constraints and get a new
> > > unused region?
> > 
> > The issue with allocate_resource() is that it does scan the resource tree
> > from lower address to higher ones. I was told that it was less likely to
> > have hotplug issue conflict if i pick highest physicall address for the
> > device memory hence why i do my own scan from the end toward the start.
> > 
> > Again all this function does not apply to PPC, it can be hidden behind
> > x86 config if you prefer it.
> 
> Ok so i have look into it and there is no arithmetic bug in my code the
> issue is simpler than that. It seems only x86 clamp iomem_resource.end to
> MAX_PHYSMEM_BITS so using allocate_resource() would just hide the issue.

> 
> It is fine not to clamp if you know that you won't get resource with
> funky physical address but in case of UNADDRESSABLE i do not get any
> physical address so i have to pick one and i want to pick one that is
> unlikely to cause trouble latter on with someone hotpluging memory.
> 
> If we care about the UNADDRESSABLE case on powerpc i see 2 way to fix
> this. Clamp iomem_resource.end to MAX_PHYSMEM_BITS or restrict my scan
> in hmm to MIN(iomem_resource.end, 1UL << MAX_PHYSMEM_BITS) the latter
> is probably safer and more bullet proof in respect to other arch getting
> interested in this.
>

We do care about UNADDRESSABLE for certain platforms on powerpc
 
I think MAX_PHYSMEM_BITS sounds good or we can make it an arch hook. I spoke
to Michael Ellerman and he recommended we do either. We can't clamp down
iomem_resource.end in the arch as we have other things beyond MAX_PHYSMEM_BITS,
but doing the walk in HMM from the end of MAX_PHYSMEM_BITS is a good idea to
begin with.

Balbir Singh.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
@ 2017-04-10  4:31           ` Balbir Singh
  0 siblings, 0 replies; 81+ messages in thread
From: Balbir Singh @ 2017-04-10  4:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Fri, 2017-04-07 at 12:26 -0400, Jerome Glisse wrote:
> On Thu, Apr 06, 2017 at 10:02:55PM -0400, Jerome Glisse wrote:
> > On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> > > On Wed, 2017-04-05 at 16:40 -0400, JA(C)rA'me Glisse wrote:
> > > > This introduce a simple struct and associated helpers for device driver
> > > > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > > > will find a unuse physical address range and trigger memory hotplug for
> > > > it which allocates and initialize struct page for the device memory.
> > > > 
> > > > Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> > > > Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
> > > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > > > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > > > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > > > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > > > ---
> > > >  include/linux/hmm.h | 114 +++++++++++++++
> > > >  mm/Kconfig          |   9 ++
> > > >  mm/hmm.c            | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 521 insertions(+)
> > > > 
> > > > +/*
> > > > + * To add (hotplug) device memory, HMM assumes that there is no real resource
> > > > + * that reserves a range in the physical address space (this is intended to be
> > > > + * use by unaddressable device memory). It will reserve a physical range big
> > > > + * enough and allocate struct page for it.
> > > 
> > > I've found that the implementation of this is quite non-portable, in that
> > > starting from iomem_resource.end+1-size (which is effectively -size) on
> > > my platform (powerpc) does not give expected results. It could be that
> > > additional changes are needed to arch_add_memory() to support this
> > > use case.
> > 
> > The CDM version does not use that part, that being said isn't -size a valid
> > value we care only about unsigned here ? What is the end value on powerpc ?
> > In any case this sounds more like a unsigned/signed arithmetic issue, i will
> > look into it.
> > 

Thanks!

> > > 
> > > > +
> > > > +	size = ALIGN(size, SECTION_SIZE);
> > > > +	addr = (iomem_resource.end + 1ULL) - size;
> > > 
> > > 
> > > Why don't we allocate_resource() with the right constraints and get a new
> > > unused region?
> > 
> > The issue with allocate_resource() is that it does scan the resource tree
> > from lower address to higher ones. I was told that it was less likely to
> > have hotplug issue conflict if i pick highest physicall address for the
> > device memory hence why i do my own scan from the end toward the start.
> > 
> > Again all this function does not apply to PPC, it can be hidden behind
> > x86 config if you prefer it.
> 
> Ok so i have look into it and there is no arithmetic bug in my code the
> issue is simpler than that. It seems only x86 clamp iomem_resource.end to
> MAX_PHYSMEM_BITS so using allocate_resource() would just hide the issue.

> 
> It is fine not to clamp if you know that you won't get resource with
> funky physical address but in case of UNADDRESSABLE i do not get any
> physical address so i have to pick one and i want to pick one that is
> unlikely to cause trouble latter on with someone hotpluging memory.
> 
> If we care about the UNADDRESSABLE case on powerpc i see 2 way to fix
> this. Clamp iomem_resource.end to MAX_PHYSMEM_BITS or restrict my scan
> in hmm to MIN(iomem_resource.end, 1UL << MAX_PHYSMEM_BITS) the latter
> is probably safer and more bullet proof in respect to other arch getting
> interested in this.
>

We do care about UNADDRESSABLE for certain platforms on powerpc
 
I think MAX_PHYSMEM_BITS sounds good or we can make it an arch hook. I spoke
to Michael Ellerman and he recommended we do either. We can't clamp down
iomem_resource.end in the arch as we have other things beyond MAX_PHYSMEM_BITS,
but doing the walk in HMM from the end of MAX_PHYSMEM_BITS is a good idea to
begin with.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
  2017-04-05 20:40   ` Jérôme Glisse
@ 2017-04-10  8:35     ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-10  8:35 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Wed 05-04-17 16:40:20, Jérôme Glisse wrote:
> This does not use existing page table walker because we want to share
> same code for our page fault handler.

I am getting the following compilation error with sparc32
allmodconfig. I didn't check more closely yet.

mm/hmm.c: In function 'hmm_vma_walk_pmd':
mm/hmm.c:370:53: error: macro "pte_index" requires 2 arguments, but only 1 given
    unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
                                                     ^
mm/hmm.c:370:39: error: 'pte_index' undeclared (first use in this function)
    unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
                                       ^
mm/hmm.c:370:39: note: each undeclared identifier is reported only once for each function it appears in
mm/hmm.c: In function 'hmm_devmem_release':
mm/hmm.c:816:2: error: implicit declaration of function 'arch_remove_memory' [-Werror=implicit-function-declaration]
  arch_remove_memory(align_start, align_size, devmem->pagemap.type);
  ^
cc1: some warnings being treated as errors
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
@ 2017-04-10  8:35     ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-10  8:35 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Wed 05-04-17 16:40:20, Jerome Glisse wrote:
> This does not use existing page table walker because we want to share
> same code for our page fault handler.

I am getting the following compilation error with sparc32
allmodconfig. I didn't check more closely yet.

mm/hmm.c: In function 'hmm_vma_walk_pmd':
mm/hmm.c:370:53: error: macro "pte_index" requires 2 arguments, but only 1 given
    unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
                                                     ^
mm/hmm.c:370:39: error: 'pte_index' undeclared (first use in this function)
    unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
                                       ^
mm/hmm.c:370:39: note: each undeclared identifier is reported only once for each function it appears in
mm/hmm.c: In function 'hmm_devmem_release':
mm/hmm.c:816:2: error: implicit declaration of function 'arch_remove_memory' [-Werror=implicit-function-declaration]
  arch_remove_memory(align_start, align_size, devmem->pagemap.type);
  ^
cc1: some warnings being treated as errors
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
  2017-04-05 20:40   ` Jérôme Glisse
@ 2017-04-10  8:43     ` Michal Hocko
  -1 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-10  8:43 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

There are more for alpha allmodconfig

=== Config /home/mhocko/work/build-test/configs/alpha/allmodconfig
mm/hmm.c: In function 'hmm_vma_walk_pmd':
mm/hmm.c:370:4: error: implicit declaration of function 'pmd_pfn' [-Werror=implicit-function-declaration]
    unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
    ^
mm/hmm.c:370:4: error: implicit declaration of function 'pte_index' [-Werror=implicit-function-declaration]
mm/hmm.c: In function 'hmm_devmem_radix_release':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:790:36: note: in expansion of macro 'SECTION_SIZE'
  align_start = resource->start & ~(SECTION_SIZE - 1);
                                    ^
mm/hmm.c:784:30: note: each undeclared identifier is reported only once for each function it appears in
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:790:36: note: in expansion of macro 'SECTION_SIZE'
  align_start = resource->start & ~(SECTION_SIZE - 1);
                                    ^
mm/hmm.c: In function 'hmm_devmem_release':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:812:36: note: in expansion of macro 'SECTION_SIZE'
  align_start = resource->start & ~(SECTION_SIZE - 1);
                                    ^
mm/hmm.c:816:2: error: implicit declaration of function 'arch_remove_memory' [-Werror=implicit-function-declaration]
  arch_remove_memory(align_start, align_size, devmem->pagemap.type);
  ^
mm/hmm.c: In function 'hmm_devmem_find':
mm/hmm.c:827:54: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
  return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
                                                      ^
mm/hmm.c: In function 'hmm_devmem_pages_create':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:838:44: note: in expansion of macro 'SECTION_SIZE'
  align_start = devmem->resource->start & ~(SECTION_SIZE - 1);
                                            ^
In file included from ./include/linux/cache.h:4:0,
                 from ./include/linux/printk.h:8,
                 from ./include/linux/kernel.h:13,
                 from ./include/asm-generic/bug.h:13,
                 from ./arch/alpha/include/asm/bug.h:22,
                 from ./include/linux/bug.h:4,
                 from ./include/linux/mmdebug.h:4,
                 from ./include/linux/mm.h:8,
                 from mm/hmm.c:20:
mm/hmm.c: In function 'hmm_devmem_add':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
./include/uapi/linux/kernel.h:10:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
 #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                               ^
./include/linux/kernel.h:49:22: note: in expansion of macro '__ALIGN_KERNEL'
 #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                      ^
mm/hmm.c:982:9: note: in expansion of macro 'ALIGN'
  size = ALIGN(size, SECTION_SIZE);
         ^
mm/hmm.c:982:21: note: in expansion of macro 'SECTION_SIZE'
  size = ALIGN(size, SECTION_SIZE);
                     ^
mm/hmm.c: In function 'hmm_devmem_find':
mm/hmm.c:828:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^
cc1: some warnings being treated as errors
make[1]: *** [mm/hmm.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [mm] Error 2
make: *** Waiting for unfinished jobs....
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
@ 2017-04-10  8:43     ` Michal Hocko
  0 siblings, 0 replies; 81+ messages in thread
From: Michal Hocko @ 2017-04-10  8:43 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

There are more for alpha allmodconfig

=== Config /home/mhocko/work/build-test/configs/alpha/allmodconfig
mm/hmm.c: In function 'hmm_vma_walk_pmd':
mm/hmm.c:370:4: error: implicit declaration of function 'pmd_pfn' [-Werror=implicit-function-declaration]
    unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
    ^
mm/hmm.c:370:4: error: implicit declaration of function 'pte_index' [-Werror=implicit-function-declaration]
mm/hmm.c: In function 'hmm_devmem_radix_release':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:790:36: note: in expansion of macro 'SECTION_SIZE'
  align_start = resource->start & ~(SECTION_SIZE - 1);
                                    ^
mm/hmm.c:784:30: note: each undeclared identifier is reported only once for each function it appears in
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:790:36: note: in expansion of macro 'SECTION_SIZE'
  align_start = resource->start & ~(SECTION_SIZE - 1);
                                    ^
mm/hmm.c: In function 'hmm_devmem_release':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:812:36: note: in expansion of macro 'SECTION_SIZE'
  align_start = resource->start & ~(SECTION_SIZE - 1);
                                    ^
mm/hmm.c:816:2: error: implicit declaration of function 'arch_remove_memory' [-Werror=implicit-function-declaration]
  arch_remove_memory(align_start, align_size, devmem->pagemap.type);
  ^
mm/hmm.c: In function 'hmm_devmem_find':
mm/hmm.c:827:54: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
  return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
                                                      ^
mm/hmm.c: In function 'hmm_devmem_pages_create':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
mm/hmm.c:838:44: note: in expansion of macro 'SECTION_SIZE'
  align_start = devmem->resource->start & ~(SECTION_SIZE - 1);
                                            ^
In file included from ./include/linux/cache.h:4:0,
                 from ./include/linux/printk.h:8,
                 from ./include/linux/kernel.h:13,
                 from ./include/asm-generic/bug.h:13,
                 from ./arch/alpha/include/asm/bug.h:22,
                 from ./include/linux/bug.h:4,
                 from ./include/linux/mmdebug.h:4,
                 from ./include/linux/mm.h:8,
                 from mm/hmm.c:20:
mm/hmm.c: In function 'hmm_devmem_add':
mm/hmm.c:784:30: error: 'PA_SECTION_SHIFT' undeclared (first use in this function)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
                              ^
./include/uapi/linux/kernel.h:10:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
 #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                               ^
./include/linux/kernel.h:49:22: note: in expansion of macro '__ALIGN_KERNEL'
 #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                      ^
mm/hmm.c:982:9: note: in expansion of macro 'ALIGN'
  size = ALIGN(size, SECTION_SIZE);
         ^
mm/hmm.c:982:21: note: in expansion of macro 'SECTION_SIZE'
  size = ALIGN(size, SECTION_SIZE);
                     ^
mm/hmm.c: In function 'hmm_devmem_find':
mm/hmm.c:828:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^
cc1: some warnings being treated as errors
make[1]: *** [mm/hmm.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [mm] Error 2
make: *** Waiting for unfinished jobs....
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
  2017-04-10  8:43     ` Michal Hocko
@ 2017-04-10 22:10       ` Andrew Morton
  -1 siblings, 0 replies; 81+ messages in thread
From: Andrew Morton @ 2017-04-10 22:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, Naoya Horiguchi, David Nellans, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, 10 Apr 2017 10:43:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> There are more for alpha allmodconfig

HMM is rather a compile catastrophe, as was the earlier version I
merged.

Jerome, I'm thinking you need to install some cross-compilers!

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
@ 2017-04-10 22:10       ` Andrew Morton
  0 siblings, 0 replies; 81+ messages in thread
From: Andrew Morton @ 2017-04-10 22:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, Naoya Horiguchi, David Nellans, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, 10 Apr 2017 10:43:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> There are more for alpha allmodconfig

HMM is rather a compile catastrophe, as was the earlier version I
merged.

Jerome, I'm thinking you need to install some cross-compilers!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
  2017-04-10 22:10       ` Andrew Morton
@ 2017-04-11  1:33         ` Jerome Glisse
  -1 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-11  1:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

> On Mon, 10 Apr 2017 10:43:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > There are more for alpha allmodconfig
> 
> HMM is rather a compile catastrophe, as was the earlier version I
> merged.
> 
> Jerome, I'm thinking you need to install some cross-compilers!

Sorry about that.

I tested some but obviously not all, in the v20 i did on top of Michal
patchset i simply made everything to be x86-64 only. So if you revert
v19 and wait for Michal to finish his v3 then i will post v20 that is
x86-64 only which i do build and use. At least from my discussion with
Michal i thought you were dropping v19 until Michal could finish his
memory hotplug rework.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
@ 2017-04-11  1:33         ` Jerome Glisse
  0 siblings, 0 replies; 81+ messages in thread
From: Jerome Glisse @ 2017-04-11  1:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

> On Mon, 10 Apr 2017 10:43:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > There are more for alpha allmodconfig
> 
> HMM is rather a compile catastrophe, as was the earlier version I
> merged.
> 
> Jerome, I'm thinking you need to install some cross-compilers!

Sorry about that.

I tested some but obviously not all, in the v20 i did on top of Michal
patchset i simply made everything to be x86-64 only. So if you revert
v19 and wait for Michal to finish his v3 then i will post v20 that is
x86-64 only which i do build and use. At least from my discussion with
Michal i thought you were dropping v19 until Michal could finish his
memory hotplug rework.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
  2017-04-11  1:33         ` Jerome Glisse
@ 2017-04-11 20:33           ` Andrew Morton
  -1 siblings, 0 replies; 81+ messages in thread
From: Andrew Morton @ 2017-04-11 20:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Mon, 10 Apr 2017 21:33:51 -0400 (EDT) Jerome Glisse <jglisse@redhat.com> wrote:

> > On Mon, 10 Apr 2017 10:43:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > There are more for alpha allmodconfig
> > 
> > HMM is rather a compile catastrophe, as was the earlier version I
> > merged.
> > 
> > Jerome, I'm thinking you need to install some cross-compilers!
> 
> Sorry about that.
> 
> I tested some but obviously not all, in the v20 i did on top of Michal
> patchset i simply made everything to be x86-64 only. So if you revert
> v19 and wait for Michal to finish his v3 then i will post v20 that is
> x86-64 only which i do build and use. At least from my discussion with
> Michal i thought you were dropping v19 until Michal could finish his
> memory hotplug rework.

OK, I'll quietly drop the hmm series again for now.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2
@ 2017-04-11 20:33           ` Andrew Morton
  0 siblings, 0 replies; 81+ messages in thread
From: Andrew Morton @ 2017-04-11 20:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	Naoya Horiguchi, David Nellans, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Mon, 10 Apr 2017 21:33:51 -0400 (EDT) Jerome Glisse <jglisse@redhat.com> wrote:

> > On Mon, 10 Apr 2017 10:43:26 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > There are more for alpha allmodconfig
> > 
> > HMM is rather a compile catastrophe, as was the earlier version I
> > merged.
> > 
> > Jerome, I'm thinking you need to install some cross-compilers!
> 
> Sorry about that.
> 
> I tested some but obviously not all, in the v20 i did on top of Michal
> patchset i simply made everything to be x86-64 only. So if you revert
> v19 and wait for Michal to finish his v3 then i will post v20 that is
> x86-64 only which i do build and use. At least from my discussion with
> Michal i thought you were dropping v19 until Michal could finish his
> memory hotplug rework.

OK, I'll quietly drop the hmm series again for now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2017-04-11 20:33 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-05 20:40 [HMM 00/16] HMM (Heterogeneous Memory Management) v19 Jérôme Glisse
2017-04-05 20:40 ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 01/16] mm/memory/hotplug: add memory type parameter to arch_add/remove_memory Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-06  9:45   ` Anshuman Khandual
2017-04-06  9:45     ` Anshuman Khandual
2017-04-06 13:58     ` Jerome Glisse
2017-04-06 13:58       ` Jerome Glisse
2017-04-07 12:13   ` Michal Hocko
2017-04-07 12:13     ` Michal Hocko
2017-04-07 14:32     ` Jerome Glisse
2017-04-07 14:32       ` Jerome Glisse
2017-04-07 14:45       ` Michal Hocko
2017-04-07 14:45         ` Michal Hocko
2017-04-07 14:57         ` Jerome Glisse
2017-04-07 14:57           ` Jerome Glisse
2017-04-07 15:11           ` Michal Hocko
2017-04-07 15:11             ` Michal Hocko
2017-04-07 16:10             ` Jerome Glisse
2017-04-07 16:10               ` Jerome Glisse
2017-04-07 16:37               ` Michal Hocko
2017-04-07 16:37                 ` Michal Hocko
2017-04-07 17:10                 ` Jerome Glisse
2017-04-07 17:10                   ` Jerome Glisse
2017-04-07 17:59                   ` Michal Hocko
2017-04-07 17:59                     ` Michal Hocko
2017-04-07 18:27                     ` Jerome Glisse
2017-04-07 18:27                       ` Jerome Glisse
2017-04-05 20:40 ` [HMM 02/16] mm/put_page: move ZONE_DEVICE page reference decrement v2 Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 03/16] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 04/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 05/16] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 06/16] mm/migrate: new memory migration helper for use with device memory v4 Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 07/16] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 08/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 09/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 10/16] mm/hmm/mirror: helper to snapshot CPU page table v2 Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-10  8:35   ` Michal Hocko
2017-04-10  8:35     ` Michal Hocko
2017-04-10  8:43   ` Michal Hocko
2017-04-10  8:43     ` Michal Hocko
2017-04-10 22:10     ` Andrew Morton
2017-04-10 22:10       ` Andrew Morton
2017-04-11  1:33       ` Jerome Glisse
2017-04-11  1:33         ` Jerome Glisse
2017-04-11 20:33         ` Andrew Morton
2017-04-11 20:33           ` Andrew Morton
2017-04-05 20:40 ` [HMM 11/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 12/16] mm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 13/16] mm/migrate: allow migrate_vma() to alloc new page on empty entry Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-06 21:22   ` Jerome Glisse
2017-04-06 21:22     ` Jerome Glisse
2017-04-07  1:37   ` Balbir Singh
2017-04-07  1:37     ` Balbir Singh
2017-04-07  2:02     ` Jerome Glisse
2017-04-07  2:02       ` Jerome Glisse
2017-04-07 16:26       ` Jerome Glisse
2017-04-07 16:26         ` Jerome Glisse
2017-04-10  4:31         ` Balbir Singh
2017-04-10  4:31           ` Balbir Singh
2017-04-05 20:40 ` [HMM 15/16] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2 Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-05 20:40 ` [HMM 16/16] hmm: heterogeneous memory management documentation Jérôme Glisse
2017-04-05 20:40   ` Jérôme Glisse
2017-04-06  3:22 ` [HMM 00/16] HMM (Heterogeneous Memory Management) v19 Figo.zhang
2017-04-06  4:59   ` Jerome Glisse
2017-04-06  4:59     ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.