[RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)

* [RFC 0/4] RFC - Coherent Device Memory (Not for inclusion)
@ 2017-04-19  7:52 Balbir Singh
  2017-04-19  7:52 ` [RFC 1/4] mm: create N_COHERENT_MEMORY Balbir Singh
                   ` (6 more replies)
  0 siblings, 7 replies; 45+ messages in thread
From: Balbir Singh @ 2017-04-19  7:52 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: khandual, benh, aneesh.kumar, paulmck, srikar, haren, jglisse,
	mgorman, mhocko, arbab, vbabka, cl, Balbir Singh

This is a request for comments on the discussed approaches
for coherent memory at mm-summit (some of the details are at
https://lwn.net/Articles/717601/). The latest posted patch
series is at https://lwn.net/Articles/713035/. I am reposting
this as RFC, Michal Hocko suggested using HMM for CDM, but
we believe there are stronger reasons to use the NUMA approach.
The earlier patches for Coherent Device memory were implemented
and designed by Anshuman Khandual.

Jerome posted HMM-CDM at https://lwn.net/Articles/713035/.
The patches do a great deal to enable CDM with HMM, but we
still believe that HMM with CDM is not a natural way to
represent coherent device memory and the mm will need
to be audited and enhanced for it to even work.

With HMM we'll see ZONE_DEVICE pages mapped into
user space and that would mean a thorough audit of all code
paths to make sure we are ready for such a use case and enabling
those use cases, like with HMM CDM patch 1, which changes
move_pages() and migration paths. I've done a quick
evaluation to check for features and found limitationd around
features like migration (page cache
migration), fault handling to the right location
(direct page cache allocation in the coherent memory), mlock
handling, RSS accounting, memcg enforcement for pages not on LRU, etc.

This series has a set of 4 patches

The first patch defines N_COHERENT_MEMORY and supports onlining of
N_COHERENT_MEMORY.  The second one enables marking of coherent
memory nodes in architecture specific code, the third patch
enables mempolicy MPOL_BIND and MPOL_PREFERRED changes to
explicitly specify a node for allocation. The fourth patch adds
documentation explaining the design and motivation behind
coherent memory. The primary motivation of these patches
is to avoid allocator overhead that Mel Gorman had concerns with,
but for explicit specification of a node in the nodemask,
mempolicy changes are required.

Introduction and design (taken from patch 4)

Introduction

CDM device memory is cache coherent with system memory and we would like
this to show up as a NUMA node, however there are certain algorithms
that might not be currently suitable for N_COHERENT_MEMORY

1. AutoNUMA balancing
2. kswapd reclaim

The reason for exposing this device memory as NUMA is to simplify
the programming model, where memory allocation via malloc() or
mmap() for example would seamlessly work across both kinds of
memory. Since we expect the size of device memory to be smaller
than system RAM, we would like to control the allocation of such
memory. The proposed mechanism reuses nodemasks and explicit
specification of the coherent node in the nodemask for allocation
from device memory. This implementation also allows for kernel
level allocation via __GFP_THISNODE and existing techniques
such as page migration to work.

Assumptions:

1. Nodes with N_COHERENT_MEMORY don't have CPUs on them, so
effectively they are CPUless memory nodes
2. Nodes with N_COHERENT_MEMORY are marked as movable_nodes.
Slub allocations from these nodes will fail otherwise.

Implementation Details

A new node state N_COHERENT_MEMORY is created. Each architecture
can then mark devices as being N_COHERENT_MEMORY and the implementation
makes sure this node set is disjoint from the N_MEMORY node state
nodes. A typical node zonelist (FALLBACK) with N_COHERENT_MEMORY would
be:

Assuming we have 2 nodes and 1 coherent memory node

Node1:	Node 1 --> Node 2

Node2:	Node 2 --> Node 1

Node3:	Node 3 --> Node 2 --> Node 1

This effectively means that allocations that have Node 1 and Node 2
in the nodemask will not allocate from Node 3. Allocations with
__GFP_THISNODE use the NOFALLBACK list and should allocate from Node 3,
if it is specified.  Since Node 3 has no CPUs, we don't expect any
default allocations occurring from it.

However to support allocation from the coherent node, changes have been
made to mempolicy, specifically policy_nodemask() and policy_zonelist()
such that

1. MPOL_BIND with the coherent node (Node 3 in the above example) will
not filter out N_COHERENT_MEMORY if any of the nodes in the nodemask
is in N_COHERENT_MEMORY
2. MPOL_PREFERRED will use the FALLBACK list of the coherent node (Node 3)
if a policy that specifies a preference to it is used.

Limitations

The limitation of this approach might be that in the future we would want
more granularity of inclusion of algorithms for example could we have
N_COHERENT_MEMORY devices that want to participate in autonuma balancing,
but not participate in kswapd reclaim or vice-versa? One way to solve
the problem would be to have tunables or extend the notion of
N_COHERENT_MEMORY.

Using coherent memory is not compatible with cpusets, since cpusets
would enforce mems_allowed and mems_allowed will not contain the
coherent node. With numactl for example, the user would have to use
"-a" to parse all nodes.

Coherent memory relies on the node being a movable_node which is a
requirement for device memory anyway due to the need to hotplug them.

Review Recommendations

Michal Hocko/Mel Gorman for the approach and allocator bits
Vlastimil Babka/Christoph Lameter for the mempolicy changes.

Testing

I tested these patches in a virtual machine where I was able to simulate
coherent device memory. I had 3 normal NUMA nodes and one N_COHERENT_MEMORY
node. I ran mmtests with the config-global-dhp__pagealloc-performance config
and noted the numbers for the following tests in particular
page_test, brk_test, exec_test and fork_test. Observations from these
tests show

1. page_test shows similar rates for with and without coherent memory
and same number of nodes
2. brk_test was faster with coherent memory (3 NUMA, 1 COHERENT) as compared
to 4 NUMA nodes, but had similar rates as the system with (3 NUMA, 0 COHERENT) 
3. exec_test was a bit slower on the system with coherent memory compared
to a system with no coherent memory
4. fork_test was a bit slower on the system with coherent memory compared
to a system with no coherent memory

I also did some basic tests with numactl -a memhog with various membind
and preferred policies. I wrote a small kernel module to allocate
memory with __GFP_THISNODE and GFP_HIGHUSER_MOVABLE (for memory on the
coherent node).

Balbir Singh (4):
  mm: create N_COHERENT_MEMORY
  arch/powerpc/mm: add support for coherent memory
  mm: Integrate N_COHERENT_MEMORY with mempolicy and the rest of the
    system
  linux/mm: Add documentation for coherent memory

 Documentation/memory-hotplug.txt     | 11 +++++++
 Documentation/vm/00-INDEX            |  2 ++
 Documentation/vm/coherent-memory.txt | 59 ++++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/numa.c               |  8 +++++
 drivers/base/memory.c                |  3 ++
 drivers/base/node.c                  |  2 ++
 include/linux/memory_hotplug.h       |  1 +
 include/linux/nodemask.h             |  1 +
 mm/memory_hotplug.c                  |  8 +++--
 mm/mempolicy.c                       | 30 ++++++++++++++++--
 mm/page_alloc.c                      | 20 +++++++++---
 11 files changed, 136 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/vm/coherent-memory.txt

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread