* [LSF/MM TOPIC] Rethinking NUMA
@ 2018-01-28 7:18 Anshuman Khandual
0 siblings, 0 replies; only message in thread
From: Anshuman Khandual @ 2018-01-28 7:18 UTC (permalink / raw)
To: lsf-pc, linux-mm
Cc: Vlastimil Babka, Michal Hocko, Michal Hocko, Mel Gorman,
Andrew Morton, Jerome Glisse, John Hubbard, Dave Hansen,
Dan Williams, Ross Zwisler, Kogut, Jaroslaw
In last couple of years, after going through various patch series
related to HMM, HMM-CDM, NUMA CDM, ACPI HMAT representation in sysfs
etc,it is the right time to take a closer look at existing NUMA
representation and how it can evolve in the long term to accommodate
coherent memory with multiple attributes. There are various possible
directions which need to be discussed, evaluated and try build a
consensus among all stakeholders in the community. This is an attempt
to kick start that discussion around the topic.
People:
Mel Gorman <mgorman@suse.de>
Michal Hocko <mhocko@kernel.org>
Vlastimil Babka <vbabka@suse.cz>
Jerome Glisse <jglisse@redhat.com>
John Hubbard <jhubbard@nvidia.com>
Dave Hansen <dave.hansen@intel.com>
Ross Zwisler <ross.zwisler@linux.intel.com>
Process Address Space Evolution
===============================
Different attribute based memory mapped into the process address space
will give new capabilities and opportunities which were never possible
before.
1. Explore new programming and problem solving capabilities
2. Save energy with big working set which is resident longer but
accessed rarely
3. Optimal placement of data structures depending upon various user
space requirements like access speed (latency or bandwidth) and
residency time span etc
With advent of new attribute based memory this is inevitable in the
long run.
Mapping Attribute Memory Into Process Address Space
===================================================
Attribute memory can be mapped into any process address space through
it's page table in two distinct ways with their own advantages and
disadvantages.
1. Device Driver
a. Driver is required, kernel is not aware about it's presence at all
b. Driver manages allocation/free into attribute memory not the kernel
c. Driver loading and initialization of attribute memory is required
d. User specifies the required attributes through ioctl flags
e. Lower level of integration into MM, hence less features available
2. Core MM system calls
a. No driver is required, its integrated into kernel
b. Kernel manages allocation/free for the attribute memory
c. Driver loading and initialization is not required
d. User specifies the attributes through system call flags
e. Higher integration into MM, hence more features applicable
A. Driver IOCTL Mapping
=======================
If we are going in this direction where device driver manages everything
1. Nothing else needs to be done in kernel
2. Moreover HMM and HMM-CDM solutions provides more functionality like
migration etc along with better integration with core MM through
ZONE_DEVICE
Why this is not a long term solution
1. Passing over different attribute memory representation to drivers
2. Kernel relinquishing it's responsibilities to device drivers
3. Multiple attribute memory provided by multiple device vendors will
have their own drivers and the user space will have to deal with all
these drivers to get different memory which is neither optimal nor
elegant
4. Interoperability between these memory or with system RAM like
migration will be complicated as all of drivers need to export
supporting functions
5. HMM, HMM-CDM or any traditional driver based solutions had a bit
complication because there was a need to have a device driver which
sometimes was a closed source one to manage the device itself. So the
proposition that driver should also take care of the memory as well
was somewhat logical and justified
But going forward when these devices will be managed by open source
drivers and their memory available for representation in the kernel
then that argument just goes away. Like any other memory, kernel will
have to represent this attribute memory and can no longer hand over
the responsibility to device drivers.
B. MM System Calls
==================
B.1 Attribute Memory as distinct NUMA nodes:
--------------------------------------------
User space can access any attribute memory with simply doing mbind
(MPOL_BIND...) after identifying the right node. There will be sysfs
interface which will help. The view of memory attributes will be two
dimensional. Each broadly will have these kind of attribute values.
Accuracy and completeness of this list can be debated later and
agreed upon.
1. Bandwidth
2. Latency
3. Reliability
4. Power
5. Density
More over these attributes can be 'as seen from' different compute
nodes having CPUs. This will require a two dimensional structure of
attribute values to be exported for user space. IIUC, HMAT export
because of the new ACPI standard was one such attempt.
https://lkml.org/lkml/2017/12/13/968
But lack of clarity on the directions of NUMA will prevent us from
deliberating on how the use interface for attributes should look like
going forward.
Distinct NUMA representation can be achieved with or without changing
the core MM.
B.1.1 Without changing core MM
Just plug the attribute memory as a distinct NUMA node with
ZONE_MOVEABLE (just to prevent kernel allocations into it) with a
higher NUMA distance reducing the chances of implicit allocation leaks
into it. This is the simplest solution in the category when attribute
memory needs to be represented as NUMA nodes. But it has a single
fundamental drawback.
* Allocation leaks which can not be prevented with just high NUMA
distance
All other complexities like memory fallback options can be handled in
the user space. But if the attribute values are 'as seen' basis, then
user space needs to rebind appropriately as and when the tasks move
around the system which might be overwhelming.
B.1.2 With changing core MM
Representing attribute memory as NUMA nodes but with some changes in the
core MM will have the following benefits.
1. There wont be implicit memory leaks into the attribute memory
2. Allocation fallback options can be handled precisely in the kernel
3. Enforcement of the memory policy in kernel even when the tasks move
around
CDM implementation last year demonstrated by changing zonelist creation
how the implicit allocation leaks into the device memory can be
prevented.
https://lkml.org/lkml/2017/2/15/224
B.2 Attribute Memory Inside Existing NUMA nodes:
------------------------------------------------
Some attribute memory might be connected directly to the compute nodes
lacking their own NUMA distance. Separate NUMA node representation will
not make sense in those situations. Even otherwise, these attribute
memory can be represented in the the compute nodes having CPU. NUMA
view of the buddy allocator needs to contains all of these memory now
either as
1. Separate zones for attribute memory
2. Separate MIGRATE_TYPE page blocks for attribute memory
3. Separate free_area[] for attribute memory
One such very high level proposal can be found here which changes
in free_area[] to accommodate attribute memory.
http://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf
Any of these changes as stated above will require significant changes
to core MM. Also there are draw backs with these kind of representations
as well.
1. In absence of node info, struct page will lack identity as attribute
memory
2. struct page will need a single bit specifying it as a attribute
memory though specific differentiation can be handled once this bit
is set
3. User cannot specify attribute memory through mbind(MPOL_BIND...)
any more. It will need new flags with madvise() or new system calls
altogether
But these changes will also have the following benefits (similar to
method B.1.2 With changing core MM)
1. There wont be implicit memory leaks into the attribute memory
2. Allocation fallback options can be handled precisely in the kernel
3. Enforcement of the memory policy in kernel even when the tasks move
around
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2018-01-28 7:19 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-28 7:18 [LSF/MM TOPIC] Rethinking NUMA Anshuman Khandual
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.