* [RFC 0/8] Define coherent device memory node @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora There are certain devices like accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which might contain on board memory. This on board memory can be coherent along with system RAM and may be accessible from either the CPU or from the device. The coherency is usually achieved through synchronizing the cache accesses from either side. This makes the device memory appear in the same address space as that of the system RAM. The on board device memory and system RAM are coherent but have differences in their properties as explained and elaborated below. Following diagram explains how the coherent device memory appears in the memory address space. +-----------------+ +-----------------+ | | | | | CPU | | DEVICE | | | | | +-----------------+ +-----------------+ | | | Shared Address Space | +---------------------------------------------------------------------+ | | | | | | | System RAM | Coherent Memory | | | | | | | +---------------------------------------------------------------------+ User space applications might be interested in using the coherent device memory either explicitly or implicitly along with the system RAM utilizing the basic semantics for memory allocation, access and release. Basically the user applications should be able to allocate memory any where (system RAM or coherent memory) and then get it accessed either from the CPU or from the coherent device for various computation or data transformation purpose. User space really should not be concerned about memory placement and their subsequent allocations when the memory really faults because of the access. To achieve seamless integration between system RAM and coherent device memory it must be able to utilize core memory kernel features like anon mapping, file mapping, page cache, driver managed pages, HW poisoning, migrations, reclaim, compaction, etc. Making the coherent device memory appear as a distinct memory only NUMA node which will be initialized as any other node with memory can create this integration with currently available system RAM memory. Also at the same time there should be a differentiating mark which indicates that this node is a coherent device memory node not any other memory only system RAM node. Coherent device memory invariably isn't available until the driver for the device has been initialized. It is desirable but not required for the device to support memory offlining for the purposes such as power management, link management and hardware errors. Kernel allocation should not come here as it cannot be moved out. Hence coherent device memory should go inside ZONE_MOVABLE zone instead. This guarantees that kernel allocations will never be satisfied from this memory and any process having un-movable pages on this coherent device memory (likely achieved through pinning later on after initial allocation) can be killed to free up memory from page table and eventually hot plugging the node out. After similar representation as a NUMA node, the coherent memory might still need some special consideration while being inside the kernel. There can be a variety of coherent device memory nodes with different expectations and special considerations from the core kernel. This RFC discusses only one such scenario where the coherent device memory requires just isolation. Now let us consider in detail the case of a coherent device memory node which requires isolation. This kind of coherent device memory is on board an external device attached to the system through a link where there is a chance of link errors plugging out the entire memory node with it. More over the memory might also have higher chances of ECC errors as compared to the system RAM. These are just some possibilities. But the fact remains that the coherent device memory can have some other different properties which might not be desirable for some user space applications. An application should not be exposed to related risks of a device if its not taking advantage of special features of that device and it's memory. Because of the reasons explained above allocations into isolation based coherent device memory node should further be regulated apart from earlier requirement of kernel allocations not coming there. User space allocations should not come here implicitly without the user application explicitly knowing about it. This summarizes isolation requirement of certain kind of a coherent device memory node as an example. Some coherent memory devices may not require isolation altogether. Then there might be other coherent memory devices which require some other special treatment after being part of core memory representation in kernel. Though the framework suggested by this RFC has made provisions for them, it has not considered any other kind of requirement other than isolation for now. Though this RFC series currently attempts to implement one such isolation seeking coherent device memory example, this framework can be extended to accommodate any present or future coherent memory devices which will fit the requirement as explained before even with new requirements other than isolation. In case of isolation seeking coherent device memory node, there will be other core VM code paths which need to be taken care before it can be completely isolated as required. Core kernel memory features like reclamation, evictions etc. might need to be restricted or modified on the coherent device memory node as they can be performance limiting. The RFC does not propose anything on this yet but it can be looked into later on. For now it just disables Auto NUMA for any VMA which has coherent device memory. Seamless integration of coherent device memory with system memory will enable various other features, some of which can be listed as follows. a. Seamless migrations between system RAM and the coherent memory b. Will have asynchronous and high throughput migrations c. Be able to allocate huge order pages from these memory regions d. Restrict allocations to a large extent to the tasks using the device for workload acceleration Before concluding, will look into the reasons why the existing solutions don't work. There are two basic requirements which have to be satisfies before the coherent device memory can be integrated with core kernel seamlessly. a. PFN must have struct page b. Struct page must able to be inside standard LRU lists The above two basic requirements discard the existing method of device memory representation approaches like these which then requires the need of creating a new framework. (1) Traditional ioremap a. Memory is mapped into kernel (linear and virtual) and user space b. These PFNs do not have struct pages associated with it c. These special PFNs are marked with special flags inside the PTE d. Cannot participate in core VM functions much because of this e. Cannot do easy user space migrations (2) Zone ZONE_DEVICE a. Memory is mapped into kernel and user space b. PFNs do have struct pages associated with it c. These struct pages are allocated inside it's own memory range d. Unfortunately the struct page's union containing LRU has been used for struct dev_pagemap pointer e. Hence it cannot be part of any LRU (like Page cache) f. Hence file cached mapping cannot reside on these PFNs g. Cannot do easy migrations I had also explored non LRU representation of this coherent device memory where the integration with system RAM in the core VM is limited only to the following functions. Not being inside LRU is definitely going to reduce the scope of tight integration with system RAM. (1) Migration support between system RAM and coherent memory (2) Migration support between various coherent memory nodes (3) Isolation of the coherent memory (4) Mapping the coherent memory into user space through driver's struct vm_operations (5) HW poisoning of the coherent memory Allocating the entire memory of the coherent device node right after hot plug into ZONE_MOVABLE (where the memory is already inside the buddy system) will still expose a time window where other user space allocations can come into the coherent device memory node and prevent the intended isolation. So traditional hot plug is not the solution. Hence started looking into CMA based non LRU solution but then hit the following roadblocks. (1) CMA does not support hot plugging of new memory node a. CMA area needs to be marked during boot before buddy is initialized b. cma_alloc()/cma_release() can happen on the marked area c. Should be able to mark the CMA areas just after memory hot plug d. cma_alloc()/cma_release() can happen later after the hot plug e. This is not currently supported right now (2) Mapped non LRU migration of pages a. Recent work from Michan Kim makes non LRU page migratable b. But it still does not support migration of mapped non LRU pages c. With non LRU CMA reserved, again there are some additional challenges With hot pluggable CMA and non LRU mapped migration support there may be an alternate approach to represent coherent device memory. Please do review this RFC proposal and let me know your comments or suggestions. Thank you. Anshuman Khandual (8): mm: Define coherent device memory node mm: Add specialized fallback zonelist for coherent device memory nodes mm: Isolate coherent device memory nodes from HugeTLB allocation paths mm: Accommodate coherent device memory nodes in MPOL_BIND implementation mm: Add new flag VM_CDM for coherent device memory mm: Make VM_CDM marked VMAs non migratable mm: Add a new migration function migrate_virtual_range() mm: Add N_COHERENT_DEVICE node type into node_states[] Documentation/ABI/stable/sysfs-devices-node | 7 +++ drivers/base/node.c | 6 +++ include/linux/mempolicy.h | 24 +++++++++ include/linux/migrate.h | 3 ++ include/linux/mm.h | 5 ++ include/linux/mmzone.h | 29 ++++++++++ include/linux/nodemask.h | 3 ++ mm/Kconfig | 13 +++++ mm/hugetlb.c | 38 ++++++++++++- mm/memory_hotplug.c | 10 ++++ mm/mempolicy.c | 70 ++++++++++++++++++++++-- mm/migrate.c | 84 +++++++++++++++++++++++++++++ mm/page_alloc.c | 10 ++++ 13 files changed, 295 insertions(+), 7 deletions(-) -- 2.1.0 ^ permalink raw reply [flat|nested] 135+ messages in thread
* [RFC 0/8] Define coherent device memory node @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora There are certain devices like accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which might contain on board memory. This on board memory can be coherent along with system RAM and may be accessible from either the CPU or from the device. The coherency is usually achieved through synchronizing the cache accesses from either side. This makes the device memory appear in the same address space as that of the system RAM. The on board device memory and system RAM are coherent but have differences in their properties as explained and elaborated below. Following diagram explains how the coherent device memory appears in the memory address space. +-----------------+ +-----------------+ | | | | | CPU | | DEVICE | | | | | +-----------------+ +-----------------+ | | | Shared Address Space | +---------------------------------------------------------------------+ | | | | | | | System RAM | Coherent Memory | | | | | | | +---------------------------------------------------------------------+ User space applications might be interested in using the coherent device memory either explicitly or implicitly along with the system RAM utilizing the basic semantics for memory allocation, access and release. Basically the user applications should be able to allocate memory any where (system RAM or coherent memory) and then get it accessed either from the CPU or from the coherent device for various computation or data transformation purpose. User space really should not be concerned about memory placement and their subsequent allocations when the memory really faults because of the access. To achieve seamless integration between system RAM and coherent device memory it must be able to utilize core memory kernel features like anon mapping, file mapping, page cache, driver managed pages, HW poisoning, migrations, reclaim, compaction, etc. Making the coherent device memory appear as a distinct memory only NUMA node which will be initialized as any other node with memory can create this integration with currently available system RAM memory. Also at the same time there should be a differentiating mark which indicates that this node is a coherent device memory node not any other memory only system RAM node. Coherent device memory invariably isn't available until the driver for the device has been initialized. It is desirable but not required for the device to support memory offlining for the purposes such as power management, link management and hardware errors. Kernel allocation should not come here as it cannot be moved out. Hence coherent device memory should go inside ZONE_MOVABLE zone instead. This guarantees that kernel allocations will never be satisfied from this memory and any process having un-movable pages on this coherent device memory (likely achieved through pinning later on after initial allocation) can be killed to free up memory from page table and eventually hot plugging the node out. After similar representation as a NUMA node, the coherent memory might still need some special consideration while being inside the kernel. There can be a variety of coherent device memory nodes with different expectations and special considerations from the core kernel. This RFC discusses only one such scenario where the coherent device memory requires just isolation. Now let us consider in detail the case of a coherent device memory node which requires isolation. This kind of coherent device memory is on board an external device attached to the system through a link where there is a chance of link errors plugging out the entire memory node with it. More over the memory might also have higher chances of ECC errors as compared to the system RAM. These are just some possibilities. But the fact remains that the coherent device memory can have some other different properties which might not be desirable for some user space applications. An application should not be exposed to related risks of a device if its not taking advantage of special features of that device and it's memory. Because of the reasons explained above allocations into isolation based coherent device memory node should further be regulated apart from earlier requirement of kernel allocations not coming there. User space allocations should not come here implicitly without the user application explicitly knowing about it. This summarizes isolation requirement of certain kind of a coherent device memory node as an example. Some coherent memory devices may not require isolation altogether. Then there might be other coherent memory devices which require some other special treatment after being part of core memory representation in kernel. Though the framework suggested by this RFC has made provisions for them, it has not considered any other kind of requirement other than isolation for now. Though this RFC series currently attempts to implement one such isolation seeking coherent device memory example, this framework can be extended to accommodate any present or future coherent memory devices which will fit the requirement as explained before even with new requirements other than isolation. In case of isolation seeking coherent device memory node, there will be other core VM code paths which need to be taken care before it can be completely isolated as required. Core kernel memory features like reclamation, evictions etc. might need to be restricted or modified on the coherent device memory node as they can be performance limiting. The RFC does not propose anything on this yet but it can be looked into later on. For now it just disables Auto NUMA for any VMA which has coherent device memory. Seamless integration of coherent device memory with system memory will enable various other features, some of which can be listed as follows. a. Seamless migrations between system RAM and the coherent memory b. Will have asynchronous and high throughput migrations c. Be able to allocate huge order pages from these memory regions d. Restrict allocations to a large extent to the tasks using the device for workload acceleration Before concluding, will look into the reasons why the existing solutions don't work. There are two basic requirements which have to be satisfies before the coherent device memory can be integrated with core kernel seamlessly. a. PFN must have struct page b. Struct page must able to be inside standard LRU lists The above two basic requirements discard the existing method of device memory representation approaches like these which then requires the need of creating a new framework. (1) Traditional ioremap a. Memory is mapped into kernel (linear and virtual) and user space b. These PFNs do not have struct pages associated with it c. These special PFNs are marked with special flags inside the PTE d. Cannot participate in core VM functions much because of this e. Cannot do easy user space migrations (2) Zone ZONE_DEVICE a. Memory is mapped into kernel and user space b. PFNs do have struct pages associated with it c. These struct pages are allocated inside it's own memory range d. Unfortunately the struct page's union containing LRU has been used for struct dev_pagemap pointer e. Hence it cannot be part of any LRU (like Page cache) f. Hence file cached mapping cannot reside on these PFNs g. Cannot do easy migrations I had also explored non LRU representation of this coherent device memory where the integration with system RAM in the core VM is limited only to the following functions. Not being inside LRU is definitely going to reduce the scope of tight integration with system RAM. (1) Migration support between system RAM and coherent memory (2) Migration support between various coherent memory nodes (3) Isolation of the coherent memory (4) Mapping the coherent memory into user space through driver's struct vm_operations (5) HW poisoning of the coherent memory Allocating the entire memory of the coherent device node right after hot plug into ZONE_MOVABLE (where the memory is already inside the buddy system) will still expose a time window where other user space allocations can come into the coherent device memory node and prevent the intended isolation. So traditional hot plug is not the solution. Hence started looking into CMA based non LRU solution but then hit the following roadblocks. (1) CMA does not support hot plugging of new memory node a. CMA area needs to be marked during boot before buddy is initialized b. cma_alloc()/cma_release() can happen on the marked area c. Should be able to mark the CMA areas just after memory hot plug d. cma_alloc()/cma_release() can happen later after the hot plug e. This is not currently supported right now (2) Mapped non LRU migration of pages a. Recent work from Michan Kim makes non LRU page migratable b. But it still does not support migration of mapped non LRU pages c. With non LRU CMA reserved, again there are some additional challenges With hot pluggable CMA and non LRU mapped migration support there may be an alternate approach to represent coherent device memory. Please do review this RFC proposal and let me know your comments or suggestions. Thank you. Anshuman Khandual (8): mm: Define coherent device memory node mm: Add specialized fallback zonelist for coherent device memory nodes mm: Isolate coherent device memory nodes from HugeTLB allocation paths mm: Accommodate coherent device memory nodes in MPOL_BIND implementation mm: Add new flag VM_CDM for coherent device memory mm: Make VM_CDM marked VMAs non migratable mm: Add a new migration function migrate_virtual_range() mm: Add N_COHERENT_DEVICE node type into node_states[] Documentation/ABI/stable/sysfs-devices-node | 7 +++ drivers/base/node.c | 6 +++ include/linux/mempolicy.h | 24 +++++++++ include/linux/migrate.h | 3 ++ include/linux/mm.h | 5 ++ include/linux/mmzone.h | 29 ++++++++++ include/linux/nodemask.h | 3 ++ mm/Kconfig | 13 +++++ mm/hugetlb.c | 38 ++++++++++++- mm/memory_hotplug.c | 10 ++++ mm/mempolicy.c | 70 ++++++++++++++++++++++-- mm/migrate.c | 84 +++++++++++++++++++++++++++++ mm/page_alloc.c | 10 ++++ 13 files changed, 295 insertions(+), 7 deletions(-) -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [RFC 1/8] mm: Define coherent device memory node 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora There are certain devices like specialized accelerator, GPU cards, network cards, FPGA cards etc which might contain onboard memory which is coherent along with the existing system RAM while being accessed either from the CPU or from the device. They share some similar properties with that of normal system RAM but at the same time can also be different with respect to system RAM. User applications might be interested in using this kind of coherent device memory explicitly or implicitly along side the system RAM utilizing all possible core memory functions like anon mapping (LRU), file mapping (LRU), page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations etc. To achieve this kind of tight integration with core memory subsystem, the device onbaord coherent memory must be represented as a memory only NUMA node. At the same time pglist_data structure (which is node's memory representation) of this NUMA node must also be differentiated indicating that it's coherent device memory not regular system RAM. After achieving the integration with core memory subsystem through a marked pglist_data structure, coherent device memory might still need some special consideration inside the kernel. There can be a variety of coherent memory nodes with different expectations from the core kernel memory. But right now only one kind of special treatment is considered which requires certain isolation. Now consider the case of a coherent device memory node type which requires isolation. This kind of coherent memory is onboard an external device attached to the system through a link where there is always a chance of a link failure taking down the entire memory node with it. More over the memory might also have higher chance of ECC failure as compared to the system RAM. Hence allocation into this kind of coherent memory node should be regulated. Kernel allocations must not come here. Normal user space allocations too should not come here implicitly (without user application knowing about it). This summarizes isolation requirement of certain kind of coherent device memory node as an example. There can be different kinds of isolation requirement also. Some coherent memory devices might not require isolation altogether after all. Then there might be other coherent memory devices which might require some other special treatment after being part of core memory representation For now, will look into isolation seeking coherent device memory node not the other ones. This adds a new 'bool coherent' element in pglist_data structure which can identify any coherent device node. Instead this can be a u64 which can then hold an array of properties bits for various types of coherent devices in future. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mmzone.h | 29 +++++++++++++++++++++++++++++ mm/Kconfig | 13 +++++++++++++ 2 files changed, 42 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7f2ae99..821dffb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -722,8 +722,37 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; + +#ifdef CONFIG_COHERENT_DEVICE + /* + * Coherent device memory node + * + * Devices containing coherent memory is represented as a + * special coherent memory NUMA node, should be identified + * differently compared to normal memory nodes. Though it + * shares lot of common properties with system memory, it + * also has some differentiating factors as well. + * + * XXX: Though this is a bool which identifies the isolation + * requiring coherent device memory node right now, it can be + * extended as a bit mask to represent different properties + * for future coherent device memory nodes. + */ + bool coherent_device; +#endif } pg_data_t; +#ifdef CONFIG_COHERENT_DEVICE +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) +#else +#define set_cdm_isolation(nid) () +#define clr_cdm_isolation(nid) () +#define isolated_cdm_node(nid) (0) +#endif + #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) #ifdef CONFIG_FLAT_NODE_MEM_MAP diff --git a/mm/Kconfig b/mm/Kconfig index be0ee11..cb50468 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -704,6 +704,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config COHERENT_DEVICE + bool "Coherent device memory support" + depends on MEMORY_HOTPLUG + depends on MEMORY_HOTREMOVE + depends on PPC64 + default y + help + Coherent device memory node support enables the system to hotplug + a device with coherent memory as a normal system memory node. FPGA, + network, GPU cards etc might contain coherent memory. + + If not sure, then say N. + config FRAME_VECTOR bool -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 1/8] mm: Define coherent device memory node @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora There are certain devices like specialized accelerator, GPU cards, network cards, FPGA cards etc which might contain onboard memory which is coherent along with the existing system RAM while being accessed either from the CPU or from the device. They share some similar properties with that of normal system RAM but at the same time can also be different with respect to system RAM. User applications might be interested in using this kind of coherent device memory explicitly or implicitly along side the system RAM utilizing all possible core memory functions like anon mapping (LRU), file mapping (LRU), page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations etc. To achieve this kind of tight integration with core memory subsystem, the device onbaord coherent memory must be represented as a memory only NUMA node. At the same time pglist_data structure (which is node's memory representation) of this NUMA node must also be differentiated indicating that it's coherent device memory not regular system RAM. After achieving the integration with core memory subsystem through a marked pglist_data structure, coherent device memory might still need some special consideration inside the kernel. There can be a variety of coherent memory nodes with different expectations from the core kernel memory. But right now only one kind of special treatment is considered which requires certain isolation. Now consider the case of a coherent device memory node type which requires isolation. This kind of coherent memory is onboard an external device attached to the system through a link where there is always a chance of a link failure taking down the entire memory node with it. More over the memory might also have higher chance of ECC failure as compared to the system RAM. Hence allocation into this kind of coherent memory node should be regulated. Kernel allocations must not come here. Normal user space allocations too should not come here implicitly (without user application knowing about it). This summarizes isolation requirement of certain kind of coherent device memory node as an example. There can be different kinds of isolation requirement also. Some coherent memory devices might not require isolation altogether after all. Then there might be other coherent memory devices which might require some other special treatment after being part of core memory representation For now, will look into isolation seeking coherent device memory node not the other ones. This adds a new 'bool coherent' element in pglist_data structure which can identify any coherent device node. Instead this can be a u64 which can then hold an array of properties bits for various types of coherent devices in future. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mmzone.h | 29 +++++++++++++++++++++++++++++ mm/Kconfig | 13 +++++++++++++ 2 files changed, 42 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7f2ae99..821dffb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -722,8 +722,37 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; + +#ifdef CONFIG_COHERENT_DEVICE + /* + * Coherent device memory node + * + * Devices containing coherent memory is represented as a + * special coherent memory NUMA node, should be identified + * differently compared to normal memory nodes. Though it + * shares lot of common properties with system memory, it + * also has some differentiating factors as well. + * + * XXX: Though this is a bool which identifies the isolation + * requiring coherent device memory node right now, it can be + * extended as a bit mask to represent different properties + * for future coherent device memory nodes. + */ + bool coherent_device; +#endif } pg_data_t; +#ifdef CONFIG_COHERENT_DEVICE +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) +#else +#define set_cdm_isolation(nid) () +#define clr_cdm_isolation(nid) () +#define isolated_cdm_node(nid) (0) +#endif + #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) #ifdef CONFIG_FLAT_NODE_MEM_MAP diff --git a/mm/Kconfig b/mm/Kconfig index be0ee11..cb50468 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -704,6 +704,19 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config COHERENT_DEVICE + bool "Coherent device memory support" + depends on MEMORY_HOTPLUG + depends on MEMORY_HOTREMOVE + depends on PPC64 + default y + help + Coherent device memory node support enables the system to hotplug + a device with coherent memory as a normal system memory node. FPGA, + network, GPU cards etc might contain coherent memory. + + If not sure, then say N. + config FRAME_VECTOR bool -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [RFC 1/8] mm: Define coherent device memory node 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 17:09 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:09 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora > +#ifdef CONFIG_COHERENT_DEVICE > +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) > +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) > +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) > +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) > +#else > +#define set_cdm_isolation(nid) () > +#define clr_cdm_isolation(nid) () > +#define isolated_cdm_node(nid) (0) > +#endif FWIW, I think adding all this "cdm" gunk in the names is probably a bad thing. I can think of other memory types that are coherent, but non-device-based that might want behavior like this. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 1/8] mm: Define coherent device memory node @ 2016-10-24 17:09 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:09 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora > +#ifdef CONFIG_COHERENT_DEVICE > +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) > +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) > +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) > +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) > +#else > +#define set_cdm_isolation(nid) () > +#define clr_cdm_isolation(nid) () > +#define isolated_cdm_node(nid) (0) > +#endif FWIW, I think adding all this "cdm" gunk in the names is probably a bad thing. I can think of other memory types that are coherent, but non-device-based that might want behavior like this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 1/8] mm: Define coherent device memory node 2016-10-24 17:09 ` Dave Hansen @ 2016-10-25 1:22 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-25 1:22 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:39 PM, Dave Hansen wrote: >> +#ifdef CONFIG_COHERENT_DEVICE >> > +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) >> > +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) >> > +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) >> > +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) >> > +#else >> > +#define set_cdm_isolation(nid) () >> > +#define clr_cdm_isolation(nid) () >> > +#define isolated_cdm_node(nid) (0) >> > +#endif > FWIW, I think adding all this "cdm" gunk in the names is probably a bad > thing. > > I can think of other memory types that are coherent, but > non-device-based that might want behavior like this. Hmm, I was not aware about non-device-based coherent memory. Could you please name some of them ? If thats the case we need to change CDM to some thing which can accommodate both device and non device based coherent memory. May be like "Differentiated/special coherent memory". But it needs to communicate that its not system RAM. Thats the idea. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 1/8] mm: Define coherent device memory node @ 2016-10-25 1:22 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-25 1:22 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:39 PM, Dave Hansen wrote: >> +#ifdef CONFIG_COHERENT_DEVICE >> > +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) >> > +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) >> > +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) >> > +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) >> > +#else >> > +#define set_cdm_isolation(nid) () >> > +#define clr_cdm_isolation(nid) () >> > +#define isolated_cdm_node(nid) (0) >> > +#endif > FWIW, I think adding all this "cdm" gunk in the names is probably a bad > thing. > > I can think of other memory types that are coherent, but > non-device-based that might want behavior like this. Hmm, I was not aware about non-device-based coherent memory. Could you please name some of them ? If thats the case we need to change CDM to some thing which can accommodate both device and non device based coherent memory. May be like "Differentiated/special coherent memory". But it needs to communicate that its not system RAM. Thats the idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 1/8] mm: Define coherent device memory node 2016-10-25 1:22 ` Anshuman Khandual @ 2016-10-25 15:47 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-25 15:47 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 06:22 PM, Anshuman Khandual wrote: > On 10/24/2016 10:39 PM, Dave Hansen wrote: >>> +#ifdef CONFIG_COHERENT_DEVICE >>>> +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) >>>> +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) >>>> +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) >>>> +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) >>>> +#else >>>> +#define set_cdm_isolation(nid) () >>>> +#define clr_cdm_isolation(nid) () >>>> +#define isolated_cdm_node(nid) (0) >>>> +#endif >> FWIW, I think adding all this "cdm" gunk in the names is probably a bad >> thing. >> >> I can think of other memory types that are coherent, but >> non-device-based that might want behavior like this. > > Hmm, I was not aware about non-device-based coherent memory. Could you > please name some of them ? If thats the case we need to change CDM to > some thing which can accommodate both device and non device based > coherent memory. May be like "Differentiated/special coherent memory". > But it needs to communicate that its not system RAM. Thats the idea. Intel has some stuff called MCDRAM. It's described in detail here: > https://software.intel.com/en-us/articles/mcdram-high-bandwidth-memory-on-knights-landing-analysis-methods-tools You can also Google around for more information. I believe Samsung has a technology called High Bandwidth Memory (HBM) that's already a couple of generations old that sounds similar. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 1/8] mm: Define coherent device memory node @ 2016-10-25 15:47 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-25 15:47 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 06:22 PM, Anshuman Khandual wrote: > On 10/24/2016 10:39 PM, Dave Hansen wrote: >>> +#ifdef CONFIG_COHERENT_DEVICE >>>> +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) >>>> +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) >>>> +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) >>>> +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) >>>> +#else >>>> +#define set_cdm_isolation(nid) () >>>> +#define clr_cdm_isolation(nid) () >>>> +#define isolated_cdm_node(nid) (0) >>>> +#endif >> FWIW, I think adding all this "cdm" gunk in the names is probably a bad >> thing. >> >> I can think of other memory types that are coherent, but >> non-device-based that might want behavior like this. > > Hmm, I was not aware about non-device-based coherent memory. Could you > please name some of them ? If thats the case we need to change CDM to > some thing which can accommodate both device and non device based > coherent memory. May be like "Differentiated/special coherent memory". > But it needs to communicate that its not system RAM. Thats the idea. Intel has some stuff called MCDRAM. It's described in detail here: > https://software.intel.com/en-us/articles/mcdram-high-bandwidth-memory-on-knights-landing-analysis-methods-tools You can also Google around for more information. I believe Samsung has a technology called High Bandwidth Memory (HBM) that's already a couple of generations old that sounds similar. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This change is part of the isolation requiring coherent device memory node's implementation. Isolation seeking coherent memory node requires isolation from implicit memory allocations from user space but at the same time there should also have an explicit way to do the allocation. Kernel allocation to this memory can be prevented by putting the entire memory in ZONE_MOVABLE for example. Platform node's both zonelists are fundamental to where the memory comes when there is an allocation request. In order to achieve the two objectives stated above, zonelists building process has to change as both zonelists (FALLBACK and NOFALLBACK) gives access to the node's memory zones during any kind of memory allocation. The following changes are implemented in this regard. (1) Coherent node's zones are not part of any other node's FALLBACK list (2) Coherent node's FALLBACK list contains it's own memory zones followed by all system RAM zones in normal order (3) Coherent node's zones are part of it's own NOFALLBACK list The above changes which will ensure the following which in turn isolates the coherent memory node as desired. (1) There wont be any implicit allocation ending up in the coherent node (2) __GFP_THISNODE marked allocations will come from the coherent node (3) Coherent memory can also be allocated through MPOL_BIND interface Sample zonelist configuration: [NODE (0)] System RAM node ZONELIST_FALLBACK (0xc00000000140da00) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001411a10) (0) (node 0) (DMA 0xc00000000140c000) [NODE (1)] System RAM node ZONELIST_FALLBACK (0xc000000100001a00) (0) (node 1) (DMA 0xc000000100000000) (1) (node 0) (DMA 0xc00000000140c000) ZONELIST_NOFALLBACK (0xc000000100005a10) (0) (node 1) (DMA 0xc000000100000000) [NODE (2)] Coherent memory ZONELIST_FALLBACK (0xc000000001427700) (0) (node 2) (Movable 0xc000000001427080) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000142b710) (0) (node 2) (Movable 0xc000000001427080) [NODE (3)] Coherent memory ZONELIST_FALLBACK (0xc000000001431400) (0) (node 3) (Movable 0xc000000001430d80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001435410) (0) (node 3) (Movable 0xc000000001430d80) [NODE (4)] Coherent memory ZONELIST_FALLBACK (0xc00000000143b100) (0) (node 4) (Movable 0xc00000000143aa80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000143f110) (0) (node 4) (Movable 0xc00000000143aa80) Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/page_alloc.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b3bf67..a2536b4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4753,6 +4753,16 @@ static void build_zonelists(pg_data_t *pgdat) i = 0; while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { +#ifdef CONFIG_COHERENT_DEVICE + /* + * Isolation requiring coherent device memory node's zones + * should not be part of any other node's fallback zonelist + * but it's own fallback list. + */ + if (isolated_cdm_node(node) && (pgdat->node_id != node)) + continue; +#endif + /* * We don't want to pressure a particular node. * So adding penalty to the first node in same -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This change is part of the isolation requiring coherent device memory node's implementation. Isolation seeking coherent memory node requires isolation from implicit memory allocations from user space but at the same time there should also have an explicit way to do the allocation. Kernel allocation to this memory can be prevented by putting the entire memory in ZONE_MOVABLE for example. Platform node's both zonelists are fundamental to where the memory comes when there is an allocation request. In order to achieve the two objectives stated above, zonelists building process has to change as both zonelists (FALLBACK and NOFALLBACK) gives access to the node's memory zones during any kind of memory allocation. The following changes are implemented in this regard. (1) Coherent node's zones are not part of any other node's FALLBACK list (2) Coherent node's FALLBACK list contains it's own memory zones followed by all system RAM zones in normal order (3) Coherent node's zones are part of it's own NOFALLBACK list The above changes which will ensure the following which in turn isolates the coherent memory node as desired. (1) There wont be any implicit allocation ending up in the coherent node (2) __GFP_THISNODE marked allocations will come from the coherent node (3) Coherent memory can also be allocated through MPOL_BIND interface Sample zonelist configuration: [NODE (0)] System RAM node ZONELIST_FALLBACK (0xc00000000140da00) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001411a10) (0) (node 0) (DMA 0xc00000000140c000) [NODE (1)] System RAM node ZONELIST_FALLBACK (0xc000000100001a00) (0) (node 1) (DMA 0xc000000100000000) (1) (node 0) (DMA 0xc00000000140c000) ZONELIST_NOFALLBACK (0xc000000100005a10) (0) (node 1) (DMA 0xc000000100000000) [NODE (2)] Coherent memory ZONELIST_FALLBACK (0xc000000001427700) (0) (node 2) (Movable 0xc000000001427080) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000142b710) (0) (node 2) (Movable 0xc000000001427080) [NODE (3)] Coherent memory ZONELIST_FALLBACK (0xc000000001431400) (0) (node 3) (Movable 0xc000000001430d80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001435410) (0) (node 3) (Movable 0xc000000001430d80) [NODE (4)] Coherent memory ZONELIST_FALLBACK (0xc00000000143b100) (0) (node 4) (Movable 0xc00000000143aa80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000143f110) (0) (node 4) (Movable 0xc00000000143aa80) Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/page_alloc.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b3bf67..a2536b4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4753,6 +4753,16 @@ static void build_zonelists(pg_data_t *pgdat) i = 0; while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { +#ifdef CONFIG_COHERENT_DEVICE + /* + * Isolation requiring coherent device memory node's zones + * should not be part of any other node's fallback zonelist + * but it's own fallback list. + */ + if (isolated_cdm_node(node) && (pgdat->node_id != node)) + continue; +#endif + /* * We don't want to pressure a particular node. * So adding penalty to the first node in same -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 17:10 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:10 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > +#ifdef CONFIG_COHERENT_DEVICE > + /* > + * Isolation requiring coherent device memory node's zones > + * should not be part of any other node's fallback zonelist > + * but it's own fallback list. > + */ > + if (isolated_cdm_node(node) && (pgdat->node_id != node)) > + continue; > +#endif Total nit: Why do you need an #ifdef here when you had +#ifdef CONFIG_COHERENT_DEVICE +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) +#else +#define set_cdm_isolation(nid) () +#define clr_cdm_isolation(nid) () +#define isolated_cdm_node(nid) (0) +#endif in your last patch? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes @ 2016-10-24 17:10 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:10 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > +#ifdef CONFIG_COHERENT_DEVICE > + /* > + * Isolation requiring coherent device memory node's zones > + * should not be part of any other node's fallback zonelist > + * but it's own fallback list. > + */ > + if (isolated_cdm_node(node) && (pgdat->node_id != node)) > + continue; > +#endif Total nit: Why do you need an #ifdef here when you had +#ifdef CONFIG_COHERENT_DEVICE +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) +#else +#define set_cdm_isolation(nid) () +#define clr_cdm_isolation(nid) () +#define isolated_cdm_node(nid) (0) +#endif in your last patch? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes 2016-10-24 17:10 ` Dave Hansen @ 2016-10-25 1:27 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-25 1:27 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:40 PM, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> +#ifdef CONFIG_COHERENT_DEVICE >> + /* >> + * Isolation requiring coherent device memory node's zones >> + * should not be part of any other node's fallback zonelist >> + * but it's own fallback list. >> + */ >> + if (isolated_cdm_node(node) && (pgdat->node_id != node)) >> + continue; >> +#endif > > Total nit: Why do you need an #ifdef here when you had > > +#ifdef CONFIG_COHERENT_DEVICE > +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) > +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) > +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) > +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) > +#else > +#define set_cdm_isolation(nid) () > +#define clr_cdm_isolation(nid) () > +#define isolated_cdm_node(nid) (0) > +#endif > > in your last patch? Right, the "if" condition with an "&&" as a whole would have evaluated to be false. Hence the "ifdef" is not required. Will change it next time around. Thanks for pointing out. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes @ 2016-10-25 1:27 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-25 1:27 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:40 PM, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> +#ifdef CONFIG_COHERENT_DEVICE >> + /* >> + * Isolation requiring coherent device memory node's zones >> + * should not be part of any other node's fallback zonelist >> + * but it's own fallback list. >> + */ >> + if (isolated_cdm_node(node) && (pgdat->node_id != node)) >> + continue; >> +#endif > > Total nit: Why do you need an #ifdef here when you had > > +#ifdef CONFIG_COHERENT_DEVICE > +#define node_cdm(nid) (NODE_DATA(nid)->coherent_device) > +#define set_cdm_isolation(nid) (node_cdm(nid) = 1) > +#define clr_cdm_isolation(nid) (node_cdm(nid) = 0) > +#define isolated_cdm_node(nid) (node_cdm(nid) == 1) > +#else > +#define set_cdm_isolation(nid) () > +#define clr_cdm_isolation(nid) () > +#define isolated_cdm_node(nid) (0) > +#endif > > in your last patch? Right, the "if" condition with an "&&" as a whole would have evaluated to be false. Hence the "ifdef" is not required. Will change it next time around. Thanks for pointing out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes 2016-10-24 4:31 ` Anshuman Khandual @ 2016-11-17 7:40 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-17 7:40 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:01 AM, Anshuman Khandual wrote: > This change is part of the isolation requiring coherent device memory > node's implementation. > > Isolation seeking coherent memory node requires isolation from implicit > memory allocations from user space but at the same time there should also > have an explicit way to do the allocation. Kernel allocation to this memory > can be prevented by putting the entire memory in ZONE_MOVABLE for example. > > Platform node's both zonelists are fundamental to where the memory comes > when there is an allocation request. In order to achieve the two objectives > stated above, zonelists building process has to change as both zonelists > (FALLBACK and NOFALLBACK) gives access to the node's memory zones during > any kind of memory allocation. The following changes are implemented in > this regard. > > (1) Coherent node's zones are not part of any other node's FALLBACK list > (2) Coherent node's FALLBACK list contains it's own memory zones followed > by all system RAM zones in normal order > (3) Coherent node's zones are part of it's own NOFALLBACK list > > The above changes which will ensure the following which in turn isolates > the coherent memory node as desired. > > (1) There wont be any implicit allocation ending up in the coherent node > (2) __GFP_THISNODE marked allocations will come from the coherent node > (3) Coherent memory can also be allocated through MPOL_BIND interface > > Sample zonelist configuration: > > [NODE (0)] System RAM node > ZONELIST_FALLBACK (0xc00000000140da00) > (0) (node 0) (DMA 0xc00000000140c000) > (1) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc000000001411a10) > (0) (node 0) (DMA 0xc00000000140c000) > [NODE (1)] System RAM node > ZONELIST_FALLBACK (0xc000000100001a00) > (0) (node 1) (DMA 0xc000000100000000) > (1) (node 0) (DMA 0xc00000000140c000) > ZONELIST_NOFALLBACK (0xc000000100005a10) > (0) (node 1) (DMA 0xc000000100000000) > [NODE (2)] Coherent memory > ZONELIST_FALLBACK (0xc000000001427700) > (0) (node 2) (Movable 0xc000000001427080) > (1) (node 0) (DMA 0xc00000000140c000) > (2) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc00000000142b710) > (0) (node 2) (Movable 0xc000000001427080) > [NODE (3)] Coherent memory > ZONELIST_FALLBACK (0xc000000001431400) > (0) (node 3) (Movable 0xc000000001430d80) > (1) (node 0) (DMA 0xc00000000140c000) > (2) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc000000001435410) > (0) (node 3) (Movable 0xc000000001430d80) > [NODE (4)] Coherent memory > ZONELIST_FALLBACK (0xc00000000143b100) > (0) (node 4) (Movable 0xc00000000143aa80) > (1) (node 0) (DMA 0xc00000000140c000) > (2) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc00000000143f110) > (0) (node 4) (Movable 0xc00000000143aa80) > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > --- Another way of achieving isolation of the CDM nodes from user space allocations would be through cpuset changes. Will be sending out couple of draft patches in this direction. Then we can look into whether the current method or the cpuset method is a better way to go forward. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes @ 2016-11-17 7:40 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-17 7:40 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:01 AM, Anshuman Khandual wrote: > This change is part of the isolation requiring coherent device memory > node's implementation. > > Isolation seeking coherent memory node requires isolation from implicit > memory allocations from user space but at the same time there should also > have an explicit way to do the allocation. Kernel allocation to this memory > can be prevented by putting the entire memory in ZONE_MOVABLE for example. > > Platform node's both zonelists are fundamental to where the memory comes > when there is an allocation request. In order to achieve the two objectives > stated above, zonelists building process has to change as both zonelists > (FALLBACK and NOFALLBACK) gives access to the node's memory zones during > any kind of memory allocation. The following changes are implemented in > this regard. > > (1) Coherent node's zones are not part of any other node's FALLBACK list > (2) Coherent node's FALLBACK list contains it's own memory zones followed > by all system RAM zones in normal order > (3) Coherent node's zones are part of it's own NOFALLBACK list > > The above changes which will ensure the following which in turn isolates > the coherent memory node as desired. > > (1) There wont be any implicit allocation ending up in the coherent node > (2) __GFP_THISNODE marked allocations will come from the coherent node > (3) Coherent memory can also be allocated through MPOL_BIND interface > > Sample zonelist configuration: > > [NODE (0)] System RAM node > ZONELIST_FALLBACK (0xc00000000140da00) > (0) (node 0) (DMA 0xc00000000140c000) > (1) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc000000001411a10) > (0) (node 0) (DMA 0xc00000000140c000) > [NODE (1)] System RAM node > ZONELIST_FALLBACK (0xc000000100001a00) > (0) (node 1) (DMA 0xc000000100000000) > (1) (node 0) (DMA 0xc00000000140c000) > ZONELIST_NOFALLBACK (0xc000000100005a10) > (0) (node 1) (DMA 0xc000000100000000) > [NODE (2)] Coherent memory > ZONELIST_FALLBACK (0xc000000001427700) > (0) (node 2) (Movable 0xc000000001427080) > (1) (node 0) (DMA 0xc00000000140c000) > (2) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc00000000142b710) > (0) (node 2) (Movable 0xc000000001427080) > [NODE (3)] Coherent memory > ZONELIST_FALLBACK (0xc000000001431400) > (0) (node 3) (Movable 0xc000000001430d80) > (1) (node 0) (DMA 0xc00000000140c000) > (2) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc000000001435410) > (0) (node 3) (Movable 0xc000000001430d80) > [NODE (4)] Coherent memory > ZONELIST_FALLBACK (0xc00000000143b100) > (0) (node 4) (Movable 0xc00000000143aa80) > (1) (node 0) (DMA 0xc00000000140c000) > (2) (node 1) (DMA 0xc000000100000000) > ZONELIST_NOFALLBACK (0xc00000000143f110) > (0) (node 4) (Movable 0xc00000000143aa80) > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > --- Another way of achieving isolation of the CDM nodes from user space allocations would be through cpuset changes. Will be sending out couple of draft patches in this direction. Then we can look into whether the current method or the cpuset method is a better way to go forward. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask 2016-11-17 7:40 ` Anshuman Khandual @ 2016-11-17 7:59 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-17 7:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora task->mems_allowed decides the final node mask of nodes from which memory can be allocated irrespective of process or VMA based memory policy. CDM nodes should not be used for any user space memory allocation, hence they should not be part of any mems_allowed mask in user space to begin with. This adds a function system_ram() which computes system RAM only nodes and excludes all the CDM nodes on the platform. This resultant system RAM nodemask is used instead of N_MEMORY mask during cpuset and mems_allowed initialization. This achieves isolation of the coherent device memory from userspace allocations. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- This completely isolates CDM nodes from user space allocations. Hence explicit allocation to the CDM nodes would not be possible any more. To again enable explicit allocation capability from user space, cpuset needs to be changed to accommodate CDM nodes into task's mems_allowed. include/linux/mm.h | 9 +++++++++ kernel/cpuset.c | 12 +++++++----- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a92c8d7..f338492 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -446,6 +446,15 @@ static inline int put_page_testzero(struct page *page) return page_ref_dec_and_test(page); } +static inline nodemask_t system_ram(void) +{ + nodemask_t ram_nodes; + + nodes_clear(ram_nodes); + nodes_andnot(ram_nodes, node_states[N_MEMORY], node_states[N_COHERENT_DEVICE]); + return ram_nodes; +} + /* * Try to grab a ref unless the page has a refcount of zero, return false if * that is the case. diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 29f815d..78c6fa3 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -364,9 +364,11 @@ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask) */ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) { - while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY])) + nodemask_t nodes = system_ram(); + + while (!nodes_intersects(cs->effective_mems, nodes)) cs = parent_cs(cs); - nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]); + nodes_and(*pmask, cs->effective_mems, nodes); } /* @@ -2301,7 +2303,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) /* fetch the available cpus/mems and find out which changed how */ cpumask_copy(&new_cpus, cpu_active_mask); - new_mems = node_states[N_MEMORY]; + new_mems = system_ram(); cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus); mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems); @@ -2393,11 +2395,11 @@ static int cpuset_track_online_nodes(struct notifier_block *self, void __init cpuset_init_smp(void) { cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask); - top_cpuset.mems_allowed = node_states[N_MEMORY]; + top_cpuset.mems_allowed = system_ram(); top_cpuset.old_mems_allowed = top_cpuset.mems_allowed; cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask); - top_cpuset.effective_mems = node_states[N_MEMORY]; + top_cpuset.effective_mems = system_ram(); register_hotmemory_notifier(&cpuset_track_online_nodes_nb); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask @ 2016-11-17 7:59 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-17 7:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora task->mems_allowed decides the final node mask of nodes from which memory can be allocated irrespective of process or VMA based memory policy. CDM nodes should not be used for any user space memory allocation, hence they should not be part of any mems_allowed mask in user space to begin with. This adds a function system_ram() which computes system RAM only nodes and excludes all the CDM nodes on the platform. This resultant system RAM nodemask is used instead of N_MEMORY mask during cpuset and mems_allowed initialization. This achieves isolation of the coherent device memory from userspace allocations. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- This completely isolates CDM nodes from user space allocations. Hence explicit allocation to the CDM nodes would not be possible any more. To again enable explicit allocation capability from user space, cpuset needs to be changed to accommodate CDM nodes into task's mems_allowed. include/linux/mm.h | 9 +++++++++ kernel/cpuset.c | 12 +++++++----- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a92c8d7..f338492 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -446,6 +446,15 @@ static inline int put_page_testzero(struct page *page) return page_ref_dec_and_test(page); } +static inline nodemask_t system_ram(void) +{ + nodemask_t ram_nodes; + + nodes_clear(ram_nodes); + nodes_andnot(ram_nodes, node_states[N_MEMORY], node_states[N_COHERENT_DEVICE]); + return ram_nodes; +} + /* * Try to grab a ref unless the page has a refcount of zero, return false if * that is the case. diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 29f815d..78c6fa3 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -364,9 +364,11 @@ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask) */ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) { - while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY])) + nodemask_t nodes = system_ram(); + + while (!nodes_intersects(cs->effective_mems, nodes)) cs = parent_cs(cs); - nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]); + nodes_and(*pmask, cs->effective_mems, nodes); } /* @@ -2301,7 +2303,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) /* fetch the available cpus/mems and find out which changed how */ cpumask_copy(&new_cpus, cpu_active_mask); - new_mems = node_states[N_MEMORY]; + new_mems = system_ram(); cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus); mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems); @@ -2393,11 +2395,11 @@ static int cpuset_track_online_nodes(struct notifier_block *self, void __init cpuset_init_smp(void) { cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask); - top_cpuset.mems_allowed = node_states[N_MEMORY]; + top_cpuset.mems_allowed = system_ram(); top_cpuset.old_mems_allowed = top_cpuset.mems_allowed; cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask); - top_cpuset.effective_mems = node_states[N_MEMORY]; + top_cpuset.effective_mems = system_ram(); register_hotmemory_notifier(&cpuset_track_online_nodes_nb); -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DRAFT 2/2] mm/hugetlb: Restrict HugeTLB allocations only to the system RAM nodes 2016-11-17 7:59 ` Anshuman Khandual @ 2016-11-17 7:59 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-17 7:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora HugeTLB allocation/release/accounting currently spans across all the nodes under N_MEMORY mask. CDM nodes should not be part of these. So use system_ram() call to fetch system RAM only nodes on the platform which can then be used for HugeTLB purpose instead of N_MEMORY. This isolates CDM nodes from HugeTLB allocation. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- This also completely isolates CDM nodes from user space HugeTLB allocations . Hence explicit allocation to the CDM nodes would not be possible any more . To again enable explicit HugeTLB allocation capability from user space, HugeTLB subsystem needs to be changed. mm/hugetlb.c | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 418bf01..1936c5a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1782,6 +1782,9 @@ static void return_unused_surplus_pages(struct hstate *h, unsigned long unused_resv_pages) { unsigned long nr_pages; + nodemask_t nodes; + + nodes = system_ram(); /* Uncommit the reservation */ h->resv_huge_pages -= unused_resv_pages; @@ -1801,7 +1804,7 @@ static void return_unused_surplus_pages(struct hstate *h, * on-line nodes with memory and will handle the hstate accounting. */ while (nr_pages--) { - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) + if (!free_pool_huge_page(h, &nodes, 1)) break; cond_resched_lock(&hugetlb_lock); } @@ -2088,8 +2091,10 @@ int __weak alloc_bootmem_huge_page(struct hstate *h) { struct huge_bootmem_page *m; int nr_nodes, node; + nodemask_t nodes; - for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { + nodes = system_ram(); + for_each_node_mask_to_alloc(h, nr_nodes, node, &nodes) { void *addr; addr = memblock_virt_alloc_try_nid_nopanic( @@ -2158,13 +2163,15 @@ static void __init gather_bootmem_prealloc(void) static void __init hugetlb_hstate_alloc_pages(struct hstate *h) { unsigned long i; + nodemask_t nodes; + + nodes = system_ram(); for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { if (!alloc_bootmem_huge_page(h)) break; - } else if (!alloc_fresh_huge_page(h, - &node_states[N_MEMORY])) + } else if (!alloc_fresh_huge_page(h, &nodes)) break; } h->max_huge_pages = i; @@ -2401,8 +2408,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, unsigned long count, size_t len) { int err; + nodemask_t ram_nodes; + NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY); + ram_nodes = system_ram(); if (hstate_is_gigantic(h) && !gigantic_page_supported()) { err = -EINVAL; goto out; @@ -2415,7 +2425,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, if (!(obey_mempolicy && init_nodemask_of_mempolicy(nodes_allowed))) { NODEMASK_FREE(nodes_allowed); - nodes_allowed = &node_states[N_MEMORY]; + nodes_allowed = &ram_nodes; } } else if (nodes_allowed) { /* @@ -2425,11 +2435,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; init_nodemask_of_node(nodes_allowed, nid); } else - nodes_allowed = &node_states[N_MEMORY]; + nodes_allowed = &ram_nodes; h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed); - if (nodes_allowed != &node_states[N_MEMORY]) + if (nodes_allowed != &ram_nodes) NODEMASK_FREE(nodes_allowed); return len; @@ -2726,9 +2736,11 @@ static void hugetlb_register_node(struct node *node) */ static void __init hugetlb_register_all_nodes(void) { + nodemask_t nodes; int nid; - for_each_node_state(nid, N_MEMORY) { + nodes = system_ram(); + for_each_node_mask(nid, nodes) { struct node *node = node_devices[nid]; if (node->dev.id == nid) hugetlb_register_node(node); @@ -2998,13 +3010,15 @@ int hugetlb_report_node_meminfo(int nid, char *buf) void hugetlb_show_meminfo(void) { + nodemask_t nodes; struct hstate *h; int nid; if (!hugepages_supported()) return; - for_each_node_state(nid, N_MEMORY) + nodes = system_ram(); + for_each_node_mask(nid, nodes) for_each_hstate(h) pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", nid, -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DRAFT 2/2] mm/hugetlb: Restrict HugeTLB allocations only to the system RAM nodes @ 2016-11-17 7:59 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-17 7:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora HugeTLB allocation/release/accounting currently spans across all the nodes under N_MEMORY mask. CDM nodes should not be part of these. So use system_ram() call to fetch system RAM only nodes on the platform which can then be used for HugeTLB purpose instead of N_MEMORY. This isolates CDM nodes from HugeTLB allocation. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- This also completely isolates CDM nodes from user space HugeTLB allocations . Hence explicit allocation to the CDM nodes would not be possible any more . To again enable explicit HugeTLB allocation capability from user space, HugeTLB subsystem needs to be changed. mm/hugetlb.c | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 418bf01..1936c5a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1782,6 +1782,9 @@ static void return_unused_surplus_pages(struct hstate *h, unsigned long unused_resv_pages) { unsigned long nr_pages; + nodemask_t nodes; + + nodes = system_ram(); /* Uncommit the reservation */ h->resv_huge_pages -= unused_resv_pages; @@ -1801,7 +1804,7 @@ static void return_unused_surplus_pages(struct hstate *h, * on-line nodes with memory and will handle the hstate accounting. */ while (nr_pages--) { - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) + if (!free_pool_huge_page(h, &nodes, 1)) break; cond_resched_lock(&hugetlb_lock); } @@ -2088,8 +2091,10 @@ int __weak alloc_bootmem_huge_page(struct hstate *h) { struct huge_bootmem_page *m; int nr_nodes, node; + nodemask_t nodes; - for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { + nodes = system_ram(); + for_each_node_mask_to_alloc(h, nr_nodes, node, &nodes) { void *addr; addr = memblock_virt_alloc_try_nid_nopanic( @@ -2158,13 +2163,15 @@ static void __init gather_bootmem_prealloc(void) static void __init hugetlb_hstate_alloc_pages(struct hstate *h) { unsigned long i; + nodemask_t nodes; + + nodes = system_ram(); for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { if (!alloc_bootmem_huge_page(h)) break; - } else if (!alloc_fresh_huge_page(h, - &node_states[N_MEMORY])) + } else if (!alloc_fresh_huge_page(h, &nodes)) break; } h->max_huge_pages = i; @@ -2401,8 +2408,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, unsigned long count, size_t len) { int err; + nodemask_t ram_nodes; + NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY); + ram_nodes = system_ram(); if (hstate_is_gigantic(h) && !gigantic_page_supported()) { err = -EINVAL; goto out; @@ -2415,7 +2425,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, if (!(obey_mempolicy && init_nodemask_of_mempolicy(nodes_allowed))) { NODEMASK_FREE(nodes_allowed); - nodes_allowed = &node_states[N_MEMORY]; + nodes_allowed = &ram_nodes; } } else if (nodes_allowed) { /* @@ -2425,11 +2435,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; init_nodemask_of_node(nodes_allowed, nid); } else - nodes_allowed = &node_states[N_MEMORY]; + nodes_allowed = &ram_nodes; h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed); - if (nodes_allowed != &node_states[N_MEMORY]) + if (nodes_allowed != &ram_nodes) NODEMASK_FREE(nodes_allowed); return len; @@ -2726,9 +2736,11 @@ static void hugetlb_register_node(struct node *node) */ static void __init hugetlb_register_all_nodes(void) { + nodemask_t nodes; int nid; - for_each_node_state(nid, N_MEMORY) { + nodes = system_ram(); + for_each_node_mask(nid, nodes) { struct node *node = node_devices[nid]; if (node->dev.id == nid) hugetlb_register_node(node); @@ -2998,13 +3010,15 @@ int hugetlb_report_node_meminfo(int nid, char *buf) void hugetlb_show_meminfo(void) { + nodemask_t nodes; struct hstate *h; int nid; if (!hugepages_supported()) return; - for_each_node_state(nid, N_MEMORY) + nodes = system_ram(); + for_each_node_mask(nid, nodes) for_each_hstate(h) pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", nid, -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask 2016-11-17 7:59 ` Anshuman Khandual (?) (?) @ 2016-11-17 8:28 ` kbuild test robot -1 siblings, 0 replies; 135+ messages in thread From: kbuild test robot @ 2016-11-17 8:28 UTC (permalink / raw) To: Anshuman Khandual Cc: kbuild-all, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora [-- Attachment #1: Type: text/plain, Size: 2555 bytes --] Hi Anshuman, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.9-rc5 next-20161117] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-cpuset-Exclude-CDM-nodes-from-each-task-s-mems_allowed-node-mask/20161117-160736 base: git://git.cmpxchg.org/linux-mmotm.git master config: i386-tinyconfig (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): In file included from include/linux/mmzone.h:16:0, from include/linux/gfp.h:5, from include/linux/slab.h:14, from include/linux/crypto.h:24, from arch/x86/kernel/asm-offsets.c:8: include/linux/mm.h: In function 'system_ram': >> include/linux/mm.h:454:61: error: 'N_COHERENT_DEVICE' undeclared (first use in this function) nodes_andnot(ram_nodes, node_states[N_MEMORY], node_states[N_COHERENT_DEVICE]); ^ include/linux/nodemask.h:176:38: note: in definition of macro 'nodes_andnot' __nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES) ^~~~ include/linux/mm.h:454:61: note: each undeclared identifier is reported only once for each function it appears in nodes_andnot(ram_nodes, node_states[N_MEMORY], node_states[N_COHERENT_DEVICE]); ^ include/linux/nodemask.h:176:38: note: in definition of macro 'nodes_andnot' __nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES) ^~~~ make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1 make[2]: Target '__build' not remade because of errors. make[1]: *** [prepare0] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [sub-make] Error 2 vim +/N_COHERENT_DEVICE +454 include/linux/mm.h 448 449 static inline nodemask_t system_ram(void) 450 { 451 nodemask_t ram_nodes; 452 453 nodes_clear(ram_nodes); > 454 nodes_andnot(ram_nodes, node_states[N_MEMORY], node_states[N_COHERENT_DEVICE]); 455 return ram_nodes; 456 } 457 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 6384 bytes --] ^ permalink raw reply [flat|nested] 135+ messages in thread
* [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This change is part of the isolation requiring coherent device memory nodes implementation. Isolation seeking coherent device memory node requires allocation isolation from implicit memory allocations from user space. Towards that effect, the memory should not be used for generic HugeTLB page pool allocations. This modifies relevant functions to skip all coherent memory nodes present on the system during allocation, freeing and auditing for HugeTLB pages. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ec49d9e..466a44c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1147,6 +1147,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h, int nr_nodes, node; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + page = alloc_fresh_gigantic_page_node(h, node); if (page) return 1; @@ -1382,6 +1385,9 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) int ret = 0; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + page = alloc_fresh_huge_page_node(h, node); if (page) { ret = 1; @@ -1410,6 +1416,9 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, int ret = 0; for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + /* * If we're returning unused surplus pages, only examine * nodes with surplus pages. @@ -2028,6 +2037,9 @@ int __weak alloc_bootmem_huge_page(struct hstate *h) for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { void *addr; + if (isolated_cdm_node(node)) + continue; + addr = memblock_virt_alloc_try_nid_nopanic( huge_page_size(h), huge_page_size(h), 0, BOOTMEM_ALLOC_ACCESSIBLE, node); @@ -2156,6 +2168,10 @@ static void try_to_free_low(struct hstate *h, unsigned long count, for_each_node_mask(i, *nodes_allowed) { struct page *page, *next; struct list_head *freel = &h->hugepage_freelists[i]; + + if (isolated_cdm_node(i)) + continue; + list_for_each_entry_safe(page, next, freel, lru) { if (count >= h->nr_huge_pages) return; @@ -2189,11 +2205,17 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed, if (delta < 0) { for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + if (h->surplus_huge_pages_node[node]) goto found; } } else { for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + if (h->surplus_huge_pages_node[node] < h->nr_huge_pages_node[node]) goto found; @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) for_each_node_state(nid, N_MEMORY) { struct node *node = node_devices[nid]; + + if (isolated_cdm_node(nid)) + continue; + if (node->dev.id == nid) hugetlb_register_node(node); } @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) int node; unsigned int nr = 0; - for_each_node_mask(node, cpuset_current_mems_allowed) + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (isolated_cdm_node(node)) + continue; + nr += array[node]; + } return nr; } @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) if (!hugepages_supported()) return; - for_each_node_state(nid, N_MEMORY) + for_each_node_state(nid, N_MEMORY) { + if (isolated_cdm_node(nid)) + continue; + for_each_hstate(h) pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", nid, @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) h->free_huge_pages_node[nid], h->surplus_huge_pages_node[nid], 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); + } } void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm) -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This change is part of the isolation requiring coherent device memory nodes implementation. Isolation seeking coherent device memory node requires allocation isolation from implicit memory allocations from user space. Towards that effect, the memory should not be used for generic HugeTLB page pool allocations. This modifies relevant functions to skip all coherent memory nodes present on the system during allocation, freeing and auditing for HugeTLB pages. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ec49d9e..466a44c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1147,6 +1147,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h, int nr_nodes, node; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + page = alloc_fresh_gigantic_page_node(h, node); if (page) return 1; @@ -1382,6 +1385,9 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) int ret = 0; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + page = alloc_fresh_huge_page_node(h, node); if (page) { ret = 1; @@ -1410,6 +1416,9 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, int ret = 0; for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + /* * If we're returning unused surplus pages, only examine * nodes with surplus pages. @@ -2028,6 +2037,9 @@ int __weak alloc_bootmem_huge_page(struct hstate *h) for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { void *addr; + if (isolated_cdm_node(node)) + continue; + addr = memblock_virt_alloc_try_nid_nopanic( huge_page_size(h), huge_page_size(h), 0, BOOTMEM_ALLOC_ACCESSIBLE, node); @@ -2156,6 +2168,10 @@ static void try_to_free_low(struct hstate *h, unsigned long count, for_each_node_mask(i, *nodes_allowed) { struct page *page, *next; struct list_head *freel = &h->hugepage_freelists[i]; + + if (isolated_cdm_node(i)) + continue; + list_for_each_entry_safe(page, next, freel, lru) { if (count >= h->nr_huge_pages) return; @@ -2189,11 +2205,17 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed, if (delta < 0) { for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + if (h->surplus_huge_pages_node[node]) goto found; } } else { for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { + if (isolated_cdm_node(node)) + continue; + if (h->surplus_huge_pages_node[node] < h->nr_huge_pages_node[node]) goto found; @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) for_each_node_state(nid, N_MEMORY) { struct node *node = node_devices[nid]; + + if (isolated_cdm_node(nid)) + continue; + if (node->dev.id == nid) hugetlb_register_node(node); } @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) int node; unsigned int nr = 0; - for_each_node_mask(node, cpuset_current_mems_allowed) + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (isolated_cdm_node(node)) + continue; + nr += array[node]; + } return nr; } @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) if (!hugepages_supported()) return; - for_each_node_state(nid, N_MEMORY) + for_each_node_state(nid, N_MEMORY) { + if (isolated_cdm_node(nid)) + continue; + for_each_hstate(h) pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", nid, @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) h->free_huge_pages_node[nid], h->surplus_huge_pages_node[nid], 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); + } } void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm) -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 17:16 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:16 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > This change is part of the isolation requiring coherent device memory nodes > implementation. > > Isolation seeking coherent device memory node requires allocation isolation > from implicit memory allocations from user space. Towards that effect, the > memory should not be used for generic HugeTLB page pool allocations. This > modifies relevant functions to skip all coherent memory nodes present on > the system during allocation, freeing and auditing for HugeTLB pages. This seems really fragile. You had to hit, what, 18 call sites? What are the odds that this is going to stay working? > @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) > > for_each_node_state(nid, N_MEMORY) { > struct node *node = node_devices[nid]; > + > + if (isolated_cdm_node(nid)) > + continue; > + > if (node->dev.id == nid) > hugetlb_register_node(node); > } This looks to be completely kneecapping hugetlbfs on these cdm nodes. Is that really what you want? > @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) > int node; > unsigned int nr = 0; > > - for_each_node_mask(node, cpuset_current_mems_allowed) > + for_each_node_mask(node, cpuset_current_mems_allowed) { > + if (isolated_cdm_node(node)) > + continue; > + > nr += array[node]; > + } > > return nr; > } > @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) > if (!hugepages_supported()) > return; > > - for_each_node_state(nid, N_MEMORY) > + for_each_node_state(nid, N_MEMORY) { > + if (isolated_cdm_node(nid)) > + continue; > + > for_each_hstate(h) > pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", > nid, > @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) > h->free_huge_pages_node[nid], > h->surplus_huge_pages_node[nid], > 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); > + } > } Your patch description talks about removing *implicit* memory allocations. But, this removes even the ability to gather *stats* about huge pages sitting on one of these nodes. That's a lot more drastic than just changing implicit policies. Is that patch description accurate? It looks to me like you just went through all the for_each_node*() loops in hugetlb.c and hacked your node check into them indiscriminately. This totally removes the ability to *do* hugetlb on this nodes. Isn't there some simpler way to do all this, like maybe changing the root cpuset to disallow allocations to these nodes? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths @ 2016-10-24 17:16 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:16 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > This change is part of the isolation requiring coherent device memory nodes > implementation. > > Isolation seeking coherent device memory node requires allocation isolation > from implicit memory allocations from user space. Towards that effect, the > memory should not be used for generic HugeTLB page pool allocations. This > modifies relevant functions to skip all coherent memory nodes present on > the system during allocation, freeing and auditing for HugeTLB pages. This seems really fragile. You had to hit, what, 18 call sites? What are the odds that this is going to stay working? > @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) > > for_each_node_state(nid, N_MEMORY) { > struct node *node = node_devices[nid]; > + > + if (isolated_cdm_node(nid)) > + continue; > + > if (node->dev.id == nid) > hugetlb_register_node(node); > } This looks to be completely kneecapping hugetlbfs on these cdm nodes. Is that really what you want? > @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) > int node; > unsigned int nr = 0; > > - for_each_node_mask(node, cpuset_current_mems_allowed) > + for_each_node_mask(node, cpuset_current_mems_allowed) { > + if (isolated_cdm_node(node)) > + continue; > + > nr += array[node]; > + } > > return nr; > } > @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) > if (!hugepages_supported()) > return; > > - for_each_node_state(nid, N_MEMORY) > + for_each_node_state(nid, N_MEMORY) { > + if (isolated_cdm_node(nid)) > + continue; > + > for_each_hstate(h) > pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", > nid, > @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) > h->free_huge_pages_node[nid], > h->surplus_huge_pages_node[nid], > 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); > + } > } Your patch description talks about removing *implicit* memory allocations. But, this removes even the ability to gather *stats* about huge pages sitting on one of these nodes. That's a lot more drastic than just changing implicit policies. Is that patch description accurate? It looks to me like you just went through all the for_each_node*() loops in hugetlb.c and hacked your node check into them indiscriminately. This totally removes the ability to *do* hugetlb on this nodes. Isn't there some simpler way to do all this, like maybe changing the root cpuset to disallow allocations to these nodes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths 2016-10-24 17:16 ` Dave Hansen @ 2016-10-25 4:15 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 4:15 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Dave Hansen <dave.hansen@intel.com> writes: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> This change is part of the isolation requiring coherent device memory nodes >> implementation. >> >> Isolation seeking coherent device memory node requires allocation isolation >> from implicit memory allocations from user space. Towards that effect, the >> memory should not be used for generic HugeTLB page pool allocations. This >> modifies relevant functions to skip all coherent memory nodes present on >> the system during allocation, freeing and auditing for HugeTLB pages. > > This seems really fragile. You had to hit, what, 18 call sites? What > are the odds that this is going to stay working? I guess a better approach is to introduce new node_states entry such that we have one that excludes coherent device memory numa nodes. One possibility is to add N_SYSTEM_MEMORY and N_MEMORY. Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes system and device/any other memory which is coherent. All the isolation can then be achieved based on the nodemask_t used for allocation. So for allocations we want to avoid from coherent device we use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to allocate from CDM with fallbacks we use N_MEMORY. All nodes zonelist will have zones from the coherent device nodes but we will not end up allocating from coherent device node zone due to the node mask used. This will also make sure we end up allocating from the correct coherent device numa node in the presence of multiple of them based on the distance of the coherent device node from the current executing numa node. > >> @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) >> >> for_each_node_state(nid, N_MEMORY) { >> struct node *node = node_devices[nid]; >> + >> + if (isolated_cdm_node(nid)) >> + continue; >> + >> if (node->dev.id == nid) >> hugetlb_register_node(node); >> } > > This looks to be completely kneecapping hugetlbfs on these cdm nodes. > Is that really what you want? > >> @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) >> int node; >> unsigned int nr = 0; >> >> - for_each_node_mask(node, cpuset_current_mems_allowed) >> + for_each_node_mask(node, cpuset_current_mems_allowed) { >> + if (isolated_cdm_node(node)) >> + continue; >> + >> nr += array[node]; >> + } >> >> return nr; >> } >> @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) >> if (!hugepages_supported()) >> return; >> >> - for_each_node_state(nid, N_MEMORY) >> + for_each_node_state(nid, N_MEMORY) { >> + if (isolated_cdm_node(nid)) >> + continue; >> + >> for_each_hstate(h) >> pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", >> nid, >> @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) >> h->free_huge_pages_node[nid], >> h->surplus_huge_pages_node[nid], >> 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); >> + } >> } > > Your patch description talks about removing *implicit* memory > allocations. But, this removes even the ability to gather *stats* about > huge pages sitting on one of these nodes. That's a lot more drastic > than just changing implicit policies. > > Is that patch description accurate? > > It looks to me like you just went through all the for_each_node*() loops > in hugetlb.c and hacked your node check into them indiscriminately. > This totally removes the ability to *do* hugetlb on this nodes. > > Isn't there some simpler way to do all this, like maybe changing the > root cpuset to disallow allocations to these nodes? -aneesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths @ 2016-10-25 4:15 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 4:15 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Dave Hansen <dave.hansen@intel.com> writes: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> This change is part of the isolation requiring coherent device memory nodes >> implementation. >> >> Isolation seeking coherent device memory node requires allocation isolation >> from implicit memory allocations from user space. Towards that effect, the >> memory should not be used for generic HugeTLB page pool allocations. This >> modifies relevant functions to skip all coherent memory nodes present on >> the system during allocation, freeing and auditing for HugeTLB pages. > > This seems really fragile. You had to hit, what, 18 call sites? What > are the odds that this is going to stay working? I guess a better approach is to introduce new node_states entry such that we have one that excludes coherent device memory numa nodes. One possibility is to add N_SYSTEM_MEMORY and N_MEMORY. Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes system and device/any other memory which is coherent. All the isolation can then be achieved based on the nodemask_t used for allocation. So for allocations we want to avoid from coherent device we use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to allocate from CDM with fallbacks we use N_MEMORY. All nodes zonelist will have zones from the coherent device nodes but we will not end up allocating from coherent device node zone due to the node mask used. This will also make sure we end up allocating from the correct coherent device numa node in the presence of multiple of them based on the distance of the coherent device node from the current executing numa node. > >> @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) >> >> for_each_node_state(nid, N_MEMORY) { >> struct node *node = node_devices[nid]; >> + >> + if (isolated_cdm_node(nid)) >> + continue; >> + >> if (node->dev.id == nid) >> hugetlb_register_node(node); >> } > > This looks to be completely kneecapping hugetlbfs on these cdm nodes. > Is that really what you want? > >> @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) >> int node; >> unsigned int nr = 0; >> >> - for_each_node_mask(node, cpuset_current_mems_allowed) >> + for_each_node_mask(node, cpuset_current_mems_allowed) { >> + if (isolated_cdm_node(node)) >> + continue; >> + >> nr += array[node]; >> + } >> >> return nr; >> } >> @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) >> if (!hugepages_supported()) >> return; >> >> - for_each_node_state(nid, N_MEMORY) >> + for_each_node_state(nid, N_MEMORY) { >> + if (isolated_cdm_node(nid)) >> + continue; >> + >> for_each_hstate(h) >> pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", >> nid, >> @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) >> h->free_huge_pages_node[nid], >> h->surplus_huge_pages_node[nid], >> 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); >> + } >> } > > Your patch description talks about removing *implicit* memory > allocations. But, this removes even the ability to gather *stats* about > huge pages sitting on one of these nodes. That's a lot more drastic > than just changing implicit policies. > > Is that patch description accurate? > > It looks to me like you just went through all the for_each_node*() loops > in hugetlb.c and hacked your node check into them indiscriminately. > This totally removes the ability to *do* hugetlb on this nodes. > > Isn't there some simpler way to do all this, like maybe changing the > root cpuset to disallow allocations to these nodes? -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths 2016-10-25 4:15 ` Aneesh Kumar K.V @ 2016-10-25 7:17 ` Balbir Singh -1 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 7:17 UTC (permalink / raw) To: Aneesh Kumar K.V, Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm On 25/10/16 15:15, Aneesh Kumar K.V wrote: > Dave Hansen <dave.hansen@intel.com> writes: > >> On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >>> This change is part of the isolation requiring coherent device memory nodes >>> implementation. >>> >>> Isolation seeking coherent device memory node requires allocation isolation >>> from implicit memory allocations from user space. Towards that effect, the >>> memory should not be used for generic HugeTLB page pool allocations. This >>> modifies relevant functions to skip all coherent memory nodes present on >>> the system during allocation, freeing and auditing for HugeTLB pages. >> >> This seems really fragile. You had to hit, what, 18 call sites? What >> are the odds that this is going to stay working? > > > I guess a better approach is to introduce new node_states entry such > that we have one that excludes coherent device memory numa nodes. One > possibility is to add N_SYSTEM_MEMORY and N_MEMORY. > > Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes > system and device/any other memory which is coherent. > I thought of this as well, but I would rather see N_COHERENT_MEMORY as a flag. The idea being that some device memory is a part of N_MEMORY, but N_COHERENT_MEMORY gives it additional attributes > All the isolation can then be achieved based on the nodemask_t used for > allocation. So for allocations we want to avoid from coherent device we > use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to > allocate from CDM with fallbacks we use N_MEMORY. > I suspect its going to be easier to exclude N_COHERENT_MEMORY. > All nodes zonelist will have zones from the coherent device nodes but we > will not end up allocating from coherent device node zone due to the > node mask used. > > > This will also make sure we end up allocating from the correct coherent > device numa node in the presence of multiple of them based on the > distance of the coherent device node from the current executing numa > node. > The idea is good overall, but I think its going to be good to document the exclusions with the flags Balbir Singh. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths @ 2016-10-25 7:17 ` Balbir Singh 0 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 7:17 UTC (permalink / raw) To: Aneesh Kumar K.V, Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm On 25/10/16 15:15, Aneesh Kumar K.V wrote: > Dave Hansen <dave.hansen@intel.com> writes: > >> On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >>> This change is part of the isolation requiring coherent device memory nodes >>> implementation. >>> >>> Isolation seeking coherent device memory node requires allocation isolation >>> from implicit memory allocations from user space. Towards that effect, the >>> memory should not be used for generic HugeTLB page pool allocations. This >>> modifies relevant functions to skip all coherent memory nodes present on >>> the system during allocation, freeing and auditing for HugeTLB pages. >> >> This seems really fragile. You had to hit, what, 18 call sites? What >> are the odds that this is going to stay working? > > > I guess a better approach is to introduce new node_states entry such > that we have one that excludes coherent device memory numa nodes. One > possibility is to add N_SYSTEM_MEMORY and N_MEMORY. > > Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes > system and device/any other memory which is coherent. > I thought of this as well, but I would rather see N_COHERENT_MEMORY as a flag. The idea being that some device memory is a part of N_MEMORY, but N_COHERENT_MEMORY gives it additional attributes > All the isolation can then be achieved based on the nodemask_t used for > allocation. So for allocations we want to avoid from coherent device we > use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to > allocate from CDM with fallbacks we use N_MEMORY. > I suspect its going to be easier to exclude N_COHERENT_MEMORY. > All nodes zonelist will have zones from the coherent device nodes but we > will not end up allocating from coherent device node zone due to the > node mask used. > > > This will also make sure we end up allocating from the correct coherent > device numa node in the presence of multiple of them based on the > distance of the coherent device node from the current executing numa > node. > The idea is good overall, but I think its going to be good to document the exclusions with the flags Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths 2016-10-25 7:17 ` Balbir Singh @ 2016-10-25 7:25 ` Balbir Singh -1 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 7:25 UTC (permalink / raw) To: Aneesh Kumar K.V, Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm On 25/10/16 18:17, Balbir Singh wrote: > > > On 25/10/16 15:15, Aneesh Kumar K.V wrote: >> Dave Hansen <dave.hansen@intel.com> writes: >> >>> On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >>>> This change is part of the isolation requiring coherent device memory nodes >>>> implementation. >>>> >>>> Isolation seeking coherent device memory node requires allocation isolation >>>> from implicit memory allocations from user space. Towards that effect, the >>>> memory should not be used for generic HugeTLB page pool allocations. This >>>> modifies relevant functions to skip all coherent memory nodes present on >>>> the system during allocation, freeing and auditing for HugeTLB pages. >>> >>> This seems really fragile. You had to hit, what, 18 call sites? What >>> are the odds that this is going to stay working? >> >> >> I guess a better approach is to introduce new node_states entry such >> that we have one that excludes coherent device memory numa nodes. One >> possibility is to add N_SYSTEM_MEMORY and N_MEMORY. >> >> Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes >> system and device/any other memory which is coherent. >> > > I thought of this as well, but I would rather see N_COHERENT_MEMORY > as a flag. The idea being that some device memory is a part of > N_MEMORY, but N_COHERENT_MEMORY gives it additional attributes > >> All the isolation can then be achieved based on the nodemask_t used for >> allocation. So for allocations we want to avoid from coherent device we >> use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to >> allocate from CDM with fallbacks we use N_MEMORY. >> > > I suspect its going to be easier to exclude N_COHERENT_MEMORY. > >> All nodes zonelist will have zones from the coherent device nodes but we >> will not end up allocating from coherent device node zone due to the >> node mask used. >> >> >> This will also make sure we end up allocating from the correct coherent >> device numa node in the presence of multiple of them based on the >> distance of the coherent device node from the current executing numa >> node. >> > > The idea is good overall, but I think its going to be good to document > the exclusions with the flags > FWIW,, some of this is present in 8/8 Balbir ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths @ 2016-10-25 7:25 ` Balbir Singh 0 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 7:25 UTC (permalink / raw) To: Aneesh Kumar K.V, Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm On 25/10/16 18:17, Balbir Singh wrote: > > > On 25/10/16 15:15, Aneesh Kumar K.V wrote: >> Dave Hansen <dave.hansen@intel.com> writes: >> >>> On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >>>> This change is part of the isolation requiring coherent device memory nodes >>>> implementation. >>>> >>>> Isolation seeking coherent device memory node requires allocation isolation >>>> from implicit memory allocations from user space. Towards that effect, the >>>> memory should not be used for generic HugeTLB page pool allocations. This >>>> modifies relevant functions to skip all coherent memory nodes present on >>>> the system during allocation, freeing and auditing for HugeTLB pages. >>> >>> This seems really fragile. You had to hit, what, 18 call sites? What >>> are the odds that this is going to stay working? >> >> >> I guess a better approach is to introduce new node_states entry such >> that we have one that excludes coherent device memory numa nodes. One >> possibility is to add N_SYSTEM_MEMORY and N_MEMORY. >> >> Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes >> system and device/any other memory which is coherent. >> > > I thought of this as well, but I would rather see N_COHERENT_MEMORY > as a flag. The idea being that some device memory is a part of > N_MEMORY, but N_COHERENT_MEMORY gives it additional attributes > >> All the isolation can then be achieved based on the nodemask_t used for >> allocation. So for allocations we want to avoid from coherent device we >> use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to >> allocate from CDM with fallbacks we use N_MEMORY. >> > > I suspect its going to be easier to exclude N_COHERENT_MEMORY. > >> All nodes zonelist will have zones from the coherent device nodes but we >> will not end up allocating from coherent device node zone due to the >> node mask used. >> >> >> This will also make sure we end up allocating from the correct coherent >> device numa node in the presence of multiple of them based on the >> distance of the coherent device node from the current executing numa >> node. >> > > The idea is good overall, but I think its going to be good to document > the exclusions with the flags > FWIW,, some of this is present in 8/8 Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [RFC 4/8] mm: Accommodate coherent device memory nodes in MPOL_BIND implementation 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This change is part of the isolation requiring coherent device memory nodes implementation. Currently MPOL_MBIND interface simply fails on a coherent device memory node after the zonelist changes introduced earlier. Without __GFP_THISNODE flag, the first node of the nodemask will not be selected in the case where the local node (where the application is executing) is not part of the user provided nodemask for MPOL_MBIND. This will be the case for coherent memory nodes which are always CPU less. This changes the mbind() system call implementation so that memory can be allocated from coherent memory node through MPOL_MBIND interface. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/mempolicy.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b859af..cb1ba01 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1694,6 +1694,26 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy, if (unlikely(gfp & __GFP_THISNODE) && unlikely(!node_isset(nd, policy->v.nodes))) nd = first_node(policy->v.nodes); + +#ifdef CONFIG_COHERENT_DEVICE + /* + * Coherent device memory + * + * In case the local node is not part of the nodemask, test if + * the first node in the nodemask is a coherent device memory + * node in which case select it. + * + * FIXME: The check will be restricted to the first node of the + * nodemask or scan through the nodemask to select any present + * coherent device memory node on it or select the first one if + * all of the nodes in the nodemask are coherent device memory. + * These are various approaches possible. + */ + if (unlikely(!node_isset(nd, policy->v.nodes))) { + if (isolated_cdm_node(first_node(policy->v.nodes))) + nd = first_node(policy->v.nodes); + } +#endif break; default: BUG(); -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 4/8] mm: Accommodate coherent device memory nodes in MPOL_BIND implementation @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This change is part of the isolation requiring coherent device memory nodes implementation. Currently MPOL_MBIND interface simply fails on a coherent device memory node after the zonelist changes introduced earlier. Without __GFP_THISNODE flag, the first node of the nodemask will not be selected in the case where the local node (where the application is executing) is not part of the user provided nodemask for MPOL_MBIND. This will be the case for coherent memory nodes which are always CPU less. This changes the mbind() system call implementation so that memory can be allocated from coherent memory node through MPOL_MBIND interface. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/mempolicy.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b859af..cb1ba01 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1694,6 +1694,26 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy, if (unlikely(gfp & __GFP_THISNODE) && unlikely(!node_isset(nd, policy->v.nodes))) nd = first_node(policy->v.nodes); + +#ifdef CONFIG_COHERENT_DEVICE + /* + * Coherent device memory + * + * In case the local node is not part of the nodemask, test if + * the first node in the nodemask is a coherent device memory + * node in which case select it. + * + * FIXME: The check will be restricted to the first node of the + * nodemask or scan through the nodemask to select any present + * coherent device memory node on it or select the first one if + * all of the nodes in the nodemask are coherent device memory. + * These are various approaches possible. + */ + if (unlikely(!node_isset(nd, policy->v.nodes))) { + if (isolated_cdm_node(first_node(policy->v.nodes))) + nd = first_node(policy->v.nodes); + } +#endif break; default: BUG(); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora VMAs containing coherent device memory should be marked with VM_CDM. These VMAs need to be identified in various core kernel paths and this new flag will help in this regard. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mm.h | 5 +++++ mm/mempolicy.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3a19185..acee4d1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -182,6 +182,11 @@ extern unsigned int kobjsize(const void *objp); #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ + +#ifdef CONFIG_COHERENT_DEVICE +#define VM_CDM 0x00800000 /* Contains coherent device memory */ +#endif + #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ #define VM_ARCH_2 0x02000000 #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cb1ba01..b983cea 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -174,6 +174,47 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, nodes_onto(*ret, tmp, *rel); } +#ifdef CONFIG_COHERENT_DEVICE +static bool nodemask_contains_cdm(nodemask_t *nodes) +{ + int weight, nid, i; + nodemask_t mask; + + + if (!nodes) + return false; + + mask = *nodes; + weight = nodes_weight(mask); + nid = first_node(mask); + for (i = 0; i < weight; i++) { + if (isolated_cdm_node(nid)) + return true; + nid = next_node(nid, mask); + } + return false; +} + +static void update_coherent_vma_flag(nodemask_t *nmask, + struct page *page, struct vm_area_struct *vma) +{ + if (!page) + return; + + if (nodemask_contains_cdm(nmask)) { + if (!(vma->vm_flags & VM_CDM)) { + if (isolated_cdm_node(page_to_nid(page))) + vma->vm_flags |= VM_CDM; + } + } +} +#else +static void update_coherent_vma_flag(nodemask_t *nmask, + struct page *page, struct vm_area_struct *vma) +{ +} +#endif + static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes) { if (nodes_empty(*nodes)) @@ -2045,6 +2086,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, zl = policy_zonelist(gfp, pol, node); mpol_cond_put(pol); page = __alloc_pages_nodemask(gfp, order, zl, nmask); + update_coherent_vma_flag(nmask, page, vma); + out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora VMAs containing coherent device memory should be marked with VM_CDM. These VMAs need to be identified in various core kernel paths and this new flag will help in this regard. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mm.h | 5 +++++ mm/mempolicy.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3a19185..acee4d1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -182,6 +182,11 @@ extern unsigned int kobjsize(const void *objp); #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ + +#ifdef CONFIG_COHERENT_DEVICE +#define VM_CDM 0x00800000 /* Contains coherent device memory */ +#endif + #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ #define VM_ARCH_2 0x02000000 #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cb1ba01..b983cea 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -174,6 +174,47 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, nodes_onto(*ret, tmp, *rel); } +#ifdef CONFIG_COHERENT_DEVICE +static bool nodemask_contains_cdm(nodemask_t *nodes) +{ + int weight, nid, i; + nodemask_t mask; + + + if (!nodes) + return false; + + mask = *nodes; + weight = nodes_weight(mask); + nid = first_node(mask); + for (i = 0; i < weight; i++) { + if (isolated_cdm_node(nid)) + return true; + nid = next_node(nid, mask); + } + return false; +} + +static void update_coherent_vma_flag(nodemask_t *nmask, + struct page *page, struct vm_area_struct *vma) +{ + if (!page) + return; + + if (nodemask_contains_cdm(nmask)) { + if (!(vma->vm_flags & VM_CDM)) { + if (isolated_cdm_node(page_to_nid(page))) + vma->vm_flags |= VM_CDM; + } + } +} +#else +static void update_coherent_vma_flag(nodemask_t *nmask, + struct page *page, struct vm_area_struct *vma) +{ +} +#endif + static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes) { if (nodes_empty(*nodes)) @@ -2045,6 +2086,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, zl = policy_zonelist(gfp, pol, node); mpol_cond_put(pol); page = __alloc_pages_nodemask(gfp, order, zl, nmask); + update_coherent_vma_flag(nmask, page, vma); + out: if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 17:38 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:38 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > VMAs containing coherent device memory should be marked with VM_CDM. These > VMAs need to be identified in various core kernel paths and this new flag > will help in this regard. ... and it's sticky? So if a VMA *ever* has one of these funky pages in it, it's stuck being VM_CDM forever? Never to be merged with other VMAs? Never to see the light of autonuma ever again? What if a 100TB VMA has one page of fancy pants device memory, and the rest normal vanilla memory? Do we really want to consider the whole thing fancy? This whole patch set is looking really hackish. If you want things to be isolated from the VM, them it should probably *actually* be isolated from the VM. As Jerome mentioned, ZONE_DEVICE is probably a better thing to use here than to try what you're attempting. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory @ 2016-10-24 17:38 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 17:38 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > VMAs containing coherent device memory should be marked with VM_CDM. These > VMAs need to be identified in various core kernel paths and this new flag > will help in this regard. ... and it's sticky? So if a VMA *ever* has one of these funky pages in it, it's stuck being VM_CDM forever? Never to be merged with other VMAs? Never to see the light of autonuma ever again? What if a 100TB VMA has one page of fancy pants device memory, and the rest normal vanilla memory? Do we really want to consider the whole thing fancy? This whole patch set is looking really hackish. If you want things to be isolated from the VM, them it should probably *actually* be isolated from the VM. As Jerome mentioned, ZONE_DEVICE is probably a better thing to use here than to try what you're attempting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory 2016-10-24 17:38 ` Dave Hansen @ 2016-10-24 18:00 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 18:00 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:38 AM, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> > VMAs containing coherent device memory should be marked with VM_CDM. These >> > VMAs need to be identified in various core kernel paths and this new flag >> > will help in this regard. > ... and it's sticky? So if a VMA *ever* has one of these funky pages in > it, it's stuck being VM_CDM forever? Never to be merged with other > VMAs? Never to see the light of autonuma ever again? Urg, this is even worse than I suspected. Does this handle shared pages (like the page cache mode you call out as a requirement) where the "cdm" page is faulted into one process VMA, but it was allocated against another? Can't that give you a "cdm" page mapped into a non-VM_CDM VMA? Or, a VM_CDM VMA with no "cdm" pages in it? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory @ 2016-10-24 18:00 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 18:00 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 10:38 AM, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> > VMAs containing coherent device memory should be marked with VM_CDM. These >> > VMAs need to be identified in various core kernel paths and this new flag >> > will help in this regard. > ... and it's sticky? So if a VMA *ever* has one of these funky pages in > it, it's stuck being VM_CDM forever? Never to be merged with other > VMAs? Never to see the light of autonuma ever again? Urg, this is even worse than I suspected. Does this handle shared pages (like the page cache mode you call out as a requirement) where the "cdm" page is faulted into one process VMA, but it was allocated against another? Can't that give you a "cdm" page mapped into a non-VM_CDM VMA? Or, a VM_CDM VMA with no "cdm" pages in it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory 2016-10-24 17:38 ` Dave Hansen @ 2016-10-25 12:36 ` Balbir Singh -1 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 12:36 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 25/10/16 04:38, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> VMAs containing coherent device memory should be marked with VM_CDM. These >> VMAs need to be identified in various core kernel paths and this new flag >> will help in this regard. > > ... and it's sticky? So if a VMA *ever* has one of these funky pages in > it, it's stuck being VM_CDM forever? Never to be merged with other > VMAs? Never to see the light of autonuma ever again? > > What if a 100TB VMA has one page of fancy pants device memory, and the > rest normal vanilla memory? Do we really want to consider the whole > thing fancy? > Those are good review comments to improve the patchset. > This whole patch set is looking really hackish. If you want things to > be isolated from the VM, them it should probably *actually* be isolated > from the VM. As Jerome mentioned, ZONE_DEVICE is probably a better > thing to use here than to try what you're attempting. > The RFC explains the motivation, this is not fancy pants, it is regular memory from the systems perspective, with some changes as described Thanks for the review! Balbir Singh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory @ 2016-10-25 12:36 ` Balbir Singh 0 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 12:36 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 25/10/16 04:38, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> VMAs containing coherent device memory should be marked with VM_CDM. These >> VMAs need to be identified in various core kernel paths and this new flag >> will help in this regard. > > ... and it's sticky? So if a VMA *ever* has one of these funky pages in > it, it's stuck being VM_CDM forever? Never to be merged with other > VMAs? Never to see the light of autonuma ever again? > > What if a 100TB VMA has one page of fancy pants device memory, and the > rest normal vanilla memory? Do we really want to consider the whole > thing fancy? > Those are good review comments to improve the patchset. > This whole patch set is looking really hackish. If you want things to > be isolated from the VM, them it should probably *actually* be isolated > from the VM. As Jerome mentioned, ZONE_DEVICE is probably a better > thing to use here than to try what you're attempting. > The RFC explains the motivation, this is not fancy pants, it is regular memory from the systems perspective, with some changes as described Thanks for the review! Balbir Singh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory 2016-10-24 17:38 ` Dave Hansen @ 2016-10-25 19:20 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 19:20 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Dave Hansen <dave.hansen@intel.com> writes: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> VMAs containing coherent device memory should be marked with VM_CDM. These >> VMAs need to be identified in various core kernel paths and this new flag >> will help in this regard. > > ... and it's sticky? So if a VMA *ever* has one of these funky pages in > it, it's stuck being VM_CDM forever? Never to be merged with other > VMAs? Never to see the light of autonuma ever again? > > What if a 100TB VMA has one page of fancy pants device memory, and the > rest normal vanilla memory? Do we really want to consider the whole > thing fancy? This definitely needs fine tuning. I guess we should look at this as possibly stating that, coherent device would like to not participate in auto numa balancing, because it is difficult to update the core kernel about access patters within the coherent device. This can result in core kernel always trying to migrate pages from coherent device to system ram even though we have large number of access within coherent device. One possible option is to use a software pte bit (may be steal _PAGE_DEVMAP) and prevent a numa pte setup from change_prot_numa(). ie, if the pfn backing the pte is from coherent device we don't allow that to be converted to a prot none pte for numa faults ? -aneesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory @ 2016-10-25 19:20 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 19:20 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Dave Hansen <dave.hansen@intel.com> writes: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> VMAs containing coherent device memory should be marked with VM_CDM. These >> VMAs need to be identified in various core kernel paths and this new flag >> will help in this regard. > > ... and it's sticky? So if a VMA *ever* has one of these funky pages in > it, it's stuck being VM_CDM forever? Never to be merged with other > VMAs? Never to see the light of autonuma ever again? > > What if a 100TB VMA has one page of fancy pants device memory, and the > rest normal vanilla memory? Do we really want to consider the whole > thing fancy? This definitely needs fine tuning. I guess we should look at this as possibly stating that, coherent device would like to not participate in auto numa balancing, because it is difficult to update the core kernel about access patters within the coherent device. This can result in core kernel always trying to migrate pages from coherent device to system ram even though we have large number of access within coherent device. One possible option is to use a software pte bit (may be steal _PAGE_DEVMAP) and prevent a numa pte setup from change_prot_numa(). ie, if the pfn backing the pte is from coherent device we don't allow that to be converted to a prot none pte for numa faults ? -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory 2016-10-25 19:20 ` Aneesh Kumar K.V @ 2016-10-25 20:01 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-25 20:01 UTC (permalink / raw) To: Aneesh Kumar K.V, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/25/2016 12:20 PM, Aneesh Kumar K.V wrote: > Dave Hansen <dave.hansen@intel.com> writes: >> On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >>> VMAs containing coherent device memory should be marked with VM_CDM. These >>> VMAs need to be identified in various core kernel paths and this new flag >>> will help in this regard. >> >> ... and it's sticky? So if a VMA *ever* has one of these funky pages in >> it, it's stuck being VM_CDM forever? Never to be merged with other >> VMAs? Never to see the light of autonuma ever again? >> >> What if a 100TB VMA has one page of fancy pants device memory, and the >> rest normal vanilla memory? Do we really want to consider the whole >> thing fancy? > > This definitely needs fine tuning. I guess we should look at this as > possibly stating that, coherent device would like to not participate in > auto numa balancing ... Right, in this one, particular case you don't want NUMA balancing. But, if you have to take an _explicit_ action to even get access to this coherent memory (setting a NUMA policy), why keeps that explicit action from also explicitly disabling NUMA migration? I really don't think we should tie together the isolation aspect with anything else, including NUMA balancing. For instance, on x86, we have the ability for devices to grok the CPU's page tables, including doing faults. There's very little to stop us from doing things like autonuma. > One possible option is to use a software pte bit (may be steal > _PAGE_DEVMAP) and prevent a numa pte setup from change_prot_numa(). > ie, if the pfn backing the pte is from coherent device we don't allow > that to be converted to a prot none pte for numa faults ? Why would you need to tag individual pages, especially if the VMA has a policy set on it that disallows migration? But, even if you did need to identify individual pages from the PTE, you can easily do: page_to_nid(pfn_to_page(pte_pfn(pte))) and then tell if the node is a fancy-pants device node. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory @ 2016-10-25 20:01 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-25 20:01 UTC (permalink / raw) To: Aneesh Kumar K.V, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/25/2016 12:20 PM, Aneesh Kumar K.V wrote: > Dave Hansen <dave.hansen@intel.com> writes: >> On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >>> VMAs containing coherent device memory should be marked with VM_CDM. These >>> VMAs need to be identified in various core kernel paths and this new flag >>> will help in this regard. >> >> ... and it's sticky? So if a VMA *ever* has one of these funky pages in >> it, it's stuck being VM_CDM forever? Never to be merged with other >> VMAs? Never to see the light of autonuma ever again? >> >> What if a 100TB VMA has one page of fancy pants device memory, and the >> rest normal vanilla memory? Do we really want to consider the whole >> thing fancy? > > This definitely needs fine tuning. I guess we should look at this as > possibly stating that, coherent device would like to not participate in > auto numa balancing ... Right, in this one, particular case you don't want NUMA balancing. But, if you have to take an _explicit_ action to even get access to this coherent memory (setting a NUMA policy), why keeps that explicit action from also explicitly disabling NUMA migration? I really don't think we should tie together the isolation aspect with anything else, including NUMA balancing. For instance, on x86, we have the ability for devices to grok the CPU's page tables, including doing faults. There's very little to stop us from doing things like autonuma. > One possible option is to use a software pte bit (may be steal > _PAGE_DEVMAP) and prevent a numa pte setup from change_prot_numa(). > ie, if the pfn backing the pte is from coherent device we don't allow > that to be converted to a prot none pte for numa faults ? Why would you need to tag individual pages, especially if the VMA has a policy set on it that disallows migration? But, even if you did need to identify individual pages from the PTE, you can easily do: page_to_nid(pfn_to_page(pte_pfn(pte))) and then tell if the node is a fancy-pants device node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [RFC 6/8] mm: Make VM_CDM marked VMAs non migratable 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Auto NUMA does migratability check on any given VMA before scanning it for marking purpose. For now if the coherent device memory has been faulted in or migrated into a process VMA, it should not be part of the auto NUMA migration scheme. The check is based on VM_CDM flag. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mempolicy.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5e5b296..09d4b70 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -171,9 +171,26 @@ extern int mpol_parse_str(char *str, struct mempolicy **mpol); extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol); +#ifdef CONFIG_COHERENT_DEVICE +static bool is_cdm_vma(struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_CDM) + return true; + return false; +} +#else +static bool is_cdm_vma(struct vm_area_struct *vma) +{ + return false; +} +#endif + /* Check if a vma is migratable */ static inline bool vma_migratable(struct vm_area_struct *vma) { + if (is_cdm_vma(vma)) + return false; + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) return false; -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 6/8] mm: Make VM_CDM marked VMAs non migratable @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Auto NUMA does migratability check on any given VMA before scanning it for marking purpose. For now if the coherent device memory has been faulted in or migrated into a process VMA, it should not be part of the auto NUMA migration scheme. The check is based on VM_CDM flag. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mempolicy.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5e5b296..09d4b70 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -171,9 +171,26 @@ extern int mpol_parse_str(char *str, struct mempolicy **mpol); extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol); +#ifdef CONFIG_COHERENT_DEVICE +static bool is_cdm_vma(struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_CDM) + return true; + return false; +} +#else +static bool is_cdm_vma(struct vm_area_struct *vma) +{ + return false; +} +#endif + /* Check if a vma is migratable */ static inline bool vma_migratable(struct vm_area_struct *vma) { + if (is_cdm_vma(vma)) + return false; + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) return false; -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 7/8] mm: Add a new migration function migrate_virtual_range() 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This adds a new virtual address range based migration interface which can migrate all the mapped pages from a virtual range of a process to a destination node. This also exports this new function symbol. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mempolicy.h | 7 ++++ include/linux/migrate.h | 3 ++ mm/mempolicy.c | 7 ++-- mm/migrate.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 96 insertions(+), 5 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 09d4b70..f18c0ea 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -152,6 +152,9 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, const nodemask_t *mask); extern unsigned int mempolicy_slab_node(void); +extern int queue_pages_range(struct mm_struct *mm, unsigned long start, + unsigned long end, nodemask_t *nodes, + unsigned long flags, struct list_head *pagelist); extern enum zone_type policy_zone; @@ -319,4 +322,8 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } #endif /* CONFIG_NUMA */ + +#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ +#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ + #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ae8d475..e2a1af5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -49,6 +49,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode, int extra_count); + +extern int migrate_virtual_range(int pid, unsigned long vaddr, + unsigned long size, int nid); #else static inline void putback_movable_pages(struct list_head *l) {} diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b983cea..aa8479b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -100,10 +100,6 @@ #include "internal.h" -/* Internal flags */ -#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ -#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ - static struct kmem_cache *policy_cache; static struct kmem_cache *sn_cache; @@ -703,7 +699,7 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, * @nodes and @flags,) it's isolated and queued to the pagelist which is * passed via @private.) */ -static int +int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, nodemask_t *nodes, unsigned long flags, struct list_head *pagelist) @@ -724,6 +720,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, return walk_page_range(start, end, &queue_pages_walk); } +EXPORT_SYMBOL(queue_pages_range); /* * Apply policy to a single VMA diff --git a/mm/migrate.c b/mm/migrate.c index 99250ae..06300bb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1367,6 +1367,90 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, return rc; } +static struct page *new_node_page(struct page *page, + unsigned long node, int **x) +{ + return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE + | __GFP_THISNODE, 0); +} + +#ifdef COHERENT_DEVICE +static void mark_vma_cdm(struct vm_area_struct *vma) +{ + vma->vm_flags |= VM_CDM; +} +#else +static void mark_vma_cdm(struct vm_area_struct *vma) {} +#endif + +/* + * migrate_virtual_range - migrate all the pages faulted within a virtual + * address range to a specified node. + * + * @pid: PID of the task + * @start: Virtual address range beginning + * @end: Virtual address range end + * @nid: Target migration node + * + * The function first scans the process VMA list to find out the VMA which + * contains the given virtual range. Then validates that the virtual range + * is within the given VMA's limits. + * + * Returns the number of pages that were not migrated or an error code. + */ +int migrate_virtual_range(int pid, unsigned long start, + unsigned long end, int nid) +{ + struct mm_struct *mm; + struct vm_area_struct *vma; + nodemask_t nmask; + int ret = -EINVAL; + + LIST_HEAD(mlist); + + nodes_clear(nmask); + nodes_setall(nmask); + + if ((!start) || (!end)) + return -EINVAL; + + rcu_read_lock(); + mm = find_task_by_vpid(pid)->mm; + rcu_read_unlock(); + + start &= PAGE_MASK; + end &= PAGE_MASK; + down_write(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((start < vma->vm_start) || (end > vma->vm_end)) + continue; + + ret = queue_pages_range(mm, start, end, &nmask, MPOL_MF_MOVE_ALL + | MPOL_MF_DISCONTIG_OK, &mlist); + if (ret) { + putback_movable_pages(&mlist); + break; + } + + if (list_empty(&mlist)) { + ret = -ENOMEM; + break; + } + + ret = migrate_pages(&mlist, new_node_page, NULL, nid, + MIGRATE_SYNC, MR_COMPACTION); + if (ret) { + putback_movable_pages(&mlist); + } else { + if (isolated_cdm_node(nid)) + mark_vma_cdm(vma); + } + } + up_write(&mm->mmap_sem); + return ret; +} +EXPORT_SYMBOL(migrate_virtual_range); + #ifdef CONFIG_NUMA /* * Move a list of individual pages -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 7/8] mm: Add a new migration function migrate_virtual_range() @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This adds a new virtual address range based migration interface which can migrate all the mapped pages from a virtual range of a process to a destination node. This also exports this new function symbol. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mempolicy.h | 7 ++++ include/linux/migrate.h | 3 ++ mm/mempolicy.c | 7 ++-- mm/migrate.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 96 insertions(+), 5 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 09d4b70..f18c0ea 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -152,6 +152,9 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, const nodemask_t *mask); extern unsigned int mempolicy_slab_node(void); +extern int queue_pages_range(struct mm_struct *mm, unsigned long start, + unsigned long end, nodemask_t *nodes, + unsigned long flags, struct list_head *pagelist); extern enum zone_type policy_zone; @@ -319,4 +322,8 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } #endif /* CONFIG_NUMA */ + +#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ +#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ + #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ae8d475..e2a1af5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -49,6 +49,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode, int extra_count); + +extern int migrate_virtual_range(int pid, unsigned long vaddr, + unsigned long size, int nid); #else static inline void putback_movable_pages(struct list_head *l) {} diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b983cea..aa8479b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -100,10 +100,6 @@ #include "internal.h" -/* Internal flags */ -#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ -#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ - static struct kmem_cache *policy_cache; static struct kmem_cache *sn_cache; @@ -703,7 +699,7 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, * @nodes and @flags,) it's isolated and queued to the pagelist which is * passed via @private.) */ -static int +int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, nodemask_t *nodes, unsigned long flags, struct list_head *pagelist) @@ -724,6 +720,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, return walk_page_range(start, end, &queue_pages_walk); } +EXPORT_SYMBOL(queue_pages_range); /* * Apply policy to a single VMA diff --git a/mm/migrate.c b/mm/migrate.c index 99250ae..06300bb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1367,6 +1367,90 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, return rc; } +static struct page *new_node_page(struct page *page, + unsigned long node, int **x) +{ + return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE + | __GFP_THISNODE, 0); +} + +#ifdef COHERENT_DEVICE +static void mark_vma_cdm(struct vm_area_struct *vma) +{ + vma->vm_flags |= VM_CDM; +} +#else +static void mark_vma_cdm(struct vm_area_struct *vma) {} +#endif + +/* + * migrate_virtual_range - migrate all the pages faulted within a virtual + * address range to a specified node. + * + * @pid: PID of the task + * @start: Virtual address range beginning + * @end: Virtual address range end + * @nid: Target migration node + * + * The function first scans the process VMA list to find out the VMA which + * contains the given virtual range. Then validates that the virtual range + * is within the given VMA's limits. + * + * Returns the number of pages that were not migrated or an error code. + */ +int migrate_virtual_range(int pid, unsigned long start, + unsigned long end, int nid) +{ + struct mm_struct *mm; + struct vm_area_struct *vma; + nodemask_t nmask; + int ret = -EINVAL; + + LIST_HEAD(mlist); + + nodes_clear(nmask); + nodes_setall(nmask); + + if ((!start) || (!end)) + return -EINVAL; + + rcu_read_lock(); + mm = find_task_by_vpid(pid)->mm; + rcu_read_unlock(); + + start &= PAGE_MASK; + end &= PAGE_MASK; + down_write(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((start < vma->vm_start) || (end > vma->vm_end)) + continue; + + ret = queue_pages_range(mm, start, end, &nmask, MPOL_MF_MOVE_ALL + | MPOL_MF_DISCONTIG_OK, &mlist); + if (ret) { + putback_movable_pages(&mlist); + break; + } + + if (list_empty(&mlist)) { + ret = -ENOMEM; + break; + } + + ret = migrate_pages(&mlist, new_node_page, NULL, nid, + MIGRATE_SYNC, MR_COMPACTION); + if (ret) { + putback_movable_pages(&mlist); + } else { + if (isolated_cdm_node(nid)) + mark_vma_cdm(vma); + } + } + up_write(&mm->mmap_sem); + return ret; +} +EXPORT_SYMBOL(migrate_virtual_range); + #ifdef CONFIG_NUMA /* * Move a list of individual pages -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:31 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Add a new member N_COHERENT_DEVICE into node_states[] nodemask array to enlist all those nodes which contain only coherent device memory. Also creates a new sysfs interface /sys/devices/system/node/is_coherent_device to list down all those nodes which has coherent device memory. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- Documentation/ABI/stable/sysfs-devices-node | 7 +++++++ drivers/base/node.c | 6 ++++++ include/linux/nodemask.h | 3 +++ mm/memory_hotplug.c | 10 ++++++++++ 4 files changed, 26 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 5b2d0f0..5538791 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -29,6 +29,13 @@ Description: Nodes that have regular or high memory. Depends on CONFIG_HIGHMEM. +What: /sys/devices/system/node/is_coherent_device +Date: October 2016 +Contact: Linux Memory Management list <linux-mm@kvack.org> +Description: + Lists the nodemask of nodes that have coherent memory. + Depends on CONFIG_COHERENT_DEVICE. + What: /sys/devices/system/node/nodeX Date: October 2002 Contact: Linux Memory Management list <linux-mm@kvack.org> diff --git a/drivers/base/node.c b/drivers/base/node.c index 5548f96..5b5dd89 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = { [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), #endif [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), +#ifdef CONFIG_COHERENT_DEVICE + [N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE), +#endif }; static struct attribute *node_state_attrs[] = { @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = { &node_state_attr[N_MEMORY].attr.attr, #endif &node_state_attr[N_CPU].attr.attr, +#ifdef CONFIG_COHERENT_DEVICE + &node_state_attr[N_COHERENT_DEVICE].attr.attr, +#endif NULL }; diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index f746e44..605cb0d 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -393,6 +393,9 @@ enum node_states { N_MEMORY = N_HIGH_MEMORY, #endif N_CPU, /* The node has one or more cpus */ +#ifdef CONFIG_COHERENT_DEVICE + N_COHERENT_DEVICE, /* The node has coherent device memory */ +#endif NR_NODE_STATES }; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9629273..8f03962 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1044,6 +1044,11 @@ static void node_states_set_node(int node, struct memory_notify *arg) if (arg->status_change_nid_high >= 0) node_set_state(node, N_HIGH_MEMORY); +#ifdef CONFIG_COHERENT_DEVICE + if (isolated_cdm_node(node)) + node_set_state(node, N_COHERENT_DEVICE); +#endif + node_set_state(node, N_MEMORY); } @@ -1858,6 +1863,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg) if ((N_MEMORY != N_HIGH_MEMORY) && (arg->status_change_nid >= 0)) node_clear_state(node, N_MEMORY); + +#ifdef CONFIG_COHERENT_DEVICE + if (isolated_cdm_node(node)) + node_clear_state(node, N_COHERENT_DEVICE); +#endif } static int __ref __offline_pages(unsigned long start_pfn, -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] @ 2016-10-24 4:31 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:31 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Add a new member N_COHERENT_DEVICE into node_states[] nodemask array to enlist all those nodes which contain only coherent device memory. Also creates a new sysfs interface /sys/devices/system/node/is_coherent_device to list down all those nodes which has coherent device memory. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- Documentation/ABI/stable/sysfs-devices-node | 7 +++++++ drivers/base/node.c | 6 ++++++ include/linux/nodemask.h | 3 +++ mm/memory_hotplug.c | 10 ++++++++++ 4 files changed, 26 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 5b2d0f0..5538791 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -29,6 +29,13 @@ Description: Nodes that have regular or high memory. Depends on CONFIG_HIGHMEM. +What: /sys/devices/system/node/is_coherent_device +Date: October 2016 +Contact: Linux Memory Management list <linux-mm@kvack.org> +Description: + Lists the nodemask of nodes that have coherent memory. + Depends on CONFIG_COHERENT_DEVICE. + What: /sys/devices/system/node/nodeX Date: October 2002 Contact: Linux Memory Management list <linux-mm@kvack.org> diff --git a/drivers/base/node.c b/drivers/base/node.c index 5548f96..5b5dd89 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = { [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), #endif [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), +#ifdef CONFIG_COHERENT_DEVICE + [N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE), +#endif }; static struct attribute *node_state_attrs[] = { @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = { &node_state_attr[N_MEMORY].attr.attr, #endif &node_state_attr[N_CPU].attr.attr, +#ifdef CONFIG_COHERENT_DEVICE + &node_state_attr[N_COHERENT_DEVICE].attr.attr, +#endif NULL }; diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index f746e44..605cb0d 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -393,6 +393,9 @@ enum node_states { N_MEMORY = N_HIGH_MEMORY, #endif N_CPU, /* The node has one or more cpus */ +#ifdef CONFIG_COHERENT_DEVICE + N_COHERENT_DEVICE, /* The node has coherent device memory */ +#endif NR_NODE_STATES }; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9629273..8f03962 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1044,6 +1044,11 @@ static void node_states_set_node(int node, struct memory_notify *arg) if (arg->status_change_nid_high >= 0) node_set_state(node, N_HIGH_MEMORY); +#ifdef CONFIG_COHERENT_DEVICE + if (isolated_cdm_node(node)) + node_set_state(node, N_COHERENT_DEVICE); +#endif + node_set_state(node, N_MEMORY); } @@ -1858,6 +1863,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg) if ((N_MEMORY != N_HIGH_MEMORY) && (arg->status_change_nid >= 0)) node_clear_state(node, N_MEMORY); + +#ifdef CONFIG_COHERENT_DEVICE + if (isolated_cdm_node(node)) + node_clear_state(node, N_COHERENT_DEVICE); +#endif } static int __ref __offline_pages(unsigned long start_pfn, -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-25 7:22 ` Balbir Singh -1 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 7:22 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 24/10/16 15:31, Anshuman Khandual wrote: > Add a new member N_COHERENT_DEVICE into node_states[] nodemask array to > enlist all those nodes which contain only coherent device memory. Also > creates a new sysfs interface /sys/devices/system/node/is_coherent_device > to list down all those nodes which has coherent device memory. > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > --- > Documentation/ABI/stable/sysfs-devices-node | 7 +++++++ > drivers/base/node.c | 6 ++++++ > include/linux/nodemask.h | 3 +++ > mm/memory_hotplug.c | 10 ++++++++++ > 4 files changed, 26 insertions(+) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node > index 5b2d0f0..5538791 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -29,6 +29,13 @@ Description: > Nodes that have regular or high memory. > Depends on CONFIG_HIGHMEM. > > +What: /sys/devices/system/node/is_coherent_device > +Date: October 2016 > +Contact: Linux Memory Management list <linux-mm@kvack.org> > +Description: > + Lists the nodemask of nodes that have coherent memory. > + Depends on CONFIG_COHERENT_DEVICE. > + > What: /sys/devices/system/node/nodeX > Date: October 2002 > Contact: Linux Memory Management list <linux-mm@kvack.org> > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 5548f96..5b5dd89 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = { > [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), > #endif > [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), > +#ifdef CONFIG_COHERENT_DEVICE > + [N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE), > +#endif > }; > > static struct attribute *node_state_attrs[] = { > @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = { > &node_state_attr[N_MEMORY].attr.attr, > #endif > &node_state_attr[N_CPU].attr.attr, > +#ifdef CONFIG_COHERENT_DEVICE > + &node_state_attr[N_COHERENT_DEVICE].attr.attr, > +#endif > NULL > }; > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index f746e44..605cb0d 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -393,6 +393,9 @@ enum node_states { > N_MEMORY = N_HIGH_MEMORY, > #endif > N_CPU, /* The node has one or more cpus */ > +#ifdef CONFIG_COHERENT_DEVICE > + N_COHERENT_DEVICE, /* The node has coherent device memory */ > +#endif > NR_NODE_STATES > }; > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 9629273..8f03962 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1044,6 +1044,11 @@ static void node_states_set_node(int node, struct memory_notify *arg) > if (arg->status_change_nid_high >= 0) > node_set_state(node, N_HIGH_MEMORY); > > +#ifdef CONFIG_COHERENT_DEVICE > + if (isolated_cdm_node(node)) > + node_set_state(node, N_COHERENT_DEVICE); > +#endif > + #ifdef not required, see below > node_set_state(node, N_MEMORY); > } > > @@ -1858,6 +1863,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg) > if ((N_MEMORY != N_HIGH_MEMORY) && > (arg->status_change_nid >= 0)) > node_clear_state(node, N_MEMORY); > + > +#ifdef CONFIG_COHERENT_DEVICE > + if (isolated_cdm_node(node)) > + node_clear_state(node, N_COHERENT_DEVICE); > +#endif > } > I think the #ifdefs are not needed if isolated_cdm_node is defined for both with and without CONFIG_COHERENT_DEVICE. I think this patch needs to move up in the series so that node state can be examined by other core algorithms Balbir ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] @ 2016-10-25 7:22 ` Balbir Singh 0 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 7:22 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 24/10/16 15:31, Anshuman Khandual wrote: > Add a new member N_COHERENT_DEVICE into node_states[] nodemask array to > enlist all those nodes which contain only coherent device memory. Also > creates a new sysfs interface /sys/devices/system/node/is_coherent_device > to list down all those nodes which has coherent device memory. > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > --- > Documentation/ABI/stable/sysfs-devices-node | 7 +++++++ > drivers/base/node.c | 6 ++++++ > include/linux/nodemask.h | 3 +++ > mm/memory_hotplug.c | 10 ++++++++++ > 4 files changed, 26 insertions(+) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node > index 5b2d0f0..5538791 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -29,6 +29,13 @@ Description: > Nodes that have regular or high memory. > Depends on CONFIG_HIGHMEM. > > +What: /sys/devices/system/node/is_coherent_device > +Date: October 2016 > +Contact: Linux Memory Management list <linux-mm@kvack.org> > +Description: > + Lists the nodemask of nodes that have coherent memory. > + Depends on CONFIG_COHERENT_DEVICE. > + > What: /sys/devices/system/node/nodeX > Date: October 2002 > Contact: Linux Memory Management list <linux-mm@kvack.org> > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 5548f96..5b5dd89 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = { > [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), > #endif > [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), > +#ifdef CONFIG_COHERENT_DEVICE > + [N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE), > +#endif > }; > > static struct attribute *node_state_attrs[] = { > @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = { > &node_state_attr[N_MEMORY].attr.attr, > #endif > &node_state_attr[N_CPU].attr.attr, > +#ifdef CONFIG_COHERENT_DEVICE > + &node_state_attr[N_COHERENT_DEVICE].attr.attr, > +#endif > NULL > }; > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index f746e44..605cb0d 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -393,6 +393,9 @@ enum node_states { > N_MEMORY = N_HIGH_MEMORY, > #endif > N_CPU, /* The node has one or more cpus */ > +#ifdef CONFIG_COHERENT_DEVICE > + N_COHERENT_DEVICE, /* The node has coherent device memory */ > +#endif > NR_NODE_STATES > }; > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 9629273..8f03962 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1044,6 +1044,11 @@ static void node_states_set_node(int node, struct memory_notify *arg) > if (arg->status_change_nid_high >= 0) > node_set_state(node, N_HIGH_MEMORY); > > +#ifdef CONFIG_COHERENT_DEVICE > + if (isolated_cdm_node(node)) > + node_set_state(node, N_COHERENT_DEVICE); > +#endif > + #ifdef not required, see below > node_set_state(node, N_MEMORY); > } > > @@ -1858,6 +1863,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg) > if ((N_MEMORY != N_HIGH_MEMORY) && > (arg->status_change_nid >= 0)) > node_clear_state(node, N_MEMORY); > + > +#ifdef CONFIG_COHERENT_DEVICE > + if (isolated_cdm_node(node)) > + node_clear_state(node, N_COHERENT_DEVICE); > +#endif > } > I think the #ifdefs are not needed if isolated_cdm_node is defined for both with and without CONFIG_COHERENT_DEVICE. I think this patch needs to move up in the series so that node state can be examined by other core algorithms Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] 2016-10-25 7:22 ` Balbir Singh @ 2016-10-26 4:52 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-26 4:52 UTC (permalink / raw) To: Balbir Singh, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 10/25/2016 12:52 PM, Balbir Singh wrote: > > > On 24/10/16 15:31, Anshuman Khandual wrote: >> Add a new member N_COHERENT_DEVICE into node_states[] nodemask array to >> enlist all those nodes which contain only coherent device memory. Also >> creates a new sysfs interface /sys/devices/system/node/is_coherent_device >> to list down all those nodes which has coherent device memory. >> >> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> >> --- >> Documentation/ABI/stable/sysfs-devices-node | 7 +++++++ >> drivers/base/node.c | 6 ++++++ >> include/linux/nodemask.h | 3 +++ >> mm/memory_hotplug.c | 10 ++++++++++ >> 4 files changed, 26 insertions(+) >> >> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node >> index 5b2d0f0..5538791 100644 >> --- a/Documentation/ABI/stable/sysfs-devices-node >> +++ b/Documentation/ABI/stable/sysfs-devices-node >> @@ -29,6 +29,13 @@ Description: >> Nodes that have regular or high memory. >> Depends on CONFIG_HIGHMEM. >> >> +What: /sys/devices/system/node/is_coherent_device >> +Date: October 2016 >> +Contact: Linux Memory Management list <linux-mm@kvack.org> >> +Description: >> + Lists the nodemask of nodes that have coherent memory. >> + Depends on CONFIG_COHERENT_DEVICE. >> + >> What: /sys/devices/system/node/nodeX >> Date: October 2002 >> Contact: Linux Memory Management list <linux-mm@kvack.org> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index 5548f96..5b5dd89 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = { >> [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), >> #endif >> [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), >> +#ifdef CONFIG_COHERENT_DEVICE >> + [N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE), >> +#endif >> }; >> >> static struct attribute *node_state_attrs[] = { >> @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = { >> &node_state_attr[N_MEMORY].attr.attr, >> #endif >> &node_state_attr[N_CPU].attr.attr, >> +#ifdef CONFIG_COHERENT_DEVICE >> + &node_state_attr[N_COHERENT_DEVICE].attr.attr, >> +#endif >> NULL >> }; >> >> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h >> index f746e44..605cb0d 100644 >> --- a/include/linux/nodemask.h >> +++ b/include/linux/nodemask.h >> @@ -393,6 +393,9 @@ enum node_states { >> N_MEMORY = N_HIGH_MEMORY, >> #endif >> N_CPU, /* The node has one or more cpus */ >> +#ifdef CONFIG_COHERENT_DEVICE >> + N_COHERENT_DEVICE, /* The node has coherent device memory */ >> +#endif >> NR_NODE_STATES >> }; >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 9629273..8f03962 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1044,6 +1044,11 @@ static void node_states_set_node(int node, struct memory_notify *arg) >> if (arg->status_change_nid_high >= 0) >> node_set_state(node, N_HIGH_MEMORY); >> >> +#ifdef CONFIG_COHERENT_DEVICE >> + if (isolated_cdm_node(node)) >> + node_set_state(node, N_COHERENT_DEVICE); >> +#endif >> + > > #ifdef not required, see below > Right, will change. >> node_set_state(node, N_MEMORY); >> } >> >> @@ -1858,6 +1863,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg) >> if ((N_MEMORY != N_HIGH_MEMORY) && >> (arg->status_change_nid >= 0)) >> node_clear_state(node, N_MEMORY); >> + >> +#ifdef CONFIG_COHERENT_DEVICE >> + if (isolated_cdm_node(node)) >> + node_clear_state(node, N_COHERENT_DEVICE); >> +#endif >> } >> > > I think the #ifdefs are not needed if isolated_cdm_node > is defined for both with and without CONFIG_COHERENT_DEVICE. > > I think this patch needs to move up in the series so that > node state can be examined by other core algorithms Okay, will move up. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] @ 2016-10-26 4:52 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-26 4:52 UTC (permalink / raw) To: Balbir Singh, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 10/25/2016 12:52 PM, Balbir Singh wrote: > > > On 24/10/16 15:31, Anshuman Khandual wrote: >> Add a new member N_COHERENT_DEVICE into node_states[] nodemask array to >> enlist all those nodes which contain only coherent device memory. Also >> creates a new sysfs interface /sys/devices/system/node/is_coherent_device >> to list down all those nodes which has coherent device memory. >> >> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> >> --- >> Documentation/ABI/stable/sysfs-devices-node | 7 +++++++ >> drivers/base/node.c | 6 ++++++ >> include/linux/nodemask.h | 3 +++ >> mm/memory_hotplug.c | 10 ++++++++++ >> 4 files changed, 26 insertions(+) >> >> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node >> index 5b2d0f0..5538791 100644 >> --- a/Documentation/ABI/stable/sysfs-devices-node >> +++ b/Documentation/ABI/stable/sysfs-devices-node >> @@ -29,6 +29,13 @@ Description: >> Nodes that have regular or high memory. >> Depends on CONFIG_HIGHMEM. >> >> +What: /sys/devices/system/node/is_coherent_device >> +Date: October 2016 >> +Contact: Linux Memory Management list <linux-mm@kvack.org> >> +Description: >> + Lists the nodemask of nodes that have coherent memory. >> + Depends on CONFIG_COHERENT_DEVICE. >> + >> What: /sys/devices/system/node/nodeX >> Date: October 2002 >> Contact: Linux Memory Management list <linux-mm@kvack.org> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index 5548f96..5b5dd89 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = { >> [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), >> #endif >> [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), >> +#ifdef CONFIG_COHERENT_DEVICE >> + [N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE), >> +#endif >> }; >> >> static struct attribute *node_state_attrs[] = { >> @@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = { >> &node_state_attr[N_MEMORY].attr.attr, >> #endif >> &node_state_attr[N_CPU].attr.attr, >> +#ifdef CONFIG_COHERENT_DEVICE >> + &node_state_attr[N_COHERENT_DEVICE].attr.attr, >> +#endif >> NULL >> }; >> >> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h >> index f746e44..605cb0d 100644 >> --- a/include/linux/nodemask.h >> +++ b/include/linux/nodemask.h >> @@ -393,6 +393,9 @@ enum node_states { >> N_MEMORY = N_HIGH_MEMORY, >> #endif >> N_CPU, /* The node has one or more cpus */ >> +#ifdef CONFIG_COHERENT_DEVICE >> + N_COHERENT_DEVICE, /* The node has coherent device memory */ >> +#endif >> NR_NODE_STATES >> }; >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 9629273..8f03962 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1044,6 +1044,11 @@ static void node_states_set_node(int node, struct memory_notify *arg) >> if (arg->status_change_nid_high >= 0) >> node_set_state(node, N_HIGH_MEMORY); >> >> +#ifdef CONFIG_COHERENT_DEVICE >> + if (isolated_cdm_node(node)) >> + node_set_state(node, N_COHERENT_DEVICE); >> +#endif >> + > > #ifdef not required, see below > Right, will change. >> node_set_state(node, N_MEMORY); >> } >> >> @@ -1858,6 +1863,11 @@ static void node_states_clear_node(int node, struct memory_notify *arg) >> if ((N_MEMORY != N_HIGH_MEMORY) && >> (arg->status_change_nid >= 0)) >> node_clear_state(node, N_MEMORY); >> + >> +#ifdef CONFIG_COHERENT_DEVICE >> + if (isolated_cdm_node(node)) >> + node_clear_state(node, N_COHERENT_DEVICE); >> +#endif >> } >> > > I think the #ifdefs are not needed if isolated_cdm_node > is defined for both with and without CONFIG_COHERENT_DEVICE. > > I think this patch needs to move up in the series so that > node state can be examined by other core algorithms Okay, will move up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [DEBUG 00/10] Test and debug patches for coherent device memory 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Coherent device memory support has been experimented around on POWER platform with simulations and QEMU changes. This series contains patches which can be classified into three categories. (1) Memory less node hot plug support (2) Identifying coherent device nodes during NUMA init (3) Debug patches to observe zonelists information (4) Test drivers and scripts Patch (2) could have been part of the RFC series but because of the dependency on patch (1), it goes here. Now lets look at the how all these components work. Before Hotplug ============== NUMACTL Information: -------------------- available: 5 nodes (0-4) node 0 cpus: 0 1 2 5 6 20 21 23 27 28 31 32 37 38 39 43 44 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 node 0 size: 4059 MB node 0 free: 2956 MB node 1 cpus: 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 22 24 25 26 29 30 33 34 35 36 40 41 42 45 46 47 63 node 1 size: 4091 MB node 1 free: 3920 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: node 4 size: 0 MB node 4 free: 0 MB node distances: node 0 1 2 3 4 0: 10 40 40 40 40 1: 40 10 40 40 40 2: 40 40 10 40 40 3: 40 40 40 10 40 4: 40 40 40 40 10 ZONELIST Information -------------------- [NODE (0)] ZONELIST_FALLBACK (0xc00000000140da00) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001411a10) (0) (node 0) (DMA 0xc00000000140c000) [NODE (1)] ZONELIST_FALLBACK (0xc000000100001a00) (0) (node 1) (DMA 0xc000000100000000) (1) (node 0) (DMA 0xc00000000140c000) ZONELIST_NOFALLBACK (0xc000000100005a10) (0) (node 1) (DMA 0xc000000100000000) [NODE (2)] ZONELIST_FALLBACK (0xc000000001427700) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000142b710) [NODE (3)] ZONELIST_FALLBACK (0xc000000001431400) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001435410) [NODE (4)] ZONELIST_FALLBACK (0xc00000000143b100) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000143f110) After Hotplug ============= NUMACTL Information: -------------------- available: 5 nodes (0-4) node 0 cpus: 0 1 2 5 6 20 21 23 27 28 31 32 37 38 39 43 44 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 node 0 size: 4059 MB node 0 free: 2804 MB node 1 cpus: 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 22 24 25 26 29 30 33 34 35 36 40 41 42 45 46 47 63 node 1 size: 4091 MB node 1 free: 3860 MB node 2 cpus: node 2 size: 4096 MB node 2 free: 4095 MB node 3 cpus: node 3 size: 4096 MB node 3 free: 4095 MB node 4 cpus: node 4 size: 4096 MB node 4 free: 4095 MB node distances: node 0 1 2 3 4 0: 10 40 40 40 40 1: 40 10 40 40 40 2: 40 40 10 40 40 3: 40 40 40 10 40 4: 40 40 40 40 10 ZONELIST Information: --------------------- [NODE (0)] ZONELIST_FALLBACK (0xc00000000140da00) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001411a10) (0) (node 0) (DMA 0xc00000000140c000) [NODE (1)] ZONELIST_FALLBACK (0xc000000100001a00) (0) (node 1) (DMA 0xc000000100000000) (1) (node 0) (DMA 0xc00000000140c000) ZONELIST_NOFALLBACK (0xc000000100005a10) (0) (node 1) (DMA 0xc000000100000000) [NODE (2)] ZONELIST_FALLBACK (0xc000000001427700) (0) (node 2) (Movable 0xc000000001427080) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000142b710) (0) (node 2) (Movable 0xc000000001427080) [NODE (3)] ZONELIST_FALLBACK (0xc000000001431400) (0) (node 3) (Movable 0xc000000001430d80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001435410) (0) (node 3) (Movable 0xc000000001430d80) [NODE (4)] ZONELIST_FALLBACK (0xc00000000143b100) (0) (node 4) (Movable 0xc00000000143aa80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000143f110) (0) (node 4) (Movable 0xc00000000143aa80) After the coherent device memory nodes have been hot plugged into the kernel, did some simple VMA migration tests to verify it's stability. cdm_migration.sh does the actual test of moving VMAs of ebizzy workload which results in the following stats and traces. Results: ------- passed 13 failed 0 queuef 0 empty 3 missing 0 Traces: ------- migrate_virtual_range: 55094 10000000 10010000 0: migration_passed migrate_virtual_range: 55094 10010000 10020000 0: migration_passed migrate_virtual_range: 55094 10020000 10030000 3: migration_passed migrate_virtual_range: 55094 3fff3b6a0000 3fff8b3c0000 0: list_empty migrate_virtual_range: 55094 3fff8b3c0000 3fff8b580000 1: migration_passed migrate_virtual_range: 55094 3fff8b580000 3fff8b590000 2: migration_passed migrate_virtual_range: 55094 3fff8b590000 3fff8b5a0000 0: migration_passed migrate_virtual_range: 55094 3fff8b5a0000 3fff8b5c0000 2: migration_passed migrate_virtual_range: 55094 3fff8b5c0000 3fff8b5d0000 2: migration_passed migrate_virtual_range: 55094 3fff8b5d0000 3fff8b5e0000 0: migration_passed migrate_virtual_range: 55094 3fff8b5e0000 3fff8b5f0000 0: list_empty migrate_virtual_range: 55094 3fff8b5f0000 3fff8b610000 3: list_empty migrate_virtual_range: 55094 3fff8b610000 3fff8b640000 3: migration_passed migrate_virtual_range: 55094 3fff8b640000 3fff8b650000 2: migration_passed migrate_virtual_range: 55094 3fff8b650000 3fff8b660000 1: migration_passed migrate_virtual_range: 55094 3ffff25e0000 3ffff2610000 1: migration_passed Anshuman Khandual (6): powerpc/mm: Identify isolation seeking coherent memory nodes during boot mm: Export definition of 'zone_names' array through mmzone.h mm: Add debugfs interface to dump each node's zonelist information powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform drivers: Add two drivers for coherent device memory tests test: Add a script to perform random VMA migrations across nodes Reza Arbab (4): dt-bindings: Add doc for ibm,hotplug-aperture powerpc/mm: Create numa nodes for hotplug memory powerpc/mm: Allow memory hotplug into a memory less node mm: Enable CONFIG_MOVABLE_NODE on powerpc .../bindings/powerpc/opal/hotplug-aperture.txt | 26 ++ Documentation/kernel-parameters.txt | 2 +- arch/powerpc/Kconfig | 4 + arch/powerpc/mm/numa.c | 43 ++- drivers/char/Kconfig | 23 ++ drivers/char/Makefile | 2 + drivers/char/coherent_hotplug_demo.c | 133 ++++++++ drivers/char/coherent_memory_demo.c | 337 +++++++++++++++++++++ drivers/char/memory_online_sysfs.h | 148 +++++++++ include/linux/mmzone.h | 1 + mm/Kconfig | 2 +- mm/memory.c | 63 ++++ mm/migrate.c | 10 + mm/page_alloc.c | 2 +- tools/testing/selftests/vm/cdm_migration.sh | 76 +++++ 15 files changed, 855 insertions(+), 17 deletions(-) create mode 100644 Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt create mode 100644 drivers/char/coherent_hotplug_demo.c create mode 100644 drivers/char/coherent_memory_demo.c create mode 100644 drivers/char/memory_online_sysfs.h create mode 100755 tools/testing/selftests/vm/cdm_migration.sh -- 2.1.0 ^ permalink raw reply [flat|nested] 135+ messages in thread
* [DEBUG 00/10] Test and debug patches for coherent device memory @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Coherent device memory support has been experimented around on POWER platform with simulations and QEMU changes. This series contains patches which can be classified into three categories. (1) Memory less node hot plug support (2) Identifying coherent device nodes during NUMA init (3) Debug patches to observe zonelists information (4) Test drivers and scripts Patch (2) could have been part of the RFC series but because of the dependency on patch (1), it goes here. Now lets look at the how all these components work. Before Hotplug ============== NUMACTL Information: -------------------- available: 5 nodes (0-4) node 0 cpus: 0 1 2 5 6 20 21 23 27 28 31 32 37 38 39 43 44 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 node 0 size: 4059 MB node 0 free: 2956 MB node 1 cpus: 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 22 24 25 26 29 30 33 34 35 36 40 41 42 45 46 47 63 node 1 size: 4091 MB node 1 free: 3920 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: node 4 size: 0 MB node 4 free: 0 MB node distances: node 0 1 2 3 4 0: 10 40 40 40 40 1: 40 10 40 40 40 2: 40 40 10 40 40 3: 40 40 40 10 40 4: 40 40 40 40 10 ZONELIST Information -------------------- [NODE (0)] ZONELIST_FALLBACK (0xc00000000140da00) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001411a10) (0) (node 0) (DMA 0xc00000000140c000) [NODE (1)] ZONELIST_FALLBACK (0xc000000100001a00) (0) (node 1) (DMA 0xc000000100000000) (1) (node 0) (DMA 0xc00000000140c000) ZONELIST_NOFALLBACK (0xc000000100005a10) (0) (node 1) (DMA 0xc000000100000000) [NODE (2)] ZONELIST_FALLBACK (0xc000000001427700) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000142b710) [NODE (3)] ZONELIST_FALLBACK (0xc000000001431400) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001435410) [NODE (4)] ZONELIST_FALLBACK (0xc00000000143b100) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000143f110) After Hotplug ============= NUMACTL Information: -------------------- available: 5 nodes (0-4) node 0 cpus: 0 1 2 5 6 20 21 23 27 28 31 32 37 38 39 43 44 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 node 0 size: 4059 MB node 0 free: 2804 MB node 1 cpus: 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 22 24 25 26 29 30 33 34 35 36 40 41 42 45 46 47 63 node 1 size: 4091 MB node 1 free: 3860 MB node 2 cpus: node 2 size: 4096 MB node 2 free: 4095 MB node 3 cpus: node 3 size: 4096 MB node 3 free: 4095 MB node 4 cpus: node 4 size: 4096 MB node 4 free: 4095 MB node distances: node 0 1 2 3 4 0: 10 40 40 40 40 1: 40 10 40 40 40 2: 40 40 10 40 40 3: 40 40 40 10 40 4: 40 40 40 40 10 ZONELIST Information: --------------------- [NODE (0)] ZONELIST_FALLBACK (0xc00000000140da00) (0) (node 0) (DMA 0xc00000000140c000) (1) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001411a10) (0) (node 0) (DMA 0xc00000000140c000) [NODE (1)] ZONELIST_FALLBACK (0xc000000100001a00) (0) (node 1) (DMA 0xc000000100000000) (1) (node 0) (DMA 0xc00000000140c000) ZONELIST_NOFALLBACK (0xc000000100005a10) (0) (node 1) (DMA 0xc000000100000000) [NODE (2)] ZONELIST_FALLBACK (0xc000000001427700) (0) (node 2) (Movable 0xc000000001427080) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000142b710) (0) (node 2) (Movable 0xc000000001427080) [NODE (3)] ZONELIST_FALLBACK (0xc000000001431400) (0) (node 3) (Movable 0xc000000001430d80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc000000001435410) (0) (node 3) (Movable 0xc000000001430d80) [NODE (4)] ZONELIST_FALLBACK (0xc00000000143b100) (0) (node 4) (Movable 0xc00000000143aa80) (1) (node 0) (DMA 0xc00000000140c000) (2) (node 1) (DMA 0xc000000100000000) ZONELIST_NOFALLBACK (0xc00000000143f110) (0) (node 4) (Movable 0xc00000000143aa80) After the coherent device memory nodes have been hot plugged into the kernel, did some simple VMA migration tests to verify it's stability. cdm_migration.sh does the actual test of moving VMAs of ebizzy workload which results in the following stats and traces. Results: ------- passed 13 failed 0 queuef 0 empty 3 missing 0 Traces: ------- migrate_virtual_range: 55094 10000000 10010000 0: migration_passed migrate_virtual_range: 55094 10010000 10020000 0: migration_passed migrate_virtual_range: 55094 10020000 10030000 3: migration_passed migrate_virtual_range: 55094 3fff3b6a0000 3fff8b3c0000 0: list_empty migrate_virtual_range: 55094 3fff8b3c0000 3fff8b580000 1: migration_passed migrate_virtual_range: 55094 3fff8b580000 3fff8b590000 2: migration_passed migrate_virtual_range: 55094 3fff8b590000 3fff8b5a0000 0: migration_passed migrate_virtual_range: 55094 3fff8b5a0000 3fff8b5c0000 2: migration_passed migrate_virtual_range: 55094 3fff8b5c0000 3fff8b5d0000 2: migration_passed migrate_virtual_range: 55094 3fff8b5d0000 3fff8b5e0000 0: migration_passed migrate_virtual_range: 55094 3fff8b5e0000 3fff8b5f0000 0: list_empty migrate_virtual_range: 55094 3fff8b5f0000 3fff8b610000 3: list_empty migrate_virtual_range: 55094 3fff8b610000 3fff8b640000 3: migration_passed migrate_virtual_range: 55094 3fff8b640000 3fff8b650000 2: migration_passed migrate_virtual_range: 55094 3fff8b650000 3fff8b660000 1: migration_passed migrate_virtual_range: 55094 3ffff25e0000 3ffff2610000 1: migration_passed Anshuman Khandual (6): powerpc/mm: Identify isolation seeking coherent memory nodes during boot mm: Export definition of 'zone_names' array through mmzone.h mm: Add debugfs interface to dump each node's zonelist information powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform drivers: Add two drivers for coherent device memory tests test: Add a script to perform random VMA migrations across nodes Reza Arbab (4): dt-bindings: Add doc for ibm,hotplug-aperture powerpc/mm: Create numa nodes for hotplug memory powerpc/mm: Allow memory hotplug into a memory less node mm: Enable CONFIG_MOVABLE_NODE on powerpc .../bindings/powerpc/opal/hotplug-aperture.txt | 26 ++ Documentation/kernel-parameters.txt | 2 +- arch/powerpc/Kconfig | 4 + arch/powerpc/mm/numa.c | 43 ++- drivers/char/Kconfig | 23 ++ drivers/char/Makefile | 2 + drivers/char/coherent_hotplug_demo.c | 133 ++++++++ drivers/char/coherent_memory_demo.c | 337 +++++++++++++++++++++ drivers/char/memory_online_sysfs.h | 148 +++++++++ include/linux/mmzone.h | 1 + mm/Kconfig | 2 +- mm/memory.c | 63 ++++ mm/migrate.c | 10 + mm/page_alloc.c | 2 +- tools/testing/selftests/vm/cdm_migration.sh | 76 +++++ 15 files changed, 855 insertions(+), 17 deletions(-) create mode 100644 Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt create mode 100644 drivers/char/coherent_hotplug_demo.c create mode 100644 drivers/char/coherent_memory_demo.c create mode 100644 drivers/char/memory_online_sysfs.h create mode 100755 tools/testing/selftests/vm/cdm_migration.sh -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* [DEBUG 01/10] dt-bindings: Add doc for ibm,hotplug-aperture 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- .../bindings/powerpc/opal/hotplug-aperture.txt | 26 ++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt diff --git a/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt b/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt new file mode 100644 index 0000000..04dde03 --- /dev/null +++ b/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt @@ -0,0 +1,26 @@ +Designated hotplug memory +------------------------- + +This binding describes a region of hotplug memory which is not present +at boot, allowing its eventual NUMA associativity to be prespecified. + +Required properties: + +- compatible + "ibm,hotplug-aperture" + +- reg + base address and size of the region (standard definition) + +- ibm,associativity + NUMA associativity (standard definition) + +Example: + +A 2 GiB aperture at 0x100000000, to be part of nid 3 when hotplugged: + + hotplug-memory@100000000 { + compatible = "ibm,hotplug-aperture"; + reg = <0x0 0x100000000 0x0 0x80000000>; + ibm,associativity = <0x4 0x0 0x0 0x0 0x3>; + }; -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 01/10] dt-bindings: Add doc for ibm,hotplug-aperture @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- .../bindings/powerpc/opal/hotplug-aperture.txt | 26 ++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt diff --git a/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt b/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt new file mode 100644 index 0000000..04dde03 --- /dev/null +++ b/Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt @@ -0,0 +1,26 @@ +Designated hotplug memory +------------------------- + +This binding describes a region of hotplug memory which is not present +at boot, allowing its eventual NUMA associativity to be prespecified. + +Required properties: + +- compatible + "ibm,hotplug-aperture" + +- reg + base address and size of the region (standard definition) + +- ibm,associativity + NUMA associativity (standard definition) + +Example: + +A 2 GiB aperture at 0x100000000, to be part of nid 3 when hotplugged: + + hotplug-memory@100000000 { + compatible = "ibm,hotplug-aperture"; + reg = <0x0 0x100000000 0x0 0x80000000>; + ibm,associativity = <0x4 0x0 0x0 0x0 0x3>; + }; -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 02/10] powerpc/mm: Create numa nodes for hotplug memory 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> When scanning the device tree to initialize the system NUMA topology, process dt elements with compatible id "ibm,hotplug-aperture" to create memoryless numa nodes. These nodes will be filled when hotplug occurs within the associated address range. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index a51c188..42fcc8e 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -708,6 +708,12 @@ static void __init parse_drconf_memory(struct device_node *memory) } } +static const struct of_device_id memory_match[] = { + { .type = "memory" }, + { .compatible = "ibm,hotplug-aperture" }, + { /* sentinel */ } +}; + static int __init parse_numa_properties(void) { struct device_node *memory; @@ -752,7 +758,7 @@ static int __init parse_numa_properties(void) get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells); - for_each_node_by_type(memory, "memory") { + for_each_matching_node(memory, memory_match) { unsigned long start; unsigned long size; int nid; @@ -1044,7 +1050,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) struct device_node *memory; int nid = -1; - for_each_node_by_type(memory, "memory") { + for_each_matching_node(memory, memory_match) { unsigned long start, size; int ranges; const __be32 *memcell_buf; -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 02/10] powerpc/mm: Create numa nodes for hotplug memory @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> When scanning the device tree to initialize the system NUMA topology, process dt elements with compatible id "ibm,hotplug-aperture" to create memoryless numa nodes. These nodes will be filled when hotplug occurs within the associated address range. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index a51c188..42fcc8e 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -708,6 +708,12 @@ static void __init parse_drconf_memory(struct device_node *memory) } } +static const struct of_device_id memory_match[] = { + { .type = "memory" }, + { .compatible = "ibm,hotplug-aperture" }, + { /* sentinel */ } +}; + static int __init parse_numa_properties(void) { struct device_node *memory; @@ -752,7 +758,7 @@ static int __init parse_numa_properties(void) get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells); - for_each_node_by_type(memory, "memory") { + for_each_matching_node(memory, memory_match) { unsigned long start; unsigned long size; int nid; @@ -1044,7 +1050,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) struct device_node *memory; int nid = -1; - for_each_node_by_type(memory, "memory") { + for_each_matching_node(memory, memory_match) { unsigned long start, size; int ranges; const __be32 *memcell_buf; -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 03/10] powerpc/mm: Allow memory hotplug into a memory less node 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> Remove the check which prevents us from hotplugging into an empty node. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 42fcc8e..5010181 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1091,7 +1091,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) int hot_add_scn_to_nid(unsigned long scn_addr) { struct device_node *memory = NULL; - int nid, found = 0; + int nid; if (!numa_enabled || (min_common_depth < 0)) return first_online_node; @@ -1107,17 +1107,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr) if (nid < 0 || !node_online(nid)) nid = first_online_node; - if (NODE_DATA(nid)->node_spanned_pages) - return nid; - - for_each_online_node(nid) { - if (NODE_DATA(nid)->node_spanned_pages) { - found = 1; - break; - } - } - - BUG_ON(!found); return nid; } -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 03/10] powerpc/mm: Allow memory hotplug into a memory less node @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> Remove the check which prevents us from hotplugging into an empty node. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 42fcc8e..5010181 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1091,7 +1091,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) int hot_add_scn_to_nid(unsigned long scn_addr) { struct device_node *memory = NULL; - int nid, found = 0; + int nid; if (!numa_enabled || (min_common_depth < 0)) return first_online_node; @@ -1107,17 +1107,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr) if (nid < 0 || !node_online(nid)) nid = first_online_node; - if (NODE_DATA(nid)->node_spanned_pages) - return nid; - - for_each_online_node(nid) { - if (NODE_DATA(nid)->node_spanned_pages) { - found = 1; - break; - } - } - - BUG_ON(!found); return nid; } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 04/10] mm: Enable CONFIG_MOVABLE_NODE on powerpc 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> Onlining memory into ZONE_MOVABLE requires CONFIG_MOVABLE_NODE. Enable the use of this config option on PPC64 platforms. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- Documentation/kernel-parameters.txt | 2 +- mm/Kconfig | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 37babf9..61cfa0b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2401,7 +2401,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted. that the amount of memory usable for all allocations is not too small. - movable_node [KNL,X86] Boot-time switch to enable the effects + movable_node [KNL,X86,PPC] Boot-time switch to enable the effects of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details. MTD_Partition= [MTD] diff --git a/mm/Kconfig b/mm/Kconfig index cb50468..a4727fa 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -153,7 +153,7 @@ config MOVABLE_NODE bool "Enable to assign a node which has only movable memory" depends on HAVE_MEMBLOCK depends on NO_BOOTMEM - depends on X86_64 + depends on X86_64 || PPC64 depends on NUMA default n help -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 04/10] mm: Enable CONFIG_MOVABLE_NODE on powerpc @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora From: Reza Arbab <arbab@linux.vnet.ibm.com> Onlining memory into ZONE_MOVABLE requires CONFIG_MOVABLE_NODE. Enable the use of this config option on PPC64 platforms. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- Documentation/kernel-parameters.txt | 2 +- mm/Kconfig | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 37babf9..61cfa0b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2401,7 +2401,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted. that the amount of memory usable for all allocations is not too small. - movable_node [KNL,X86] Boot-time switch to enable the effects + movable_node [KNL,X86,PPC] Boot-time switch to enable the effects of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details. MTD_Partition= [MTD] diff --git a/mm/Kconfig b/mm/Kconfig index cb50468..a4727fa 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -153,7 +153,7 @@ config MOVABLE_NODE bool "Enable to assign a node which has only movable memory" depends on HAVE_MEMBLOCK depends on NO_BOOTMEM - depends on X86_64 + depends on X86_64 || PPC64 depends on NUMA default n help -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 05/10] powerpc/mm: Identify isolation seeking coherent memory nodes during boot 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Isolation seeking coherent memory nodes which wish to be MNODE_ISOLATION in core VM will have "ibm,hotplug-aperture" as one of the compatible properties in their respective device nodes in device tree. Detect them during platform NUMA initialization and mark their respective coherent mask in pglist_data structure as MNODE_ISOLATION. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 5010181..89ae64c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -64,6 +64,7 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static int node_to_phys_device_map[MAX_NUMNODES]; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -714,6 +715,17 @@ static const struct of_device_id memory_match[] = { { /* sentinel */ } }; +int arch_get_memory_phys_device(unsigned long start_pfn) +{ + return node_to_phys_device_map[pfn_to_nid(start_pfn)]; +} + +int special_mem_node(int nid) +{ + return node_to_phys_device_map[nid]; +} +EXPORT_SYMBOL(special_mem_node); + static int __init parse_numa_properties(void) { struct device_node *memory; @@ -789,6 +801,9 @@ static int __init parse_numa_properties(void) if (nid < 0) nid = default_nid; + if (of_device_is_compatible(memory, "ibm,hotplug-aperture")) + node_to_phys_device_map[nid] = 1; + fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid); node_set_online(nid); @@ -908,6 +923,11 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) NODE_DATA(nid)->node_id = nid; NODE_DATA(nid)->node_start_pfn = start_pfn; NODE_DATA(nid)->node_spanned_pages = spanned_pages; + +#ifdef CONFIG_COHERENT_DEVICE + if (special_mem_node(nid)) + set_cdm_isolation(nid); +#endif } void __init initmem_init(void) -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 05/10] powerpc/mm: Identify isolation seeking coherent memory nodes during boot @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Isolation seeking coherent memory nodes which wish to be MNODE_ISOLATION in core VM will have "ibm,hotplug-aperture" as one of the compatible properties in their respective device nodes in device tree. Detect them during platform NUMA initialization and mark their respective coherent mask in pglist_data structure as MNODE_ISOLATION. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 5010181..89ae64c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -64,6 +64,7 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static int node_to_phys_device_map[MAX_NUMNODES]; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -714,6 +715,17 @@ static const struct of_device_id memory_match[] = { { /* sentinel */ } }; +int arch_get_memory_phys_device(unsigned long start_pfn) +{ + return node_to_phys_device_map[pfn_to_nid(start_pfn)]; +} + +int special_mem_node(int nid) +{ + return node_to_phys_device_map[nid]; +} +EXPORT_SYMBOL(special_mem_node); + static int __init parse_numa_properties(void) { struct device_node *memory; @@ -789,6 +801,9 @@ static int __init parse_numa_properties(void) if (nid < 0) nid = default_nid; + if (of_device_is_compatible(memory, "ibm,hotplug-aperture")) + node_to_phys_device_map[nid] = 1; + fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid); node_set_online(nid); @@ -908,6 +923,11 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) NODE_DATA(nid)->node_id = nid; NODE_DATA(nid)->node_start_pfn = start_pfn; NODE_DATA(nid)->node_spanned_pages = spanned_pages; + +#ifdef CONFIG_COHERENT_DEVICE + if (special_mem_node(nid)) + set_cdm_isolation(nid); +#endif } void __init initmem_init(void) -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 06/10] mm: Export definition of 'zone_names' array through mmzone.h 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora zone_names[] is used to identify any zone given it's index which can be used in many other places. So exporting the definition through include/linux/mmzone.h header for it's broader access. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 821dffb..560bbcd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -341,6 +341,7 @@ enum zone_type { }; +extern char * const zone_names[]; #ifndef __GENERATING_BOUNDS_H struct zone { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2536b4..35c6d2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -212,7 +212,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { EXPORT_SYMBOL(totalram_pages); -static char * const zone_names[MAX_NR_ZONES] = { +char * const zone_names[MAX_NR_ZONES] = { #ifdef CONFIG_ZONE_DMA "DMA", #endif -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 06/10] mm: Export definition of 'zone_names' array through mmzone.h @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora zone_names[] is used to identify any zone given it's index which can be used in many other places. So exporting the definition through include/linux/mmzone.h header for it's broader access. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 821dffb..560bbcd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -341,6 +341,7 @@ enum zone_type { }; +extern char * const zone_names[]; #ifndef __GENERATING_BOUNDS_H struct zone { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2536b4..35c6d2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -212,7 +212,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { EXPORT_SYMBOL(totalram_pages); -static char * const zone_names[MAX_NR_ZONES] = { +char * const zone_names[MAX_NR_ZONES] = { #ifdef CONFIG_ZONE_DMA "DMA", #endif -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 07/10] mm: Add debugfs interface to dump each node's zonelist information 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Each individual node in the system has a ZONELIST_FALLBACK zonelist and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback order of zones during memory allocations. Sometimes it helps to dump these zonelists to see the priority order of various zones in them. Particularly platforms which support memory hotplug into previously non existing zones (at boot), this interface helps in visualizing which all zonelists of the system at what priority level, the new hot added memory ends up in. POWER is such a platform where all the memory detected during boot time remains with ZONE_DMA for good but then hot plug process can actually get new memory into ZONE_MOVABLE. So having a way to get the snapshot of the zonelists on the system after memory or node hot[un]plug is desirable. This change adds one new debugfs interface (/sys/kernel/debug/zonelists) which will fetch and dump this information. Example zonelist information from a KVM guest with four NUMA nodes on a POWER8 platform. [NODE (0)] ZONELIST_FALLBACK (0) (Node 0) (DMA) (1) (Node 1) (DMA) (2) (Node 2) (DMA) (3) (Node 3) (DMA) ZONELIST_NOFALLBACK (0) (Node 0) (DMA) [NODE (1)] ZONELIST_FALLBACK (0) (Node 1) (DMA) (1) (Node 2) (DMA) (2) (Node 3) (DMA) (3) (Node 0) (DMA) ZONELIST_NOFALLBACK (0) (Node 1) (DMA) [NODE (2)] ZONELIST_FALLBACK (0) (Node 2) (DMA) (1) (Node 3) (DMA) (2) (Node 0) (DMA) (3) (Node 1) (DMA) ZONELIST_NOFALLBACK (0) (Node 2) (DMA) [NODE (3)] ZONELIST_FALLBACK (0) (Node 3) (DMA) (1) (Node 0) (DMA) (2) (Node 1) (DMA) (3) (Node 2) (DMA) ZONELIST_NOFALLBACK (0) (Node 3) (DMA) Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/memory.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index e18c57b..3be1753 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -64,6 +64,7 @@ #include <linux/debugfs.h> #include <linux/userfaultfd_k.h> #include <linux/dax.h> +#include <linux/mmzone.h> #include <asm/io.h> #include <asm/mmu_context.h> @@ -3087,6 +3088,68 @@ static int __init fault_around_debugfs(void) pr_warn("Failed to create fault_around_bytes in debugfs"); return 0; } + +#ifdef CONFIG_NUMA +static void show_zonelist(struct seq_file *m, struct zonelist *zonelist) +{ + unsigned int i; + + for (i = 0; zonelist->_zonerefs[i].zone; i++) { + seq_printf(m, "\t\t(%d) (Node %d) (%-7s 0x%pK)\n", i, + zonelist->_zonerefs[i].zone->zone_pgdat->node_id, + zone_names[zonelist->_zonerefs[i].zone_idx], + (void *) zonelist->_zonerefs[i].zone); + } +} + +static int zonelists_show(struct seq_file *m, void *v) +{ + struct zonelist *zonelist; + unsigned int node; + + for_each_online_node(node) { + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_FALLBACK]); + seq_printf(m, "[NODE (%d)]\n", node); + seq_puts(m, "\tZONELIST_FALLBACK "); + seq_printf(m, "(0x%pK)\n", zonelist); + show_zonelist(m, zonelist); + + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_NOFALLBACK]); + seq_puts(m, "\tZONELIST_NOFALLBACK "); + seq_printf(m, "(0x%pK)\n", zonelist); + show_zonelist(m, zonelist); + } + return 0; +} + +static int zonelists_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, zonelists_show, NULL); +} + +static const struct file_operations zonelists_fops = { + .open = zonelists_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init zonelists_debugfs(void) +{ + void *ret; + + ret = debugfs_create_file("zonelists", 0444, NULL, NULL, + &zonelists_fops); + if (!ret) + pr_warn("Failed to create zonelists in debugfs"); + return 0; +} + +late_initcall(zonelists_debugfs); +#endif /* CONFIG_NUMA */ + late_initcall(fault_around_debugfs); #endif -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 07/10] mm: Add debugfs interface to dump each node's zonelist information @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Each individual node in the system has a ZONELIST_FALLBACK zonelist and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback order of zones during memory allocations. Sometimes it helps to dump these zonelists to see the priority order of various zones in them. Particularly platforms which support memory hotplug into previously non existing zones (at boot), this interface helps in visualizing which all zonelists of the system at what priority level, the new hot added memory ends up in. POWER is such a platform where all the memory detected during boot time remains with ZONE_DMA for good but then hot plug process can actually get new memory into ZONE_MOVABLE. So having a way to get the snapshot of the zonelists on the system after memory or node hot[un]plug is desirable. This change adds one new debugfs interface (/sys/kernel/debug/zonelists) which will fetch and dump this information. Example zonelist information from a KVM guest with four NUMA nodes on a POWER8 platform. [NODE (0)] ZONELIST_FALLBACK (0) (Node 0) (DMA) (1) (Node 1) (DMA) (2) (Node 2) (DMA) (3) (Node 3) (DMA) ZONELIST_NOFALLBACK (0) (Node 0) (DMA) [NODE (1)] ZONELIST_FALLBACK (0) (Node 1) (DMA) (1) (Node 2) (DMA) (2) (Node 3) (DMA) (3) (Node 0) (DMA) ZONELIST_NOFALLBACK (0) (Node 1) (DMA) [NODE (2)] ZONELIST_FALLBACK (0) (Node 2) (DMA) (1) (Node 3) (DMA) (2) (Node 0) (DMA) (3) (Node 1) (DMA) ZONELIST_NOFALLBACK (0) (Node 2) (DMA) [NODE (3)] ZONELIST_FALLBACK (0) (Node 3) (DMA) (1) (Node 0) (DMA) (2) (Node 1) (DMA) (3) (Node 2) (DMA) ZONELIST_NOFALLBACK (0) (Node 3) (DMA) Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- mm/memory.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index e18c57b..3be1753 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -64,6 +64,7 @@ #include <linux/debugfs.h> #include <linux/userfaultfd_k.h> #include <linux/dax.h> +#include <linux/mmzone.h> #include <asm/io.h> #include <asm/mmu_context.h> @@ -3087,6 +3088,68 @@ static int __init fault_around_debugfs(void) pr_warn("Failed to create fault_around_bytes in debugfs"); return 0; } + +#ifdef CONFIG_NUMA +static void show_zonelist(struct seq_file *m, struct zonelist *zonelist) +{ + unsigned int i; + + for (i = 0; zonelist->_zonerefs[i].zone; i++) { + seq_printf(m, "\t\t(%d) (Node %d) (%-7s 0x%pK)\n", i, + zonelist->_zonerefs[i].zone->zone_pgdat->node_id, + zone_names[zonelist->_zonerefs[i].zone_idx], + (void *) zonelist->_zonerefs[i].zone); + } +} + +static int zonelists_show(struct seq_file *m, void *v) +{ + struct zonelist *zonelist; + unsigned int node; + + for_each_online_node(node) { + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_FALLBACK]); + seq_printf(m, "[NODE (%d)]\n", node); + seq_puts(m, "\tZONELIST_FALLBACK "); + seq_printf(m, "(0x%pK)\n", zonelist); + show_zonelist(m, zonelist); + + zonelist = &(NODE_DATA(node)-> + node_zonelists[ZONELIST_NOFALLBACK]); + seq_puts(m, "\tZONELIST_NOFALLBACK "); + seq_printf(m, "(0x%pK)\n", zonelist); + show_zonelist(m, zonelist); + } + return 0; +} + +static int zonelists_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, zonelists_show, NULL); +} + +static const struct file_operations zonelists_fops = { + .open = zonelists_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init zonelists_debugfs(void) +{ + void *ret; + + ret = debugfs_create_file("zonelists", 0444, NULL, NULL, + &zonelists_fops); + if (!ret) + pr_warn("Failed to create zonelists in debugfs"); + return 0; +} + +late_initcall(zonelists_debugfs); +#endif /* CONFIG_NUMA */ + late_initcall(fault_around_debugfs); #endif -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 08/10] powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Just enable MOVABLE_NODE config option for PPC64 platform by default. This prevents accidentally building the kernel without the required config option. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 65fba4c..3989d89 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -310,6 +310,10 @@ config PGTABLE_LEVELS default 3 if PPC_64K_PAGES && !PPC_BOOK3S_64 default 4 +config MOVABLE_NODE + bool + default y if PPC64 + source "init/Kconfig" source "kernel/Kconfig.freezer" -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 08/10] powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora Just enable MOVABLE_NODE config option for PPC64 platform by default. This prevents accidentally building the kernel without the required config option. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- arch/powerpc/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 65fba4c..3989d89 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -310,6 +310,10 @@ config PGTABLE_LEVELS default 3 if PPC_64K_PAGES && !PPC_BOOK3S_64 default 4 +config MOVABLE_NODE + bool + default y if PPC64 + source "init/Kconfig" source "kernel/Kconfig.freezer" -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 09/10] drivers: Add two drivers for coherent device memory tests 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This adds two different drivers inside drivers/char/ directory under two new kernel config options COHERENT_HOTPLUG_DEMO and COHERENT_MEMORY_DEMO. 1) coherent_hotplug_demo: Detects, hoptlugs the coherent device memory 2) coherent_memory_demo: Exports debugfs interface for VMA migrations Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- drivers/char/Kconfig | 23 +++ drivers/char/Makefile | 2 + drivers/char/coherent_hotplug_demo.c | 133 ++++++++++++++ drivers/char/coherent_memory_demo.c | 337 +++++++++++++++++++++++++++++++++++ drivers/char/memory_online_sysfs.h | 148 +++++++++++++++ mm/migrate.c | 10 ++ 6 files changed, 653 insertions(+) create mode 100644 drivers/char/coherent_hotplug_demo.c create mode 100644 drivers/char/coherent_memory_demo.c create mode 100644 drivers/char/memory_online_sysfs.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index dcc0973..22c538d 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -588,6 +588,29 @@ config TILE_SROM device appear much like a simple EEPROM, and knows how to partition a single ROM for multiple purposes. +config COHERENT_HOTPLUG_DEMO + tristate "Demo driver to test coherent memory node hotplug" + depends on PPC64 || COHERENT_DEVICE + default n + help + Say yes when you want to build a test driver to hotplug all + the coherent memory nodes present on the system. This driver + scans through the device tree, checks on "ibm,memory-device" + property device nodes and onlines its memory. When unloaded, + it goes through the list of memory ranges it onlined before + and oflines them one by one. If not sure, select N. + +config COHERENT_MEMORY_DEMO + tristate "Demo driver to test coherent memory node functionality" + depends on PPC64 || COHERENT_DEVICE + default n + help + Say yes when you want to build a test driver to demonstrate + the coherent memory functionalities, capabilities and probable + utilizaton. It also exports a debugfs file to accept inputs for + virtual address range migration for any process. If not sure, + select N. + source "drivers/char/xillybus/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 6e6c244..92fa338 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -60,3 +60,5 @@ js-rtc-y = rtc.o obj-$(CONFIG_TILE_SROM) += tile-srom.o obj-$(CONFIG_XILLYBUS) += xillybus/ obj-$(CONFIG_POWERNV_OP_PANEL) += powernv-op-panel.o +obj-$(CONFIG_COHERENT_HOTPLUG_DEMO) += coherent_hotplug_demo.o +obj-$(CONFIG_COHERENT_MEMORY_DEMO) += coherent_memory_demo.o diff --git a/drivers/char/coherent_hotplug_demo.c b/drivers/char/coherent_hotplug_demo.c new file mode 100644 index 0000000..3670081 --- /dev/null +++ b/drivers/char/coherent_hotplug_demo.c @@ -0,0 +1,133 @@ +/* + * Memory hotplug support for coherent memory nodes in runtime. + * + * Copyright (C) 2016, Reza Arbab, IBM Corporation. + * Copyright (C) 2016, Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include <linux/of.h> +#include <linux/export.h> +#include <linux/spinlock.h> +#include <linux/init.h> +#include <linux/memblock.h> +#include <linux/module.h> +#include <linux/memory.h> +#include <linux/sizes.h> +#include <linux/bitops.h> +#include <linux/device.h> +#include <linux/fs.h> +#include <linux/slab.h> +#include <linux/mm.h> +#include <linux/pagemap.h> +#include <linux/migrate.h> +#include <linux/memblock.h> +#include <linux/uaccess.h> + +#include <asm/mmu.h> +#include <asm/pgalloc.h> +#include "memory_online_sysfs.h" + +#define MAX_HOTADD_NODES 100 +phys_addr_t addr[MAX_HOTADD_NODES][2]; +int nr_addr; + +/* + * extern int memory_failure(unsigned long pfn, int trapno, int flags); + * extern int min_free_kbytes; + * extern int user_min_free_kbytes; + * + * extern unsigned long nr_kernel_pages; + * extern unsigned long nr_all_pages; + * extern unsigned long dma_reserve; + */ + +static void dump_core_vm_tunables(void) +{ +/* + * printk(":::::::: VM TUNABLES :::::::\n"); + * printk("[min_free_kbytes] %d\n", min_free_kbytes); + * printk("[user_min_free_kbytes] %d\n", user_min_free_kbytes); + * printk("[nr_kernel_pages] %ld\n", nr_kernel_pages); + * printk("[nr_all_pages] %ld\n", nr_all_pages); + * printk("[dma_reserve] %ld\n", dma_reserve); + */ +} + + + +static int online_coherent_memory(void) +{ + struct device_node *memory; + + nr_addr = 0; + disable_auto_online(); + dump_core_vm_tunables(); + for_each_compatible_node(memory, NULL, "ibm,memory-device") { + struct device_node *mem; + const __be64 *reg; + unsigned int len, ret; + phys_addr_t start, size; + + mem = of_parse_phandle(memory, "memory-region", 0); + if (!mem) { + pr_info("memory-region property not found\n"); + return -1; + } + + reg = of_get_property(mem, "reg", &len); + if (!reg || len <= 0) { + pr_info("memory-region property not found\n"); + return -1; + } + start = be64_to_cpu(*reg); + size = be64_to_cpu(*(reg + 1)); + pr_info("Coherent memory start %llx size %llx\n", start, size); + ret = memory_probe_store(start, size); + if (ret) + pr_info("probe faile\n"); + + ret = store_mem_state(start, size, "online_movable"); + if (ret) + pr_info("online_movable failed\n"); + + addr[nr_addr][0] = start; + addr[nr_addr][1] = size; + nr_addr++; + } + dump_core_vm_tunables(); + enable_auto_online(); + return 0; +} + +static int offline_coherent_memory(void) +{ + int i; + + for (i = 0; i < nr_addr; i++) + store_mem_state(addr[i][0], addr[i][1], "offline"); + return 0; +} + +static void __exit coherent_hotplug_exit(void) +{ + pr_info("%s\n", __func__); + offline_coherent_memory(); +} + +static int __init coherent_hotplug_init(void) +{ + pr_info("%s\n", __func__); + return online_coherent_memory(); +} +module_init(coherent_hotplug_init); +module_exit(coherent_hotplug_exit); +MODULE_LICENSE("GPL"); diff --git a/drivers/char/coherent_memory_demo.c b/drivers/char/coherent_memory_demo.c new file mode 100644 index 0000000..1dcd9f7 --- /dev/null +++ b/drivers/char/coherent_memory_demo.c @@ -0,0 +1,337 @@ +/* + * Demonstrating various aspects of the coherent memory. + * + * Copyright (C) 2016, Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include <linux/of.h> +#include <linux/export.h> +#include <linux/spinlock.h> +#include <linux/init.h> +#include <linux/memblock.h> +#include <linux/module.h> +#include <linux/memory.h> +#include <linux/sizes.h> +#include <linux/bitops.h> +#include <linux/device.h> +#include <linux/fs.h> +#include <linux/slab.h> +#include <linux/mm.h> +#include <linux/pagemap.h> +#include <linux/migrate.h> +#include <linux/memblock.h> +#include <linux/debugfs.h> +#include <linux/uaccess.h> + +#include <asm/mmu.h> +#include <asm/pgalloc.h> + +#define COHERENT_DEV_MAJOR 89 +#define COHERENT_DEV_NAME "coherent_memory" + +#define CRNT_NODE_NID1 1 +#define CRNT_NODE_NID2 2 +#define CRNT_NODE_NID3 3 + +#define RAM_CRNT_MIGRATE 1 +#define CRNT_RAM_MIGRATE 2 + +struct vma_map_info { + struct list_head list; + unsigned long nr_pages; + spinlock_t lock; +}; + +static void vma_map_info_init(struct vm_area_struct *vma) +{ + struct vma_map_info *info = kmalloc(sizeof(struct vma_map_info), + GFP_KERNEL); + + BUG_ON(!info); + INIT_LIST_HEAD(&info->list); + spin_lock_init(&info->lock); + vma->vm_private_data = info; + info->nr_pages = 0; +} + +static void coherent_vmops_open(struct vm_area_struct *vma) +{ + vma_map_info_init(vma); +} + +static void coherent_vmops_close(struct vm_area_struct *vma) +{ + struct vma_map_info *info = vma->vm_private_data; + + BUG_ON(!info); +again: + cond_resched(); + spin_lock(&info->lock); + while (info->nr_pages) { + struct page *page, *page2; + + list_for_each_entry_safe(page, page2, &info->list, lru) { + if (!trylock_page(page)) { + spin_unlock(&info->lock); + goto again; + } + + list_del_init(&page->lru); + info->nr_pages--; + unlock_page(page); + SetPageReclaim(page); + put_page(page); + } + spin_unlock(&info->lock); + cond_resched(); + spin_lock(&info->lock); + } + spin_unlock(&info->lock); + kfree(info); + vma->vm_private_data = NULL; +} + +static int coherent_vmops_fault(struct vm_area_struct *vma, + struct vm_fault *vmf) +{ + struct vma_map_info *info; + struct page *page; + static int coherent_node = CRNT_NODE_NID1; + + if (coherent_node == CRNT_NODE_NID1) + coherent_node = CRNT_NODE_NID2; + else + coherent_node = CRNT_NODE_NID1; + + page = alloc_pages_node(coherent_node, + GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); + if (!page) + return VM_FAULT_SIGBUS; + + info = (struct vma_map_info *) vma->vm_private_data; + BUG_ON(!info); + spin_lock(&info->lock); + list_add(&page->lru, &info->list); + info->nr_pages++; + spin_unlock(&info->lock); + + page->index = vmf->pgoff; + get_page(page); + vmf->page = page; + return 0; +} + +static const struct vm_operations_struct coherent_memory_vmops = { + .open = coherent_vmops_open, + .close = coherent_vmops_close, + .fault = coherent_vmops_fault, +}; + +static int coherent_memory_mmap(struct file *file, struct vm_area_struct *vma) +{ + pr_info("Mmap opened (file: %lx vma: %lx)\n", + (unsigned long) file, (unsigned long) vma); + vma->vm_ops = &coherent_memory_vmops; + coherent_vmops_open(vma); + return 0; +} + +static int coherent_memory_open(struct inode *inode, struct file *file) +{ + pr_info("Device opened (inode: %lx file: %lx)\n", + (unsigned long) inode, (unsigned long) file); + return 0; +} + +static int coherent_memory_close(struct inode *inode, struct file *file) +{ + pr_info("Device closed (inode: %lx file: %lx)\n", + (unsigned long) inode, (unsigned long) file); + return 0; +} + +static void lru_ram_coherent_migrate(unsigned long addr) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + nodemask_t nmask; + LIST_HEAD(mlist); + + nodes_clear(nmask); + nodes_setall(nmask); + down_write(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((addr < vma->vm_start) || (addr > vma->vm_end)) + continue; + break; + } + up_write(&mm->mmap_sem); + if (!vma) { + pr_info("%s: No VMA found\n", __func__); + return; + } + migrate_virtual_range(current->pid, vma->vm_start, vma->vm_end, 2); +} + +static void lru_coherent_ram_migrate(unsigned long addr) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + nodemask_t nmask; + LIST_HEAD(mlist); + + nodes_clear(nmask); + nodes_setall(nmask); + down_write(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((addr < vma->vm_start) || (addr > vma->vm_end)) + continue; + break; + } + up_write(&mm->mmap_sem); + if (!vma) { + pr_info("%s: No VMA found\n", __func__); + return; + } + migrate_virtual_range(current->pid, vma->vm_start, vma->vm_end, 0); +} + +static long coherent_memory_ioctl(struct file *file, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case RAM_CRNT_MIGRATE: + lru_ram_coherent_migrate(arg); + break; + + case CRNT_RAM_MIGRATE: + lru_coherent_ram_migrate(arg); + break; + + default: + pr_info("%s Invalid ioctl() command: %d\n", __func__, cmd); + return -EINVAL; + } + return 0; +} + +static const struct file_operations fops = { + .mmap = coherent_memory_mmap, + .open = coherent_memory_open, + .release = coherent_memory_close, + .unlocked_ioctl = &coherent_memory_ioctl +}; + +static char kbuf[100]; /* Will store original user passed buffer */ +static char str[100]; /* Working copy for individual substring */ + +static u64 args[4]; +static u64 index; +static void convert_substring(const char *buf) +{ + u64 val = 0; + + if (kstrtou64(buf, 0, &val)) + pr_info("String conversion failed\n"); + + args[index] = val; + index++; +} + +static ssize_t coherent_debug_write(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + char *tmp, *tmp1; + size_t ret; + + memset(args, 0, sizeof(args)); + index = 0; + + ret = simple_write_to_buffer(kbuf, sizeof(kbuf), ppos, user_buf, count); + if (ret < 0) + return ret; + + kbuf[ret] = '\0'; + tmp = kbuf; + do { + tmp1 = strchr(tmp, ','); + if (tmp1) { + *tmp1 = '\0'; + strncpy(str, (const char *)tmp, strlen(tmp)); + convert_substring(str); + } else { + strncpy(str, (const char *)tmp, strlen(tmp)); + convert_substring(str); + break; + } + tmp = tmp1 + 1; + memset(str, 0, sizeof(str)); + } while (true); + migrate_virtual_range(args[0], args[1], args[2], args[3]); + return ret; +} + +static int coherent_debug_show(struct seq_file *m, void *v) +{ + seq_puts(m, "Expected Value: <pid,vaddr,size,nid>\n"); + return 0; +} + +static int coherent_debug_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, coherent_debug_show, NULL); +} + +static const struct file_operations coherent_debug_fops = { + .open = coherent_debug_open, + .write = coherent_debug_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static struct dentry *debugfile; + +static void coherent_memory_debugfs(void) +{ + + debugfile = debugfs_create_file("coherent_debug", 0644, NULL, NULL, + &coherent_debug_fops); + if (!debugfile) + pr_warn("Failed to create coherent_memory in debugfs"); +} + +static void __exit coherent_memory_exit(void) +{ + pr_info("%s\n", __func__); + debugfs_remove(debugfile); + unregister_chrdev(COHERENT_DEV_MAJOR, COHERENT_DEV_NAME); +} + +static int __init coherent_memory_init(void) +{ + int ret; + + pr_info("%s\n", __func__); + ret = register_chrdev(COHERENT_DEV_MAJOR, COHERENT_DEV_NAME, &fops); + if (ret < 0) { + pr_info("%s register_chrdev() failed\n", __func__); + return -1; + } + coherent_memory_debugfs(); + return 0; +} + +module_init(coherent_memory_init); +module_exit(coherent_memory_exit); +MODULE_LICENSE("GPL"); diff --git a/drivers/char/memory_online_sysfs.h b/drivers/char/memory_online_sysfs.h new file mode 100644 index 0000000..a5f022d --- /dev/null +++ b/drivers/char/memory_online_sysfs.h @@ -0,0 +1,148 @@ +/* + * Accessing sysfs interface for memory hotplug operation from + * inside the kernel. + * + * Licensed under GPL V2 + */ +#ifndef __SYSFS_H +#define __SYSFS_H + +#include <linux/fs.h> +#include <linux/uaccess.h> + +#define AUTO_ONLINE_BLOCKS "/sys/devices/system/memory/auto_online_blocks" +#define BLOCK_SIZE_BYTES "/sys/devices/system/memory/block_size_bytes" +#define MEMORY_PROBE "/sys/devices/system/memory/probe" + +static ssize_t read_buf(char *filename, char *buf, ssize_t count) +{ + mm_segment_t old_fs; + struct file *filp; + loff_t pos = 0; + + if (!count) + return 0; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + + filp = filp_open(filename, O_RDONLY, 0); + if (IS_ERR(filp)) { + count = PTR_ERR(filp); + goto err_open; + } + + count = vfs_read(filp, buf, count - 1, &pos); + buf[count] = '\0'; + + filp_close(filp, NULL); + +err_open: + set_fs(old_fs); + + return count; +} + +static unsigned long long read_0x(char *filename) +{ + unsigned long long ret; + char buf[32]; + + if (read_buf(filename, buf, 32) <= 0) + return 0; + + if (kstrtoull(buf, 16, &ret)) + return 0; + + return ret; +} + +static ssize_t write_buf(char *filename, char *buf) +{ + int ret; + mm_segment_t old_fs; + struct file *filp; + loff_t pos = 0; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + + filp = filp_open(filename, O_WRONLY, 0); + if (IS_ERR(filp)) { + ret = PTR_ERR(filp); + goto err_open; + } + + ret = vfs_write(filp, buf, strlen(buf), &pos); + + filp_close(filp, NULL); + +err_open: + set_fs(old_fs); + + return ret; +} + +int memory_probe_store(phys_addr_t addr, phys_addr_t size) +{ + phys_addr_t block_sz = + read_0x(BLOCK_SIZE_BYTES); + long i; + + for (i = 0; i < size / block_sz; i++, addr += block_sz) { + char s[32]; + ssize_t count; + + snprintf(s, 32, "0x%llx", addr); + + count = write_buf(MEMORY_PROBE, s); + if (count < 0) + return count; + } + + return 0; +} + +int store_mem_state(phys_addr_t addr, phys_addr_t size, char *state) +{ + phys_addr_t block_sz = read_0x(BLOCK_SIZE_BYTES); + unsigned long start_block, end_block, i; + + start_block = addr / block_sz; + end_block = start_block + size / block_sz; + + for (i = end_block - 1; i >= start_block; i--) { + char filename[64]; + ssize_t count; + + snprintf(filename, 64, + "/sys/devices/system/memory/memory%ld/state", i); + + count = write_buf(filename, state); + if (count < 0) + return count; + } + + return 0; +} + +int disable_auto_online(void) +{ + int ret; + + ret = write_buf(AUTO_ONLINE_BLOCKS, "offline"); + if (ret) + return ret; + return 0; +} + +int enable_auto_online(void) +{ + int ret; + + ret = write_buf(AUTO_ONLINE_BLOCKS, "online"); + if (ret) + return ret; + return 0; +} +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 06300bb..1fb2b19 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1405,6 +1405,7 @@ int migrate_virtual_range(int pid, unsigned long start, struct vm_area_struct *vma; nodemask_t nmask; int ret = -EINVAL; + bool found = false; LIST_HEAD(mlist); @@ -1414,6 +1415,7 @@ int migrate_virtual_range(int pid, unsigned long start, if ((!start) || (!end)) return -EINVAL; + pr_info("%s: %d %lx %lx %d: ", __func__, pid, start, end, nid); rcu_read_lock(); mm = find_task_by_vpid(pid)->mm; rcu_read_unlock(); @@ -1425,14 +1427,17 @@ int migrate_virtual_range(int pid, unsigned long start, if ((start < vma->vm_start) || (end > vma->vm_end)) continue; + found = true; ret = queue_pages_range(mm, start, end, &nmask, MPOL_MF_MOVE_ALL | MPOL_MF_DISCONTIG_OK, &mlist); if (ret) { + pr_info("queue_pages_range_failed\n"); putback_movable_pages(&mlist); break; } if (list_empty(&mlist)) { + pr_info("list_empty\n"); ret = -ENOMEM; break; } @@ -1440,12 +1445,17 @@ int migrate_virtual_range(int pid, unsigned long start, ret = migrate_pages(&mlist, new_node_page, NULL, nid, MIGRATE_SYNC, MR_COMPACTION); if (ret) { + pr_info("migration_failed\n"); putback_movable_pages(&mlist); } else { + pr_info("migration_passed\n"); if (isolated_cdm_node(nid)) mark_vma_cdm(vma); } } + if (!found) + pr_info("vma_missing\n"); + up_write(&mm->mmap_sem); return ret; } -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 09/10] drivers: Add two drivers for coherent device memory tests @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This adds two different drivers inside drivers/char/ directory under two new kernel config options COHERENT_HOTPLUG_DEMO and COHERENT_MEMORY_DEMO. 1) coherent_hotplug_demo: Detects, hoptlugs the coherent device memory 2) coherent_memory_demo: Exports debugfs interface for VMA migrations Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- drivers/char/Kconfig | 23 +++ drivers/char/Makefile | 2 + drivers/char/coherent_hotplug_demo.c | 133 ++++++++++++++ drivers/char/coherent_memory_demo.c | 337 +++++++++++++++++++++++++++++++++++ drivers/char/memory_online_sysfs.h | 148 +++++++++++++++ mm/migrate.c | 10 ++ 6 files changed, 653 insertions(+) create mode 100644 drivers/char/coherent_hotplug_demo.c create mode 100644 drivers/char/coherent_memory_demo.c create mode 100644 drivers/char/memory_online_sysfs.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index dcc0973..22c538d 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -588,6 +588,29 @@ config TILE_SROM device appear much like a simple EEPROM, and knows how to partition a single ROM for multiple purposes. +config COHERENT_HOTPLUG_DEMO + tristate "Demo driver to test coherent memory node hotplug" + depends on PPC64 || COHERENT_DEVICE + default n + help + Say yes when you want to build a test driver to hotplug all + the coherent memory nodes present on the system. This driver + scans through the device tree, checks on "ibm,memory-device" + property device nodes and onlines its memory. When unloaded, + it goes through the list of memory ranges it onlined before + and oflines them one by one. If not sure, select N. + +config COHERENT_MEMORY_DEMO + tristate "Demo driver to test coherent memory node functionality" + depends on PPC64 || COHERENT_DEVICE + default n + help + Say yes when you want to build a test driver to demonstrate + the coherent memory functionalities, capabilities and probable + utilizaton. It also exports a debugfs file to accept inputs for + virtual address range migration for any process. If not sure, + select N. + source "drivers/char/xillybus/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 6e6c244..92fa338 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -60,3 +60,5 @@ js-rtc-y = rtc.o obj-$(CONFIG_TILE_SROM) += tile-srom.o obj-$(CONFIG_XILLYBUS) += xillybus/ obj-$(CONFIG_POWERNV_OP_PANEL) += powernv-op-panel.o +obj-$(CONFIG_COHERENT_HOTPLUG_DEMO) += coherent_hotplug_demo.o +obj-$(CONFIG_COHERENT_MEMORY_DEMO) += coherent_memory_demo.o diff --git a/drivers/char/coherent_hotplug_demo.c b/drivers/char/coherent_hotplug_demo.c new file mode 100644 index 0000000..3670081 --- /dev/null +++ b/drivers/char/coherent_hotplug_demo.c @@ -0,0 +1,133 @@ +/* + * Memory hotplug support for coherent memory nodes in runtime. + * + * Copyright (C) 2016, Reza Arbab, IBM Corporation. + * Copyright (C) 2016, Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include <linux/of.h> +#include <linux/export.h> +#include <linux/spinlock.h> +#include <linux/init.h> +#include <linux/memblock.h> +#include <linux/module.h> +#include <linux/memory.h> +#include <linux/sizes.h> +#include <linux/bitops.h> +#include <linux/device.h> +#include <linux/fs.h> +#include <linux/slab.h> +#include <linux/mm.h> +#include <linux/pagemap.h> +#include <linux/migrate.h> +#include <linux/memblock.h> +#include <linux/uaccess.h> + +#include <asm/mmu.h> +#include <asm/pgalloc.h> +#include "memory_online_sysfs.h" + +#define MAX_HOTADD_NODES 100 +phys_addr_t addr[MAX_HOTADD_NODES][2]; +int nr_addr; + +/* + * extern int memory_failure(unsigned long pfn, int trapno, int flags); + * extern int min_free_kbytes; + * extern int user_min_free_kbytes; + * + * extern unsigned long nr_kernel_pages; + * extern unsigned long nr_all_pages; + * extern unsigned long dma_reserve; + */ + +static void dump_core_vm_tunables(void) +{ +/* + * printk(":::::::: VM TUNABLES :::::::\n"); + * printk("[min_free_kbytes] %d\n", min_free_kbytes); + * printk("[user_min_free_kbytes] %d\n", user_min_free_kbytes); + * printk("[nr_kernel_pages] %ld\n", nr_kernel_pages); + * printk("[nr_all_pages] %ld\n", nr_all_pages); + * printk("[dma_reserve] %ld\n", dma_reserve); + */ +} + + + +static int online_coherent_memory(void) +{ + struct device_node *memory; + + nr_addr = 0; + disable_auto_online(); + dump_core_vm_tunables(); + for_each_compatible_node(memory, NULL, "ibm,memory-device") { + struct device_node *mem; + const __be64 *reg; + unsigned int len, ret; + phys_addr_t start, size; + + mem = of_parse_phandle(memory, "memory-region", 0); + if (!mem) { + pr_info("memory-region property not found\n"); + return -1; + } + + reg = of_get_property(mem, "reg", &len); + if (!reg || len <= 0) { + pr_info("memory-region property not found\n"); + return -1; + } + start = be64_to_cpu(*reg); + size = be64_to_cpu(*(reg + 1)); + pr_info("Coherent memory start %llx size %llx\n", start, size); + ret = memory_probe_store(start, size); + if (ret) + pr_info("probe faile\n"); + + ret = store_mem_state(start, size, "online_movable"); + if (ret) + pr_info("online_movable failed\n"); + + addr[nr_addr][0] = start; + addr[nr_addr][1] = size; + nr_addr++; + } + dump_core_vm_tunables(); + enable_auto_online(); + return 0; +} + +static int offline_coherent_memory(void) +{ + int i; + + for (i = 0; i < nr_addr; i++) + store_mem_state(addr[i][0], addr[i][1], "offline"); + return 0; +} + +static void __exit coherent_hotplug_exit(void) +{ + pr_info("%s\n", __func__); + offline_coherent_memory(); +} + +static int __init coherent_hotplug_init(void) +{ + pr_info("%s\n", __func__); + return online_coherent_memory(); +} +module_init(coherent_hotplug_init); +module_exit(coherent_hotplug_exit); +MODULE_LICENSE("GPL"); diff --git a/drivers/char/coherent_memory_demo.c b/drivers/char/coherent_memory_demo.c new file mode 100644 index 0000000..1dcd9f7 --- /dev/null +++ b/drivers/char/coherent_memory_demo.c @@ -0,0 +1,337 @@ +/* + * Demonstrating various aspects of the coherent memory. + * + * Copyright (C) 2016, Anshuman Khandual, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include <linux/of.h> +#include <linux/export.h> +#include <linux/spinlock.h> +#include <linux/init.h> +#include <linux/memblock.h> +#include <linux/module.h> +#include <linux/memory.h> +#include <linux/sizes.h> +#include <linux/bitops.h> +#include <linux/device.h> +#include <linux/fs.h> +#include <linux/slab.h> +#include <linux/mm.h> +#include <linux/pagemap.h> +#include <linux/migrate.h> +#include <linux/memblock.h> +#include <linux/debugfs.h> +#include <linux/uaccess.h> + +#include <asm/mmu.h> +#include <asm/pgalloc.h> + +#define COHERENT_DEV_MAJOR 89 +#define COHERENT_DEV_NAME "coherent_memory" + +#define CRNT_NODE_NID1 1 +#define CRNT_NODE_NID2 2 +#define CRNT_NODE_NID3 3 + +#define RAM_CRNT_MIGRATE 1 +#define CRNT_RAM_MIGRATE 2 + +struct vma_map_info { + struct list_head list; + unsigned long nr_pages; + spinlock_t lock; +}; + +static void vma_map_info_init(struct vm_area_struct *vma) +{ + struct vma_map_info *info = kmalloc(sizeof(struct vma_map_info), + GFP_KERNEL); + + BUG_ON(!info); + INIT_LIST_HEAD(&info->list); + spin_lock_init(&info->lock); + vma->vm_private_data = info; + info->nr_pages = 0; +} + +static void coherent_vmops_open(struct vm_area_struct *vma) +{ + vma_map_info_init(vma); +} + +static void coherent_vmops_close(struct vm_area_struct *vma) +{ + struct vma_map_info *info = vma->vm_private_data; + + BUG_ON(!info); +again: + cond_resched(); + spin_lock(&info->lock); + while (info->nr_pages) { + struct page *page, *page2; + + list_for_each_entry_safe(page, page2, &info->list, lru) { + if (!trylock_page(page)) { + spin_unlock(&info->lock); + goto again; + } + + list_del_init(&page->lru); + info->nr_pages--; + unlock_page(page); + SetPageReclaim(page); + put_page(page); + } + spin_unlock(&info->lock); + cond_resched(); + spin_lock(&info->lock); + } + spin_unlock(&info->lock); + kfree(info); + vma->vm_private_data = NULL; +} + +static int coherent_vmops_fault(struct vm_area_struct *vma, + struct vm_fault *vmf) +{ + struct vma_map_info *info; + struct page *page; + static int coherent_node = CRNT_NODE_NID1; + + if (coherent_node == CRNT_NODE_NID1) + coherent_node = CRNT_NODE_NID2; + else + coherent_node = CRNT_NODE_NID1; + + page = alloc_pages_node(coherent_node, + GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); + if (!page) + return VM_FAULT_SIGBUS; + + info = (struct vma_map_info *) vma->vm_private_data; + BUG_ON(!info); + spin_lock(&info->lock); + list_add(&page->lru, &info->list); + info->nr_pages++; + spin_unlock(&info->lock); + + page->index = vmf->pgoff; + get_page(page); + vmf->page = page; + return 0; +} + +static const struct vm_operations_struct coherent_memory_vmops = { + .open = coherent_vmops_open, + .close = coherent_vmops_close, + .fault = coherent_vmops_fault, +}; + +static int coherent_memory_mmap(struct file *file, struct vm_area_struct *vma) +{ + pr_info("Mmap opened (file: %lx vma: %lx)\n", + (unsigned long) file, (unsigned long) vma); + vma->vm_ops = &coherent_memory_vmops; + coherent_vmops_open(vma); + return 0; +} + +static int coherent_memory_open(struct inode *inode, struct file *file) +{ + pr_info("Device opened (inode: %lx file: %lx)\n", + (unsigned long) inode, (unsigned long) file); + return 0; +} + +static int coherent_memory_close(struct inode *inode, struct file *file) +{ + pr_info("Device closed (inode: %lx file: %lx)\n", + (unsigned long) inode, (unsigned long) file); + return 0; +} + +static void lru_ram_coherent_migrate(unsigned long addr) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + nodemask_t nmask; + LIST_HEAD(mlist); + + nodes_clear(nmask); + nodes_setall(nmask); + down_write(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((addr < vma->vm_start) || (addr > vma->vm_end)) + continue; + break; + } + up_write(&mm->mmap_sem); + if (!vma) { + pr_info("%s: No VMA found\n", __func__); + return; + } + migrate_virtual_range(current->pid, vma->vm_start, vma->vm_end, 2); +} + +static void lru_coherent_ram_migrate(unsigned long addr) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + nodemask_t nmask; + LIST_HEAD(mlist); + + nodes_clear(nmask); + nodes_setall(nmask); + down_write(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((addr < vma->vm_start) || (addr > vma->vm_end)) + continue; + break; + } + up_write(&mm->mmap_sem); + if (!vma) { + pr_info("%s: No VMA found\n", __func__); + return; + } + migrate_virtual_range(current->pid, vma->vm_start, vma->vm_end, 0); +} + +static long coherent_memory_ioctl(struct file *file, + unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case RAM_CRNT_MIGRATE: + lru_ram_coherent_migrate(arg); + break; + + case CRNT_RAM_MIGRATE: + lru_coherent_ram_migrate(arg); + break; + + default: + pr_info("%s Invalid ioctl() command: %d\n", __func__, cmd); + return -EINVAL; + } + return 0; +} + +static const struct file_operations fops = { + .mmap = coherent_memory_mmap, + .open = coherent_memory_open, + .release = coherent_memory_close, + .unlocked_ioctl = &coherent_memory_ioctl +}; + +static char kbuf[100]; /* Will store original user passed buffer */ +static char str[100]; /* Working copy for individual substring */ + +static u64 args[4]; +static u64 index; +static void convert_substring(const char *buf) +{ + u64 val = 0; + + if (kstrtou64(buf, 0, &val)) + pr_info("String conversion failed\n"); + + args[index] = val; + index++; +} + +static ssize_t coherent_debug_write(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + char *tmp, *tmp1; + size_t ret; + + memset(args, 0, sizeof(args)); + index = 0; + + ret = simple_write_to_buffer(kbuf, sizeof(kbuf), ppos, user_buf, count); + if (ret < 0) + return ret; + + kbuf[ret] = '\0'; + tmp = kbuf; + do { + tmp1 = strchr(tmp, ','); + if (tmp1) { + *tmp1 = '\0'; + strncpy(str, (const char *)tmp, strlen(tmp)); + convert_substring(str); + } else { + strncpy(str, (const char *)tmp, strlen(tmp)); + convert_substring(str); + break; + } + tmp = tmp1 + 1; + memset(str, 0, sizeof(str)); + } while (true); + migrate_virtual_range(args[0], args[1], args[2], args[3]); + return ret; +} + +static int coherent_debug_show(struct seq_file *m, void *v) +{ + seq_puts(m, "Expected Value: <pid,vaddr,size,nid>\n"); + return 0; +} + +static int coherent_debug_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, coherent_debug_show, NULL); +} + +static const struct file_operations coherent_debug_fops = { + .open = coherent_debug_open, + .write = coherent_debug_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static struct dentry *debugfile; + +static void coherent_memory_debugfs(void) +{ + + debugfile = debugfs_create_file("coherent_debug", 0644, NULL, NULL, + &coherent_debug_fops); + if (!debugfile) + pr_warn("Failed to create coherent_memory in debugfs"); +} + +static void __exit coherent_memory_exit(void) +{ + pr_info("%s\n", __func__); + debugfs_remove(debugfile); + unregister_chrdev(COHERENT_DEV_MAJOR, COHERENT_DEV_NAME); +} + +static int __init coherent_memory_init(void) +{ + int ret; + + pr_info("%s\n", __func__); + ret = register_chrdev(COHERENT_DEV_MAJOR, COHERENT_DEV_NAME, &fops); + if (ret < 0) { + pr_info("%s register_chrdev() failed\n", __func__); + return -1; + } + coherent_memory_debugfs(); + return 0; +} + +module_init(coherent_memory_init); +module_exit(coherent_memory_exit); +MODULE_LICENSE("GPL"); diff --git a/drivers/char/memory_online_sysfs.h b/drivers/char/memory_online_sysfs.h new file mode 100644 index 0000000..a5f022d --- /dev/null +++ b/drivers/char/memory_online_sysfs.h @@ -0,0 +1,148 @@ +/* + * Accessing sysfs interface for memory hotplug operation from + * inside the kernel. + * + * Licensed under GPL V2 + */ +#ifndef __SYSFS_H +#define __SYSFS_H + +#include <linux/fs.h> +#include <linux/uaccess.h> + +#define AUTO_ONLINE_BLOCKS "/sys/devices/system/memory/auto_online_blocks" +#define BLOCK_SIZE_BYTES "/sys/devices/system/memory/block_size_bytes" +#define MEMORY_PROBE "/sys/devices/system/memory/probe" + +static ssize_t read_buf(char *filename, char *buf, ssize_t count) +{ + mm_segment_t old_fs; + struct file *filp; + loff_t pos = 0; + + if (!count) + return 0; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + + filp = filp_open(filename, O_RDONLY, 0); + if (IS_ERR(filp)) { + count = PTR_ERR(filp); + goto err_open; + } + + count = vfs_read(filp, buf, count - 1, &pos); + buf[count] = '\0'; + + filp_close(filp, NULL); + +err_open: + set_fs(old_fs); + + return count; +} + +static unsigned long long read_0x(char *filename) +{ + unsigned long long ret; + char buf[32]; + + if (read_buf(filename, buf, 32) <= 0) + return 0; + + if (kstrtoull(buf, 16, &ret)) + return 0; + + return ret; +} + +static ssize_t write_buf(char *filename, char *buf) +{ + int ret; + mm_segment_t old_fs; + struct file *filp; + loff_t pos = 0; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + + filp = filp_open(filename, O_WRONLY, 0); + if (IS_ERR(filp)) { + ret = PTR_ERR(filp); + goto err_open; + } + + ret = vfs_write(filp, buf, strlen(buf), &pos); + + filp_close(filp, NULL); + +err_open: + set_fs(old_fs); + + return ret; +} + +int memory_probe_store(phys_addr_t addr, phys_addr_t size) +{ + phys_addr_t block_sz = + read_0x(BLOCK_SIZE_BYTES); + long i; + + for (i = 0; i < size / block_sz; i++, addr += block_sz) { + char s[32]; + ssize_t count; + + snprintf(s, 32, "0x%llx", addr); + + count = write_buf(MEMORY_PROBE, s); + if (count < 0) + return count; + } + + return 0; +} + +int store_mem_state(phys_addr_t addr, phys_addr_t size, char *state) +{ + phys_addr_t block_sz = read_0x(BLOCK_SIZE_BYTES); + unsigned long start_block, end_block, i; + + start_block = addr / block_sz; + end_block = start_block + size / block_sz; + + for (i = end_block - 1; i >= start_block; i--) { + char filename[64]; + ssize_t count; + + snprintf(filename, 64, + "/sys/devices/system/memory/memory%ld/state", i); + + count = write_buf(filename, state); + if (count < 0) + return count; + } + + return 0; +} + +int disable_auto_online(void) +{ + int ret; + + ret = write_buf(AUTO_ONLINE_BLOCKS, "offline"); + if (ret) + return ret; + return 0; +} + +int enable_auto_online(void) +{ + int ret; + + ret = write_buf(AUTO_ONLINE_BLOCKS, "online"); + if (ret) + return ret; + return 0; +} +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 06300bb..1fb2b19 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1405,6 +1405,7 @@ int migrate_virtual_range(int pid, unsigned long start, struct vm_area_struct *vma; nodemask_t nmask; int ret = -EINVAL; + bool found = false; LIST_HEAD(mlist); @@ -1414,6 +1415,7 @@ int migrate_virtual_range(int pid, unsigned long start, if ((!start) || (!end)) return -EINVAL; + pr_info("%s: %d %lx %lx %d: ", __func__, pid, start, end, nid); rcu_read_lock(); mm = find_task_by_vpid(pid)->mm; rcu_read_unlock(); @@ -1425,14 +1427,17 @@ int migrate_virtual_range(int pid, unsigned long start, if ((start < vma->vm_start) || (end > vma->vm_end)) continue; + found = true; ret = queue_pages_range(mm, start, end, &nmask, MPOL_MF_MOVE_ALL | MPOL_MF_DISCONTIG_OK, &mlist); if (ret) { + pr_info("queue_pages_range_failed\n"); putback_movable_pages(&mlist); break; } if (list_empty(&mlist)) { + pr_info("list_empty\n"); ret = -ENOMEM; break; } @@ -1440,12 +1445,17 @@ int migrate_virtual_range(int pid, unsigned long start, ret = migrate_pages(&mlist, new_node_page, NULL, nid, MIGRATE_SYNC, MR_COMPACTION); if (ret) { + pr_info("migration_failed\n"); putback_movable_pages(&mlist); } else { + pr_info("migration_passed\n"); if (isolated_cdm_node(nid)) mark_vma_cdm(vma); } } + if (!found) + pr_info("vma_missing\n"); + up_write(&mm->mmap_sem); return ret; } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 10/10] test: Add a script to perform random VMA migrations across nodes 2016-10-24 4:42 ` Anshuman Khandual @ 2016-10-24 4:42 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This is a test script which creates a workload (e.g ebizzy) and go through it's VMAs (/proc/pid/maps) and initiate migration to random nodes which can be either system memory node or coherent memory node. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- tools/testing/selftests/vm/cdm_migration.sh | 76 +++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100755 tools/testing/selftests/vm/cdm_migration.sh diff --git a/tools/testing/selftests/vm/cdm_migration.sh b/tools/testing/selftests/vm/cdm_migration.sh new file mode 100755 index 0000000..3ab7230 --- /dev/null +++ b/tools/testing/selftests/vm/cdm_migration.sh @@ -0,0 +1,76 @@ +#!/usr/bin/bash +# +# Should work with any workoad and workload commandline. +# But for now ebizzy should be installed. Please run it +# as root. +# +# Copyright (C) Anshuman Khandual 2016, IBM Corporation +# +# Licensed under GPL V2 + +# Unload, build and reload modules +if [ "$1" = "reload" ] +then + rmmod coherent_memory_demo + rmmod coherent_hotplug_demo + cd ../../../../ + make -s -j 64 modules + insmod drivers/char/coherent_hotplug_demo.ko + insmod drivers/char/coherent_memory_demo.ko + cd - +fi + +# Workload +workload=ebizzy +work_cmd="ebizzy -T -z -m -t 128 -n 100000 -s 32768 -S 10000" + +pkill $workload +$work_cmd & + +# File +if [ -e input_file.txt ] +then + rm input_file.txt +fi + +# Inputs +pid=`pidof ebizzy` +cp /proc/$pid/maps input_file.txt +if [ ! -e input_file.txt ] +then + echo "Input file was not created" + exit +fi +input=input_file.txt + +# Migrations +dmesg -C +while read line +do + addr_start=$(echo $line | cut -d '-' -f1) + addr_end=$(echo $line | cut -d '-' -f2 | cut -d ' ' -f1) + node=`expr $RANDOM % 4` + + echo $pid,0x$addr_start,0x$addr_end,$node > /sys/kernel/debug/coherent_debug +done < "$input" + +# Analyze dmesg output +passed=`dmesg | grep "migration_passed" | wc -l` +failed=`dmesg | grep "migration_failed" | wc -l` +queuef=`dmesg | grep "queue_pages_range_failed" | wc -l` +empty=`dmesg | grep "list_empty" | wc -l` +missing=`dmesg | grep "vma_missing" | wc -l` + +# Stats +echo passed $passed +echo failed $failed +echo queuef $queuef +echo empty $empty +echo missing $missing + +# Cleanup +rm input_file.txt +if pgrep -x $workload > /dev/null +then + pkill $workload +fi -- 2.1.0 ^ permalink raw reply related [flat|nested] 135+ messages in thread
* [DEBUG 10/10] test: Add a script to perform random VMA migrations across nodes @ 2016-10-24 4:42 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-24 4:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora This is a test script which creates a workload (e.g ebizzy) and go through it's VMAs (/proc/pid/maps) and initiate migration to random nodes which can be either system memory node or coherent memory node. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> --- tools/testing/selftests/vm/cdm_migration.sh | 76 +++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100755 tools/testing/selftests/vm/cdm_migration.sh diff --git a/tools/testing/selftests/vm/cdm_migration.sh b/tools/testing/selftests/vm/cdm_migration.sh new file mode 100755 index 0000000..3ab7230 --- /dev/null +++ b/tools/testing/selftests/vm/cdm_migration.sh @@ -0,0 +1,76 @@ +#!/usr/bin/bash +# +# Should work with any workoad and workload commandline. +# But for now ebizzy should be installed. Please run it +# as root. +# +# Copyright (C) Anshuman Khandual 2016, IBM Corporation +# +# Licensed under GPL V2 + +# Unload, build and reload modules +if [ "$1" = "reload" ] +then + rmmod coherent_memory_demo + rmmod coherent_hotplug_demo + cd ../../../../ + make -s -j 64 modules + insmod drivers/char/coherent_hotplug_demo.ko + insmod drivers/char/coherent_memory_demo.ko + cd - +fi + +# Workload +workload=ebizzy +work_cmd="ebizzy -T -z -m -t 128 -n 100000 -s 32768 -S 10000" + +pkill $workload +$work_cmd & + +# File +if [ -e input_file.txt ] +then + rm input_file.txt +fi + +# Inputs +pid=`pidof ebizzy` +cp /proc/$pid/maps input_file.txt +if [ ! -e input_file.txt ] +then + echo "Input file was not created" + exit +fi +input=input_file.txt + +# Migrations +dmesg -C +while read line +do + addr_start=$(echo $line | cut -d '-' -f1) + addr_end=$(echo $line | cut -d '-' -f2 | cut -d ' ' -f1) + node=`expr $RANDOM % 4` + + echo $pid,0x$addr_start,0x$addr_end,$node > /sys/kernel/debug/coherent_debug +done < "$input" + +# Analyze dmesg output +passed=`dmesg | grep "migration_passed" | wc -l` +failed=`dmesg | grep "migration_failed" | wc -l` +queuef=`dmesg | grep "queue_pages_range_failed" | wc -l` +empty=`dmesg | grep "list_empty" | wc -l` +missing=`dmesg | grep "vma_missing" | wc -l` + +# Stats +echo passed $passed +echo failed $failed +echo queuef $queuef +echo empty $empty +echo missing $missing + +# Cleanup +rm input_file.txt +if pgrep -x $workload > /dev/null +then + pkill $workload +fi -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 17:09 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-24 17:09 UTC (permalink / raw) To: Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > [...] > Core kernel memory features like reclamation, evictions etc. might > need to be restricted or modified on the coherent device memory node as > they can be performance limiting. The RFC does not propose anything on this > yet but it can be looked into later on. For now it just disables Auto NUMA > for any VMA which has coherent device memory. > > Seamless integration of coherent device memory with system memory > will enable various other features, some of which can be listed as follows. > > a. Seamless migrations between system RAM and the coherent memory > b. Will have asynchronous and high throughput migrations > c. Be able to allocate huge order pages from these memory regions > d. Restrict allocations to a large extent to the tasks using the > device for workload acceleration > > Before concluding, will look into the reasons why the existing > solutions don't work. There are two basic requirements which have to be > satisfies before the coherent device memory can be integrated with core > kernel seamlessly. > > a. PFN must have struct page > b. Struct page must able to be inside standard LRU lists > > The above two basic requirements discard the existing method of > device memory representation approaches like these which then requires the > need of creating a new framework. I do not believe the LRU list is a hard requirement, yes when faulting in a page inside the page cache it assumes it needs to be added to lru list. But i think this can easily be work around. In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) so in my case a file back page must always be spawn first from a regular page and once read from disk then i can migrate to GPU page. So if you accept this intermediary step you can easily use ZONE_DEVICE for device memory. This way no lru, no complex dance to make the memory out of reach from regular memory allocator. I think we would have much to gain if we pool our effort on a single common solution for device memory. In my case the device memory is not accessible by the CPU (because PCIE restrictions), in your case it is. Thus the only difference is that in my case it can not be map inside the CPU page table while in yours it can. > > (1) Traditional ioremap > > a. Memory is mapped into kernel (linear and virtual) and user space > b. These PFNs do not have struct pages associated with it > c. These special PFNs are marked with special flags inside the PTE > d. Cannot participate in core VM functions much because of this > e. Cannot do easy user space migrations > > (2) Zone ZONE_DEVICE > > a. Memory is mapped into kernel and user space > b. PFNs do have struct pages associated with it > c. These struct pages are allocated inside it's own memory range > d. Unfortunately the struct page's union containing LRU has been > used for struct dev_pagemap pointer > e. Hence it cannot be part of any LRU (like Page cache) > f. Hence file cached mapping cannot reside on these PFNs > g. Cannot do easy migrations > > I had also explored non LRU representation of this coherent device > memory where the integration with system RAM in the core VM is limited only > to the following functions. Not being inside LRU is definitely going to > reduce the scope of tight integration with system RAM. > > (1) Migration support between system RAM and coherent memory > (2) Migration support between various coherent memory nodes > (3) Isolation of the coherent memory > (4) Mapping the coherent memory into user space through driver's > struct vm_operations > (5) HW poisoning of the coherent memory > > Allocating the entire memory of the coherent device node right > after hot plug into ZONE_MOVABLE (where the memory is already inside the > buddy system) will still expose a time window where other user space > allocations can come into the coherent device memory node and prevent the > intended isolation. So traditional hot plug is not the solution. Hence > started looking into CMA based non LRU solution but then hit the following > roadblocks. > > (1) CMA does not support hot plugging of new memory node > a. CMA area needs to be marked during boot before buddy is > initialized > b. cma_alloc()/cma_release() can happen on the marked area > c. Should be able to mark the CMA areas just after memory hot plug > d. cma_alloc()/cma_release() can happen later after the hot plug > e. This is not currently supported right now > > (2) Mapped non LRU migration of pages > a. Recent work from Michan Kim makes non LRU page migratable > b. But it still does not support migration of mapped non LRU pages > c. With non LRU CMA reserved, again there are some additional > challenges > > With hot pluggable CMA and non LRU mapped migration support there > may be an alternate approach to represent coherent device memory. Please > do review this RFC proposal and let me know your comments or suggestions. > Thank you. You can take a look at hmm-v13 if you want to see how i do non LRU page migration. While i put most of the migration code inside hmm_migrate.c it could easily be move to migrate.c without hmm_ prefix. There is 2 missing piece with existing migrate code. First is to put memory allocation for destination under control of who call the migrate code. Second is to allow offloading the copy operation to device (ie not use the CPU to copy data). I believe same requirement also make sense for platform you are targeting. Thus same code can be use. hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 I haven't posted this patchset yet because we are doing some modifications to the device driver API to accomodate some new features. But the ZONE_DEVICE changes and the overall migration code will stay the same more or less (i have patches that move it to migrate.c and share more code with existing migrate code). If you think i missed anything about lru and page cache please point it to me. Because when i audited code for that i didn't see any road block with the few fs i was looking at (ext4, xfs and core page cache code). > [...] Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-24 17:09 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-24 17:09 UTC (permalink / raw) To: Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > [...] > Core kernel memory features like reclamation, evictions etc. might > need to be restricted or modified on the coherent device memory node as > they can be performance limiting. The RFC does not propose anything on this > yet but it can be looked into later on. For now it just disables Auto NUMA > for any VMA which has coherent device memory. > > Seamless integration of coherent device memory with system memory > will enable various other features, some of which can be listed as follows. > > a. Seamless migrations between system RAM and the coherent memory > b. Will have asynchronous and high throughput migrations > c. Be able to allocate huge order pages from these memory regions > d. Restrict allocations to a large extent to the tasks using the > device for workload acceleration > > Before concluding, will look into the reasons why the existing > solutions don't work. There are two basic requirements which have to be > satisfies before the coherent device memory can be integrated with core > kernel seamlessly. > > a. PFN must have struct page > b. Struct page must able to be inside standard LRU lists > > The above two basic requirements discard the existing method of > device memory representation approaches like these which then requires the > need of creating a new framework. I do not believe the LRU list is a hard requirement, yes when faulting in a page inside the page cache it assumes it needs to be added to lru list. But i think this can easily be work around. In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) so in my case a file back page must always be spawn first from a regular page and once read from disk then i can migrate to GPU page. So if you accept this intermediary step you can easily use ZONE_DEVICE for device memory. This way no lru, no complex dance to make the memory out of reach from regular memory allocator. I think we would have much to gain if we pool our effort on a single common solution for device memory. In my case the device memory is not accessible by the CPU (because PCIE restrictions), in your case it is. Thus the only difference is that in my case it can not be map inside the CPU page table while in yours it can. > > (1) Traditional ioremap > > a. Memory is mapped into kernel (linear and virtual) and user space > b. These PFNs do not have struct pages associated with it > c. These special PFNs are marked with special flags inside the PTE > d. Cannot participate in core VM functions much because of this > e. Cannot do easy user space migrations > > (2) Zone ZONE_DEVICE > > a. Memory is mapped into kernel and user space > b. PFNs do have struct pages associated with it > c. These struct pages are allocated inside it's own memory range > d. Unfortunately the struct page's union containing LRU has been > used for struct dev_pagemap pointer > e. Hence it cannot be part of any LRU (like Page cache) > f. Hence file cached mapping cannot reside on these PFNs > g. Cannot do easy migrations > > I had also explored non LRU representation of this coherent device > memory where the integration with system RAM in the core VM is limited only > to the following functions. Not being inside LRU is definitely going to > reduce the scope of tight integration with system RAM. > > (1) Migration support between system RAM and coherent memory > (2) Migration support between various coherent memory nodes > (3) Isolation of the coherent memory > (4) Mapping the coherent memory into user space through driver's > struct vm_operations > (5) HW poisoning of the coherent memory > > Allocating the entire memory of the coherent device node right > after hot plug into ZONE_MOVABLE (where the memory is already inside the > buddy system) will still expose a time window where other user space > allocations can come into the coherent device memory node and prevent the > intended isolation. So traditional hot plug is not the solution. Hence > started looking into CMA based non LRU solution but then hit the following > roadblocks. > > (1) CMA does not support hot plugging of new memory node > a. CMA area needs to be marked during boot before buddy is > initialized > b. cma_alloc()/cma_release() can happen on the marked area > c. Should be able to mark the CMA areas just after memory hot plug > d. cma_alloc()/cma_release() can happen later after the hot plug > e. This is not currently supported right now > > (2) Mapped non LRU migration of pages > a. Recent work from Michan Kim makes non LRU page migratable > b. But it still does not support migration of mapped non LRU pages > c. With non LRU CMA reserved, again there are some additional > challenges > > With hot pluggable CMA and non LRU mapped migration support there > may be an alternate approach to represent coherent device memory. Please > do review this RFC proposal and let me know your comments or suggestions. > Thank you. You can take a look at hmm-v13 if you want to see how i do non LRU page migration. While i put most of the migration code inside hmm_migrate.c it could easily be move to migrate.c without hmm_ prefix. There is 2 missing piece with existing migrate code. First is to put memory allocation for destination under control of who call the migrate code. Second is to allow offloading the copy operation to device (ie not use the CPU to copy data). I believe same requirement also make sense for platform you are targeting. Thus same code can be use. hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 I haven't posted this patchset yet because we are doing some modifications to the device driver API to accomodate some new features. But the ZONE_DEVICE changes and the overall migration code will stay the same more or less (i have patches that move it to migrate.c and share more code with existing migrate code). If you think i missed anything about lru and page cache please point it to me. Because when i audited code for that i didn't see any road block with the few fs i was looking at (ext4, xfs and core page cache code). > [...] Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 17:09 ` Jerome Glisse @ 2016-10-25 4:26 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 4:26 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. One of the reason to look at this as a NUMA node is to allow things like over-commit of coherent device memory. The pages backing CDM being part of lru and considering the coherent device as a numa node makes that really simpler (we can run kswapd for that node). > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. IMHO, we should be able to share the HMM migration approach. We definitely won't need the mirror page table part. That is one of the reson I requested HMM mirror page table to be a seperate patchset. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). I looked at the hmm-v13 w.r.t migration and I guess some form of device callback/acceleration during migration is something we should definitely have. I still haven't figured out how non addressable and coherent device memory can fit together there. I was waiting for the page cache migration support to be pushed to the repository before I start looking at this closely. -aneesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 4:26 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 4:26 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. One of the reason to look at this as a NUMA node is to allow things like over-commit of coherent device memory. The pages backing CDM being part of lru and considering the coherent device as a numa node makes that really simpler (we can run kswapd for that node). > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. IMHO, we should be able to share the HMM migration approach. We definitely won't need the mirror page table part. That is one of the reson I requested HMM mirror page table to be a seperate patchset. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). I looked at the hmm-v13 w.r.t migration and I guess some form of device callback/acceleration during migration is something we should definitely have. I still haven't figured out how non addressable and coherent device memory can fit together there. I was waiting for the page cache migration support to be pushed to the repository before I start looking at this closely. -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 4:26 ` Aneesh Kumar K.V @ 2016-10-25 15:16 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 15:16 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > >> [...] > > > >> Core kernel memory features like reclamation, evictions etc. might > >> need to be restricted or modified on the coherent device memory node as > >> they can be performance limiting. The RFC does not propose anything on this > >> yet but it can be looked into later on. For now it just disables Auto NUMA > >> for any VMA which has coherent device memory. > >> > >> Seamless integration of coherent device memory with system memory > >> will enable various other features, some of which can be listed as follows. > >> > >> a. Seamless migrations between system RAM and the coherent memory > >> b. Will have asynchronous and high throughput migrations > >> c. Be able to allocate huge order pages from these memory regions > >> d. Restrict allocations to a large extent to the tasks using the > >> device for workload acceleration > >> > >> Before concluding, will look into the reasons why the existing > >> solutions don't work. There are two basic requirements which have to be > >> satisfies before the coherent device memory can be integrated with core > >> kernel seamlessly. > >> > >> a. PFN must have struct page > >> b. Struct page must able to be inside standard LRU lists > >> > >> The above two basic requirements discard the existing method of > >> device memory representation approaches like these which then requires the > >> need of creating a new framework. > > > > I do not believe the LRU list is a hard requirement, yes when faulting in > > a page inside the page cache it assumes it needs to be added to lru list. > > But i think this can easily be work around. > > > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > > so in my case a file back page must always be spawn first from a regular > > page and once read from disk then i can migrate to GPU page. > > > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > > device memory. This way no lru, no complex dance to make the memory out of > > reach from regular memory allocator. > > One of the reason to look at this as a NUMA node is to allow things like > over-commit of coherent device memory. The pages backing CDM being part of > lru and considering the coherent device as a numa node makes that really > simpler (we can run kswapd for that node). I am not convince that kswapd is what you want for overcommit, for HMM i leave overcommit to device driver and they seem quite happy about handling that themself. Only the device driver have enough information on what is worth evicting or what need to be evicted. > > I think we would have much to gain if we pool our effort on a single common > > solution for device memory. In my case the device memory is not accessible > > by the CPU (because PCIE restrictions), in your case it is. Thus the only > > difference is that in my case it can not be map inside the CPU page table > > while in yours it can. > > IMHO, we should be able to share the HMM migration approach. We > definitely won't need the mirror page table part. That is one of the > reson I requested HMM mirror page table to be a seperate patchset. They will need to share one thing, that is hmm_pfn_t which is a special pfn type in which i store HMM and migrate specific flag for migration. Because i can not use the struct list_head lru of struct page i have to do migration using array of pfn and i need to keep some flags per page during migration. So i share the same type hmm_pfn_t btw mirror and migrate code. But that's pretty small and it can be factor out of HMM, i can also just use pfn_t and add flag i need their. > > > > >> > >> (1) Traditional ioremap > >> > >> a. Memory is mapped into kernel (linear and virtual) and user space > >> b. These PFNs do not have struct pages associated with it > >> c. These special PFNs are marked with special flags inside the PTE > >> d. Cannot participate in core VM functions much because of this > >> e. Cannot do easy user space migrations > >> > >> (2) Zone ZONE_DEVICE > >> > >> a. Memory is mapped into kernel and user space > >> b. PFNs do have struct pages associated with it > >> c. These struct pages are allocated inside it's own memory range > >> d. Unfortunately the struct page's union containing LRU has been > >> used for struct dev_pagemap pointer > >> e. Hence it cannot be part of any LRU (like Page cache) > >> f. Hence file cached mapping cannot reside on these PFNs > >> g. Cannot do easy migrations > >> > >> I had also explored non LRU representation of this coherent device > >> memory where the integration with system RAM in the core VM is limited only > >> to the following functions. Not being inside LRU is definitely going to > >> reduce the scope of tight integration with system RAM. > >> > >> (1) Migration support between system RAM and coherent memory > >> (2) Migration support between various coherent memory nodes > >> (3) Isolation of the coherent memory > >> (4) Mapping the coherent memory into user space through driver's > >> struct vm_operations > >> (5) HW poisoning of the coherent memory > >> > >> Allocating the entire memory of the coherent device node right > >> after hot plug into ZONE_MOVABLE (where the memory is already inside the > >> buddy system) will still expose a time window where other user space > >> allocations can come into the coherent device memory node and prevent the > >> intended isolation. So traditional hot plug is not the solution. Hence > >> started looking into CMA based non LRU solution but then hit the following > >> roadblocks. > >> > >> (1) CMA does not support hot plugging of new memory node > >> a. CMA area needs to be marked during boot before buddy is > >> initialized > >> b. cma_alloc()/cma_release() can happen on the marked area > >> c. Should be able to mark the CMA areas just after memory hot plug > >> d. cma_alloc()/cma_release() can happen later after the hot plug > >> e. This is not currently supported right now > >> > >> (2) Mapped non LRU migration of pages > >> a. Recent work from Michan Kim makes non LRU page migratable > >> b. But it still does not support migration of mapped non LRU pages > >> c. With non LRU CMA reserved, again there are some additional > >> challenges > >> > >> With hot pluggable CMA and non LRU mapped migration support there > >> may be an alternate approach to represent coherent device memory. Please > >> do review this RFC proposal and let me know your comments or suggestions. > >> Thank you. > > > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put memory > > allocation for destination under control of who call the migrate code. Second > > is to allow offloading the copy operation to device (ie not use the CPU to > > copy data). > > > > I believe same requirement also make sense for platform you are targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some modifications > > to the device driver API to accomodate some new features. But the ZONE_DEVICE > > changes and the overall migration code will stay the same more or less (i have > > patches that move it to migrate.c and share more code with existing migrate > > code). > > > > If you think i missed anything about lru and page cache please point it to > > me. Because when i audited code for that i didn't see any road block with > > the few fs i was looking at (ext4, xfs and core page cache code). > > I looked at the hmm-v13 w.r.t migration and I guess some form of device > callback/acceleration during migration is something we should definitely > have. I still haven't figured out how non addressable and coherent device > memory can fit together there. I was waiting for the page cache > migration support to be pushed to the repository before I start looking > at this closely. > The page cache migration does not touch the migrate code path. My issue with page cache is writeback. The only difference with existing migrate code is refcount check for ZONE_DEVICE page. Everything else is the same. For writeback i need to use a bounce page so basicly i am trying to hook myself along the ISA bounce infrastructure for bio and i think it is the easiest path to solve this in my case. In your case where block device can also access the device memory you don't even need to use bounce page for writeback. Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 15:16 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 15:16 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > >> [...] > > > >> Core kernel memory features like reclamation, evictions etc. might > >> need to be restricted or modified on the coherent device memory node as > >> they can be performance limiting. The RFC does not propose anything on this > >> yet but it can be looked into later on. For now it just disables Auto NUMA > >> for any VMA which has coherent device memory. > >> > >> Seamless integration of coherent device memory with system memory > >> will enable various other features, some of which can be listed as follows. > >> > >> a. Seamless migrations between system RAM and the coherent memory > >> b. Will have asynchronous and high throughput migrations > >> c. Be able to allocate huge order pages from these memory regions > >> d. Restrict allocations to a large extent to the tasks using the > >> device for workload acceleration > >> > >> Before concluding, will look into the reasons why the existing > >> solutions don't work. There are two basic requirements which have to be > >> satisfies before the coherent device memory can be integrated with core > >> kernel seamlessly. > >> > >> a. PFN must have struct page > >> b. Struct page must able to be inside standard LRU lists > >> > >> The above two basic requirements discard the existing method of > >> device memory representation approaches like these which then requires the > >> need of creating a new framework. > > > > I do not believe the LRU list is a hard requirement, yes when faulting in > > a page inside the page cache it assumes it needs to be added to lru list. > > But i think this can easily be work around. > > > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > > so in my case a file back page must always be spawn first from a regular > > page and once read from disk then i can migrate to GPU page. > > > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > > device memory. This way no lru, no complex dance to make the memory out of > > reach from regular memory allocator. > > One of the reason to look at this as a NUMA node is to allow things like > over-commit of coherent device memory. The pages backing CDM being part of > lru and considering the coherent device as a numa node makes that really > simpler (we can run kswapd for that node). I am not convince that kswapd is what you want for overcommit, for HMM i leave overcommit to device driver and they seem quite happy about handling that themself. Only the device driver have enough information on what is worth evicting or what need to be evicted. > > I think we would have much to gain if we pool our effort on a single common > > solution for device memory. In my case the device memory is not accessible > > by the CPU (because PCIE restrictions), in your case it is. Thus the only > > difference is that in my case it can not be map inside the CPU page table > > while in yours it can. > > IMHO, we should be able to share the HMM migration approach. We > definitely won't need the mirror page table part. That is one of the > reson I requested HMM mirror page table to be a seperate patchset. They will need to share one thing, that is hmm_pfn_t which is a special pfn type in which i store HMM and migrate specific flag for migration. Because i can not use the struct list_head lru of struct page i have to do migration using array of pfn and i need to keep some flags per page during migration. So i share the same type hmm_pfn_t btw mirror and migrate code. But that's pretty small and it can be factor out of HMM, i can also just use pfn_t and add flag i need their. > > > > >> > >> (1) Traditional ioremap > >> > >> a. Memory is mapped into kernel (linear and virtual) and user space > >> b. These PFNs do not have struct pages associated with it > >> c. These special PFNs are marked with special flags inside the PTE > >> d. Cannot participate in core VM functions much because of this > >> e. Cannot do easy user space migrations > >> > >> (2) Zone ZONE_DEVICE > >> > >> a. Memory is mapped into kernel and user space > >> b. PFNs do have struct pages associated with it > >> c. These struct pages are allocated inside it's own memory range > >> d. Unfortunately the struct page's union containing LRU has been > >> used for struct dev_pagemap pointer > >> e. Hence it cannot be part of any LRU (like Page cache) > >> f. Hence file cached mapping cannot reside on these PFNs > >> g. Cannot do easy migrations > >> > >> I had also explored non LRU representation of this coherent device > >> memory where the integration with system RAM in the core VM is limited only > >> to the following functions. Not being inside LRU is definitely going to > >> reduce the scope of tight integration with system RAM. > >> > >> (1) Migration support between system RAM and coherent memory > >> (2) Migration support between various coherent memory nodes > >> (3) Isolation of the coherent memory > >> (4) Mapping the coherent memory into user space through driver's > >> struct vm_operations > >> (5) HW poisoning of the coherent memory > >> > >> Allocating the entire memory of the coherent device node right > >> after hot plug into ZONE_MOVABLE (where the memory is already inside the > >> buddy system) will still expose a time window where other user space > >> allocations can come into the coherent device memory node and prevent the > >> intended isolation. So traditional hot plug is not the solution. Hence > >> started looking into CMA based non LRU solution but then hit the following > >> roadblocks. > >> > >> (1) CMA does not support hot plugging of new memory node > >> a. CMA area needs to be marked during boot before buddy is > >> initialized > >> b. cma_alloc()/cma_release() can happen on the marked area > >> c. Should be able to mark the CMA areas just after memory hot plug > >> d. cma_alloc()/cma_release() can happen later after the hot plug > >> e. This is not currently supported right now > >> > >> (2) Mapped non LRU migration of pages > >> a. Recent work from Michan Kim makes non LRU page migratable > >> b. But it still does not support migration of mapped non LRU pages > >> c. With non LRU CMA reserved, again there are some additional > >> challenges > >> > >> With hot pluggable CMA and non LRU mapped migration support there > >> may be an alternate approach to represent coherent device memory. Please > >> do review this RFC proposal and let me know your comments or suggestions. > >> Thank you. > > > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put memory > > allocation for destination under control of who call the migrate code. Second > > is to allow offloading the copy operation to device (ie not use the CPU to > > copy data). > > > > I believe same requirement also make sense for platform you are targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some modifications > > to the device driver API to accomodate some new features. But the ZONE_DEVICE > > changes and the overall migration code will stay the same more or less (i have > > patches that move it to migrate.c and share more code with existing migrate > > code). > > > > If you think i missed anything about lru and page cache please point it to > > me. Because when i audited code for that i didn't see any road block with > > the few fs i was looking at (ext4, xfs and core page cache code). > > I looked at the hmm-v13 w.r.t migration and I guess some form of device > callback/acceleration during migration is something we should definitely > have. I still haven't figured out how non addressable and coherent device > memory can fit together there. I was waiting for the page cache > migration support to be pushed to the repository before I start looking > at this closely. > The page cache migration does not touch the migrate code path. My issue with page cache is writeback. The only difference with existing migrate code is refcount check for ZONE_DEVICE page. Everything else is the same. For writeback i need to use a bounce page so basicly i am trying to hook myself along the ISA bounce infrastructure for bio and i think it is the easiest path to solve this in my case. In your case where block device can also access the device memory you don't even need to use bounce page for writeback. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 15:16 ` Jerome Glisse @ 2016-10-26 11:09 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-26 11:09 UTC (permalink / raw) To: Jerome Glisse Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> > >> I looked at the hmm-v13 w.r.t migration and I guess some form of device >> callback/acceleration during migration is something we should definitely >> have. I still haven't figured out how non addressable and coherent device >> memory can fit together there. I was waiting for the page cache >> migration support to be pushed to the repository before I start looking >> at this closely. >> > > The page cache migration does not touch the migrate code path. My issue with > page cache is writeback. The only difference with existing migrate code is > refcount check for ZONE_DEVICE page. Everything else is the same. What about the radix tree ? does file system migrate_page callback handle replacing normal page with ZONE_DEVICE page/exceptional entries ? > > For writeback i need to use a bounce page so basicly i am trying to hook myself > along the ISA bounce infrastructure for bio and i think it is the easiest path > to solve this in my case. > > In your case where block device can also access the device memory you don't > even need to use bounce page for writeback. > -aneesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-26 11:09 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-26 11:09 UTC (permalink / raw) To: Jerome Glisse Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> > >> I looked at the hmm-v13 w.r.t migration and I guess some form of device >> callback/acceleration during migration is something we should definitely >> have. I still haven't figured out how non addressable and coherent device >> memory can fit together there. I was waiting for the page cache >> migration support to be pushed to the repository before I start looking >> at this closely. >> > > The page cache migration does not touch the migrate code path. My issue with > page cache is writeback. The only difference with existing migrate code is > refcount check for ZONE_DEVICE page. Everything else is the same. What about the radix tree ? does file system migrate_page callback handle replacing normal page with ZONE_DEVICE page/exceptional entries ? > > For writeback i need to use a bounce page so basicly i am trying to hook myself > along the ISA bounce infrastructure for bio and i think it is the easiest path > to solve this in my case. > > In your case where block device can also access the device memory you don't > even need to use bounce page for writeback. > -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-26 11:09 ` Aneesh Kumar K.V @ 2016-10-26 16:07 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-26 16:07 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> > > >> I looked at the hmm-v13 w.r.t migration and I guess some form of device > >> callback/acceleration during migration is something we should definitely > >> have. I still haven't figured out how non addressable and coherent device > >> memory can fit together there. I was waiting for the page cache > >> migration support to be pushed to the repository before I start looking > >> at this closely. > >> > > > > The page cache migration does not touch the migrate code path. My issue with > > page cache is writeback. The only difference with existing migrate code is > > refcount check for ZONE_DEVICE page. Everything else is the same. > > What about the radix tree ? does file system migrate_page callback handle > replacing normal page with ZONE_DEVICE page/exceptional entries ? > It use the exact same existing code (from mm/migrate.c) so yes the radix tree is updated and buffer_head are migrated. Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-26 16:07 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-26 16:07 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> > > >> I looked at the hmm-v13 w.r.t migration and I guess some form of device > >> callback/acceleration during migration is something we should definitely > >> have. I still haven't figured out how non addressable and coherent device > >> memory can fit together there. I was waiting for the page cache > >> migration support to be pushed to the repository before I start looking > >> at this closely. > >> > > > > The page cache migration does not touch the migrate code path. My issue with > > page cache is writeback. The only difference with existing migrate code is > > refcount check for ZONE_DEVICE page. Everything else is the same. > > What about the radix tree ? does file system migrate_page callback handle > replacing normal page with ZONE_DEVICE page/exceptional entries ? > It use the exact same existing code (from mm/migrate.c) so yes the radix tree is updated and buffer_head are migrated. Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-26 16:07 ` Jerome Glisse @ 2016-10-28 5:29 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-28 5:29 UTC (permalink / raw) To: Jerome Glisse Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: >> >> Jerome Glisse <j.glisse@gmail.com> writes: >> >> >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> >> > >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device >> >> callback/acceleration during migration is something we should definitely >> >> have. I still haven't figured out how non addressable and coherent device >> >> memory can fit together there. I was waiting for the page cache >> >> migration support to be pushed to the repository before I start looking >> >> at this closely. >> >> >> > >> > The page cache migration does not touch the migrate code path. My issue with >> > page cache is writeback. The only difference with existing migrate code is >> > refcount check for ZONE_DEVICE page. Everything else is the same. >> >> What about the radix tree ? does file system migrate_page callback handle >> replacing normal page with ZONE_DEVICE page/exceptional entries ? >> > > It use the exact same existing code (from mm/migrate.c) so yes the radix tree > is updated and buffer_head are migrated. > I looked at the the page cache migration patches shared and I find that you are not using exceptional entries when we migrate a page cache page to device memory. But I am now not sure how a read from page cache will work with that. ie, a file system read will now find the page in page cache. But we cannot do a copy_to_user of that page because that is now backed by an unaddressable memory right ? do_generic_file_read() does page = find_get_page(mapping, index); .... ret = copy_page_to_iter(page, offset, nr, iter); which does void *kaddr = kmap_atomic(page); size_t wanted = copy_to_iter(kaddr + offset, bytes, i); kunmap_atomic(kaddr); -aneesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-28 5:29 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-28 5:29 UTC (permalink / raw) To: Jerome Glisse Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: >> >> Jerome Glisse <j.glisse@gmail.com> writes: >> >> >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> >> > >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device >> >> callback/acceleration during migration is something we should definitely >> >> have. I still haven't figured out how non addressable and coherent device >> >> memory can fit together there. I was waiting for the page cache >> >> migration support to be pushed to the repository before I start looking >> >> at this closely. >> >> >> > >> > The page cache migration does not touch the migrate code path. My issue with >> > page cache is writeback. The only difference with existing migrate code is >> > refcount check for ZONE_DEVICE page. Everything else is the same. >> >> What about the radix tree ? does file system migrate_page callback handle >> replacing normal page with ZONE_DEVICE page/exceptional entries ? >> > > It use the exact same existing code (from mm/migrate.c) so yes the radix tree > is updated and buffer_head are migrated. > I looked at the the page cache migration patches shared and I find that you are not using exceptional entries when we migrate a page cache page to device memory. But I am now not sure how a read from page cache will work with that. ie, a file system read will now find the page in page cache. But we cannot do a copy_to_user of that page because that is now backed by an unaddressable memory right ? do_generic_file_read() does page = find_get_page(mapping, index); .... ret = copy_page_to_iter(page, offset, nr, iter); which does void *kaddr = kmap_atomic(page); size_t wanted = copy_to_iter(kaddr + offset, bytes, i); kunmap_atomic(kaddr); -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-28 5:29 ` Aneesh Kumar K.V @ 2016-10-28 16:16 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-28 16:16 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Fri, Oct 28, 2016 at 10:59:52AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > >> >> Jerome Glisse <j.glisse@gmail.com> writes: > >> >> > >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> >> > > >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device > >> >> callback/acceleration during migration is something we should definitely > >> >> have. I still haven't figured out how non addressable and coherent device > >> >> memory can fit together there. I was waiting for the page cache > >> >> migration support to be pushed to the repository before I start looking > >> >> at this closely. > >> >> > >> > > >> > The page cache migration does not touch the migrate code path. My issue with > >> > page cache is writeback. The only difference with existing migrate code is > >> > refcount check for ZONE_DEVICE page. Everything else is the same. > >> > >> What about the radix tree ? does file system migrate_page callback handle > >> replacing normal page with ZONE_DEVICE page/exceptional entries ? > >> > > > > It use the exact same existing code (from mm/migrate.c) so yes the radix tree > > is updated and buffer_head are migrated. > > > > I looked at the the page cache migration patches shared and I find that > you are not using exceptional entries when we migrate a page cache page to > device memory. But I am now not sure how a read from page cache will > work with that. > > ie, a file system read will now find the page in page cache. But we > cannot do a copy_to_user of that page because that is now backed by an > unaddressable memory right ? > > do_generic_file_read() does > page = find_get_page(mapping, index); > .... > ret = copy_page_to_iter(page, offset, nr, iter); > > which does > void *kaddr = kmap_atomic(page); > size_t wanted = copy_to_iter(kaddr + offset, bytes, i); > kunmap_atomic(kaddr); Like i said right now for un-addressable memory my patches are mostly broken. For read and write. I am focusing on page write back for now as it seemed to be the more problematic case. For read/write the intention is to trigger a migration back to system memory inside read/write of filesystem. This is also why i will need a flag to indicate if a filesystem support migration to un-addressable memory. But in your case where the device memory is accessible then it should just work, or do you need to do special thing when kmaping device page ? Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-28 16:16 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-28 16:16 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Fri, Oct 28, 2016 at 10:59:52AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > >> >> Jerome Glisse <j.glisse@gmail.com> writes: > >> >> > >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> >> > > >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device > >> >> callback/acceleration during migration is something we should definitely > >> >> have. I still haven't figured out how non addressable and coherent device > >> >> memory can fit together there. I was waiting for the page cache > >> >> migration support to be pushed to the repository before I start looking > >> >> at this closely. > >> >> > >> > > >> > The page cache migration does not touch the migrate code path. My issue with > >> > page cache is writeback. The only difference with existing migrate code is > >> > refcount check for ZONE_DEVICE page. Everything else is the same. > >> > >> What about the radix tree ? does file system migrate_page callback handle > >> replacing normal page with ZONE_DEVICE page/exceptional entries ? > >> > > > > It use the exact same existing code (from mm/migrate.c) so yes the radix tree > > is updated and buffer_head are migrated. > > > > I looked at the the page cache migration patches shared and I find that > you are not using exceptional entries when we migrate a page cache page to > device memory. But I am now not sure how a read from page cache will > work with that. > > ie, a file system read will now find the page in page cache. But we > cannot do a copy_to_user of that page because that is now backed by an > unaddressable memory right ? > > do_generic_file_read() does > page = find_get_page(mapping, index); > .... > ret = copy_page_to_iter(page, offset, nr, iter); > > which does > void *kaddr = kmap_atomic(page); > size_t wanted = copy_to_iter(kaddr + offset, bytes, i); > kunmap_atomic(kaddr); Like i said right now for un-addressable memory my patches are mostly broken. For read and write. I am focusing on page write back for now as it seemed to be the more problematic case. For read/write the intention is to trigger a migration back to system memory inside read/write of filesystem. This is also why i will need a flag to indicate if a filesystem support migration to un-addressable memory. But in your case where the device memory is accessible then it should just work, or do you need to do special thing when kmaping device page ? Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 4:26 ` Aneesh Kumar K.V @ 2016-11-05 5:21 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-05 5:21 UTC (permalink / raw) To: Aneesh Kumar K.V, Jerome Glisse Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote: > I looked at the hmm-v13 w.r.t migration and I guess some form of device > callback/acceleration during migration is something we should definitely > have. I still haven't figured out how non addressable and coherent device > memory can fit together there. I was waiting for the page cache > migration support to be pushed to the repository before I start looking > at this closely. Aneesh, did not get that. Currently basic page cache migration is supported, right ? The device callback during migration, fault etc are supported through page->pgmap pointer and extending dev_pagemap structure to accommodate new members. IIUC that is the reason ZONE_DEVICE is being modified so that page ->pgmap overloading can be used for various driver/device specific callbacks while inside core VM functions or HMM functions. HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which can have it's struct pages in system RAM but they cannot be accessed from the CPU. Now coherent device memory is kind of similar to persistent memory like NVDIMM which is already supported through ZONE_DEVICE (though we might not want to use vmemap_altmap instead have the struct pages in the system RAM). Now HMM has to learn working with 'dev_pagemap->addressable' type of device memory and then support all possible migrations through it's API. So in a nutshell, these are the changes we need to do to make HMM work with coherent device memory. (0) Support all possible migrations between system RAM and device memory for current un-addressable device memory and make the HMM migration API layer comprehensive and complete. (1) Create coherent device memory representation in ZONE_DEVICE (a) Make it exactly the same as that of persistent memory/NVDIMM or (b) Create a new type for coherent device memory representation (2) Support all possible migrations between system RAM and device memory for new addressable coherent device memory represented in ZONE_DEVICE extending the HMM migration API layer. Right now, HMM V13 patch series supports migration for a subset of private anonymous pages for un-addressable device memory. I am wondering how difficult is it to implement all possible anon, file mapping migration support for both un-addressable and addressable coherent device memory through ZONE_DEVICE. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-11-05 5:21 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-11-05 5:21 UTC (permalink / raw) To: Aneesh Kumar K.V, Jerome Glisse Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote: > I looked at the hmm-v13 w.r.t migration and I guess some form of device > callback/acceleration during migration is something we should definitely > have. I still haven't figured out how non addressable and coherent device > memory can fit together there. I was waiting for the page cache > migration support to be pushed to the repository before I start looking > at this closely. Aneesh, did not get that. Currently basic page cache migration is supported, right ? The device callback during migration, fault etc are supported through page->pgmap pointer and extending dev_pagemap structure to accommodate new members. IIUC that is the reason ZONE_DEVICE is being modified so that page ->pgmap overloading can be used for various driver/device specific callbacks while inside core VM functions or HMM functions. HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which can have it's struct pages in system RAM but they cannot be accessed from the CPU. Now coherent device memory is kind of similar to persistent memory like NVDIMM which is already supported through ZONE_DEVICE (though we might not want to use vmemap_altmap instead have the struct pages in the system RAM). Now HMM has to learn working with 'dev_pagemap->addressable' type of device memory and then support all possible migrations through it's API. So in a nutshell, these are the changes we need to do to make HMM work with coherent device memory. (0) Support all possible migrations between system RAM and device memory for current un-addressable device memory and make the HMM migration API layer comprehensive and complete. (1) Create coherent device memory representation in ZONE_DEVICE (a) Make it exactly the same as that of persistent memory/NVDIMM or (b) Create a new type for coherent device memory representation (2) Support all possible migrations between system RAM and device memory for new addressable coherent device memory represented in ZONE_DEVICE extending the HMM migration API layer. Right now, HMM V13 patch series supports migration for a subset of private anonymous pages for un-addressable device memory. I am wondering how difficult is it to implement all possible anon, file mapping migration support for both un-addressable and addressable coherent device memory through ZONE_DEVICE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-11-05 5:21 ` Anshuman Khandual @ 2016-11-05 18:02 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-11-05 18:02 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Sat, Nov 05, 2016 at 10:51:21AM +0530, Anshuman Khandual wrote: > On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote: > > I looked at the hmm-v13 w.r.t migration and I guess some form of device > > callback/acceleration during migration is something we should definitely > > have. I still haven't figured out how non addressable and coherent device > > memory can fit together there. I was waiting for the page cache > > migration support to be pushed to the repository before I start looking > > at this closely. > > Aneesh, did not get that. Currently basic page cache migration is supported, > right ? The device callback during migration, fault etc are supported through > page->pgmap pointer and extending dev_pagemap structure to accommodate new > members. IIUC that is the reason ZONE_DEVICE is being modified so that page > ->pgmap overloading can be used for various driver/device specific callbacks > while inside core VM functions or HMM functions. > > HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which > can have it's struct pages in system RAM but they cannot be accessed from the > CPU. Now coherent device memory is kind of similar to persistent memory like > NVDIMM which is already supported through ZONE_DEVICE (though we might not > want to use vmemap_altmap instead have the struct pages in the system RAM). > Now HMM has to learn working with 'dev_pagemap->addressable' type of device > memory and then support all possible migrations through it's API. So in a > nutshell, these are the changes we need to do to make HMM work with coherent > device memory. > > (0) Support all possible migrations between system RAM and device memory > for current un-addressable device memory and make the HMM migration > API layer comprehensive and complete. What is no comprehensive or complete in the API layer ? I think the API is pretty clear the migrate function does not rely on anything except HMM pfn. > > (1) Create coherent device memory representation in ZONE_DEVICE > (a) Make it exactly the same as that of persistent memory/NVDIMM > > or > > (b) Create a new type for coherent device memory representation So i will soon push an updated tree with modification to HMM API (from device driver point of view but the migrate stuff is virtually the same). I slpitted the addressable and movable concept and thus it is now easy to support coherent addressable memory and non addressable memory. > > (2) Support all possible migrations between system RAM and device memory > for new addressable coherent device memory represented in ZONE_DEVICE > extending the HMM migration API layer. > > Right now, HMM V13 patch series supports migration for a subset of private > anonymous pages for un-addressable device memory. I am wondering how difficult > is it to implement all possible anon, file mapping migration support for both > un-addressable and addressable coherent device memory through ZONE_DEVICE. > There is no need to extend the API to support file back as matter of fact the 2 patches i sent you do support migration of file back page (page->mapping) to and from ZONE_DEVICE as long as this ZONE_DEVICE memory is accessible by the CPU and coherent. What i am still working on is the non addressable case that is way more tedious (handle direct IO, read, write and writeback). So difficulty for coherent memory is nill, it is the non addressable memory that is hard to support in respect to file back page. Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-11-05 18:02 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-11-05 18:02 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Sat, Nov 05, 2016 at 10:51:21AM +0530, Anshuman Khandual wrote: > On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote: > > I looked at the hmm-v13 w.r.t migration and I guess some form of device > > callback/acceleration during migration is something we should definitely > > have. I still haven't figured out how non addressable and coherent device > > memory can fit together there. I was waiting for the page cache > > migration support to be pushed to the repository before I start looking > > at this closely. > > Aneesh, did not get that. Currently basic page cache migration is supported, > right ? The device callback during migration, fault etc are supported through > page->pgmap pointer and extending dev_pagemap structure to accommodate new > members. IIUC that is the reason ZONE_DEVICE is being modified so that page > ->pgmap overloading can be used for various driver/device specific callbacks > while inside core VM functions or HMM functions. > > HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which > can have it's struct pages in system RAM but they cannot be accessed from the > CPU. Now coherent device memory is kind of similar to persistent memory like > NVDIMM which is already supported through ZONE_DEVICE (though we might not > want to use vmemap_altmap instead have the struct pages in the system RAM). > Now HMM has to learn working with 'dev_pagemap->addressable' type of device > memory and then support all possible migrations through it's API. So in a > nutshell, these are the changes we need to do to make HMM work with coherent > device memory. > > (0) Support all possible migrations between system RAM and device memory > for current un-addressable device memory and make the HMM migration > API layer comprehensive and complete. What is no comprehensive or complete in the API layer ? I think the API is pretty clear the migrate function does not rely on anything except HMM pfn. > > (1) Create coherent device memory representation in ZONE_DEVICE > (a) Make it exactly the same as that of persistent memory/NVDIMM > > or > > (b) Create a new type for coherent device memory representation So i will soon push an updated tree with modification to HMM API (from device driver point of view but the migrate stuff is virtually the same). I slpitted the addressable and movable concept and thus it is now easy to support coherent addressable memory and non addressable memory. > > (2) Support all possible migrations between system RAM and device memory > for new addressable coherent device memory represented in ZONE_DEVICE > extending the HMM migration API layer. > > Right now, HMM V13 patch series supports migration for a subset of private > anonymous pages for un-addressable device memory. I am wondering how difficult > is it to implement all possible anon, file mapping migration support for both > un-addressable and addressable coherent device memory through ZONE_DEVICE. > There is no need to extend the API to support file back as matter of fact the 2 patches i sent you do support migration of file back page (page->mapping) to and from ZONE_DEVICE as long as this ZONE_DEVICE memory is accessible by the CPU and coherent. What i am still working on is the non addressable case that is way more tedious (handle direct IO, read, write and writeback). So difficulty for coherent memory is nill, it is the non addressable memory that is hard to support in respect to file back page. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 17:09 ` Jerome Glisse @ 2016-10-25 4:59 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 4:59 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > The other restriction around ZONE_DEVICE is, it is not a managed zone. That prevents any direct allocation from coherent device by application. ie, we would like to force allocation from coherent device using interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? -aneeesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 4:59 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 4:59 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > The other restriction around ZONE_DEVICE is, it is not a managed zone. That prevents any direct allocation from coherent device by application. ie, we would like to force allocation from coherent device using interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? -aneeesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 4:59 ` Aneesh Kumar K.V @ 2016-10-25 15:32 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 15:32 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put memory > > allocation for destination under control of who call the migrate code. Second > > is to allow offloading the copy operation to device (ie not use the CPU to > > copy data). > > > > I believe same requirement also make sense for platform you are targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some modifications > > to the device driver API to accomodate some new features. But the ZONE_DEVICE > > changes and the overall migration code will stay the same more or less (i have > > patches that move it to migrate.c and share more code with existing migrate > > code). > > > > If you think i missed anything about lru and page cache please point it to > > me. Because when i audited code for that i didn't see any road block with > > the few fs i was looking at (ext4, xfs and core page cache code). > > > > The other restriction around ZONE_DEVICE is, it is not a managed zone. > That prevents any direct allocation from coherent device by application. > ie, we would like to force allocation from coherent device using > interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? To achieve this we rely on device fault code path ie when device take a page fault with help of HMM it will use existing memory if any for fault address but if CPU page table is empty (and it is not file back vma because of readback) then device can directly allocate device memory and HMM will update CPU page table to point to newly allocated device memory. So in fact i am not using existing kernel API to achieve this but the whole policy of where to allocate and what to allocate is under device driver responsability and device driver leverage its existing userspace API to get proper hint/direction from the application. Device memory is really a special case in my view, it only make sense to use it if memory is actively access by device and only way device access memory is when it is program to do so through the device driver API. There is nothing such as GPU threads in the kernel and there is no way to spawn or move work thread to GPU. This are specialize device and they require special per device API. So in my view using existing kernel API such as mbind() is counter productive. You might have buggy software that will mbind their memory to device and never use the device which lead to device memory being wasted for a process that never use the device. So my opinion is that you should not try to use existing kernel API to get policy information from userspace but let the device driver gather such policy through its own private API. Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 15:32 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 15:32 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put memory > > allocation for destination under control of who call the migrate code. Second > > is to allow offloading the copy operation to device (ie not use the CPU to > > copy data). > > > > I believe same requirement also make sense for platform you are targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some modifications > > to the device driver API to accomodate some new features. But the ZONE_DEVICE > > changes and the overall migration code will stay the same more or less (i have > > patches that move it to migrate.c and share more code with existing migrate > > code). > > > > If you think i missed anything about lru and page cache please point it to > > me. Because when i audited code for that i didn't see any road block with > > the few fs i was looking at (ext4, xfs and core page cache code). > > > > The other restriction around ZONE_DEVICE is, it is not a managed zone. > That prevents any direct allocation from coherent device by application. > ie, we would like to force allocation from coherent device using > interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? To achieve this we rely on device fault code path ie when device take a page fault with help of HMM it will use existing memory if any for fault address but if CPU page table is empty (and it is not file back vma because of readback) then device can directly allocate device memory and HMM will update CPU page table to point to newly allocated device memory. So in fact i am not using existing kernel API to achieve this but the whole policy of where to allocate and what to allocate is under device driver responsability and device driver leverage its existing userspace API to get proper hint/direction from the application. Device memory is really a special case in my view, it only make sense to use it if memory is actively access by device and only way device access memory is when it is program to do so through the device driver API. There is nothing such as GPU threads in the kernel and there is no way to spawn or move work thread to GPU. This are specialize device and they require special per device API. So in my view using existing kernel API such as mbind() is counter productive. You might have buggy software that will mbind their memory to device and never use the device which lead to device memory being wasted for a process that never use the device. So my opinion is that you should not try to use existing kernel API to get policy information from userspace but let the device driver gather such policy through its own private API. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 15:32 ` Jerome Glisse @ 2016-10-25 17:31 ` Aneesh Kumar K.V -1 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 17:31 UTC (permalink / raw) To: Jerome Glisse Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >> > You can take a look at hmm-v13 if you want to see how i do non LRU page >> > migration. While i put most of the migration code inside hmm_migrate.c it >> > could easily be move to migrate.c without hmm_ prefix. >> > >> > There is 2 missing piece with existing migrate code. First is to put memory >> > allocation for destination under control of who call the migrate code. Second >> > is to allow offloading the copy operation to device (ie not use the CPU to >> > copy data). >> > >> > I believe same requirement also make sense for platform you are targeting. >> > Thus same code can be use. >> > >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >> > >> > I haven't posted this patchset yet because we are doing some modifications >> > to the device driver API to accomodate some new features. But the ZONE_DEVICE >> > changes and the overall migration code will stay the same more or less (i have >> > patches that move it to migrate.c and share more code with existing migrate >> > code). >> > >> > If you think i missed anything about lru and page cache please point it to >> > me. Because when i audited code for that i didn't see any road block with >> > the few fs i was looking at (ext4, xfs and core page cache code). >> > >> >> The other restriction around ZONE_DEVICE is, it is not a managed zone. >> That prevents any direct allocation from coherent device by application. >> ie, we would like to force allocation from coherent device using >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > To achieve this we rely on device fault code path ie when device take a page fault > with help of HMM it will use existing memory if any for fault address but if CPU > page table is empty (and it is not file back vma because of readback) then device > can directly allocate device memory and HMM will update CPU page table to point to > newly allocated device memory. > That is ok if the device touch the page first. What if we want the allocation touched first by cpu to come from GPU ?. Should we always depend on GPU driver to migrate such pages later from system RAM to GPU memory ? -aneesh ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 17:31 ` Aneesh Kumar K.V 0 siblings, 0 replies; 135+ messages in thread From: Aneesh Kumar K.V @ 2016-10-25 17:31 UTC (permalink / raw) To: Jerome Glisse Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora Jerome Glisse <j.glisse@gmail.com> writes: > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >> > You can take a look at hmm-v13 if you want to see how i do non LRU page >> > migration. While i put most of the migration code inside hmm_migrate.c it >> > could easily be move to migrate.c without hmm_ prefix. >> > >> > There is 2 missing piece with existing migrate code. First is to put memory >> > allocation for destination under control of who call the migrate code. Second >> > is to allow offloading the copy operation to device (ie not use the CPU to >> > copy data). >> > >> > I believe same requirement also make sense for platform you are targeting. >> > Thus same code can be use. >> > >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >> > >> > I haven't posted this patchset yet because we are doing some modifications >> > to the device driver API to accomodate some new features. But the ZONE_DEVICE >> > changes and the overall migration code will stay the same more or less (i have >> > patches that move it to migrate.c and share more code with existing migrate >> > code). >> > >> > If you think i missed anything about lru and page cache please point it to >> > me. Because when i audited code for that i didn't see any road block with >> > the few fs i was looking at (ext4, xfs and core page cache code). >> > >> >> The other restriction around ZONE_DEVICE is, it is not a managed zone. >> That prevents any direct allocation from coherent device by application. >> ie, we would like to force allocation from coherent device using >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > To achieve this we rely on device fault code path ie when device take a page fault > with help of HMM it will use existing memory if any for fault address but if CPU > page table is empty (and it is not file back vma because of readback) then device > can directly allocate device memory and HMM will update CPU page table to point to > newly allocated device memory. > That is ok if the device touch the page first. What if we want the allocation touched first by cpu to come from GPU ?. Should we always depend on GPU driver to migrate such pages later from system RAM to GPU memory ? -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 17:31 ` Aneesh Kumar K.V @ 2016-10-25 18:52 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 18:52 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > > [...] > > > >> > You can take a look at hmm-v13 if you want to see how i do non LRU page > >> > migration. While i put most of the migration code inside hmm_migrate.c it > >> > could easily be move to migrate.c without hmm_ prefix. > >> > > >> > There is 2 missing piece with existing migrate code. First is to put memory > >> > allocation for destination under control of who call the migrate code. Second > >> > is to allow offloading the copy operation to device (ie not use the CPU to > >> > copy data). > >> > > >> > I believe same requirement also make sense for platform you are targeting. > >> > Thus same code can be use. > >> > > >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >> > > >> > I haven't posted this patchset yet because we are doing some modifications > >> > to the device driver API to accomodate some new features. But the ZONE_DEVICE > >> > changes and the overall migration code will stay the same more or less (i have > >> > patches that move it to migrate.c and share more code with existing migrate > >> > code). > >> > > >> > If you think i missed anything about lru and page cache please point it to > >> > me. Because when i audited code for that i didn't see any road block with > >> > the few fs i was looking at (ext4, xfs and core page cache code). > >> > > >> > >> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >> That prevents any direct allocation from coherent device by application. > >> ie, we would like to force allocation from coherent device using > >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > > > To achieve this we rely on device fault code path ie when device take a page fault > > with help of HMM it will use existing memory if any for fault address but if CPU > > page table is empty (and it is not file back vma because of readback) then device > > can directly allocate device memory and HMM will update CPU page table to point to > > newly allocated device memory. > > > > That is ok if the device touch the page first. What if we want the > allocation touched first by cpu to come from GPU ?. Should we always > depend on GPU driver to migrate such pages later from system RAM to GPU > memory ? > I am not sure what kind of workload would rather have every first CPU access for a range to use device memory. So no my code does not handle that and it is pointless for it as CPU can not access device memory for me. That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. Thought my personnal preference would still be to avoid use of such generic syscall but have device driver set allocation policy through its own userspace API (device driver could reuse internal of mbind() to achieve the end result). I am not saying that eveything you want to do is doable now with HMM but, nothing preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse with device memory. Each device is so different from the other that i don't believe in a one API fit all. The drm GPU subsystem of the kernel is a testimony of how little can be share when it comes to GPU. The only common code is modesetting. Everything that deals with how to use GPU to compute stuff is per device and most of the logic is in userspace. So i do not see any commonality that could be abstracted at syscall level. I would rather let device driver stack (kernel and userspace) take such decision and have the higher level API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. Programmer target those high level API and they intend to use the mechanism each offer to manage memory and memory placement. I would say forcing them to use a second linux specific API to achieve the latter is wrong, at lest for now. So in the end if the mbind() syscall is done by the userspace side of the device driver then why not just having the device driver communicate this through its own kernel API (which can be much more expressive than what standardize syscall offers). I would rather avoid making change to any syscall for now. If latter, down the road, once the userspace ecosystem stabilize, we see that there is a good level at which we can abstract memory policy for enough devices then and only then it would make sense to either introduce new syscall or grow/modify existing one. Right now i fear we could only make bad decision that we would regret down the road. I think we can achieve memory device support with the minimum amount of changes to mm code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory is kept out of most mm mechanism and hence avoid all the changes you had to make for CDM node. It just looks a better fit from my point of view. I think it is worth considering for your use case too. I am sure folks writting the device driver would rather share more code between platform with grown up bus system (CAPI, CCIX, ...) vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 18:52 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 18:52 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse <j.glisse@gmail.com> writes: > > > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > > [...] > > > >> > You can take a look at hmm-v13 if you want to see how i do non LRU page > >> > migration. While i put most of the migration code inside hmm_migrate.c it > >> > could easily be move to migrate.c without hmm_ prefix. > >> > > >> > There is 2 missing piece with existing migrate code. First is to put memory > >> > allocation for destination under control of who call the migrate code. Second > >> > is to allow offloading the copy operation to device (ie not use the CPU to > >> > copy data). > >> > > >> > I believe same requirement also make sense for platform you are targeting. > >> > Thus same code can be use. > >> > > >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >> > > >> > I haven't posted this patchset yet because we are doing some modifications > >> > to the device driver API to accomodate some new features. But the ZONE_DEVICE > >> > changes and the overall migration code will stay the same more or less (i have > >> > patches that move it to migrate.c and share more code with existing migrate > >> > code). > >> > > >> > If you think i missed anything about lru and page cache please point it to > >> > me. Because when i audited code for that i didn't see any road block with > >> > the few fs i was looking at (ext4, xfs and core page cache code). > >> > > >> > >> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >> That prevents any direct allocation from coherent device by application. > >> ie, we would like to force allocation from coherent device using > >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > > > To achieve this we rely on device fault code path ie when device take a page fault > > with help of HMM it will use existing memory if any for fault address but if CPU > > page table is empty (and it is not file back vma because of readback) then device > > can directly allocate device memory and HMM will update CPU page table to point to > > newly allocated device memory. > > > > That is ok if the device touch the page first. What if we want the > allocation touched first by cpu to come from GPU ?. Should we always > depend on GPU driver to migrate such pages later from system RAM to GPU > memory ? > I am not sure what kind of workload would rather have every first CPU access for a range to use device memory. So no my code does not handle that and it is pointless for it as CPU can not access device memory for me. That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. Thought my personnal preference would still be to avoid use of such generic syscall but have device driver set allocation policy through its own userspace API (device driver could reuse internal of mbind() to achieve the end result). I am not saying that eveything you want to do is doable now with HMM but, nothing preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse with device memory. Each device is so different from the other that i don't believe in a one API fit all. The drm GPU subsystem of the kernel is a testimony of how little can be share when it comes to GPU. The only common code is modesetting. Everything that deals with how to use GPU to compute stuff is per device and most of the logic is in userspace. So i do not see any commonality that could be abstracted at syscall level. I would rather let device driver stack (kernel and userspace) take such decision and have the higher level API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. Programmer target those high level API and they intend to use the mechanism each offer to manage memory and memory placement. I would say forcing them to use a second linux specific API to achieve the latter is wrong, at lest for now. So in the end if the mbind() syscall is done by the userspace side of the device driver then why not just having the device driver communicate this through its own kernel API (which can be much more expressive than what standardize syscall offers). I would rather avoid making change to any syscall for now. If latter, down the road, once the userspace ecosystem stabilize, we see that there is a good level at which we can abstract memory policy for enough devices then and only then it would make sense to either introduce new syscall or grow/modify existing one. Right now i fear we could only make bad decision that we would regret down the road. I think we can achieve memory device support with the minimum amount of changes to mm code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory is kept out of most mm mechanism and hence avoid all the changes you had to make for CDM node. It just looks a better fit from my point of view. I think it is worth considering for your use case too. I am sure folks writting the device driver would rather share more code between platform with grown up bus system (CAPI, CCIX, ...) vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 18:52 ` Jerome Glisse @ 2016-10-26 11:13 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-26 11:13 UTC (permalink / raw) To: Jerome Glisse, Aneesh Kumar K.V Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>> >>> [...] >>> >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>> could easily be move to migrate.c without hmm_ prefix. >>>>> >>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>> allocation for destination under control of who call the migrate code. Second >>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>> copy data). >>>>> >>>>> I believe same requirement also make sense for platform you are targeting. >>>>> Thus same code can be use. >>>>> >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>> >>>>> I haven't posted this patchset yet because we are doing some modifications >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>> changes and the overall migration code will stay the same more or less (i have >>>>> patches that move it to migrate.c and share more code with existing migrate >>>>> code). >>>>> >>>>> If you think i missed anything about lru and page cache please point it to >>>>> me. Because when i audited code for that i didn't see any road block with >>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>> >>>> >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>> That prevents any direct allocation from coherent device by application. >>>> ie, we would like to force allocation from coherent device using >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>> >>> To achieve this we rely on device fault code path ie when device take a page fault >>> with help of HMM it will use existing memory if any for fault address but if CPU >>> page table is empty (and it is not file back vma because of readback) then device >>> can directly allocate device memory and HMM will update CPU page table to point to >>> newly allocated device memory. >>> >> >> That is ok if the device touch the page first. What if we want the >> allocation touched first by cpu to come from GPU ?. Should we always >> depend on GPU driver to migrate such pages later from system RAM to GPU >> memory ? >> > > I am not sure what kind of workload would rather have every first CPU access for > a range to use device memory. So no my code does not handle that and it is pointless > for it as CPU can not access device memory for me. > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > Thought my personnal preference would still be to avoid use of such generic syscall > but have device driver set allocation policy through its own userspace API (device > driver could reuse internal of mbind() to achieve the end result). > > I am not saying that eveything you want to do is doable now with HMM but, nothing > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > with device memory. > > Each device is so different from the other that i don't believe in a one API fit all. > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > comes to GPU. The only common code is modesetting. Everything that deals with how to > use GPU to compute stuff is per device and most of the logic is in userspace. So i do > not see any commonality that could be abstracted at syscall level. I would rather let > device driver stack (kernel and userspace) take such decision and have the higher level > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > Programmer target those high level API and they intend to use the mechanism each offer > to manage memory and memory placement. I would say forcing them to use a second linux > specific API to achieve the latter is wrong, at lest for now. > > So in the end if the mbind() syscall is done by the userspace side of the device driver > then why not just having the device driver communicate this through its own kernel > API (which can be much more expressive than what standardize syscall offers). I would > rather avoid making change to any syscall for now. > > If latter, down the road, once the userspace ecosystem stabilize, we see that there > is a good level at which we can abstract memory policy for enough devices then and > only then it would make sense to either introduce new syscall or grow/modify existing > one. Right now i fear we could only make bad decision that we would regret down the > road. > > I think we can achieve memory device support with the minimum amount of changes to mm > code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory > is kept out of most mm mechanism and hence avoid all the changes you had to make for > CDM node. It just looks a better fit from my point of view. I think it is worth > considering for your use case too. I am sure folks writting the device driver would > rather share more code between platform with grown up bus system (CAPI, CCIX, ...) > vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) Because of coherent access between the CPU and the device, the intention is to use the same buffer (VMA) accessed between CPU and device interchangeably through out the run time of the application depending upon which side is accessing more and how much of performance benefit it will provide after the migration. Now driver managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues migrating non LRU pages mapped in user space. I am not sure whether Minchan had changed the basic non LRU migration enablement code to support mapped non LRU pages well. So in that case how we are going to migrate back and forth between system RAM and device memory ? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-26 11:13 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-26 11:13 UTC (permalink / raw) To: Jerome Glisse, Aneesh Kumar K.V Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>> >>> [...] >>> >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>> could easily be move to migrate.c without hmm_ prefix. >>>>> >>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>> allocation for destination under control of who call the migrate code. Second >>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>> copy data). >>>>> >>>>> I believe same requirement also make sense for platform you are targeting. >>>>> Thus same code can be use. >>>>> >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>> >>>>> I haven't posted this patchset yet because we are doing some modifications >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>> changes and the overall migration code will stay the same more or less (i have >>>>> patches that move it to migrate.c and share more code with existing migrate >>>>> code). >>>>> >>>>> If you think i missed anything about lru and page cache please point it to >>>>> me. Because when i audited code for that i didn't see any road block with >>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>> >>>> >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>> That prevents any direct allocation from coherent device by application. >>>> ie, we would like to force allocation from coherent device using >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>> >>> To achieve this we rely on device fault code path ie when device take a page fault >>> with help of HMM it will use existing memory if any for fault address but if CPU >>> page table is empty (and it is not file back vma because of readback) then device >>> can directly allocate device memory and HMM will update CPU page table to point to >>> newly allocated device memory. >>> >> >> That is ok if the device touch the page first. What if we want the >> allocation touched first by cpu to come from GPU ?. Should we always >> depend on GPU driver to migrate such pages later from system RAM to GPU >> memory ? >> > > I am not sure what kind of workload would rather have every first CPU access for > a range to use device memory. So no my code does not handle that and it is pointless > for it as CPU can not access device memory for me. > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > Thought my personnal preference would still be to avoid use of such generic syscall > but have device driver set allocation policy through its own userspace API (device > driver could reuse internal of mbind() to achieve the end result). > > I am not saying that eveything you want to do is doable now with HMM but, nothing > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > with device memory. > > Each device is so different from the other that i don't believe in a one API fit all. > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > comes to GPU. The only common code is modesetting. Everything that deals with how to > use GPU to compute stuff is per device and most of the logic is in userspace. So i do > not see any commonality that could be abstracted at syscall level. I would rather let > device driver stack (kernel and userspace) take such decision and have the higher level > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > Programmer target those high level API and they intend to use the mechanism each offer > to manage memory and memory placement. I would say forcing them to use a second linux > specific API to achieve the latter is wrong, at lest for now. > > So in the end if the mbind() syscall is done by the userspace side of the device driver > then why not just having the device driver communicate this through its own kernel > API (which can be much more expressive than what standardize syscall offers). I would > rather avoid making change to any syscall for now. > > If latter, down the road, once the userspace ecosystem stabilize, we see that there > is a good level at which we can abstract memory policy for enough devices then and > only then it would make sense to either introduce new syscall or grow/modify existing > one. Right now i fear we could only make bad decision that we would regret down the > road. > > I think we can achieve memory device support with the minimum amount of changes to mm > code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory > is kept out of most mm mechanism and hence avoid all the changes you had to make for > CDM node. It just looks a better fit from my point of view. I think it is worth > considering for your use case too. I am sure folks writting the device driver would > rather share more code between platform with grown up bus system (CAPI, CCIX, ...) > vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) Because of coherent access between the CPU and the device, the intention is to use the same buffer (VMA) accessed between CPU and device interchangeably through out the run time of the application depending upon which side is accessing more and how much of performance benefit it will provide after the migration. Now driver managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues migrating non LRU pages mapped in user space. I am not sure whether Minchan had changed the basic non LRU migration enablement code to support mapped non LRU pages well. So in that case how we are going to migrate back and forth between system RAM and device memory ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-26 11:13 ` Anshuman Khandual @ 2016-10-26 16:02 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-26 16:02 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >>> > >>> [...] > >>> > >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page > >>>>> migration. While i put most of the migration code inside hmm_migrate.c it > >>>>> could easily be move to migrate.c without hmm_ prefix. > >>>>> > >>>>> There is 2 missing piece with existing migrate code. First is to put memory > >>>>> allocation for destination under control of who call the migrate code. Second > >>>>> is to allow offloading the copy operation to device (ie not use the CPU to > >>>>> copy data). > >>>>> > >>>>> I believe same requirement also make sense for platform you are targeting. > >>>>> Thus same code can be use. > >>>>> > >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >>>>> > >>>>> I haven't posted this patchset yet because we are doing some modifications > >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE > >>>>> changes and the overall migration code will stay the same more or less (i have > >>>>> patches that move it to migrate.c and share more code with existing migrate > >>>>> code). > >>>>> > >>>>> If you think i missed anything about lru and page cache please point it to > >>>>> me. Because when i audited code for that i didn't see any road block with > >>>>> the few fs i was looking at (ext4, xfs and core page cache code). > >>>>> > >>>> > >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >>>> That prevents any direct allocation from coherent device by application. > >>>> ie, we would like to force allocation from coherent device using > >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > >>> > >>> To achieve this we rely on device fault code path ie when device take a page fault > >>> with help of HMM it will use existing memory if any for fault address but if CPU > >>> page table is empty (and it is not file back vma because of readback) then device > >>> can directly allocate device memory and HMM will update CPU page table to point to > >>> newly allocated device memory. > >>> > >> > >> That is ok if the device touch the page first. What if we want the > >> allocation touched first by cpu to come from GPU ?. Should we always > >> depend on GPU driver to migrate such pages later from system RAM to GPU > >> memory ? > >> > > > > I am not sure what kind of workload would rather have every first CPU access for > > a range to use device memory. So no my code does not handle that and it is pointless > > for it as CPU can not access device memory for me. > > > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > > Thought my personnal preference would still be to avoid use of such generic syscall > > but have device driver set allocation policy through its own userspace API (device > > driver could reuse internal of mbind() to achieve the end result). > > > > I am not saying that eveything you want to do is doable now with HMM but, nothing > > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > > with device memory. > > > > Each device is so different from the other that i don't believe in a one API fit all. > > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > > comes to GPU. The only common code is modesetting. Everything that deals with how to > > use GPU to compute stuff is per device and most of the logic is in userspace. So i do > > not see any commonality that could be abstracted at syscall level. I would rather let > > device driver stack (kernel and userspace) take such decision and have the higher level > > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > > Programmer target those high level API and they intend to use the mechanism each offer > > to manage memory and memory placement. I would say forcing them to use a second linux > > specific API to achieve the latter is wrong, at lest for now. > > > > So in the end if the mbind() syscall is done by the userspace side of the device driver > > then why not just having the device driver communicate this through its own kernel > > API (which can be much more expressive than what standardize syscall offers). I would > > rather avoid making change to any syscall for now. > > > > If latter, down the road, once the userspace ecosystem stabilize, we see that there > > is a good level at which we can abstract memory policy for enough devices then and > > only then it would make sense to either introduce new syscall or grow/modify existing > > one. Right now i fear we could only make bad decision that we would regret down the > > road. > > > > I think we can achieve memory device support with the minimum amount of changes to mm > > code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory > > is kept out of most mm mechanism and hence avoid all the changes you had to make for > > CDM node. It just looks a better fit from my point of view. I think it is worth > > considering for your use case too. I am sure folks writting the device driver would > > rather share more code between platform with grown up bus system (CAPI, CCIX, ...) > > vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) > > Because of coherent access between the CPU and the device, the intention is to use > the same buffer (VMA) accessed between CPU and device interchangeably through out > the run time of the application depending upon which side is accessing more and > how much of performance benefit it will provide after the migration. Now driver > managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues > migrating non LRU pages mapped in user space. I am not sure whether Minchan had > changed the basic non LRU migration enablement code to support mapped non LRU > pages well. So in that case how we are going to migrate back and forth between > system RAM and device memory ? In my patchset there is no policy, it is all under device driver control which decide what range of memory is migrated and when. I think only device driver as proper knowledge to make such decision. By coalescing data from GPU counters and request from application made through the uppler level programming API like Cuda. Note that even on PCIE the GPU can access the system memory coherently, it is the reverse that is not doable (and there are limitation on the kind of atomic op the device can do on system memory). So the hmm_mirror also allow that. Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-26 16:02 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-26 16:02 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >>> > >>> [...] > >>> > >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page > >>>>> migration. While i put most of the migration code inside hmm_migrate.c it > >>>>> could easily be move to migrate.c without hmm_ prefix. > >>>>> > >>>>> There is 2 missing piece with existing migrate code. First is to put memory > >>>>> allocation for destination under control of who call the migrate code. Second > >>>>> is to allow offloading the copy operation to device (ie not use the CPU to > >>>>> copy data). > >>>>> > >>>>> I believe same requirement also make sense for platform you are targeting. > >>>>> Thus same code can be use. > >>>>> > >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >>>>> > >>>>> I haven't posted this patchset yet because we are doing some modifications > >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE > >>>>> changes and the overall migration code will stay the same more or less (i have > >>>>> patches that move it to migrate.c and share more code with existing migrate > >>>>> code). > >>>>> > >>>>> If you think i missed anything about lru and page cache please point it to > >>>>> me. Because when i audited code for that i didn't see any road block with > >>>>> the few fs i was looking at (ext4, xfs and core page cache code). > >>>>> > >>>> > >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >>>> That prevents any direct allocation from coherent device by application. > >>>> ie, we would like to force allocation from coherent device using > >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > >>> > >>> To achieve this we rely on device fault code path ie when device take a page fault > >>> with help of HMM it will use existing memory if any for fault address but if CPU > >>> page table is empty (and it is not file back vma because of readback) then device > >>> can directly allocate device memory and HMM will update CPU page table to point to > >>> newly allocated device memory. > >>> > >> > >> That is ok if the device touch the page first. What if we want the > >> allocation touched first by cpu to come from GPU ?. Should we always > >> depend on GPU driver to migrate such pages later from system RAM to GPU > >> memory ? > >> > > > > I am not sure what kind of workload would rather have every first CPU access for > > a range to use device memory. So no my code does not handle that and it is pointless > > for it as CPU can not access device memory for me. > > > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > > Thought my personnal preference would still be to avoid use of such generic syscall > > but have device driver set allocation policy through its own userspace API (device > > driver could reuse internal of mbind() to achieve the end result). > > > > I am not saying that eveything you want to do is doable now with HMM but, nothing > > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > > with device memory. > > > > Each device is so different from the other that i don't believe in a one API fit all. > > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > > comes to GPU. The only common code is modesetting. Everything that deals with how to > > use GPU to compute stuff is per device and most of the logic is in userspace. So i do > > not see any commonality that could be abstracted at syscall level. I would rather let > > device driver stack (kernel and userspace) take such decision and have the higher level > > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > > Programmer target those high level API and they intend to use the mechanism each offer > > to manage memory and memory placement. I would say forcing them to use a second linux > > specific API to achieve the latter is wrong, at lest for now. > > > > So in the end if the mbind() syscall is done by the userspace side of the device driver > > then why not just having the device driver communicate this through its own kernel > > API (which can be much more expressive than what standardize syscall offers). I would > > rather avoid making change to any syscall for now. > > > > If latter, down the road, once the userspace ecosystem stabilize, we see that there > > is a good level at which we can abstract memory policy for enough devices then and > > only then it would make sense to either introduce new syscall or grow/modify existing > > one. Right now i fear we could only make bad decision that we would regret down the > > road. > > > > I think we can achieve memory device support with the minimum amount of changes to mm > > code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory > > is kept out of most mm mechanism and hence avoid all the changes you had to make for > > CDM node. It just looks a better fit from my point of view. I think it is worth > > considering for your use case too. I am sure folks writting the device driver would > > rather share more code between platform with grown up bus system (CAPI, CCIX, ...) > > vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) > > Because of coherent access between the CPU and the device, the intention is to use > the same buffer (VMA) accessed between CPU and device interchangeably through out > the run time of the application depending upon which side is accessing more and > how much of performance benefit it will provide after the migration. Now driver > managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues > migrating non LRU pages mapped in user space. I am not sure whether Minchan had > changed the basic non LRU migration enablement code to support mapped non LRU > pages well. So in that case how we are going to migrate back and forth between > system RAM and device memory ? In my patchset there is no policy, it is all under device driver control which decide what range of memory is migrated and when. I think only device driver as proper knowledge to make such decision. By coalescing data from GPU counters and request from application made through the uppler level programming API like Cuda. Note that even on PCIE the GPU can access the system memory coherently, it is the reverse that is not doable (and there are limitation on the kind of atomic op the device can do on system memory). So the hmm_mirror also allow that. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-26 16:02 ` Jerome Glisse @ 2016-10-27 4:38 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-27 4:38 UTC (permalink / raw) To: Jerome Glisse Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/26/2016 09:32 PM, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>> >>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>>>> >>>>> [...] >>>>> >>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>>>> could easily be move to migrate.c without hmm_ prefix. >>>>>>> >>>>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>>>> allocation for destination under control of who call the migrate code. Second >>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>>>> copy data). >>>>>>> >>>>>>> I believe same requirement also make sense for platform you are targeting. >>>>>>> Thus same code can be use. >>>>>>> >>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>>>> >>>>>>> I haven't posted this patchset yet because we are doing some modifications >>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>>>> changes and the overall migration code will stay the same more or less (i have >>>>>>> patches that move it to migrate.c and share more code with existing migrate >>>>>>> code). >>>>>>> >>>>>>> If you think i missed anything about lru and page cache please point it to >>>>>>> me. Because when i audited code for that i didn't see any road block with >>>>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>>>> >>>>>> >>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>>>> That prevents any direct allocation from coherent device by application. >>>>>> ie, we would like to force allocation from coherent device using >>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>>>> >>>>> To achieve this we rely on device fault code path ie when device take a page fault >>>>> with help of HMM it will use existing memory if any for fault address but if CPU >>>>> page table is empty (and it is not file back vma because of readback) then device >>>>> can directly allocate device memory and HMM will update CPU page table to point to >>>>> newly allocated device memory. >>>>> >>>> >>>> That is ok if the device touch the page first. What if we want the >>>> allocation touched first by cpu to come from GPU ?. Should we always >>>> depend on GPU driver to migrate such pages later from system RAM to GPU >>>> memory ? >>>> >>> >>> I am not sure what kind of workload would rather have every first CPU access for >>> a range to use device memory. So no my code does not handle that and it is pointless >>> for it as CPU can not access device memory for me. >>> >>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. >>> Thought my personnal preference would still be to avoid use of such generic syscall >>> but have device driver set allocation policy through its own userspace API (device >>> driver could reuse internal of mbind() to achieve the end result). >>> >>> I am not saying that eveything you want to do is doable now with HMM but, nothing >>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think >>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse >>> with device memory. >>> >>> Each device is so different from the other that i don't believe in a one API fit all. >>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it >>> comes to GPU. The only common code is modesetting. Everything that deals with how to >>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do >>> not see any commonality that could be abstracted at syscall level. I would rather let >>> device driver stack (kernel and userspace) take such decision and have the higher level >>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. >>> Programmer target those high level API and they intend to use the mechanism each offer >>> to manage memory and memory placement. I would say forcing them to use a second linux >>> specific API to achieve the latter is wrong, at lest for now. >>> >>> So in the end if the mbind() syscall is done by the userspace side of the device driver >>> then why not just having the device driver communicate this through its own kernel >>> API (which can be much more expressive than what standardize syscall offers). I would >>> rather avoid making change to any syscall for now. >>> >>> If latter, down the road, once the userspace ecosystem stabilize, we see that there >>> is a good level at which we can abstract memory policy for enough devices then and >>> only then it would make sense to either introduce new syscall or grow/modify existing >>> one. Right now i fear we could only make bad decision that we would regret down the >>> road. >>> >>> I think we can achieve memory device support with the minimum amount of changes to mm >>> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory >>> is kept out of most mm mechanism and hence avoid all the changes you had to make for >>> CDM node. It just looks a better fit from my point of view. I think it is worth >>> considering for your use case too. I am sure folks writting the device driver would >>> rather share more code between platform with grown up bus system (CAPI, CCIX, ...) >>> vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) >> >> Because of coherent access between the CPU and the device, the intention is to use >> the same buffer (VMA) accessed between CPU and device interchangeably through out >> the run time of the application depending upon which side is accessing more and >> how much of performance benefit it will provide after the migration. Now driver >> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues >> migrating non LRU pages mapped in user space. I am not sure whether Minchan had >> changed the basic non LRU migration enablement code to support mapped non LRU >> pages well. So in that case how we are going to migrate back and forth between >> system RAM and device memory ? > > In my patchset there is no policy, it is all under device driver control which > decide what range of memory is migrated and when. I think only device driver as > proper knowledge to make such decision. By coalescing data from GPU counters and > request from application made through the uppler level programming API like > Cuda. > Right, I understand that. But what I pointed out here is that there are problems now migrating user mapped pages back and forth between LRU system RAM memory and non LRU device memory which is yet to be solved. Because you are proposing a non LRU based design with ZONE_DEVICE, how we are solving/working around these problems for bi-directional migration ? > Note that even on PCIE the GPU can access the system memory coherently, it is the > reverse that is not doable (and there are limitation on the kind of atomic op the > device can do on system memory). So the hmm_mirror also allow that. Okay. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-27 4:38 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-27 4:38 UTC (permalink / raw) To: Jerome Glisse Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/26/2016 09:32 PM, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>> >>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>>>> >>>>> [...] >>>>> >>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>>>> could easily be move to migrate.c without hmm_ prefix. >>>>>>> >>>>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>>>> allocation for destination under control of who call the migrate code. Second >>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>>>> copy data). >>>>>>> >>>>>>> I believe same requirement also make sense for platform you are targeting. >>>>>>> Thus same code can be use. >>>>>>> >>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>>>> >>>>>>> I haven't posted this patchset yet because we are doing some modifications >>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>>>> changes and the overall migration code will stay the same more or less (i have >>>>>>> patches that move it to migrate.c and share more code with existing migrate >>>>>>> code). >>>>>>> >>>>>>> If you think i missed anything about lru and page cache please point it to >>>>>>> me. Because when i audited code for that i didn't see any road block with >>>>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>>>> >>>>>> >>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>>>> That prevents any direct allocation from coherent device by application. >>>>>> ie, we would like to force allocation from coherent device using >>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>>>> >>>>> To achieve this we rely on device fault code path ie when device take a page fault >>>>> with help of HMM it will use existing memory if any for fault address but if CPU >>>>> page table is empty (and it is not file back vma because of readback) then device >>>>> can directly allocate device memory and HMM will update CPU page table to point to >>>>> newly allocated device memory. >>>>> >>>> >>>> That is ok if the device touch the page first. What if we want the >>>> allocation touched first by cpu to come from GPU ?. Should we always >>>> depend on GPU driver to migrate such pages later from system RAM to GPU >>>> memory ? >>>> >>> >>> I am not sure what kind of workload would rather have every first CPU access for >>> a range to use device memory. So no my code does not handle that and it is pointless >>> for it as CPU can not access device memory for me. >>> >>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. >>> Thought my personnal preference would still be to avoid use of such generic syscall >>> but have device driver set allocation policy through its own userspace API (device >>> driver could reuse internal of mbind() to achieve the end result). >>> >>> I am not saying that eveything you want to do is doable now with HMM but, nothing >>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think >>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse >>> with device memory. >>> >>> Each device is so different from the other that i don't believe in a one API fit all. >>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it >>> comes to GPU. The only common code is modesetting. Everything that deals with how to >>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do >>> not see any commonality that could be abstracted at syscall level. I would rather let >>> device driver stack (kernel and userspace) take such decision and have the higher level >>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. >>> Programmer target those high level API and they intend to use the mechanism each offer >>> to manage memory and memory placement. I would say forcing them to use a second linux >>> specific API to achieve the latter is wrong, at lest for now. >>> >>> So in the end if the mbind() syscall is done by the userspace side of the device driver >>> then why not just having the device driver communicate this through its own kernel >>> API (which can be much more expressive than what standardize syscall offers). I would >>> rather avoid making change to any syscall for now. >>> >>> If latter, down the road, once the userspace ecosystem stabilize, we see that there >>> is a good level at which we can abstract memory policy for enough devices then and >>> only then it would make sense to either introduce new syscall or grow/modify existing >>> one. Right now i fear we could only make bad decision that we would regret down the >>> road. >>> >>> I think we can achieve memory device support with the minimum amount of changes to mm >>> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory >>> is kept out of most mm mechanism and hence avoid all the changes you had to make for >>> CDM node. It just looks a better fit from my point of view. I think it is worth >>> considering for your use case too. I am sure folks writting the device driver would >>> rather share more code between platform with grown up bus system (CAPI, CCIX, ...) >>> vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) >> >> Because of coherent access between the CPU and the device, the intention is to use >> the same buffer (VMA) accessed between CPU and device interchangeably through out >> the run time of the application depending upon which side is accessing more and >> how much of performance benefit it will provide after the migration. Now driver >> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues >> migrating non LRU pages mapped in user space. I am not sure whether Minchan had >> changed the basic non LRU migration enablement code to support mapped non LRU >> pages well. So in that case how we are going to migrate back and forth between >> system RAM and device memory ? > > In my patchset there is no policy, it is all under device driver control which > decide what range of memory is migrated and when. I think only device driver as > proper knowledge to make such decision. By coalescing data from GPU counters and > request from application made through the uppler level programming API like > Cuda. > Right, I understand that. But what I pointed out here is that there are problems now migrating user mapped pages back and forth between LRU system RAM memory and non LRU device memory which is yet to be solved. Because you are proposing a non LRU based design with ZONE_DEVICE, how we are solving/working around these problems for bi-directional migration ? > Note that even on PCIE the GPU can access the system memory coherently, it is the > reverse that is not doable (and there are limitation on the kind of atomic op the > device can do on system memory). So the hmm_mirror also allow that. Okay. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-27 4:38 ` Anshuman Khandual @ 2016-10-27 7:03 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-27 7:03 UTC (permalink / raw) To: Anshuman Khandual, Jerome Glisse Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > On 10/26/2016 09:32 PM, Jerome Glisse wrote: >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>> >>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>>>>> >>>>>> [...] >>>>>> >>>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>>>>> could easily be move to migrate.c without hmm_ prefix. >>>>>>>> >>>>>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>>>>> allocation for destination under control of who call the migrate code. Second >>>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>>>>> copy data). >>>>>>>> >>>>>>>> I believe same requirement also make sense for platform you are targeting. >>>>>>>> Thus same code can be use. >>>>>>>> >>>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>>>>> >>>>>>>> I haven't posted this patchset yet because we are doing some modifications >>>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>>>>> changes and the overall migration code will stay the same more or less (i have >>>>>>>> patches that move it to migrate.c and share more code with existing migrate >>>>>>>> code). >>>>>>>> >>>>>>>> If you think i missed anything about lru and page cache please point it to >>>>>>>> me. Because when i audited code for that i didn't see any road block with >>>>>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>>>>> >>>>>>> >>>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>>>>> That prevents any direct allocation from coherent device by application. >>>>>>> ie, we would like to force allocation from coherent device using >>>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>>>>> >>>>>> To achieve this we rely on device fault code path ie when device take a page fault >>>>>> with help of HMM it will use existing memory if any for fault address but if CPU >>>>>> page table is empty (and it is not file back vma because of readback) then device >>>>>> can directly allocate device memory and HMM will update CPU page table to point to >>>>>> newly allocated device memory. >>>>>> >>>>> >>>>> That is ok if the device touch the page first. What if we want the >>>>> allocation touched first by cpu to come from GPU ?. Should we always >>>>> depend on GPU driver to migrate such pages later from system RAM to GPU >>>>> memory ? >>>>> >>>> >>>> I am not sure what kind of workload would rather have every first CPU access for >>>> a range to use device memory. So no my code does not handle that and it is pointless >>>> for it as CPU can not access device memory for me. >>>> >>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. >>>> Thought my personnal preference would still be to avoid use of such generic syscall >>>> but have device driver set allocation policy through its own userspace API (device >>>> driver could reuse internal of mbind() to achieve the end result). >>>> >>>> I am not saying that eveything you want to do is doable now with HMM but, nothing >>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think >>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse >>>> with device memory. >>>> >>>> Each device is so different from the other that i don't believe in a one API fit all. >>>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it >>>> comes to GPU. The only common code is modesetting. Everything that deals with how to >>>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do >>>> not see any commonality that could be abstracted at syscall level. I would rather let >>>> device driver stack (kernel and userspace) take such decision and have the higher level >>>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. >>>> Programmer target those high level API and they intend to use the mechanism each offer >>>> to manage memory and memory placement. I would say forcing them to use a second linux >>>> specific API to achieve the latter is wrong, at lest for now. >>>> >>>> So in the end if the mbind() syscall is done by the userspace side of the device driver >>>> then why not just having the device driver communicate this through its own kernel >>>> API (which can be much more expressive than what standardize syscall offers). I would >>>> rather avoid making change to any syscall for now. >>>> >>>> If latter, down the road, once the userspace ecosystem stabilize, we see that there >>>> is a good level at which we can abstract memory policy for enough devices then and >>>> only then it would make sense to either introduce new syscall or grow/modify existing >>>> one. Right now i fear we could only make bad decision that we would regret down the >>>> road. >>>> >>>> I think we can achieve memory device support with the minimum amount of changes to mm >>>> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory >>>> is kept out of most mm mechanism and hence avoid all the changes you had to make for >>>> CDM node. It just looks a better fit from my point of view. I think it is worth >>>> considering for your use case too. I am sure folks writting the device driver would >>>> rather share more code between platform with grown up bus system (CAPI, CCIX, ...) >>>> vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) >>> >>> Because of coherent access between the CPU and the device, the intention is to use >>> the same buffer (VMA) accessed between CPU and device interchangeably through out >>> the run time of the application depending upon which side is accessing more and >>> how much of performance benefit it will provide after the migration. Now driver >>> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues >>> migrating non LRU pages mapped in user space. I am not sure whether Minchan had >>> changed the basic non LRU migration enablement code to support mapped non LRU >>> pages well. So in that case how we are going to migrate back and forth between >>> system RAM and device memory ? >> >> In my patchset there is no policy, it is all under device driver control which >> decide what range of memory is migrated and when. I think only device driver as >> proper knowledge to make such decision. By coalescing data from GPU counters and >> request from application made through the uppler level programming API like >> Cuda. >> > > Right, I understand that. But what I pointed out here is that there are problems > now migrating user mapped pages back and forth between LRU system RAM memory and > non LRU device memory which is yet to be solved. Because you are proposing a non > LRU based design with ZONE_DEVICE, how we are solving/working around these > problems for bi-directional migration ? Let me elaborate on this bit more. Before non LRU migration support patch series from Minchan, it was not possible to migrate non LRU pages which are generally driver managed through migrate_pages interface. This was affecting the ability to do compaction on platforms which has a large share of non LRU pages. That series actually solved the migration problem and allowed compaction. But it still did not solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages are mapped into a process's page table and being accessed from user space, it can not be moved using migrate_pages interface. Minchan had a draft solution for that problem which is still hosted here. On his suggestion I had tried this solution but still faced some other problems during mapped pages migration. (NOTE: IIRC this was not posted in the community) git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) As I had mentioned earlier, we intend to support all possible migrations between system RAM (LRU) and device memory (Non LRU) for user space mapped pages. (1) System RAM (Anon mapping) --> Device memory, back and forth many times (2) System RAM (File mapping) --> Device memory, back and forth many times This is not happening now with non LRU pages. Here are some of reasons but before that some notes. * Driver initiates all the migrations * Driver does the isolation of pages * Driver puts the isolated pages in a linked list * Driver passes the linked list to migrate_pages interface for migration * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call * If migration fails, call page->as->aops->putback_page to give the page back to the device driver 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed 2. After a successful migration from non LRU device memory to LRU system RAM, the non LRU will be freed back. Right now migrate_pages releases these pages to buddy, but in this situation we need the pages to be given back to the driver instead. Hence migrate_pages needs to be changed to accommodate this. 3. After LRU system RAM to non LRU device migration for a mapped page, does the new page (which came from device memory) will be part of core MM LRU either for Anon or File mapping ? 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, how we are going to store "address_space->address_space_operations" and "Anon VMA Chain" reverse mapping information both on the page->mapping element ? 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, how we are going to store "address_space->address_space_operations" of the device driver and radix tree based reverse mapping information for the existing file mapping both on the same page->mapping element ? 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will defined inside the device driver) and the reverse mapping information (either anon or file mapping) together after first round of migration. This non LRU identity needs to be retained continuously if we ever need to return this page to device driver after successful migration to system RAM or for isolation/putback purpose or something else. All the reasons explained above was preventing a continuous ping-pong scheme of migration between system RAM LRU buddy pages and device memory non LRU pages which is one of the primary requirements for exploiting coherent device memory. Do you think we can solve these problems with ZONE_DEVICE and HMM framework ? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-27 7:03 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-27 7:03 UTC (permalink / raw) To: Anshuman Khandual, Jerome Glisse Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > On 10/26/2016 09:32 PM, Jerome Glisse wrote: >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>> >>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>>>>> >>>>>> [...] >>>>>> >>>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>>>>> could easily be move to migrate.c without hmm_ prefix. >>>>>>>> >>>>>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>>>>> allocation for destination under control of who call the migrate code. Second >>>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>>>>> copy data). >>>>>>>> >>>>>>>> I believe same requirement also make sense for platform you are targeting. >>>>>>>> Thus same code can be use. >>>>>>>> >>>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>>>>> >>>>>>>> I haven't posted this patchset yet because we are doing some modifications >>>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>>>>> changes and the overall migration code will stay the same more or less (i have >>>>>>>> patches that move it to migrate.c and share more code with existing migrate >>>>>>>> code). >>>>>>>> >>>>>>>> If you think i missed anything about lru and page cache please point it to >>>>>>>> me. Because when i audited code for that i didn't see any road block with >>>>>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>>>>> >>>>>>> >>>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>>>>> That prevents any direct allocation from coherent device by application. >>>>>>> ie, we would like to force allocation from coherent device using >>>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>>>>> >>>>>> To achieve this we rely on device fault code path ie when device take a page fault >>>>>> with help of HMM it will use existing memory if any for fault address but if CPU >>>>>> page table is empty (and it is not file back vma because of readback) then device >>>>>> can directly allocate device memory and HMM will update CPU page table to point to >>>>>> newly allocated device memory. >>>>>> >>>>> >>>>> That is ok if the device touch the page first. What if we want the >>>>> allocation touched first by cpu to come from GPU ?. Should we always >>>>> depend on GPU driver to migrate such pages later from system RAM to GPU >>>>> memory ? >>>>> >>>> >>>> I am not sure what kind of workload would rather have every first CPU access for >>>> a range to use device memory. So no my code does not handle that and it is pointless >>>> for it as CPU can not access device memory for me. >>>> >>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. >>>> Thought my personnal preference would still be to avoid use of such generic syscall >>>> but have device driver set allocation policy through its own userspace API (device >>>> driver could reuse internal of mbind() to achieve the end result). >>>> >>>> I am not saying that eveything you want to do is doable now with HMM but, nothing >>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think >>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse >>>> with device memory. >>>> >>>> Each device is so different from the other that i don't believe in a one API fit all. >>>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it >>>> comes to GPU. The only common code is modesetting. Everything that deals with how to >>>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do >>>> not see any commonality that could be abstracted at syscall level. I would rather let >>>> device driver stack (kernel and userspace) take such decision and have the higher level >>>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. >>>> Programmer target those high level API and they intend to use the mechanism each offer >>>> to manage memory and memory placement. I would say forcing them to use a second linux >>>> specific API to achieve the latter is wrong, at lest for now. >>>> >>>> So in the end if the mbind() syscall is done by the userspace side of the device driver >>>> then why not just having the device driver communicate this through its own kernel >>>> API (which can be much more expressive than what standardize syscall offers). I would >>>> rather avoid making change to any syscall for now. >>>> >>>> If latter, down the road, once the userspace ecosystem stabilize, we see that there >>>> is a good level at which we can abstract memory policy for enough devices then and >>>> only then it would make sense to either introduce new syscall or grow/modify existing >>>> one. Right now i fear we could only make bad decision that we would regret down the >>>> road. >>>> >>>> I think we can achieve memory device support with the minimum amount of changes to mm >>>> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory >>>> is kept out of most mm mechanism and hence avoid all the changes you had to make for >>>> CDM node. It just looks a better fit from my point of view. I think it is worth >>>> considering for your use case too. I am sure folks writting the device driver would >>>> rather share more code between platform with grown up bus system (CAPI, CCIX, ...) >>>> vs platform with kid bus system (PCIE let's forget about PCI and ISA :)) >>> >>> Because of coherent access between the CPU and the device, the intention is to use >>> the same buffer (VMA) accessed between CPU and device interchangeably through out >>> the run time of the application depending upon which side is accessing more and >>> how much of performance benefit it will provide after the migration. Now driver >>> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues >>> migrating non LRU pages mapped in user space. I am not sure whether Minchan had >>> changed the basic non LRU migration enablement code to support mapped non LRU >>> pages well. So in that case how we are going to migrate back and forth between >>> system RAM and device memory ? >> >> In my patchset there is no policy, it is all under device driver control which >> decide what range of memory is migrated and when. I think only device driver as >> proper knowledge to make such decision. By coalescing data from GPU counters and >> request from application made through the uppler level programming API like >> Cuda. >> > > Right, I understand that. But what I pointed out here is that there are problems > now migrating user mapped pages back and forth between LRU system RAM memory and > non LRU device memory which is yet to be solved. Because you are proposing a non > LRU based design with ZONE_DEVICE, how we are solving/working around these > problems for bi-directional migration ? Let me elaborate on this bit more. Before non LRU migration support patch series from Minchan, it was not possible to migrate non LRU pages which are generally driver managed through migrate_pages interface. This was affecting the ability to do compaction on platforms which has a large share of non LRU pages. That series actually solved the migration problem and allowed compaction. But it still did not solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages are mapped into a process's page table and being accessed from user space, it can not be moved using migrate_pages interface. Minchan had a draft solution for that problem which is still hosted here. On his suggestion I had tried this solution but still faced some other problems during mapped pages migration. (NOTE: IIRC this was not posted in the community) git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) As I had mentioned earlier, we intend to support all possible migrations between system RAM (LRU) and device memory (Non LRU) for user space mapped pages. (1) System RAM (Anon mapping) --> Device memory, back and forth many times (2) System RAM (File mapping) --> Device memory, back and forth many times This is not happening now with non LRU pages. Here are some of reasons but before that some notes. * Driver initiates all the migrations * Driver does the isolation of pages * Driver puts the isolated pages in a linked list * Driver passes the linked list to migrate_pages interface for migration * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call * If migration fails, call page->as->aops->putback_page to give the page back to the device driver 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed 2. After a successful migration from non LRU device memory to LRU system RAM, the non LRU will be freed back. Right now migrate_pages releases these pages to buddy, but in this situation we need the pages to be given back to the driver instead. Hence migrate_pages needs to be changed to accommodate this. 3. After LRU system RAM to non LRU device migration for a mapped page, does the new page (which came from device memory) will be part of core MM LRU either for Anon or File mapping ? 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, how we are going to store "address_space->address_space_operations" and "Anon VMA Chain" reverse mapping information both on the page->mapping element ? 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, how we are going to store "address_space->address_space_operations" of the device driver and radix tree based reverse mapping information for the existing file mapping both on the same page->mapping element ? 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will defined inside the device driver) and the reverse mapping information (either anon or file mapping) together after first round of migration. This non LRU identity needs to be retained continuously if we ever need to return this page to device driver after successful migration to system RAM or for isolation/putback purpose or something else. All the reasons explained above was preventing a continuous ping-pong scheme of migration between system RAM LRU buddy pages and device memory non LRU pages which is one of the primary requirements for exploiting coherent device memory. Do you think we can solve these problems with ZONE_DEVICE and HMM framework ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-27 7:03 ` Anshuman Khandual @ 2016-10-27 15:05 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-27 15:05 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: > On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > > On 10/26/2016 09:32 PM, Jerome Glisse wrote: > >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: > >>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > >> In my patchset there is no policy, it is all under device driver control which > >> decide what range of memory is migrated and when. I think only device driver as > >> proper knowledge to make such decision. By coalescing data from GPU counters and > >> request from application made through the uppler level programming API like > >> Cuda. > >> > > > > Right, I understand that. But what I pointed out here is that there are problems > > now migrating user mapped pages back and forth between LRU system RAM memory and > > non LRU device memory which is yet to be solved. Because you are proposing a non > > LRU based design with ZONE_DEVICE, how we are solving/working around these > > problems for bi-directional migration ? > > Let me elaborate on this bit more. Before non LRU migration support patch series > from Minchan, it was not possible to migrate non LRU pages which are generally > driver managed through migrate_pages interface. This was affecting the ability > to do compaction on platforms which has a large share of non LRU pages. That series > actually solved the migration problem and allowed compaction. But it still did not > solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages > are mapped into a process's page table and being accessed from user space, it can > not be moved using migrate_pages interface. > > Minchan had a draft solution for that problem which is still hosted here. On his > suggestion I had tried this solution but still faced some other problems during > mapped pages migration. (NOTE: IIRC this was not posted in the community) > > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following > branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) > > As I had mentioned earlier, we intend to support all possible migrations between > system RAM (LRU) and device memory (Non LRU) for user space mapped pages. > > (1) System RAM (Anon mapping) --> Device memory, back and forth many times > (2) System RAM (File mapping) --> Device memory, back and forth many times I achieve this 2 objective in HMM, i sent you the additional patches for file back page migration. I am not done working on them but they are small. > This is not happening now with non LRU pages. Here are some of reasons but before > that some notes. > > * Driver initiates all the migrations > * Driver does the isolation of pages > * Driver puts the isolated pages in a linked list > * Driver passes the linked list to migrate_pages interface for migration > * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call > * If migration fails, call page->as->aops->putback_page to give the page back to the > device driver > > 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed > > 2. After a successful migration from non LRU device memory to LRU system RAM, the non > LRU will be freed back. Right now migrate_pages releases these pages to buddy, but > in this situation we need the pages to be given back to the driver instead. Hence > migrate_pages needs to be changed to accommodate this. > > 3. After LRU system RAM to non LRU device migration for a mapped page, does the new > page (which came from device memory) will be part of core MM LRU either for Anon > or File mapping ? > > 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, > how we are going to store "address_space->address_space_operations" and "Anon VMA > Chain" reverse mapping information both on the page->mapping element ? > > 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, > how we are going to store "address_space->address_space_operations" of the device > driver and radix tree based reverse mapping information for the existing file > mapping both on the same page->mapping element ? > > 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will > defined inside the device driver) and the reverse mapping information (either anon > or file mapping) together after first round of migration. This non LRU identity needs > to be retained continuously if we ever need to return this page to device driver after > successful migration to system RAM or for isolation/putback purpose or something else. > > All the reasons explained above was preventing a continuous ping-pong scheme of migration > between system RAM LRU buddy pages and device memory non LRU pages which is one of the > primary requirements for exploiting coherent device memory. Do you think we can solve these > problems with ZONE_DEVICE and HMM framework ? Well HMM already achieve migration but design is slightly different : * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array); It must provide a pfn_array that is big enough to have one entry per page for the range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page. * hmm_migrate() collect source pages from the process. Right now it will only migrate thing that have been faulted ie with a valid CPU page table entry and will ignore swap entry, or any other special CPU page table entry. Those source pages are store in the pfn array (using their pfn value with flag like write permission) * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any non lru page that is not ZONE_DEVICE is ignored. * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page) and clear the associated entry inside the pfn_array. * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback will allocate destination device page and copy from the source page. It uses the pfn array to know which page can be migrated in the range (there is a flag). The callback must also update the pfn_array and replace any entry that was successfully allocated and copied with the pfn of the device page (and flag). * hmm_migrate() do the final struct page meta-data migration which might fail in case of file back page (buffer head migration fails or radix tree fails ...) * hmm_migrate() update the CPU page table ie remove migration special entry to point to new page if migration successfull or restore to old page otherwise. It also unlock page and call put_page() on them either through lru put back or directly for ZONE_DEVICE pages. * hmm_migrate() call cleanup() only now device driver can update its page table I slightly changing the last 2 step, it would be call device driver callback first and then restore CPU page table and device driver callback would be rename to finalize_and_map(). So with this design: 1. is a non-issue (use of pfn array and not list of page). 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system memory call put_page() which in turn will call inside the device driver to inform the device driver that page is free (assuming refcount on page reach 1) 3. New page is not part of the LRU if it is a device page. Assumption is that the device driver wants to manage its memory by itself and LRU would interfer with that. Moreover this is a device page and thus it is not something that should be use for emergency memory allocation or any regular allocation. So it is pointless for kernel to try to keep aging those pages to see when they can be reclaim. 4. I do not store address_space operation of a device, i extended struct dev_pagemap to have more callback and this can be access through struct page->pgmap So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping ie for anonymous page it points to the anon vma chain and for file back page it points to the address space of the filesystem on which the file is. 5. See 4 above 6. I do not store any device driver specific address space operation inside struct page. I do not see the need for that and doing so would require major changes to kernel mm code. All the device driver cares about is being told when a page is free (as i am assuming device does the allocation in the first place). It seems you want to rely on following struct address_space_operations callback: void (*putback_page)(struct page *); bool (*isolate_page)(struct page *, isolate_mode_t); int (*migratepage) (...); For putback_page i added a free_page() to struct dev_pagemap which does the job. I do not see need for isolate_page() and it would be bad as some filesystem do special thing in that callback. If you update the CPU page table the device should see that and i do not think you would need any special handling inside the device driver code. For migratepage() again i do not see the use for it. Some fs have special callback and that should be the one use. So i really don't think we need to have an address_space for page that are coming from device. I think we can add thing to struct dev_pagemap if needed. Did i miss something ? :) Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-27 15:05 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-27 15:05 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: > On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > > On 10/26/2016 09:32 PM, Jerome Glisse wrote: > >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: > >>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > >> In my patchset there is no policy, it is all under device driver control which > >> decide what range of memory is migrated and when. I think only device driver as > >> proper knowledge to make such decision. By coalescing data from GPU counters and > >> request from application made through the uppler level programming API like > >> Cuda. > >> > > > > Right, I understand that. But what I pointed out here is that there are problems > > now migrating user mapped pages back and forth between LRU system RAM memory and > > non LRU device memory which is yet to be solved. Because you are proposing a non > > LRU based design with ZONE_DEVICE, how we are solving/working around these > > problems for bi-directional migration ? > > Let me elaborate on this bit more. Before non LRU migration support patch series > from Minchan, it was not possible to migrate non LRU pages which are generally > driver managed through migrate_pages interface. This was affecting the ability > to do compaction on platforms which has a large share of non LRU pages. That series > actually solved the migration problem and allowed compaction. But it still did not > solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages > are mapped into a process's page table and being accessed from user space, it can > not be moved using migrate_pages interface. > > Minchan had a draft solution for that problem which is still hosted here. On his > suggestion I had tried this solution but still faced some other problems during > mapped pages migration. (NOTE: IIRC this was not posted in the community) > > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following > branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) > > As I had mentioned earlier, we intend to support all possible migrations between > system RAM (LRU) and device memory (Non LRU) for user space mapped pages. > > (1) System RAM (Anon mapping) --> Device memory, back and forth many times > (2) System RAM (File mapping) --> Device memory, back and forth many times I achieve this 2 objective in HMM, i sent you the additional patches for file back page migration. I am not done working on them but they are small. > This is not happening now with non LRU pages. Here are some of reasons but before > that some notes. > > * Driver initiates all the migrations > * Driver does the isolation of pages > * Driver puts the isolated pages in a linked list > * Driver passes the linked list to migrate_pages interface for migration > * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call > * If migration fails, call page->as->aops->putback_page to give the page back to the > device driver > > 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed > > 2. After a successful migration from non LRU device memory to LRU system RAM, the non > LRU will be freed back. Right now migrate_pages releases these pages to buddy, but > in this situation we need the pages to be given back to the driver instead. Hence > migrate_pages needs to be changed to accommodate this. > > 3. After LRU system RAM to non LRU device migration for a mapped page, does the new > page (which came from device memory) will be part of core MM LRU either for Anon > or File mapping ? > > 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, > how we are going to store "address_space->address_space_operations" and "Anon VMA > Chain" reverse mapping information both on the page->mapping element ? > > 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, > how we are going to store "address_space->address_space_operations" of the device > driver and radix tree based reverse mapping information for the existing file > mapping both on the same page->mapping element ? > > 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will > defined inside the device driver) and the reverse mapping information (either anon > or file mapping) together after first round of migration. This non LRU identity needs > to be retained continuously if we ever need to return this page to device driver after > successful migration to system RAM or for isolation/putback purpose or something else. > > All the reasons explained above was preventing a continuous ping-pong scheme of migration > between system RAM LRU buddy pages and device memory non LRU pages which is one of the > primary requirements for exploiting coherent device memory. Do you think we can solve these > problems with ZONE_DEVICE and HMM framework ? Well HMM already achieve migration but design is slightly different : * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array); It must provide a pfn_array that is big enough to have one entry per page for the range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page. * hmm_migrate() collect source pages from the process. Right now it will only migrate thing that have been faulted ie with a valid CPU page table entry and will ignore swap entry, or any other special CPU page table entry. Those source pages are store in the pfn array (using their pfn value with flag like write permission) * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any non lru page that is not ZONE_DEVICE is ignored. * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page) and clear the associated entry inside the pfn_array. * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback will allocate destination device page and copy from the source page. It uses the pfn array to know which page can be migrated in the range (there is a flag). The callback must also update the pfn_array and replace any entry that was successfully allocated and copied with the pfn of the device page (and flag). * hmm_migrate() do the final struct page meta-data migration which might fail in case of file back page (buffer head migration fails or radix tree fails ...) * hmm_migrate() update the CPU page table ie remove migration special entry to point to new page if migration successfull or restore to old page otherwise. It also unlock page and call put_page() on them either through lru put back or directly for ZONE_DEVICE pages. * hmm_migrate() call cleanup() only now device driver can update its page table I slightly changing the last 2 step, it would be call device driver callback first and then restore CPU page table and device driver callback would be rename to finalize_and_map(). So with this design: 1. is a non-issue (use of pfn array and not list of page). 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system memory call put_page() which in turn will call inside the device driver to inform the device driver that page is free (assuming refcount on page reach 1) 3. New page is not part of the LRU if it is a device page. Assumption is that the device driver wants to manage its memory by itself and LRU would interfer with that. Moreover this is a device page and thus it is not something that should be use for emergency memory allocation or any regular allocation. So it is pointless for kernel to try to keep aging those pages to see when they can be reclaim. 4. I do not store address_space operation of a device, i extended struct dev_pagemap to have more callback and this can be access through struct page->pgmap So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping ie for anonymous page it points to the anon vma chain and for file back page it points to the address space of the filesystem on which the file is. 5. See 4 above 6. I do not store any device driver specific address space operation inside struct page. I do not see the need for that and doing so would require major changes to kernel mm code. All the device driver cares about is being told when a page is free (as i am assuming device does the allocation in the first place). It seems you want to rely on following struct address_space_operations callback: void (*putback_page)(struct page *); bool (*isolate_page)(struct page *, isolate_mode_t); int (*migratepage) (...); For putback_page i added a free_page() to struct dev_pagemap which does the job. I do not see need for isolate_page() and it would be bad as some filesystem do special thing in that callback. If you update the CPU page table the device should see that and i do not think you would need any special handling inside the device driver code. For migratepage() again i do not see the use for it. Some fs have special callback and that should be the one use. So i really don't think we need to have an address_space for page that are coming from device. I think we can add thing to struct dev_pagemap if needed. Did i miss something ? :) Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-27 15:05 ` Jerome Glisse @ 2016-10-28 5:47 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-28 5:47 UTC (permalink / raw) To: Jerome Glisse Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/27/2016 08:35 PM, Jerome Glisse wrote: > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: >>>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >>>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >>>> In my patchset there is no policy, it is all under device driver control which >>>> decide what range of memory is migrated and when. I think only device driver as >>>> proper knowledge to make such decision. By coalescing data from GPU counters and >>>> request from application made through the uppler level programming API like >>>> Cuda. >>>> >>> >>> Right, I understand that. But what I pointed out here is that there are problems >>> now migrating user mapped pages back and forth between LRU system RAM memory and >>> non LRU device memory which is yet to be solved. Because you are proposing a non >>> LRU based design with ZONE_DEVICE, how we are solving/working around these >>> problems for bi-directional migration ? >> >> Let me elaborate on this bit more. Before non LRU migration support patch series >> from Minchan, it was not possible to migrate non LRU pages which are generally >> driver managed through migrate_pages interface. This was affecting the ability >> to do compaction on platforms which has a large share of non LRU pages. That series >> actually solved the migration problem and allowed compaction. But it still did not >> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages >> are mapped into a process's page table and being accessed from user space, it can >> not be moved using migrate_pages interface. >> >> Minchan had a draft solution for that problem which is still hosted here. On his >> suggestion I had tried this solution but still faced some other problems during >> mapped pages migration. (NOTE: IIRC this was not posted in the community) >> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) >> >> As I had mentioned earlier, we intend to support all possible migrations between >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. >> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > I achieve this 2 objective in HMM, i sent you the additional patches for file > back page migration. I am not done working on them but they are small. Sure, will go through them. Thanks ! > > >> This is not happening now with non LRU pages. Here are some of reasons but before >> that some notes. >> >> * Driver initiates all the migrations >> * Driver does the isolation of pages >> * Driver puts the isolated pages in a linked list >> * Driver passes the linked list to migrate_pages interface for migration >> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call >> * If migration fails, call page->as->aops->putback_page to give the page back to the >> device driver >> >> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed >> >> 2. After a successful migration from non LRU device memory to LRU system RAM, the non >> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but >> in this situation we need the pages to be given back to the driver instead. Hence >> migrate_pages needs to be changed to accommodate this. >> >> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new >> page (which came from device memory) will be part of core MM LRU either for Anon >> or File mapping ? >> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, >> how we are going to store "address_space->address_space_operations" and "Anon VMA >> Chain" reverse mapping information both on the page->mapping element ? >> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, >> how we are going to store "address_space->address_space_operations" of the device >> driver and radix tree based reverse mapping information for the existing file >> mapping both on the same page->mapping element ? >> >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will >> defined inside the device driver) and the reverse mapping information (either anon >> or file mapping) together after first round of migration. This non LRU identity needs >> to be retained continuously if we ever need to return this page to device driver after >> successful migration to system RAM or for isolation/putback purpose or something else. >> >> All the reasons explained above was preventing a continuous ping-pong scheme of migration >> between system RAM LRU buddy pages and device memory non LRU pages which is one of the >> primary requirements for exploiting coherent device memory. Do you think we can solve these >> problems with ZONE_DEVICE and HMM framework ? > > Well HMM already achieve migration but design is slightly different : > * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array); > It must provide a pfn_array that is big enough to have one entry per page for the > range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page. If we are not going to use standard core migrate_pages() interface, there is no need of building a linked list of isolated source pages for migration. Though I see a different hmm_migrate() function in the V13 tree which involves hmm_migrate structure, lets focus on hmm_migrate(mm, start, end, pfn_array) format. I guess (mm, start, end) describes the virtual range of a process which needs to be migrated and pfn_array[] is the destination array of PFNs for the migration ? * I assume pfn_array[] can contain either system RAM PFN or device memory PFN ? It will support migration in both directions ? * Device memory PFN can have struct pages (If ZONE_DEVICE based) or it may not have struct pages ? > > * hmm_migrate() collect source pages from the process. Right now it will only migrate > thing that have been faulted ie with a valid CPU page table entry and will ignore > swap entry, or any other special CPU page table entry. Those source pages are store > in the pfn array (using their pfn value with flag like write permission) So source PFNs go into pfn_array[], I was thinking it contains destination PFNs. > > * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages > it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any > non lru page that is not ZONE_DEVICE is ignored. Hmm, may be because it does not have either page->pgmap (which you have extended to contain some driver specific callbacks) or page->as->aops (Minchan Kim's framework). Therefore any other kind of non LRU pages cannot migrate. > > * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then > it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page) > and clear the associated entry inside the pfn_array. Got it. pfn_array[] at the end will contain all PFNs which need to be migrated. > > * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback > will allocate destination device page and copy from the source page. It uses the pfn So if the migration is from device to system RAM, alloc_and_copy() will allocate the destination system RAM pages and at that time pfn_array[] contains source device memory PFNs ? I am just trying see if it works both ways. > array to know which page can be migrated in the range (there is a flag). The callback > must also update the pfn_array and replace any entry that was successfully allocated > and copied with the pfn of the device page (and flag). > > * hmm_migrate() do the final struct page meta-data migration which might fail in case of > file back page (buffer head migration fails or radix tree fails ...) > > * hmm_migrate() update the CPU page table ie remove migration special entry to point > to new page if migration successfull or restore to old page otherwise. It also unlock > page and call put_page() on them either through lru put back or directly for > ZONE_DEVICE pages. If it's a ZONE_DEVICE page, the registered device driver also gets notified about it ? So that it can update it's own accounting regarding the allocated and free memory pages that it owns through a hot plugged ZONE_DEVICE zone ? > > * hmm_migrate() call cleanup() only now device driver can update its page table Though I still need to understand the page table mirroring part, I can clearly see that hmm_migrate() attempts to implement a parallel migrate_pages() kind of interface which can work with non LRU pages (right now ZONE_DEVICE based only) and a device driver. We will have to see whether this hmm_migrate() interface can accommodate all kind and direction of migration. Minchan Kim's framework enabled non LRU page migration in a different way. The device driver is suppose to create a stand alone struct address_space_operation and struct address_space and load them into each struct page with a call. Now all non LRU pages contains the stand alone struct address_space_operations as page->as->aops based callbacks. Now we have a different way of enabling non LRU device page migration by extending ZONE_DEVICE framework, does it overlap with the functionality already supported by the previous framework ? I am just curious. > > > I slightly changing the last 2 step, it would be call device driver callback first > and then restore CPU page table and device driver callback would be rename to > finalize_and_map(). > > So with this design: > 1. is a non-issue (use of pfn array and not list of page). Right. > > 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system > memory call put_page() which in turn will call inside the device driver > to inform the device driver that page is free (assuming refcount on page > reach 1) Right. > > 3. New page is not part of the LRU if it is a device page. Assumption is that the > device driver wants to manage its memory by itself and LRU would interfer with > that. Moreover this is a device page and thus it is not something that should be > use for emergency memory allocation or any regular allocation. So it is pointless > for kernel to try to keep aging those pages to see when they can be reclaim. If the driver manages everything, these device memory pages need not be on the LRU after migration. But not being on any LRU makes it difficult for other core MM features to work on these pages any more. Almost all core mm interfaces expect the pages to be on LRU, IIUC. Though they all can be changed to accommodate non LRU pages but dont you think that can be a lot of work ? Just curious. > > 4. I do not store address_space operation of a device, i extended struct dev_pagemap > to have more callback and this can be access through struct page->pgmap > So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping > ie for anonymous page it points to the anon vma chain and for file back page it > points to the address space of the filesystem on which the file is. Right. > > 5. See 4 above Right. > > 6. I do not store any device driver specific address space operation inside struct > page. I do not see the need for that and doing so would require major changes to > kernel mm code. All the device driver cares about is being told when a page is > free (as i am assuming device does the allocation in the first place). > Minchan's work introduced the idea of PageMovable (IIUC, it just says its a movable non LRU page with page->mapping->aops and some struct page flags) and changed parts of the core MM migration and compaction functions to accommodate MovablePage. > It seems you want to rely on following struct address_space_operations callback: > void (*putback_page)(struct page *); > bool (*isolate_page)(struct page *, isolate_mode_t); > int (*migratepage) (...); > > For putback_page i added a free_page() to struct dev_pagemap which does the job. Right, sounds correct from this ZONE_DEVICE based framework. > I do not see need for isolate_page() and it would be bad as some filesystem do > special thing in that callback. If you update the CPU page table the device should It was a dummy device driver specific address_space_operations, hence its not related to any file system as such. > see that and i do not think you would need any special handling inside the device > driver code. I need to understand this part. How a call back from CPU page table update comes to the device driver, will go through HMM V13 for that. > > For migratepage() again i do not see the use for it. Some fs have special callback > and that should be the one use. > > > So i really don't think we need to have an address_space for page that are coming > from device. I think we can add thing to struct dev_pagemap if needed. Right, sounds correct from this ZONE_DEVICE based framework. > > Did i miss something ? :) Will have more questions after looking deeper into HMM :) ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-28 5:47 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-28 5:47 UTC (permalink / raw) To: Jerome Glisse Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/27/2016 08:35 PM, Jerome Glisse wrote: > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: >>>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >>>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >>>> In my patchset there is no policy, it is all under device driver control which >>>> decide what range of memory is migrated and when. I think only device driver as >>>> proper knowledge to make such decision. By coalescing data from GPU counters and >>>> request from application made through the uppler level programming API like >>>> Cuda. >>>> >>> >>> Right, I understand that. But what I pointed out here is that there are problems >>> now migrating user mapped pages back and forth between LRU system RAM memory and >>> non LRU device memory which is yet to be solved. Because you are proposing a non >>> LRU based design with ZONE_DEVICE, how we are solving/working around these >>> problems for bi-directional migration ? >> >> Let me elaborate on this bit more. Before non LRU migration support patch series >> from Minchan, it was not possible to migrate non LRU pages which are generally >> driver managed through migrate_pages interface. This was affecting the ability >> to do compaction on platforms which has a large share of non LRU pages. That series >> actually solved the migration problem and allowed compaction. But it still did not >> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages >> are mapped into a process's page table and being accessed from user space, it can >> not be moved using migrate_pages interface. >> >> Minchan had a draft solution for that problem which is still hosted here. On his >> suggestion I had tried this solution but still faced some other problems during >> mapped pages migration. (NOTE: IIRC this was not posted in the community) >> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) >> >> As I had mentioned earlier, we intend to support all possible migrations between >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. >> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > I achieve this 2 objective in HMM, i sent you the additional patches for file > back page migration. I am not done working on them but they are small. Sure, will go through them. Thanks ! > > >> This is not happening now with non LRU pages. Here are some of reasons but before >> that some notes. >> >> * Driver initiates all the migrations >> * Driver does the isolation of pages >> * Driver puts the isolated pages in a linked list >> * Driver passes the linked list to migrate_pages interface for migration >> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call >> * If migration fails, call page->as->aops->putback_page to give the page back to the >> device driver >> >> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed >> >> 2. After a successful migration from non LRU device memory to LRU system RAM, the non >> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but >> in this situation we need the pages to be given back to the driver instead. Hence >> migrate_pages needs to be changed to accommodate this. >> >> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new >> page (which came from device memory) will be part of core MM LRU either for Anon >> or File mapping ? >> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, >> how we are going to store "address_space->address_space_operations" and "Anon VMA >> Chain" reverse mapping information both on the page->mapping element ? >> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, >> how we are going to store "address_space->address_space_operations" of the device >> driver and radix tree based reverse mapping information for the existing file >> mapping both on the same page->mapping element ? >> >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will >> defined inside the device driver) and the reverse mapping information (either anon >> or file mapping) together after first round of migration. This non LRU identity needs >> to be retained continuously if we ever need to return this page to device driver after >> successful migration to system RAM or for isolation/putback purpose or something else. >> >> All the reasons explained above was preventing a continuous ping-pong scheme of migration >> between system RAM LRU buddy pages and device memory non LRU pages which is one of the >> primary requirements for exploiting coherent device memory. Do you think we can solve these >> problems with ZONE_DEVICE and HMM framework ? > > Well HMM already achieve migration but design is slightly different : > * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array); > It must provide a pfn_array that is big enough to have one entry per page for the > range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page. If we are not going to use standard core migrate_pages() interface, there is no need of building a linked list of isolated source pages for migration. Though I see a different hmm_migrate() function in the V13 tree which involves hmm_migrate structure, lets focus on hmm_migrate(mm, start, end, pfn_array) format. I guess (mm, start, end) describes the virtual range of a process which needs to be migrated and pfn_array[] is the destination array of PFNs for the migration ? * I assume pfn_array[] can contain either system RAM PFN or device memory PFN ? It will support migration in both directions ? * Device memory PFN can have struct pages (If ZONE_DEVICE based) or it may not have struct pages ? > > * hmm_migrate() collect source pages from the process. Right now it will only migrate > thing that have been faulted ie with a valid CPU page table entry and will ignore > swap entry, or any other special CPU page table entry. Those source pages are store > in the pfn array (using their pfn value with flag like write permission) So source PFNs go into pfn_array[], I was thinking it contains destination PFNs. > > * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages > it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any > non lru page that is not ZONE_DEVICE is ignored. Hmm, may be because it does not have either page->pgmap (which you have extended to contain some driver specific callbacks) or page->as->aops (Minchan Kim's framework). Therefore any other kind of non LRU pages cannot migrate. > > * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then > it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page) > and clear the associated entry inside the pfn_array. Got it. pfn_array[] at the end will contain all PFNs which need to be migrated. > > * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback > will allocate destination device page and copy from the source page. It uses the pfn So if the migration is from device to system RAM, alloc_and_copy() will allocate the destination system RAM pages and at that time pfn_array[] contains source device memory PFNs ? I am just trying see if it works both ways. > array to know which page can be migrated in the range (there is a flag). The callback > must also update the pfn_array and replace any entry that was successfully allocated > and copied with the pfn of the device page (and flag). > > * hmm_migrate() do the final struct page meta-data migration which might fail in case of > file back page (buffer head migration fails or radix tree fails ...) > > * hmm_migrate() update the CPU page table ie remove migration special entry to point > to new page if migration successfull or restore to old page otherwise. It also unlock > page and call put_page() on them either through lru put back or directly for > ZONE_DEVICE pages. If it's a ZONE_DEVICE page, the registered device driver also gets notified about it ? So that it can update it's own accounting regarding the allocated and free memory pages that it owns through a hot plugged ZONE_DEVICE zone ? > > * hmm_migrate() call cleanup() only now device driver can update its page table Though I still need to understand the page table mirroring part, I can clearly see that hmm_migrate() attempts to implement a parallel migrate_pages() kind of interface which can work with non LRU pages (right now ZONE_DEVICE based only) and a device driver. We will have to see whether this hmm_migrate() interface can accommodate all kind and direction of migration. Minchan Kim's framework enabled non LRU page migration in a different way. The device driver is suppose to create a stand alone struct address_space_operation and struct address_space and load them into each struct page with a call. Now all non LRU pages contains the stand alone struct address_space_operations as page->as->aops based callbacks. Now we have a different way of enabling non LRU device page migration by extending ZONE_DEVICE framework, does it overlap with the functionality already supported by the previous framework ? I am just curious. > > > I slightly changing the last 2 step, it would be call device driver callback first > and then restore CPU page table and device driver callback would be rename to > finalize_and_map(). > > So with this design: > 1. is a non-issue (use of pfn array and not list of page). Right. > > 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system > memory call put_page() which in turn will call inside the device driver > to inform the device driver that page is free (assuming refcount on page > reach 1) Right. > > 3. New page is not part of the LRU if it is a device page. Assumption is that the > device driver wants to manage its memory by itself and LRU would interfer with > that. Moreover this is a device page and thus it is not something that should be > use for emergency memory allocation or any regular allocation. So it is pointless > for kernel to try to keep aging those pages to see when they can be reclaim. If the driver manages everything, these device memory pages need not be on the LRU after migration. But not being on any LRU makes it difficult for other core MM features to work on these pages any more. Almost all core mm interfaces expect the pages to be on LRU, IIUC. Though they all can be changed to accommodate non LRU pages but dont you think that can be a lot of work ? Just curious. > > 4. I do not store address_space operation of a device, i extended struct dev_pagemap > to have more callback and this can be access through struct page->pgmap > So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping > ie for anonymous page it points to the anon vma chain and for file back page it > points to the address space of the filesystem on which the file is. Right. > > 5. See 4 above Right. > > 6. I do not store any device driver specific address space operation inside struct > page. I do not see the need for that and doing so would require major changes to > kernel mm code. All the device driver cares about is being told when a page is > free (as i am assuming device does the allocation in the first place). > Minchan's work introduced the idea of PageMovable (IIUC, it just says its a movable non LRU page with page->mapping->aops and some struct page flags) and changed parts of the core MM migration and compaction functions to accommodate MovablePage. > It seems you want to rely on following struct address_space_operations callback: > void (*putback_page)(struct page *); > bool (*isolate_page)(struct page *, isolate_mode_t); > int (*migratepage) (...); > > For putback_page i added a free_page() to struct dev_pagemap which does the job. Right, sounds correct from this ZONE_DEVICE based framework. > I do not see need for isolate_page() and it would be bad as some filesystem do > special thing in that callback. If you update the CPU page table the device should It was a dummy device driver specific address_space_operations, hence its not related to any file system as such. > see that and i do not think you would need any special handling inside the device > driver code. I need to understand this part. How a call back from CPU page table update comes to the device driver, will go through HMM V13 for that. > > For migratepage() again i do not see the use for it. Some fs have special callback > and that should be the one use. > > > So i really don't think we need to have an address_space for page that are coming > from device. I think we can add thing to struct dev_pagemap if needed. Right, sounds correct from this ZONE_DEVICE based framework. > > Did i miss something ? :) Will have more questions after looking deeper into HMM :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-28 5:47 ` Anshuman Khandual @ 2016-10-28 16:08 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-28 16:08 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Fri, Oct 28, 2016 at 11:17:31AM +0530, Anshuman Khandual wrote: > On 10/27/2016 08:35 PM, Jerome Glisse wrote: > > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: > >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: > >>>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > >>>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: > >>>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > > [...] > > > >>>> In my patchset there is no policy, it is all under device driver control which > >>>> decide what range of memory is migrated and when. I think only device driver as > >>>> proper knowledge to make such decision. By coalescing data from GPU counters and > >>>> request from application made through the uppler level programming API like > >>>> Cuda. > >>>> > >>> > >>> Right, I understand that. But what I pointed out here is that there are problems > >>> now migrating user mapped pages back and forth between LRU system RAM memory and > >>> non LRU device memory which is yet to be solved. Because you are proposing a non > >>> LRU based design with ZONE_DEVICE, how we are solving/working around these > >>> problems for bi-directional migration ? > >> > >> Let me elaborate on this bit more. Before non LRU migration support patch series > >> from Minchan, it was not possible to migrate non LRU pages which are generally > >> driver managed through migrate_pages interface. This was affecting the ability > >> to do compaction on platforms which has a large share of non LRU pages. That series > >> actually solved the migration problem and allowed compaction. But it still did not > >> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages > >> are mapped into a process's page table and being accessed from user space, it can > >> not be moved using migrate_pages interface. > >> > >> Minchan had a draft solution for that problem which is still hosted here. On his > >> suggestion I had tried this solution but still faced some other problems during > >> mapped pages migration. (NOTE: IIRC this was not posted in the community) > >> > >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following > >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) > >> > >> As I had mentioned earlier, we intend to support all possible migrations between > >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. > >> > >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times > >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > > > I achieve this 2 objective in HMM, i sent you the additional patches for file > > back page migration. I am not done working on them but they are small. > > Sure, will go through them. Thanks ! > > > > > > >> This is not happening now with non LRU pages. Here are some of reasons but before > >> that some notes. > >> > >> * Driver initiates all the migrations > >> * Driver does the isolation of pages > >> * Driver puts the isolated pages in a linked list > >> * Driver passes the linked list to migrate_pages interface for migration > >> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call > >> * If migration fails, call page->as->aops->putback_page to give the page back to the > >> device driver > >> > >> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed > >> > >> 2. After a successful migration from non LRU device memory to LRU system RAM, the non > >> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but > >> in this situation we need the pages to be given back to the driver instead. Hence > >> migrate_pages needs to be changed to accommodate this. > >> > >> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new > >> page (which came from device memory) will be part of core MM LRU either for Anon > >> or File mapping ? > >> > >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, > >> how we are going to store "address_space->address_space_operations" and "Anon VMA > >> Chain" reverse mapping information both on the page->mapping element ? > >> > >> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, > >> how we are going to store "address_space->address_space_operations" of the device > >> driver and radix tree based reverse mapping information for the existing file > >> mapping both on the same page->mapping element ? > >> > >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will > >> defined inside the device driver) and the reverse mapping information (either anon > >> or file mapping) together after first round of migration. This non LRU identity needs > >> to be retained continuously if we ever need to return this page to device driver after > >> successful migration to system RAM or for isolation/putback purpose or something else. > >> > >> All the reasons explained above was preventing a continuous ping-pong scheme of migration > >> between system RAM LRU buddy pages and device memory non LRU pages which is one of the > >> primary requirements for exploiting coherent device memory. Do you think we can solve these > >> problems with ZONE_DEVICE and HMM framework ? > > > > Well HMM already achieve migration but design is slightly different : > > * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array); > > It must provide a pfn_array that is big enough to have one entry per page for the > > range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page. > > If we are not going to use standard core migrate_pages() interface, there is no need > of building a linked list of isolated source pages for migration. Though I see a > different hmm_migrate() function in the V13 tree which involves hmm_migrate structure, > lets focus on hmm_migrate(mm, start, end, pfn_array) format. I guess (mm, start, end) > describes the virtual range of a process which needs to be migrated and pfn_array[] > is the destination array of PFNs for the migration ? The hmm_migrate struct is just a place holder for all the argument (vma, start, end, pfn_arrays ptr, flags, ...). I can hide it inside the migrate function, it is easier to pass around for sub-functions that always having a long list of arg. > > * I assume pfn_array[] can contain either system RAM PFN or device memory PFN ? It > will support migration in both directions ? Correct both direction are supported. > > * Device memory PFN can have struct pages (If ZONE_DEVICE based) or it may not have > struct pages ? Memory must have a struct page, this is needed so that anon_vma and mapping for file back page are properly being track. > > > > * hmm_migrate() collect source pages from the process. Right now it will only migrate > > thing that have been faulted ie with a valid CPU page table entry and will ignore > > swap entry, or any other special CPU page table entry. Those source pages are store > > in the pfn array (using their pfn value with flag like write permission) > > So source PFNs go into pfn_array[], I was thinking it contains destination PFNs. In first pass it contains source pfn so driver don't have to walk CPU page table, it can be ignore by driver that use CPU page table directly. It is only after device driver callback that the device populate the array with destination memory. > > > > * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages > > it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any > > non lru page that is not ZONE_DEVICE is ignored. > > Hmm, may be because it does not have either page->pgmap (which you have extended to > contain some driver specific callbacks) or page->as->aops (Minchan Kim's framework). > Therefore any other kind of non LRU pages cannot migrate. > > > > > * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then > > it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page) > > and clear the associated entry inside the pfn_array. > > Got it. pfn_array[] at the end will contain all PFNs which need to be migrated. Yup > > > > > * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback > > will allocate destination device page and copy from the source page. It uses the pfn > > So if the migration is from device to system RAM, alloc_and_copy() will allocate the > destination system RAM pages and at that time pfn_array[] contains source device memory > PFNs ? I am just trying see if it works both ways. Yes, inside hmm_devmem* there is actualy an helper that do just that so device driver don't have to worry about the device to system RAM direction. But device driver can choose to not use hmm_devmem* and handle thing on their own (i would rather have device driver use common helpers to avoid each device driver making different mistakes). > > array to know which page can be migrated in the range (there is a flag). The callback > > must also update the pfn_array and replace any entry that was successfully allocated > > and copied with the pfn of the device page (and flag). > > > > * hmm_migrate() do the final struct page meta-data migration which might fail in case of > > file back page (buffer head migration fails or radix tree fails ...) > > > > * hmm_migrate() update the CPU page table ie remove migration special entry to point > > to new page if migration successfull or restore to old page otherwise. It also unlock > > page and call put_page() on them either through lru put back or directly for > > ZONE_DEVICE pages. > > If it's a ZONE_DEVICE page, the registered device driver also gets notified about it ? > So that it can update it's own accounting regarding the allocated and free memory pages > that it owns through a hot plugged ZONE_DEVICE zone ? > > > > > * hmm_migrate() call cleanup() only now device driver can update its page table > > Though I still need to understand the page table mirroring part, I can clearly see > that hmm_migrate() attempts to implement a parallel migrate_pages() kind of interface > which can work with non LRU pages (right now ZONE_DEVICE based only) and a device > driver. We will have to see whether this hmm_migrate() interface can accommodate all > kind and direction of migration. > > Minchan Kim's framework enabled non LRU page migration in a different way. The device > driver is suppose to create a stand alone struct address_space_operation and struct > address_space and load them into each struct page with a call. Now all non LRU pages > contains the stand alone struct address_space_operations as page->as->aops based > callbacks. > > Now we have a different way of enabling non LRU device page migration by extending > ZONE_DEVICE framework, does it overlap with the functionality already supported > by the previous framework ? I am just curious. I think Minchan is trying to allow migration for device driver kernel allocated memory ie not memory that end inside a regular vma (non special vma) but only inside a device driver file vma if at all. So we are targetting different problem. Me i only care about "regular" process memory is private anonymous, or share memory (either back by regular file or pure share memory). I do not want to mess with any of the device driver vma or any special vma that are under control of an unknown device driver. Trying to migrate any such special memory is just not going to work. Moreover i believe it is not something we care in the first place. GPU will work on either the regular process memory or some GPU specific memory but won't try to mess with other device vma. > > > > > > > I slightly changing the last 2 step, it would be call device driver callback first > > and then restore CPU page table and device driver callback would be rename to > > finalize_and_map(). > > > > So with this design: > > 1. is a non-issue (use of pfn array and not list of page). > > Right. > > > > > 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system > > memory call put_page() which in turn will call inside the device driver > > to inform the device driver that page is free (assuming refcount on page > > reach 1) > > Right. > > > > > 3. New page is not part of the LRU if it is a device page. Assumption is that the > > device driver wants to manage its memory by itself and LRU would interfer with > > that. Moreover this is a device page and thus it is not something that should be > > use for emergency memory allocation or any regular allocation. So it is pointless > > for kernel to try to keep aging those pages to see when they can be reclaim. > > If the driver manages everything, these device memory pages need not be on the LRU after > migration. But not being on any LRU makes it difficult for other core MM features to work > on these pages any more. Almost all core mm interfaces expect the pages to be on LRU, IIUC. > Though they all can be changed to accommodate non LRU pages but dont you think that can be > a lot of work ? Just curious. There is no code that assume that page is on lru. There is code that assume new page must go on lru (the file system page read for instance). But all code path i went through (i try to go over all of it but i might have miss thing) will gracefully handle a page that is not on the lru. In all cases i have been through this just meant ignore the page. Which is what i wanted in the first place :) for device memory to be left alone. My hope is that at one point hardware will have enough commonality that implementing a generic per device lru might make sense. Same for other kernel mm functionality. > > > > > 4. I do not store address_space operation of a device, i extended struct dev_pagemap > > to have more callback and this can be access through struct page->pgmap > > So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping > > ie for anonymous page it points to the anon vma chain and for file back page it > > points to the address space of the filesystem on which the file is. > > Right. > > > > > 5. See 4 above > > Right. > > > > > 6. I do not store any device driver specific address space operation inside struct > > page. I do not see the need for that and doing so would require major changes to > > kernel mm code. All the device driver cares about is being told when a page is > > free (as i am assuming device does the allocation in the first place). > > > > Minchan's work introduced the idea of PageMovable (IIUC, it just says its a movable > non LRU page with page->mapping->aops and some struct page flags) and changed parts > of the core MM migration and compaction functions to accommodate MovablePage. Like i said above i think he is targeting device driver allocated page that are not part of regular vma (private anonymous or share file) but are use by device driver. > > > It seems you want to rely on following struct address_space_operations callback: > > void (*putback_page)(struct page *); > > bool (*isolate_page)(struct page *, isolate_mode_t); > > int (*migratepage) (...); > > > > For putback_page i added a free_page() to struct dev_pagemap which does the job. > > Right, sounds correct from this ZONE_DEVICE based framework. > > > I do not see need for isolate_page() and it would be bad as some filesystem do > > special thing in that callback. If you update the CPU page table the device should > > It was a dummy device driver specific address_space_operations, hence its not related > to any file system as such. > > > see that and i do not think you would need any special handling inside the device > > driver code. > > I need to understand this part. How a call back from CPU page table update comes to > the device driver, will go through HMM V13 for that. It goes through the update() callback of hmm_mirror_ops which is part of hmm_mirror struct. So device driver register an hmm_mirror against an mm which bind the device to the given mm. Any update to CPU page table calls mmu_notifier and hmm forward those call to device driver through hmm_mirror_ops.update(). Device driver does not use mmu_notifier directly because HMM provide a way to snapshot CPU page table safely without worry from concurrent CPU page table update and without locking CPU page table directory. But all this is separate from migration or devmem, so i doubt it could be usefull with CAPI bus. hmm_mirror is just really mmu_notifier with some sugar coating :) Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-28 16:08 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-28 16:08 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Fri, Oct 28, 2016 at 11:17:31AM +0530, Anshuman Khandual wrote: > On 10/27/2016 08:35 PM, Jerome Glisse wrote: > > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: > >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: > >>>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > >>>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: > >>>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > > [...] > > > >>>> In my patchset there is no policy, it is all under device driver control which > >>>> decide what range of memory is migrated and when. I think only device driver as > >>>> proper knowledge to make such decision. By coalescing data from GPU counters and > >>>> request from application made through the uppler level programming API like > >>>> Cuda. > >>>> > >>> > >>> Right, I understand that. But what I pointed out here is that there are problems > >>> now migrating user mapped pages back and forth between LRU system RAM memory and > >>> non LRU device memory which is yet to be solved. Because you are proposing a non > >>> LRU based design with ZONE_DEVICE, how we are solving/working around these > >>> problems for bi-directional migration ? > >> > >> Let me elaborate on this bit more. Before non LRU migration support patch series > >> from Minchan, it was not possible to migrate non LRU pages which are generally > >> driver managed through migrate_pages interface. This was affecting the ability > >> to do compaction on platforms which has a large share of non LRU pages. That series > >> actually solved the migration problem and allowed compaction. But it still did not > >> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages > >> are mapped into a process's page table and being accessed from user space, it can > >> not be moved using migrate_pages interface. > >> > >> Minchan had a draft solution for that problem which is still hosted here. On his > >> suggestion I had tried this solution but still faced some other problems during > >> mapped pages migration. (NOTE: IIRC this was not posted in the community) > >> > >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following > >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) > >> > >> As I had mentioned earlier, we intend to support all possible migrations between > >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. > >> > >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times > >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > > > I achieve this 2 objective in HMM, i sent you the additional patches for file > > back page migration. I am not done working on them but they are small. > > Sure, will go through them. Thanks ! > > > > > > >> This is not happening now with non LRU pages. Here are some of reasons but before > >> that some notes. > >> > >> * Driver initiates all the migrations > >> * Driver does the isolation of pages > >> * Driver puts the isolated pages in a linked list > >> * Driver passes the linked list to migrate_pages interface for migration > >> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call > >> * If migration fails, call page->as->aops->putback_page to give the page back to the > >> device driver > >> > >> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed > >> > >> 2. After a successful migration from non LRU device memory to LRU system RAM, the non > >> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but > >> in this situation we need the pages to be given back to the driver instead. Hence > >> migrate_pages needs to be changed to accommodate this. > >> > >> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new > >> page (which came from device memory) will be part of core MM LRU either for Anon > >> or File mapping ? > >> > >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page, > >> how we are going to store "address_space->address_space_operations" and "Anon VMA > >> Chain" reverse mapping information both on the page->mapping element ? > >> > >> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page, > >> how we are going to store "address_space->address_space_operations" of the device > >> driver and radix tree based reverse mapping information for the existing file > >> mapping both on the same page->mapping element ? > >> > >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will > >> defined inside the device driver) and the reverse mapping information (either anon > >> or file mapping) together after first round of migration. This non LRU identity needs > >> to be retained continuously if we ever need to return this page to device driver after > >> successful migration to system RAM or for isolation/putback purpose or something else. > >> > >> All the reasons explained above was preventing a continuous ping-pong scheme of migration > >> between system RAM LRU buddy pages and device memory non LRU pages which is one of the > >> primary requirements for exploiting coherent device memory. Do you think we can solve these > >> problems with ZONE_DEVICE and HMM framework ? > > > > Well HMM already achieve migration but design is slightly different : > > * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array); > > It must provide a pfn_array that is big enough to have one entry per page for the > > range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page. > > If we are not going to use standard core migrate_pages() interface, there is no need > of building a linked list of isolated source pages for migration. Though I see a > different hmm_migrate() function in the V13 tree which involves hmm_migrate structure, > lets focus on hmm_migrate(mm, start, end, pfn_array) format. I guess (mm, start, end) > describes the virtual range of a process which needs to be migrated and pfn_array[] > is the destination array of PFNs for the migration ? The hmm_migrate struct is just a place holder for all the argument (vma, start, end, pfn_arrays ptr, flags, ...). I can hide it inside the migrate function, it is easier to pass around for sub-functions that always having a long list of arg. > > * I assume pfn_array[] can contain either system RAM PFN or device memory PFN ? It > will support migration in both directions ? Correct both direction are supported. > > * Device memory PFN can have struct pages (If ZONE_DEVICE based) or it may not have > struct pages ? Memory must have a struct page, this is needed so that anon_vma and mapping for file back page are properly being track. > > > > * hmm_migrate() collect source pages from the process. Right now it will only migrate > > thing that have been faulted ie with a valid CPU page table entry and will ignore > > swap entry, or any other special CPU page table entry. Those source pages are store > > in the pfn array (using their pfn value with flag like write permission) > > So source PFNs go into pfn_array[], I was thinking it contains destination PFNs. In first pass it contains source pfn so driver don't have to walk CPU page table, it can be ignore by driver that use CPU page table directly. It is only after device driver callback that the device populate the array with destination memory. > > > > * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages > > it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any > > non lru page that is not ZONE_DEVICE is ignored. > > Hmm, may be because it does not have either page->pgmap (which you have extended to > contain some driver specific callbacks) or page->as->aops (Minchan Kim's framework). > Therefore any other kind of non LRU pages cannot migrate. > > > > > * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then > > it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page) > > and clear the associated entry inside the pfn_array. > > Got it. pfn_array[] at the end will contain all PFNs which need to be migrated. Yup > > > > > * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback > > will allocate destination device page and copy from the source page. It uses the pfn > > So if the migration is from device to system RAM, alloc_and_copy() will allocate the > destination system RAM pages and at that time pfn_array[] contains source device memory > PFNs ? I am just trying see if it works both ways. Yes, inside hmm_devmem* there is actualy an helper that do just that so device driver don't have to worry about the device to system RAM direction. But device driver can choose to not use hmm_devmem* and handle thing on their own (i would rather have device driver use common helpers to avoid each device driver making different mistakes). > > array to know which page can be migrated in the range (there is a flag). The callback > > must also update the pfn_array and replace any entry that was successfully allocated > > and copied with the pfn of the device page (and flag). > > > > * hmm_migrate() do the final struct page meta-data migration which might fail in case of > > file back page (buffer head migration fails or radix tree fails ...) > > > > * hmm_migrate() update the CPU page table ie remove migration special entry to point > > to new page if migration successfull or restore to old page otherwise. It also unlock > > page and call put_page() on them either through lru put back or directly for > > ZONE_DEVICE pages. > > If it's a ZONE_DEVICE page, the registered device driver also gets notified about it ? > So that it can update it's own accounting regarding the allocated and free memory pages > that it owns through a hot plugged ZONE_DEVICE zone ? > > > > > * hmm_migrate() call cleanup() only now device driver can update its page table > > Though I still need to understand the page table mirroring part, I can clearly see > that hmm_migrate() attempts to implement a parallel migrate_pages() kind of interface > which can work with non LRU pages (right now ZONE_DEVICE based only) and a device > driver. We will have to see whether this hmm_migrate() interface can accommodate all > kind and direction of migration. > > Minchan Kim's framework enabled non LRU page migration in a different way. The device > driver is suppose to create a stand alone struct address_space_operation and struct > address_space and load them into each struct page with a call. Now all non LRU pages > contains the stand alone struct address_space_operations as page->as->aops based > callbacks. > > Now we have a different way of enabling non LRU device page migration by extending > ZONE_DEVICE framework, does it overlap with the functionality already supported > by the previous framework ? I am just curious. I think Minchan is trying to allow migration for device driver kernel allocated memory ie not memory that end inside a regular vma (non special vma) but only inside a device driver file vma if at all. So we are targetting different problem. Me i only care about "regular" process memory is private anonymous, or share memory (either back by regular file or pure share memory). I do not want to mess with any of the device driver vma or any special vma that are under control of an unknown device driver. Trying to migrate any such special memory is just not going to work. Moreover i believe it is not something we care in the first place. GPU will work on either the regular process memory or some GPU specific memory but won't try to mess with other device vma. > > > > > > > I slightly changing the last 2 step, it would be call device driver callback first > > and then restore CPU page table and device driver callback would be rename to > > finalize_and_map(). > > > > So with this design: > > 1. is a non-issue (use of pfn array and not list of page). > > Right. > > > > > 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system > > memory call put_page() which in turn will call inside the device driver > > to inform the device driver that page is free (assuming refcount on page > > reach 1) > > Right. > > > > > 3. New page is not part of the LRU if it is a device page. Assumption is that the > > device driver wants to manage its memory by itself and LRU would interfer with > > that. Moreover this is a device page and thus it is not something that should be > > use for emergency memory allocation or any regular allocation. So it is pointless > > for kernel to try to keep aging those pages to see when they can be reclaim. > > If the driver manages everything, these device memory pages need not be on the LRU after > migration. But not being on any LRU makes it difficult for other core MM features to work > on these pages any more. Almost all core mm interfaces expect the pages to be on LRU, IIUC. > Though they all can be changed to accommodate non LRU pages but dont you think that can be > a lot of work ? Just curious. There is no code that assume that page is on lru. There is code that assume new page must go on lru (the file system page read for instance). But all code path i went through (i try to go over all of it but i might have miss thing) will gracefully handle a page that is not on the lru. In all cases i have been through this just meant ignore the page. Which is what i wanted in the first place :) for device memory to be left alone. My hope is that at one point hardware will have enough commonality that implementing a generic per device lru might make sense. Same for other kernel mm functionality. > > > > > 4. I do not store address_space operation of a device, i extended struct dev_pagemap > > to have more callback and this can be access through struct page->pgmap > > So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping > > ie for anonymous page it points to the anon vma chain and for file back page it > > points to the address space of the filesystem on which the file is. > > Right. > > > > > 5. See 4 above > > Right. > > > > > 6. I do not store any device driver specific address space operation inside struct > > page. I do not see the need for that and doing so would require major changes to > > kernel mm code. All the device driver cares about is being told when a page is > > free (as i am assuming device does the allocation in the first place). > > > > Minchan's work introduced the idea of PageMovable (IIUC, it just says its a movable > non LRU page with page->mapping->aops and some struct page flags) and changed parts > of the core MM migration and compaction functions to accommodate MovablePage. Like i said above i think he is targeting device driver allocated page that are not part of regular vma (private anonymous or share file) but are use by device driver. > > > It seems you want to rely on following struct address_space_operations callback: > > void (*putback_page)(struct page *); > > bool (*isolate_page)(struct page *, isolate_mode_t); > > int (*migratepage) (...); > > > > For putback_page i added a free_page() to struct dev_pagemap which does the job. > > Right, sounds correct from this ZONE_DEVICE based framework. > > > I do not see need for isolate_page() and it would be bad as some filesystem do > > special thing in that callback. If you update the CPU page table the device should > > It was a dummy device driver specific address_space_operations, hence its not related > to any file system as such. > > > see that and i do not think you would need any special handling inside the device > > driver code. > > I need to understand this part. How a call back from CPU page table update comes to > the device driver, will go through HMM V13 for that. It goes through the update() callback of hmm_mirror_ops which is part of hmm_mirror struct. So device driver register an hmm_mirror against an mm which bind the device to the given mm. Any update to CPU page table calls mmu_notifier and hmm forward those call to device driver through hmm_mirror_ops.update(). Device driver does not use mmu_notifier directly because HMM provide a way to snapshot CPU page table safely without worry from concurrent CPU page table update and without locking CPU page table directory. But all this is separate from migration or devmem, so i doubt it could be usefull with CAPI bus. hmm_mirror is just really mmu_notifier with some sugar coating :) Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 18:52 ` Jerome Glisse @ 2016-10-26 12:56 ` Anshuman Khandual -1 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-26 12:56 UTC (permalink / raw) To: Jerome Glisse, Aneesh Kumar K.V Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>> >>> [...] >>> >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>> could easily be move to migrate.c without hmm_ prefix. >>>>> >>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>> allocation for destination under control of who call the migrate code. Second >>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>> copy data). >>>>> >>>>> I believe same requirement also make sense for platform you are targeting. >>>>> Thus same code can be use. >>>>> >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>> >>>>> I haven't posted this patchset yet because we are doing some modifications >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>> changes and the overall migration code will stay the same more or less (i have >>>>> patches that move it to migrate.c and share more code with existing migrate >>>>> code). >>>>> >>>>> If you think i missed anything about lru and page cache please point it to >>>>> me. Because when i audited code for that i didn't see any road block with >>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>> >>>> >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>> That prevents any direct allocation from coherent device by application. >>>> ie, we would like to force allocation from coherent device using >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>> >>> To achieve this we rely on device fault code path ie when device take a page fault >>> with help of HMM it will use existing memory if any for fault address but if CPU >>> page table is empty (and it is not file back vma because of readback) then device >>> can directly allocate device memory and HMM will update CPU page table to point to >>> newly allocated device memory. >>> >> >> That is ok if the device touch the page first. What if we want the >> allocation touched first by cpu to come from GPU ?. Should we always >> depend on GPU driver to migrate such pages later from system RAM to GPU >> memory ? >> > > I am not sure what kind of workload would rather have every first CPU access for > a range to use device memory. So no my code does not handle that and it is pointless > for it as CPU can not access device memory for me. If the user space application can explicitly allocate device memory directly, we can save one round of migration when the device start accessing it. But then one can argue what problem statement the device would work on on a freshly allocated memory which has not been accessed by CPU for loading the data yet. Will look into this scenario in more detail. > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > Thought my personnal preference would still be to avoid use of such generic syscall > but have device driver set allocation policy through its own userspace API (device > driver could reuse internal of mbind() to achieve the end result). Okay, the basic premise of CDM node is to have a LRU based design where we can avoid use of driver specific user space memory management code altogether. > > I am not saying that eveything you want to do is doable now with HMM but, nothing > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > with device memory. With CDM node based design, the expectation is to get all/maximum core VM mechanism working so that, driver has to do less device specific optimization. > > Each device is so different from the other that i don't believe in a one API fit all. Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually can become a bit mask indicating the type of coherent device the node is and that can be used to implement multiple types of requirement in core mm for various kinds of devices in the future. > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > comes to GPU. The only common code is modesetting. Everything that deals with how to > use GPU to compute stuff is per device and most of the logic is in userspace. So i do Whats the basic reason which prevents such code/functionality sharing ? > not see any commonality that could be abstracted at syscall level. I would rather let > device driver stack (kernel and userspace) take such decision and have the higher level > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > Programmer target those high level API and they intend to use the mechanism each offer > to manage memory and memory placement. I would say forcing them to use a second linux > specific API to achieve the latter is wrong, at lest for now. But going forward dont we want a more closely integrated coherent device solution which does not depend too much on a device driver stack ? and can be used from a basic user space program ? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-26 12:56 ` Anshuman Khandual 0 siblings, 0 replies; 135+ messages in thread From: Anshuman Khandual @ 2016-10-26 12:56 UTC (permalink / raw) To: Jerome Glisse, Aneesh Kumar K.V Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse <j.glisse@gmail.com> writes: >> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>> >>> [...] >>> >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>> could easily be move to migrate.c without hmm_ prefix. >>>>> >>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>> allocation for destination under control of who call the migrate code. Second >>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>> copy data). >>>>> >>>>> I believe same requirement also make sense for platform you are targeting. >>>>> Thus same code can be use. >>>>> >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>> >>>>> I haven't posted this patchset yet because we are doing some modifications >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>> changes and the overall migration code will stay the same more or less (i have >>>>> patches that move it to migrate.c and share more code with existing migrate >>>>> code). >>>>> >>>>> If you think i missed anything about lru and page cache please point it to >>>>> me. Because when i audited code for that i didn't see any road block with >>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>> >>>> >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>> That prevents any direct allocation from coherent device by application. >>>> ie, we would like to force allocation from coherent device using >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>> >>> To achieve this we rely on device fault code path ie when device take a page fault >>> with help of HMM it will use existing memory if any for fault address but if CPU >>> page table is empty (and it is not file back vma because of readback) then device >>> can directly allocate device memory and HMM will update CPU page table to point to >>> newly allocated device memory. >>> >> >> That is ok if the device touch the page first. What if we want the >> allocation touched first by cpu to come from GPU ?. Should we always >> depend on GPU driver to migrate such pages later from system RAM to GPU >> memory ? >> > > I am not sure what kind of workload would rather have every first CPU access for > a range to use device memory. So no my code does not handle that and it is pointless > for it as CPU can not access device memory for me. If the user space application can explicitly allocate device memory directly, we can save one round of migration when the device start accessing it. But then one can argue what problem statement the device would work on on a freshly allocated memory which has not been accessed by CPU for loading the data yet. Will look into this scenario in more detail. > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > Thought my personnal preference would still be to avoid use of such generic syscall > but have device driver set allocation policy through its own userspace API (device > driver could reuse internal of mbind() to achieve the end result). Okay, the basic premise of CDM node is to have a LRU based design where we can avoid use of driver specific user space memory management code altogether. > > I am not saying that eveything you want to do is doable now with HMM but, nothing > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > with device memory. With CDM node based design, the expectation is to get all/maximum core VM mechanism working so that, driver has to do less device specific optimization. > > Each device is so different from the other that i don't believe in a one API fit all. Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually can become a bit mask indicating the type of coherent device the node is and that can be used to implement multiple types of requirement in core mm for various kinds of devices in the future. > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > comes to GPU. The only common code is modesetting. Everything that deals with how to > use GPU to compute stuff is per device and most of the logic is in userspace. So i do Whats the basic reason which prevents such code/functionality sharing ? > not see any commonality that could be abstracted at syscall level. I would rather let > device driver stack (kernel and userspace) take such decision and have the higher level > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > Programmer target those high level API and they intend to use the mechanism each offer > to manage memory and memory placement. I would say forcing them to use a second linux > specific API to achieve the latter is wrong, at lest for now. But going forward dont we want a more closely integrated coherent device solution which does not depend too much on a device driver stack ? and can be used from a basic user space program ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-26 12:56 ` Anshuman Khandual @ 2016-10-26 16:28 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-26 16:28 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >>> > >>> [...] > >>> > >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page > >>>>> migration. While i put most of the migration code inside hmm_migrate.c it > >>>>> could easily be move to migrate.c without hmm_ prefix. > >>>>> > >>>>> There is 2 missing piece with existing migrate code. First is to put memory > >>>>> allocation for destination under control of who call the migrate code. Second > >>>>> is to allow offloading the copy operation to device (ie not use the CPU to > >>>>> copy data). > >>>>> > >>>>> I believe same requirement also make sense for platform you are targeting. > >>>>> Thus same code can be use. > >>>>> > >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >>>>> > >>>>> I haven't posted this patchset yet because we are doing some modifications > >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE > >>>>> changes and the overall migration code will stay the same more or less (i have > >>>>> patches that move it to migrate.c and share more code with existing migrate > >>>>> code). > >>>>> > >>>>> If you think i missed anything about lru and page cache please point it to > >>>>> me. Because when i audited code for that i didn't see any road block with > >>>>> the few fs i was looking at (ext4, xfs and core page cache code). > >>>>> > >>>> > >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >>>> That prevents any direct allocation from coherent device by application. > >>>> ie, we would like to force allocation from coherent device using > >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > >>> > >>> To achieve this we rely on device fault code path ie when device take a page fault > >>> with help of HMM it will use existing memory if any for fault address but if CPU > >>> page table is empty (and it is not file back vma because of readback) then device > >>> can directly allocate device memory and HMM will update CPU page table to point to > >>> newly allocated device memory. > >>> > >> > >> That is ok if the device touch the page first. What if we want the > >> allocation touched first by cpu to come from GPU ?. Should we always > >> depend on GPU driver to migrate such pages later from system RAM to GPU > >> memory ? > >> > > > > I am not sure what kind of workload would rather have every first CPU access for > > a range to use device memory. So no my code does not handle that and it is pointless > > for it as CPU can not access device memory for me. > > If the user space application can explicitly allocate device memory directly, we > can save one round of migration when the device start accessing it. But then one > can argue what problem statement the device would work on on a freshly allocated > memory which has not been accessed by CPU for loading the data yet. Will look into > this scenario in more detail. > > > > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > > Thought my personnal preference would still be to avoid use of such generic syscall > > but have device driver set allocation policy through its own userspace API (device > > driver could reuse internal of mbind() to achieve the end result). > > Okay, the basic premise of CDM node is to have a LRU based design where we can > avoid use of driver specific user space memory management code altogether. And i think it is not a good fit, at least not for GPU. GPU device driver have a big chunk of code dedicated to memory management. You can look at drm/ttm and at userspace (most is in userspace). It is not because we want to reinvent the wheel it is because they are some unique constraint. > > > > I am not saying that eveything you want to do is doable now with HMM but, nothing > > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > > with device memory. > > With CDM node based design, the expectation is to get all/maximum core VM mechanism > working so that, driver has to do less device specific optimization. I think this is a bad idea, today, for GPU but i might be wrong. > > > > Each device is so different from the other that i don't believe in a one API fit all. > > Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually > can become a bit mask indicating the type of coherent device the node is and that can > be used to implement multiple types of requirement in core mm for various kinds of > devices in the future. I really don't want to move GPU memory management into core mm, if you only concider GPGPU then it _might_ make sense but for graphic side i definitly don't think so. There are way to much device specific consideration to have in respect of memory management for GPU (not only in between different vendor but difference between different generation). > > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > > comes to GPU. The only common code is modesetting. Everything that deals with how to > > use GPU to compute stuff is per device and most of the logic is in userspace. So i do > > Whats the basic reason which prevents such code/functionality sharing ? While the higher level API (OpenGL, OpenCL, Vulkan, Cuda, ...) offer an abstraction model, they are all different abstractions. They are just no way to have kernel expose a common API that would allow all of the above to be implemented. Each GPU have complex memory management and requirement (not only differ between vendor but also between generation of same vendor). They have different isa for each generation. They have different way to schedule job for each generation. They offer different sync mechanism. They have different page table format, mmu, ... Basicly each GPU generation is a platform on it is own, like arm, ppc, x86, ... so i do not see a way to expose a common API and i don't think anyone who as work on any number of GPU see one either. I wish but it is just not the case. > > not see any commonality that could be abstracted at syscall level. I would rather let > > device driver stack (kernel and userspace) take such decision and have the higher level > > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > > Programmer target those high level API and they intend to use the mechanism each offer > > to manage memory and memory placement. I would say forcing them to use a second linux > > specific API to achieve the latter is wrong, at lest for now. > > But going forward dont we want a more closely integrated coherent device solution > which does not depend too much on a device driver stack ? and can be used from a > basic user space program ? That is something i want, but i strongly believe we are not there yet, we have no real world experience. All we have in the open source community is the graphic stack (drm) and the graphic stack clearly shows that today there is no common denominator between GPU outside of modesetting. So while i share the same aim, i think for now we need to have real experience. Once we have something like OpenCL >= 2.0, C++17 and couple other userspace API being actively use on linux with different coherent devices then we can start looking at finding a common denominator that make sense for enough devices. I am sure device driver would like to get rid of their custom memory management but i don't think this is applicable now. I fear existing mm code would always make the worst decision when it comes to memory placement, migration and reclaim. Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-26 16:28 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-26 16:28 UTC (permalink / raw) To: Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, bsingharora On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse <j.glisse@gmail.com> writes: > >> > >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>>> Jerome Glisse <j.glisse@gmail.com> writes: > >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >>> > >>> [...] > >>> > >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page > >>>>> migration. While i put most of the migration code inside hmm_migrate.c it > >>>>> could easily be move to migrate.c without hmm_ prefix. > >>>>> > >>>>> There is 2 missing piece with existing migrate code. First is to put memory > >>>>> allocation for destination under control of who call the migrate code. Second > >>>>> is to allow offloading the copy operation to device (ie not use the CPU to > >>>>> copy data). > >>>>> > >>>>> I believe same requirement also make sense for platform you are targeting. > >>>>> Thus same code can be use. > >>>>> > >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >>>>> > >>>>> I haven't posted this patchset yet because we are doing some modifications > >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE > >>>>> changes and the overall migration code will stay the same more or less (i have > >>>>> patches that move it to migrate.c and share more code with existing migrate > >>>>> code). > >>>>> > >>>>> If you think i missed anything about lru and page cache please point it to > >>>>> me. Because when i audited code for that i didn't see any road block with > >>>>> the few fs i was looking at (ext4, xfs and core page cache code). > >>>>> > >>>> > >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >>>> That prevents any direct allocation from coherent device by application. > >>>> ie, we would like to force allocation from coherent device using > >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > >>> > >>> To achieve this we rely on device fault code path ie when device take a page fault > >>> with help of HMM it will use existing memory if any for fault address but if CPU > >>> page table is empty (and it is not file back vma because of readback) then device > >>> can directly allocate device memory and HMM will update CPU page table to point to > >>> newly allocated device memory. > >>> > >> > >> That is ok if the device touch the page first. What if we want the > >> allocation touched first by cpu to come from GPU ?. Should we always > >> depend on GPU driver to migrate such pages later from system RAM to GPU > >> memory ? > >> > > > > I am not sure what kind of workload would rather have every first CPU access for > > a range to use device memory. So no my code does not handle that and it is pointless > > for it as CPU can not access device memory for me. > > If the user space application can explicitly allocate device memory directly, we > can save one round of migration when the device start accessing it. But then one > can argue what problem statement the device would work on on a freshly allocated > memory which has not been accessed by CPU for loading the data yet. Will look into > this scenario in more detail. > > > > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. > > Thought my personnal preference would still be to avoid use of such generic syscall > > but have device driver set allocation policy through its own userspace API (device > > driver could reuse internal of mbind() to achieve the end result). > > Okay, the basic premise of CDM node is to have a LRU based design where we can > avoid use of driver specific user space memory management code altogether. And i think it is not a good fit, at least not for GPU. GPU device driver have a big chunk of code dedicated to memory management. You can look at drm/ttm and at userspace (most is in userspace). It is not because we want to reinvent the wheel it is because they are some unique constraint. > > > > I am not saying that eveything you want to do is doable now with HMM but, nothing > > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think > > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse > > with device memory. > > With CDM node based design, the expectation is to get all/maximum core VM mechanism > working so that, driver has to do less device specific optimization. I think this is a bad idea, today, for GPU but i might be wrong. > > > > Each device is so different from the other that i don't believe in a one API fit all. > > Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually > can become a bit mask indicating the type of coherent device the node is and that can > be used to implement multiple types of requirement in core mm for various kinds of > devices in the future. I really don't want to move GPU memory management into core mm, if you only concider GPGPU then it _might_ make sense but for graphic side i definitly don't think so. There are way to much device specific consideration to have in respect of memory management for GPU (not only in between different vendor but difference between different generation). > > The drm GPU subsystem of the kernel is a testimony of how little can be share when it > > comes to GPU. The only common code is modesetting. Everything that deals with how to > > use GPU to compute stuff is per device and most of the logic is in userspace. So i do > > Whats the basic reason which prevents such code/functionality sharing ? While the higher level API (OpenGL, OpenCL, Vulkan, Cuda, ...) offer an abstraction model, they are all different abstractions. They are just no way to have kernel expose a common API that would allow all of the above to be implemented. Each GPU have complex memory management and requirement (not only differ between vendor but also between generation of same vendor). They have different isa for each generation. They have different way to schedule job for each generation. They offer different sync mechanism. They have different page table format, mmu, ... Basicly each GPU generation is a platform on it is own, like arm, ppc, x86, ... so i do not see a way to expose a common API and i don't think anyone who as work on any number of GPU see one either. I wish but it is just not the case. > > not see any commonality that could be abstracted at syscall level. I would rather let > > device driver stack (kernel and userspace) take such decision and have the higher level > > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. > > Programmer target those high level API and they intend to use the mechanism each offer > > to manage memory and memory placement. I would say forcing them to use a second linux > > specific API to achieve the latter is wrong, at lest for now. > > But going forward dont we want a more closely integrated coherent device solution > which does not depend too much on a device driver stack ? and can be used from a > basic user space program ? That is something i want, but i strongly believe we are not there yet, we have no real world experience. All we have in the open source community is the graphic stack (drm) and the graphic stack clearly shows that today there is no common denominator between GPU outside of modesetting. So while i share the same aim, i think for now we need to have real experience. Once we have something like OpenCL >= 2.0, C++17 and couple other userspace API being actively use on linux with different coherent devices then we can start looking at finding a common denominator that make sense for enough devices. I am sure device driver would like to get rid of their custom memory management but i don't think this is applicable now. I fear existing mm code would always make the worst decision when it comes to memory placement, migration and reclaim. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-26 16:28 ` Jerome Glisse @ 2016-10-27 10:23 ` Balbir Singh -1 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-27 10:23 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm On 27/10/16 03:28, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote: >> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>> >>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>>>> >>>>> [...] >>>>> >>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>>>> could easily be move to migrate.c without hmm_ prefix. >>>>>>> >>>>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>>>> allocation for destination under control of who call the migrate code. Second >>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>>>> copy data). >>>>>>> >>>>>>> I believe same requirement also make sense for platform you are targeting. >>>>>>> Thus same code can be use. >>>>>>> >>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>>>> >>>>>>> I haven't posted this patchset yet because we are doing some modifications >>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>>>> changes and the overall migration code will stay the same more or less (i have >>>>>>> patches that move it to migrate.c and share more code with existing migrate >>>>>>> code). >>>>>>> >>>>>>> If you think i missed anything about lru and page cache please point it to >>>>>>> me. Because when i audited code for that i didn't see any road block with >>>>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>>>> >>>>>> >>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>>>> That prevents any direct allocation from coherent device by application. >>>>>> ie, we would like to force allocation from coherent device using >>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>>>> >>>>> To achieve this we rely on device fault code path ie when device take a page fault >>>>> with help of HMM it will use existing memory if any for fault address but if CPU >>>>> page table is empty (and it is not file back vma because of readback) then device >>>>> can directly allocate device memory and HMM will update CPU page table to point to >>>>> newly allocated device memory. >>>>> >>>> >>>> That is ok if the device touch the page first. What if we want the >>>> allocation touched first by cpu to come from GPU ?. Should we always >>>> depend on GPU driver to migrate such pages later from system RAM to GPU >>>> memory ? >>>> >>> >>> I am not sure what kind of workload would rather have every first CPU access for >>> a range to use device memory. So no my code does not handle that and it is pointless >>> for it as CPU can not access device memory for me. >> >> If the user space application can explicitly allocate device memory directly, we >> can save one round of migration when the device start accessing it. But then one >> can argue what problem statement the device would work on on a freshly allocated >> memory which has not been accessed by CPU for loading the data yet. Will look into >> this scenario in more detail. >> >>> >>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. >>> Thought my personnal preference would still be to avoid use of such generic syscall >>> but have device driver set allocation policy through its own userspace API (device >>> driver could reuse internal of mbind() to achieve the end result). >> >> Okay, the basic premise of CDM node is to have a LRU based design where we can >> avoid use of driver specific user space memory management code altogether. > > And i think it is not a good fit, at least not for GPU. GPU device driver have a > big chunk of code dedicated to memory management. You can look at drm/ttm and at > userspace (most is in userspace). It is not because we want to reinvent the wheel > it is because they are some unique constraint. > Could you elaborate on the unique constraints a bit more? I looked at ttm briefly (specifically ttm_memory.c), I can see zones being replicated, it feels like a mini-mm is embedded in there. > >>> >>> I am not saying that eveything you want to do is doable now with HMM but, nothing >>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think >>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse >>> with device memory. >> >> With CDM node based design, the expectation is to get all/maximum core VM mechanism >> working so that, driver has to do less device specific optimization. > > I think this is a bad idea, today, for GPU but i might be wrong. Why do you think so? What aspects do you think are wrong? I am guessing you mean that the GPU driver via the GEM/DRM/TTM layers should interact with the mm and manage their own memory and use some form of TTM mm abstraction? I'll study those systems if possible as well. > >>> >>> Each device is so different from the other that i don't believe in a one API fit all. >> >> Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually >> can become a bit mask indicating the type of coherent device the node is and that can >> be used to implement multiple types of requirement in core mm for various kinds of >> devices in the future. > > I really don't want to move GPU memory management into core mm, if you only concider GPGPU > then it _might_ make sense but for graphic side i definitly don't think so. There are way > to much device specific consideration to have in respect of memory management for GPU > (not only in between different vendor but difference between different generation). > Yes, GPGPU is of interest. We don't look at it as GPU memory management. The memory on the device is coherent, it is a part of the system. It comes online later and we would like to hotplug it out if required. Since it's sitting on a bus, we do need optimizations and the ability to migrate to and from it. I don't think it makes sense to replicate a lot of the mm core logic to manage this memory, IMHO. I think I'd like to point out is that it is wrong to assume only a GPU having coherent memory, the RFC clarifies. > >>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it >>> comes to GPU. The only common code is modesetting. Everything that deals with how to >>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do >> >> Whats the basic reason which prevents such code/functionality sharing ? > > While the higher level API (OpenGL, OpenCL, Vulkan, Cuda, ...) offer an abstraction model, > they are all different abstractions. They are just no way to have kernel expose a common > API that would allow all of the above to be implemented. > > Each GPU have complex memory management and requirement (not only differ between vendor > but also between generation of same vendor). They have different isa for each generation. > They have different way to schedule job for each generation. They offer different sync > mechanism. They have different page table format, mmu, ... > Agreed > Basicly each GPU generation is a platform on it is own, like arm, ppc, x86, ... so i do > not see a way to expose a common API and i don't think anyone who as work on any number > of GPU see one either. I wish but it is just not the case. > We are trying to leverage the ability to see coherent memory (across a set of devices plus system RAM) to keep memory management as simple as possible > >>> not see any commonality that could be abstracted at syscall level. I would rather let >>> device driver stack (kernel and userspace) take such decision and have the higher level >>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. >>> Programmer target those high level API and they intend to use the mechanism each offer >>> to manage memory and memory placement. I would say forcing them to use a second linux >>> specific API to achieve the latter is wrong, at lest for now. >> >> But going forward dont we want a more closely integrated coherent device solution >> which does not depend too much on a device driver stack ? and can be used from a >> basic user space program ? > > That is something i want, but i strongly believe we are not there yet, we have no real > world experience. All we have in the open source community is the graphic stack (drm) > and the graphic stack clearly shows that today there is no common denominator between > GPU outside of modesetting. > :) > So while i share the same aim, i think for now we need to have real experience. Once we > have something like OpenCL >= 2.0, C++17 and couple other userspace API being actively > use on linux with different coherent devices then we can start looking at finding a > common denominator that make sense for enough devices. > > I am sure device driver would like to get rid of their custom memory management but i > don't think this is applicable now. I fear existing mm code would always make the worst > decision when it comes to memory placement, migration and reclaim. > Agreed, we don't want to make either placement/migration or reclaim slow. As I said earlier we should not restrict our thinking to just GPU devices. Balbir Singh. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-27 10:23 ` Balbir Singh 0 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-27 10:23 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm On 27/10/16 03:28, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote: >> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>> >>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>>>>> Jerome Glisse <j.glisse@gmail.com> writes: >>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>>>> >>>>> [...] >>>>> >>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it >>>>>>> could easily be move to migrate.c without hmm_ prefix. >>>>>>> >>>>>>> There is 2 missing piece with existing migrate code. First is to put memory >>>>>>> allocation for destination under control of who call the migrate code. Second >>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to >>>>>>> copy data). >>>>>>> >>>>>>> I believe same requirement also make sense for platform you are targeting. >>>>>>> Thus same code can be use. >>>>>>> >>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>>>>>> >>>>>>> I haven't posted this patchset yet because we are doing some modifications >>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE >>>>>>> changes and the overall migration code will stay the same more or less (i have >>>>>>> patches that move it to migrate.c and share more code with existing migrate >>>>>>> code). >>>>>>> >>>>>>> If you think i missed anything about lru and page cache please point it to >>>>>>> me. Because when i audited code for that i didn't see any road block with >>>>>>> the few fs i was looking at (ext4, xfs and core page cache code). >>>>>>> >>>>>> >>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>>>>> That prevents any direct allocation from coherent device by application. >>>>>> ie, we would like to force allocation from coherent device using >>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>>>> >>>>> To achieve this we rely on device fault code path ie when device take a page fault >>>>> with help of HMM it will use existing memory if any for fault address but if CPU >>>>> page table is empty (and it is not file back vma because of readback) then device >>>>> can directly allocate device memory and HMM will update CPU page table to point to >>>>> newly allocated device memory. >>>>> >>>> >>>> That is ok if the device touch the page first. What if we want the >>>> allocation touched first by cpu to come from GPU ?. Should we always >>>> depend on GPU driver to migrate such pages later from system RAM to GPU >>>> memory ? >>>> >>> >>> I am not sure what kind of workload would rather have every first CPU access for >>> a range to use device memory. So no my code does not handle that and it is pointless >>> for it as CPU can not access device memory for me. >> >> If the user space application can explicitly allocate device memory directly, we >> can save one round of migration when the device start accessing it. But then one >> can argue what problem statement the device would work on on a freshly allocated >> memory which has not been accessed by CPU for loading the data yet. Will look into >> this scenario in more detail. >> >>> >>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. >>> Thought my personnal preference would still be to avoid use of such generic syscall >>> but have device driver set allocation policy through its own userspace API (device >>> driver could reuse internal of mbind() to achieve the end result). >> >> Okay, the basic premise of CDM node is to have a LRU based design where we can >> avoid use of driver specific user space memory management code altogether. > > And i think it is not a good fit, at least not for GPU. GPU device driver have a > big chunk of code dedicated to memory management. You can look at drm/ttm and at > userspace (most is in userspace). It is not because we want to reinvent the wheel > it is because they are some unique constraint. > Could you elaborate on the unique constraints a bit more? I looked at ttm briefly (specifically ttm_memory.c), I can see zones being replicated, it feels like a mini-mm is embedded in there. > >>> >>> I am not saying that eveything you want to do is doable now with HMM but, nothing >>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think >>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse >>> with device memory. >> >> With CDM node based design, the expectation is to get all/maximum core VM mechanism >> working so that, driver has to do less device specific optimization. > > I think this is a bad idea, today, for GPU but i might be wrong. Why do you think so? What aspects do you think are wrong? I am guessing you mean that the GPU driver via the GEM/DRM/TTM layers should interact with the mm and manage their own memory and use some form of TTM mm abstraction? I'll study those systems if possible as well. > >>> >>> Each device is so different from the other that i don't believe in a one API fit all. >> >> Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually >> can become a bit mask indicating the type of coherent device the node is and that can >> be used to implement multiple types of requirement in core mm for various kinds of >> devices in the future. > > I really don't want to move GPU memory management into core mm, if you only concider GPGPU > then it _might_ make sense but for graphic side i definitly don't think so. There are way > to much device specific consideration to have in respect of memory management for GPU > (not only in between different vendor but difference between different generation). > Yes, GPGPU is of interest. We don't look at it as GPU memory management. The memory on the device is coherent, it is a part of the system. It comes online later and we would like to hotplug it out if required. Since it's sitting on a bus, we do need optimizations and the ability to migrate to and from it. I don't think it makes sense to replicate a lot of the mm core logic to manage this memory, IMHO. I think I'd like to point out is that it is wrong to assume only a GPU having coherent memory, the RFC clarifies. > >>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it >>> comes to GPU. The only common code is modesetting. Everything that deals with how to >>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do >> >> Whats the basic reason which prevents such code/functionality sharing ? > > While the higher level API (OpenGL, OpenCL, Vulkan, Cuda, ...) offer an abstraction model, > they are all different abstractions. They are just no way to have kernel expose a common > API that would allow all of the above to be implemented. > > Each GPU have complex memory management and requirement (not only differ between vendor > but also between generation of same vendor). They have different isa for each generation. > They have different way to schedule job for each generation. They offer different sync > mechanism. They have different page table format, mmu, ... > Agreed > Basicly each GPU generation is a platform on it is own, like arm, ppc, x86, ... so i do > not see a way to expose a common API and i don't think anyone who as work on any number > of GPU see one either. I wish but it is just not the case. > We are trying to leverage the ability to see coherent memory (across a set of devices plus system RAM) to keep memory management as simple as possible > >>> not see any commonality that could be abstracted at syscall level. I would rather let >>> device driver stack (kernel and userspace) take such decision and have the higher level >>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. >>> Programmer target those high level API and they intend to use the mechanism each offer >>> to manage memory and memory placement. I would say forcing them to use a second linux >>> specific API to achieve the latter is wrong, at lest for now. >> >> But going forward dont we want a more closely integrated coherent device solution >> which does not depend too much on a device driver stack ? and can be used from a >> basic user space program ? > > That is something i want, but i strongly believe we are not there yet, we have no real > world experience. All we have in the open source community is the graphic stack (drm) > and the graphic stack clearly shows that today there is no common denominator between > GPU outside of modesetting. > :) > So while i share the same aim, i think for now we need to have real experience. Once we > have something like OpenCL >= 2.0, C++17 and couple other userspace API being actively > use on linux with different coherent devices then we can start looking at finding a > common denominator that make sense for enough devices. > > I am sure device driver would like to get rid of their custom memory management but i > don't think this is applicable now. I fear existing mm code would always make the worst > decision when it comes to memory placement, migration and reclaim. > Agreed, we don't want to make either placement/migration or reclaim slow. As I said earlier we should not restrict our thinking to just GPU devices. Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 17:09 ` Jerome Glisse @ 2016-10-25 12:07 ` Balbir Singh -1 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 12:07 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 25/10/16 04:09, Jerome Glisse wrote: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE? Then get migrated? > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > I think thats a good idea to pool our efforts at the same time making progress >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > Thanks for the link > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > >> [...] > > Cheers, > Jérôme > Cheers, Balbir Singh. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 12:07 ` Balbir Singh 0 siblings, 0 replies; 135+ messages in thread From: Balbir Singh @ 2016-10-25 12:07 UTC (permalink / raw) To: Jerome Glisse, Anshuman Khandual Cc: linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On 25/10/16 04:09, Jerome Glisse wrote: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE? Then get migrated? > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > I think thats a good idea to pool our efforts at the same time making progress >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > Thanks for the link > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > >> [...] > > Cheers, > Jerome > Cheers, Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-25 12:07 ` Balbir Singh @ 2016-10-25 15:21 ` Jerome Glisse -1 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 15:21 UTC (permalink / raw) To: Balbir Singh Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On Tue, Oct 25, 2016 at 11:07:39PM +1100, Balbir Singh wrote: > On 25/10/16 04:09, Jerome Glisse wrote: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > >> [...] > > > >> Core kernel memory features like reclamation, evictions etc. might > >> need to be restricted or modified on the coherent device memory node as > >> they can be performance limiting. The RFC does not propose anything on this > >> yet but it can be looked into later on. For now it just disables Auto NUMA > >> for any VMA which has coherent device memory. > >> > >> Seamless integration of coherent device memory with system memory > >> will enable various other features, some of which can be listed as follows. > >> > >> a. Seamless migrations between system RAM and the coherent memory > >> b. Will have asynchronous and high throughput migrations > >> c. Be able to allocate huge order pages from these memory regions > >> d. Restrict allocations to a large extent to the tasks using the > >> device for workload acceleration > >> > >> Before concluding, will look into the reasons why the existing > >> solutions don't work. There are two basic requirements which have to be > >> satisfies before the coherent device memory can be integrated with core > >> kernel seamlessly. > >> > >> a. PFN must have struct page > >> b. Struct page must able to be inside standard LRU lists > >> > >> The above two basic requirements discard the existing method of > >> device memory representation approaches like these which then requires the > >> need of creating a new framework. > > > > I do not believe the LRU list is a hard requirement, yes when faulting in > > a page inside the page cache it assumes it needs to be added to lru list. > > But i think this can easily be work around. > > > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > > so in my case a file back page must always be spawn first from a regular > > page and once read from disk then i can migrate to GPU page. > > > > I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE? > Then get migrated? Because in my case device memory is not accessible by anything except the device (not entirely true but for sake of design it is) any page read from disk will be first read into regular page (from regular system memory). It is only once it is uptodate and in page cache that it can be migrated to a ZONE_DEVICE page. So read from disk use an intermediary page. Write back is kind of the same i plan on using a bounce page by leveraging existing bio bounce infrastructure. Cheers, Jérôme ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-25 15:21 ` Jerome Glisse 0 siblings, 0 replies; 135+ messages in thread From: Jerome Glisse @ 2016-10-25 15:21 UTC (permalink / raw) To: Balbir Singh Cc: Anshuman Khandual, linux-kernel, linux-mm, mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar On Tue, Oct 25, 2016 at 11:07:39PM +1100, Balbir Singh wrote: > On 25/10/16 04:09, Jerome Glisse wrote: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > >> [...] > > > >> Core kernel memory features like reclamation, evictions etc. might > >> need to be restricted or modified on the coherent device memory node as > >> they can be performance limiting. The RFC does not propose anything on this > >> yet but it can be looked into later on. For now it just disables Auto NUMA > >> for any VMA which has coherent device memory. > >> > >> Seamless integration of coherent device memory with system memory > >> will enable various other features, some of which can be listed as follows. > >> > >> a. Seamless migrations between system RAM and the coherent memory > >> b. Will have asynchronous and high throughput migrations > >> c. Be able to allocate huge order pages from these memory regions > >> d. Restrict allocations to a large extent to the tasks using the > >> device for workload acceleration > >> > >> Before concluding, will look into the reasons why the existing > >> solutions don't work. There are two basic requirements which have to be > >> satisfies before the coherent device memory can be integrated with core > >> kernel seamlessly. > >> > >> a. PFN must have struct page > >> b. Struct page must able to be inside standard LRU lists > >> > >> The above two basic requirements discard the existing method of > >> device memory representation approaches like these which then requires the > >> need of creating a new framework. > > > > I do not believe the LRU list is a hard requirement, yes when faulting in > > a page inside the page cache it assumes it needs to be added to lru list. > > But i think this can easily be work around. > > > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > > so in my case a file back page must always be spawn first from a regular > > page and once read from disk then i can migrate to GPU page. > > > > I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE? > Then get migrated? Because in my case device memory is not accessible by anything except the device (not entirely true but for sake of design it is) any page read from disk will be first read into regular page (from regular system memory). It is only once it is uptodate and in page cache that it can be migrated to a ZONE_DEVICE page. So read from disk use an intermediary page. Write back is kind of the same i plan on using a bounce page by leveraging existing bio bounce infrastructure. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 4:31 ` Anshuman Khandual @ 2016-10-24 18:04 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 18:04 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > To achieve seamless integration between system RAM and coherent > device memory it must be able to utilize core memory kernel features like > anon mapping, file mapping, page cache, driver managed pages, HW poisoning, > migrations, reclaim, compaction, etc. So, you need to support all these things, but not autonuma or hugetlbfs? What's the reasoning behind that? If you *really* don't want a "cdm" page to be migrated, then why isn't that policy set on the VMA in the first place? That would keep "cdm" pages from being made non-cdm. And, why would autonuma ever make a non-cdm page and migrate it in to cdm? There will be no NUMA access faults caused by the devices that are fed to autonuma. I'm confused. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-24 18:04 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 18:04 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > To achieve seamless integration between system RAM and coherent > device memory it must be able to utilize core memory kernel features like > anon mapping, file mapping, page cache, driver managed pages, HW poisoning, > migrations, reclaim, compaction, etc. So, you need to support all these things, but not autonuma or hugetlbfs? What's the reasoning behind that? If you *really* don't want a "cdm" page to be migrated, then why isn't that policy set on the VMA in the first place? That would keep "cdm" pages from being made non-cdm. And, why would autonuma ever make a non-cdm page and migrate it in to cdm? There will be no NUMA access faults caused by the devices that are fed to autonuma. I'm confused. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 18:04 ` Dave Hansen @ 2016-10-24 18:32 ` David Nellans -1 siblings, 0 replies; 135+ messages in thread From: David Nellans @ 2016-10-24 18:32 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 01:04 PM, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> To achieve seamless integration between system RAM and coherent >> device memory it must be able to utilize core memory kernel features like >> anon mapping, file mapping, page cache, driver managed pages, HW poisoning, >> migrations, reclaim, compaction, etc. > So, you need to support all these things, but not autonuma or hugetlbfs? > What's the reasoning behind that? > > If you *really* don't want a "cdm" page to be migrated, then why isn't > that policy set on the VMA in the first place? That would keep "cdm" > pages from being made non-cdm. And, why would autonuma ever make a > non-cdm page and migrate it in to cdm? There will be no NUMA access > faults caused by the devices that are fed to autonuma. > Pages are desired to be migrateable, both into (starting cpu zone movable->cdm) and out of (starting cdm->cpu zone movable) but only through explicit migration, not via autonuma. other pages in the same VMA should still be migrateable between CPU nodes via autonuma however. Its expected a lot of these allocations are going to end up in THPs. I'm not sure we need to explicitly disallow hugetlbfs support but the identified use case is definitely via THPs not tlbfs. ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-24 18:32 ` David Nellans 0 siblings, 0 replies; 135+ messages in thread From: David Nellans @ 2016-10-24 18:32 UTC (permalink / raw) To: Dave Hansen, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 01:04 PM, Dave Hansen wrote: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> To achieve seamless integration between system RAM and coherent >> device memory it must be able to utilize core memory kernel features like >> anon mapping, file mapping, page cache, driver managed pages, HW poisoning, >> migrations, reclaim, compaction, etc. > So, you need to support all these things, but not autonuma or hugetlbfs? > What's the reasoning behind that? > > If you *really* don't want a "cdm" page to be migrated, then why isn't > that policy set on the VMA in the first place? That would keep "cdm" > pages from being made non-cdm. And, why would autonuma ever make a > non-cdm page and migrate it in to cdm? There will be no NUMA access > faults caused by the devices that are fed to autonuma. > Pages are desired to be migrateable, both into (starting cpu zone movable->cdm) and out of (starting cdm->cpu zone movable) but only through explicit migration, not via autonuma. other pages in the same VMA should still be migrateable between CPU nodes via autonuma however. Its expected a lot of these allocations are going to end up in THPs. I'm not sure we need to explicitly disallow hugetlbfs support but the identified use case is definitely via THPs not tlbfs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node 2016-10-24 18:32 ` David Nellans @ 2016-10-24 19:36 ` Dave Hansen -1 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 19:36 UTC (permalink / raw) To: David Nellans, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 11:32 AM, David Nellans wrote: > On 10/24/2016 01:04 PM, Dave Hansen wrote: >> If you *really* don't want a "cdm" page to be migrated, then why isn't >> that policy set on the VMA in the first place? That would keep "cdm" >> pages from being made non-cdm. And, why would autonuma ever make a >> non-cdm page and migrate it in to cdm? There will be no NUMA access >> faults caused by the devices that are fed to autonuma. >> > Pages are desired to be migrateable, both into (starting cpu zone > movable->cdm) and out of (starting cdm->cpu zone movable) but only > through explicit migration, not via autonuma. OK, and is there a reason that the existing mbind code plus NUMA policies fails to give you this behavior? Does autonuma somehow override strict NUMA binding? > other pages in the same > VMA should still be migrateable between CPU nodes via autonuma however. That's not the way the implementation here works, as I understand it. See the VM_CDM patch and my responses to it. > Its expected a lot of these allocations are going to end up in THPs. > I'm not sure we need to explicitly disallow hugetlbfs support but the > identified use case is definitely via THPs not tlbfs. I think THP and hugetlbfs are implementations, not use cases. :) Is it too hard to support hugetlbfs that we should complicate its code to exclude it from this type of memory? Why? ^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [RFC 0/8] Define coherent device memory node @ 2016-10-24 19:36 ` Dave Hansen 0 siblings, 0 replies; 135+ messages in thread From: Dave Hansen @ 2016-10-24 19:36 UTC (permalink / raw) To: David Nellans, Anshuman Khandual, linux-kernel, linux-mm Cc: mhocko, js1304, vbabka, mgorman, minchan, akpm, aneesh.kumar, bsingharora On 10/24/2016 11:32 AM, David Nellans wrote: > On 10/24/2016 01:04 PM, Dave Hansen wrote: >> If you *really* don't want a "cdm" page to be migrated, then why isn't >> that policy set on the VMA in the first place? That would keep "cdm" >> pages from being made non-cdm. And, why would autonuma ever make a >> non-cdm page and migrate it in to cdm? There will be no NUMA access >> faults caused by the devices that are fed to autonuma. >> > Pages are desired to be migrateable, both into (starting cpu zone > movable->cdm) and out of (starting cdm->cpu zone movable) but only > through explicit migration, not via autonuma. OK, and is there a reason that the existing mbind code plus NUMA policies fails to give you this behavior? Does autonuma somehow override strict NUMA binding? > other pages in the same > VMA should still be migrateable between CPU nodes via autonuma however. That's not the way the implementation here works, as I understand it. See the VM_CDM patch and my responses to it. > Its expected a lot of these allocations are going to end up in THPs. > I'm not sure we need to explicitly disallow hugetlbfs support but the > identified use case is definitely via THPs not tlbfs. I think THP and hugetlbfs are implementations, not use cases. :) Is it too hard to support hugetlbfs that we should complicate its code to exclude it from this type of memory? Why? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 135+ messages in thread
end of thread, other threads:[~2016-11-17 8:29 UTC | newest] Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-10-24 4:31 [RFC 0/8] Define coherent device memory node Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 4:31 ` [RFC 1/8] mm: " Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 17:09 ` Dave Hansen 2016-10-24 17:09 ` Dave Hansen 2016-10-25 1:22 ` Anshuman Khandual 2016-10-25 1:22 ` Anshuman Khandual 2016-10-25 15:47 ` Dave Hansen 2016-10-25 15:47 ` Dave Hansen 2016-10-24 4:31 ` [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 17:10 ` Dave Hansen 2016-10-24 17:10 ` Dave Hansen 2016-10-25 1:27 ` Anshuman Khandual 2016-10-25 1:27 ` Anshuman Khandual 2016-11-17 7:40 ` Anshuman Khandual 2016-11-17 7:40 ` Anshuman Khandual 2016-11-17 7:59 ` [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask Anshuman Khandual 2016-11-17 7:59 ` Anshuman Khandual 2016-11-17 7:59 ` [DRAFT 2/2] mm/hugetlb: Restrict HugeTLB allocations only to the system RAM nodes Anshuman Khandual 2016-11-17 7:59 ` Anshuman Khandual 2016-11-17 8:28 ` [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask kbuild test robot 2016-10-24 4:31 ` [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 17:16 ` Dave Hansen 2016-10-24 17:16 ` Dave Hansen 2016-10-25 4:15 ` Aneesh Kumar K.V 2016-10-25 4:15 ` Aneesh Kumar K.V 2016-10-25 7:17 ` Balbir Singh 2016-10-25 7:17 ` Balbir Singh 2016-10-25 7:25 ` Balbir Singh 2016-10-25 7:25 ` Balbir Singh 2016-10-24 4:31 ` [RFC 4/8] mm: Accommodate coherent device memory nodes in MPOL_BIND implementation Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 4:31 ` [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 17:38 ` Dave Hansen 2016-10-24 17:38 ` Dave Hansen 2016-10-24 18:00 ` Dave Hansen 2016-10-24 18:00 ` Dave Hansen 2016-10-25 12:36 ` Balbir Singh 2016-10-25 12:36 ` Balbir Singh 2016-10-25 19:20 ` Aneesh Kumar K.V 2016-10-25 19:20 ` Aneesh Kumar K.V 2016-10-25 20:01 ` Dave Hansen 2016-10-25 20:01 ` Dave Hansen 2016-10-24 4:31 ` [RFC 6/8] mm: Make VM_CDM marked VMAs non migratable Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 4:31 ` [RFC 7/8] mm: Add a new migration function migrate_virtual_range() Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-24 4:31 ` [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] Anshuman Khandual 2016-10-24 4:31 ` Anshuman Khandual 2016-10-25 7:22 ` Balbir Singh 2016-10-25 7:22 ` Balbir Singh 2016-10-26 4:52 ` Anshuman Khandual 2016-10-26 4:52 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 00/10] Test and debug patches for coherent device memory Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 01/10] dt-bindings: Add doc for ibm,hotplug-aperture Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 02/10] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 03/10] powerpc/mm: Allow memory hotplug into a memory less node Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 04/10] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 05/10] powerpc/mm: Identify isolation seeking coherent memory nodes during boot Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 06/10] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 07/10] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 08/10] powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 09/10] drivers: Add two drivers for coherent device memory tests Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 4:42 ` [DEBUG 10/10] test: Add a script to perform random VMA migrations across nodes Anshuman Khandual 2016-10-24 4:42 ` Anshuman Khandual 2016-10-24 17:09 ` [RFC 0/8] Define coherent device memory node Jerome Glisse 2016-10-24 17:09 ` Jerome Glisse 2016-10-25 4:26 ` Aneesh Kumar K.V 2016-10-25 4:26 ` Aneesh Kumar K.V 2016-10-25 15:16 ` Jerome Glisse 2016-10-25 15:16 ` Jerome Glisse 2016-10-26 11:09 ` Aneesh Kumar K.V 2016-10-26 11:09 ` Aneesh Kumar K.V 2016-10-26 16:07 ` Jerome Glisse 2016-10-26 16:07 ` Jerome Glisse 2016-10-28 5:29 ` Aneesh Kumar K.V 2016-10-28 5:29 ` Aneesh Kumar K.V 2016-10-28 16:16 ` Jerome Glisse 2016-10-28 16:16 ` Jerome Glisse 2016-11-05 5:21 ` Anshuman Khandual 2016-11-05 5:21 ` Anshuman Khandual 2016-11-05 18:02 ` Jerome Glisse 2016-11-05 18:02 ` Jerome Glisse 2016-10-25 4:59 ` Aneesh Kumar K.V 2016-10-25 4:59 ` Aneesh Kumar K.V 2016-10-25 15:32 ` Jerome Glisse 2016-10-25 15:32 ` Jerome Glisse 2016-10-25 17:31 ` Aneesh Kumar K.V 2016-10-25 17:31 ` Aneesh Kumar K.V 2016-10-25 18:52 ` Jerome Glisse 2016-10-25 18:52 ` Jerome Glisse 2016-10-26 11:13 ` Anshuman Khandual 2016-10-26 11:13 ` Anshuman Khandual 2016-10-26 16:02 ` Jerome Glisse 2016-10-26 16:02 ` Jerome Glisse 2016-10-27 4:38 ` Anshuman Khandual 2016-10-27 4:38 ` Anshuman Khandual 2016-10-27 7:03 ` Anshuman Khandual 2016-10-27 7:03 ` Anshuman Khandual 2016-10-27 15:05 ` Jerome Glisse 2016-10-27 15:05 ` Jerome Glisse 2016-10-28 5:47 ` Anshuman Khandual 2016-10-28 5:47 ` Anshuman Khandual 2016-10-28 16:08 ` Jerome Glisse 2016-10-28 16:08 ` Jerome Glisse 2016-10-26 12:56 ` Anshuman Khandual 2016-10-26 12:56 ` Anshuman Khandual 2016-10-26 16:28 ` Jerome Glisse 2016-10-26 16:28 ` Jerome Glisse 2016-10-27 10:23 ` Balbir Singh 2016-10-27 10:23 ` Balbir Singh 2016-10-25 12:07 ` Balbir Singh 2016-10-25 12:07 ` Balbir Singh 2016-10-25 15:21 ` Jerome Glisse 2016-10-25 15:21 ` Jerome Glisse 2016-10-24 18:04 ` Dave Hansen 2016-10-24 18:04 ` Dave Hansen 2016-10-24 18:32 ` David Nellans 2016-10-24 18:32 ` David Nellans 2016-10-24 19:36 ` Dave Hansen 2016-10-24 19:36 ` Dave Hansen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.