From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Cc: Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>, x86@kernel.org, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@suse.com>, kbuild test robot <lkp@intel.com>, hch@lst.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-acpi@vger.kernel.org Subject: [PATCH v2 17/18] x86/numa: Provide a range-to-target_node lookup facility Date: Sun, 17 Nov 2019 09:46:07 -0800 [thread overview] Message-ID: <157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com> (raw) In-Reply-To: <157401267421.43284.2135775608523385279.stgit@dwillia2-desk3.amr.corp.intel.com> The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax instances, fronting performance-differentiated-memory like pmem, to be added to the System RAM pool. The numa node for that hot-added memory is derived from the device-dax instance's 'target_node' attribute. Recall that the 'target_node' is the ACPI-PXM-to-node translation for memory when it comes online whereas the 'numa_node' attribute of the device represents the closest online cpu node. Presently useful target_node information from the ACPI SRAT is discarded with the expectation that "Reserved" memory will never be onlined. Now, DEV_DAX_KMEM violates that assumption, there is a need to retain the translation. Move, rather than discard, numa_memblk data to a secondary array that memory_add_physaddr_to_target_node() may consider at a later point in time. Note that memory_add_physaddr_to_nid() is currently only available on CONFIG_MEMORY_HOTPLUG=y platforms whereas the target node information may be useful on CONFIG_MEMORY_HOTPLUG=n builds, hence why it is calling phys_to_target_node() and optionally defined by asm/io.h rather than a memory_add_physaddr_to_target_nid() helper that lives in include/linux/memory_hotplug.h. Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- arch/x86/mm/numa.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++--- include/linux/numa.h | 8 +++++ mm/mempolicy.c | 5 +++ 3 files changed, 83 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4123100e0eaf..f4f02ac0c465 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -31,6 +31,24 @@ __initdata #endif ; +/* + * Presently, DEV_DAX_KMEM is the only kernel facility that might + * convert Reserved or Soft Reserved memory to System RAM. + */ +#if IS_ENABLED(CONFIG_DEV_DAX_KMEM) +static struct numa_meminfo __numa_reserved_meminfo; + +static struct numa_meminfo *numa_reserved_meminfo(void) +{ + return &__numa_reserved_meminfo; +} +#else +static struct numa_meminfo *numa_reserved_meminfo(void) +{ + return NULL; +} +#endif + static int numa_distance_cnt; static u8 *numa_distance; @@ -168,6 +186,26 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) (mi->nr_blks - idx) * sizeof(mi->blk[0])); } +/** + * numa_move_memblk - Move one numa_memblk from one numa_meminfo to another + * @dst: numa_meminfo to move block to + * @idx: Index of memblk to remove + * @src: numa_meminfo to remove memblk from + * + * If @dst is non-NULL add it at the @dst->nr_blks index and increment + * @dst->nr_blks, then remove it from @src. + */ +static void __init numa_move_memblk(struct numa_meminfo *dst, int idx, + struct numa_meminfo *src) +{ + if (dst) { + memcpy(&dst->blk[dst->nr_blks], &src->blk[idx], + sizeof(struct numa_memblk)); + dst->nr_blks++; + } + numa_remove_memblk_from(idx, src); +} + /** * numa_add_memblk - Add one numa_memblk to numa_meminfo * @nid: NUMA node ID of the new memblk @@ -245,7 +283,7 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) if (bi->start >= bi->end || !memblock_overlaps_region(&memblock.memory, bi->start, bi->end - bi->start)) - numa_remove_memblk_from(i--, mi); + numa_move_memblk(numa_reserved_meminfo(), i--, mi); } /* merge neighboring / overlapping entries */ @@ -881,16 +919,44 @@ EXPORT_SYMBOL(cpumask_of_node); #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ +static int meminfo_to_nid(struct numa_meminfo *mi, u64 start, int *nid) +{ + int i; + + for (i = 0; mi && i < mi->nr_blks; i++) + if (mi->blk[i].start <= start && mi->blk[i].end > start) { + *nid = mi->blk[i].nid; + break; + } + return i; +} + +int phys_to_target_node(phys_addr_t start) +{ + struct numa_meminfo *mi = &numa_meminfo; + int nid = mi->blk[0].nid; + int i = meminfo_to_nid(mi, start, &nid); + + /* + * Prefer online nodes, but if reserved memory might be + * hot-added continue the search with reserved ranges. + */ + if (i < mi->nr_blks) + return nid; + + mi = numa_reserved_meminfo(); + meminfo_to_nid(mi, start, &nid); + return nid; +} +EXPORT_SYMBOL_GPL(phys_to_target_node); + #ifdef CONFIG_MEMORY_HOTPLUG int memory_add_physaddr_to_nid(u64 start) { struct numa_meminfo *mi = &numa_meminfo; int nid = mi->blk[0].nid; - int i; - for (i = 0; i < mi->nr_blks; i++) - if (mi->blk[i].start <= start && mi->blk[i].end > start) - nid = mi->blk[i].nid; + meminfo_to_nid(mi, start, &nid); return nid; } EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); diff --git a/include/linux/numa.h b/include/linux/numa.h index 20f4e44b186c..941790a0765b 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -1,7 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_NUMA_H #define _LINUX_NUMA_H - +#include <linux/types.h> #ifdef CONFIG_NODES_SHIFT #define NODES_SHIFT CONFIG_NODES_SHIFT @@ -15,11 +15,17 @@ #ifdef CONFIG_NUMA int numa_map_to_online_node(int node); +int phys_to_target_node(phys_addr_t addr); #else static inline int numa_map_to_online_node(int node) { return NUMA_NO_NODE; } + +static inline int phys_to_target_node(phys_addr_t addr) +{ + return NUMA_NO_NODE; +} #endif #endif /* _LINUX_NUMA_H */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d618121bcc17..0db8b446e23e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2996,3 +2996,8 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +__weak int phys_to_target_node(phys_addr_t addr) +{ + return NUMA_NO_NODE; +} _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Cc: Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>, x86@kernel.org, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@suse.com>, kbuild test robot <lkp@intel.com>, vishal.l.verma@intel.com, hch@lst.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-acpi@vger.kernel.org Subject: [PATCH v2 17/18] x86/numa: Provide a range-to-target_node lookup facility Date: Sun, 17 Nov 2019 09:46:07 -0800 [thread overview] Message-ID: <157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com> (raw) In-Reply-To: <157401267421.43284.2135775608523385279.stgit@dwillia2-desk3.amr.corp.intel.com> The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax instances, fronting performance-differentiated-memory like pmem, to be added to the System RAM pool. The numa node for that hot-added memory is derived from the device-dax instance's 'target_node' attribute. Recall that the 'target_node' is the ACPI-PXM-to-node translation for memory when it comes online whereas the 'numa_node' attribute of the device represents the closest online cpu node. Presently useful target_node information from the ACPI SRAT is discarded with the expectation that "Reserved" memory will never be onlined. Now, DEV_DAX_KMEM violates that assumption, there is a need to retain the translation. Move, rather than discard, numa_memblk data to a secondary array that memory_add_physaddr_to_target_node() may consider at a later point in time. Note that memory_add_physaddr_to_nid() is currently only available on CONFIG_MEMORY_HOTPLUG=y platforms whereas the target node information may be useful on CONFIG_MEMORY_HOTPLUG=n builds, hence why it is calling phys_to_target_node() and optionally defined by asm/io.h rather than a memory_add_physaddr_to_target_nid() helper that lives in include/linux/memory_hotplug.h. Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- arch/x86/mm/numa.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++--- include/linux/numa.h | 8 +++++ mm/mempolicy.c | 5 +++ 3 files changed, 83 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4123100e0eaf..f4f02ac0c465 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -31,6 +31,24 @@ __initdata #endif ; +/* + * Presently, DEV_DAX_KMEM is the only kernel facility that might + * convert Reserved or Soft Reserved memory to System RAM. + */ +#if IS_ENABLED(CONFIG_DEV_DAX_KMEM) +static struct numa_meminfo __numa_reserved_meminfo; + +static struct numa_meminfo *numa_reserved_meminfo(void) +{ + return &__numa_reserved_meminfo; +} +#else +static struct numa_meminfo *numa_reserved_meminfo(void) +{ + return NULL; +} +#endif + static int numa_distance_cnt; static u8 *numa_distance; @@ -168,6 +186,26 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) (mi->nr_blks - idx) * sizeof(mi->blk[0])); } +/** + * numa_move_memblk - Move one numa_memblk from one numa_meminfo to another + * @dst: numa_meminfo to move block to + * @idx: Index of memblk to remove + * @src: numa_meminfo to remove memblk from + * + * If @dst is non-NULL add it at the @dst->nr_blks index and increment + * @dst->nr_blks, then remove it from @src. + */ +static void __init numa_move_memblk(struct numa_meminfo *dst, int idx, + struct numa_meminfo *src) +{ + if (dst) { + memcpy(&dst->blk[dst->nr_blks], &src->blk[idx], + sizeof(struct numa_memblk)); + dst->nr_blks++; + } + numa_remove_memblk_from(idx, src); +} + /** * numa_add_memblk - Add one numa_memblk to numa_meminfo * @nid: NUMA node ID of the new memblk @@ -245,7 +283,7 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) if (bi->start >= bi->end || !memblock_overlaps_region(&memblock.memory, bi->start, bi->end - bi->start)) - numa_remove_memblk_from(i--, mi); + numa_move_memblk(numa_reserved_meminfo(), i--, mi); } /* merge neighboring / overlapping entries */ @@ -881,16 +919,44 @@ EXPORT_SYMBOL(cpumask_of_node); #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ +static int meminfo_to_nid(struct numa_meminfo *mi, u64 start, int *nid) +{ + int i; + + for (i = 0; mi && i < mi->nr_blks; i++) + if (mi->blk[i].start <= start && mi->blk[i].end > start) { + *nid = mi->blk[i].nid; + break; + } + return i; +} + +int phys_to_target_node(phys_addr_t start) +{ + struct numa_meminfo *mi = &numa_meminfo; + int nid = mi->blk[0].nid; + int i = meminfo_to_nid(mi, start, &nid); + + /* + * Prefer online nodes, but if reserved memory might be + * hot-added continue the search with reserved ranges. + */ + if (i < mi->nr_blks) + return nid; + + mi = numa_reserved_meminfo(); + meminfo_to_nid(mi, start, &nid); + return nid; +} +EXPORT_SYMBOL_GPL(phys_to_target_node); + #ifdef CONFIG_MEMORY_HOTPLUG int memory_add_physaddr_to_nid(u64 start) { struct numa_meminfo *mi = &numa_meminfo; int nid = mi->blk[0].nid; - int i; - for (i = 0; i < mi->nr_blks; i++) - if (mi->blk[i].start <= start && mi->blk[i].end > start) - nid = mi->blk[i].nid; + meminfo_to_nid(mi, start, &nid); return nid; } EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); diff --git a/include/linux/numa.h b/include/linux/numa.h index 20f4e44b186c..941790a0765b 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -1,7 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_NUMA_H #define _LINUX_NUMA_H - +#include <linux/types.h> #ifdef CONFIG_NODES_SHIFT #define NODES_SHIFT CONFIG_NODES_SHIFT @@ -15,11 +15,17 @@ #ifdef CONFIG_NUMA int numa_map_to_online_node(int node); +int phys_to_target_node(phys_addr_t addr); #else static inline int numa_map_to_online_node(int node) { return NUMA_NO_NODE; } + +static inline int phys_to_target_node(phys_addr_t addr) +{ + return NUMA_NO_NODE; +} #endif #endif /* _LINUX_NUMA_H */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d618121bcc17..0db8b446e23e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2996,3 +2996,8 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +__weak int phys_to_target_node(phys_addr_t addr) +{ + return NUMA_NO_NODE; +}
next prev parent reply other threads:[~2019-11-17 18:00 UTC|newest] Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-17 17:44 [PATCH v2 00/18] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams 2019-11-17 17:44 ` Dan Williams 2019-11-17 17:44 ` [PATCH v2 01/18] libnvdimm: Move attribute groups to device type Dan Williams 2019-11-17 17:44 ` Dan Williams 2019-11-17 17:44 ` [PATCH v2 02/18] libnvdimm: Move region attribute group definition Dan Williams 2019-11-17 17:44 ` Dan Williams 2019-11-17 17:44 ` [PATCH v2 03/18] libnvdimm: Move nd_device_attribute_group to device_type Dan Williams 2019-11-17 17:44 ` Dan Williams 2019-11-17 17:44 ` [PATCH v2 04/18] libnvdimm: Move nd_numa_attribute_group " Dan Williams 2019-11-17 17:44 ` Dan Williams 2019-11-18 9:46 ` Aneesh Kumar K.V 2019-11-18 9:46 ` Aneesh Kumar K.V 2019-11-17 17:45 ` [PATCH v2 05/18] libnvdimm: Move nd_region_attribute_group " Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 06/18] libnvdimm: Move nd_mapping_attribute_group " Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 07/18] libnvdimm: Move nvdimm_attribute_group " Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 08/18] libnvdimm: Move nvdimm_bus_attribute_group " Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 09/18] dax: Create a dax device_type Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 10/18] dax: Simplify root read-only definition for the 'resource' attribute Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 11/18] libnvdimm: " Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 12/18] dax: Add numa_node to the default device-dax attributes Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-17 17:45 ` [PATCH v2 13/18] libnvdimm: Export the target_node attribute for regions and namespaces Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-18 9:45 ` Aneesh Kumar K.V 2019-11-18 9:45 ` Aneesh Kumar K.V 2019-11-17 17:45 ` [PATCH v2 14/18] acpi/numa: Up-level "map to online node" functionality Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-29 11:56 ` Rafael J. Wysocki 2019-11-29 11:56 ` Rafael J. Wysocki 2019-11-17 17:45 ` [PATCH v2 15/18] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams 2019-11-17 17:45 ` Dan Williams 2019-11-18 9:45 ` Aneesh Kumar K.V 2019-11-18 9:45 ` Aneesh Kumar K.V 2019-11-17 17:46 ` [PATCH v2 16/18] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams 2019-11-17 17:46 ` Dan Williams 2019-11-18 9:46 ` Aneesh Kumar K.V 2019-11-18 9:46 ` Aneesh Kumar K.V 2019-11-20 10:30 ` Michael Ellerman 2019-11-20 10:30 ` Michael Ellerman 2019-11-17 17:46 ` Dan Williams [this message] 2019-11-17 17:46 ` [PATCH v2 17/18] x86/numa: Provide a range-to-target_node lookup facility Dan Williams 2019-11-18 18:45 ` Dan Williams 2019-11-18 18:45 ` Dan Williams 2019-11-18 18:45 ` Dan Williams 2019-11-17 17:46 ` [PATCH v2 18/18] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams 2019-11-17 17:46 ` Dan Williams
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com \ --to=dan.j.williams@intel.com \ --cc=akpm@linux-foundation.org \ --cc=bp@alien8.de \ --cc=dave.hansen@linux.intel.com \ --cc=david@redhat.com \ --cc=hch@lst.de \ --cc=hpa@zytor.com \ --cc=linux-acpi@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=linux-nvdimm@lists.01.org \ --cc=lkp@intel.com \ --cc=luto@kernel.org \ --cc=mhocko@suse.com \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=tglx@linutronix.de \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.