* [PATCH 0/3] Skip numa distance for offline nodes @ 2021-05-20 15:44 Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani Geetika reported yet another trace while doing a dlpar CPU add operation. This was true even on top of a recent commit 6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed a similar trace. WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 build_sched_domains+0x6e8/0x1540 Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19 Workqueue: events cpuset_hotplug_workfn NIP: c0000000001de588 LR: c0000000001de584 CTR: 00000000006cd36c REGS: c00000002772b250 TRAP: 0700 Not tainted (5.12.0-rc5-master+) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28828422 XER: 0000000d CFAR: c00000000020c2f8 IRQMASK: 0 #012GPR00: c0000000001de584 c00000002772b4f0 c000000001f55400 0000000000000036 #012GPR04: c0000063c6368010 c0000063c63f0a00 0000000000000027 c0000063c6368018 #012GPR08: 0000000000000023 c0000063c636ef48 00000063c4de0000 c0000063bfe9ffe8 #012GPR12: 0000000028828424 c0000063fe68fe80 0000000000000000 0000000000000417 #012GPR16: 0000000000000028 c00000000740dcd8 c00000000205db68 c000000001a3a4a0 #012GPR20: c000000091ed7d20 c000000091ed8520 0000000000000001 0000000000000000 #012GPR24: c0000000113a9600 0000000000000190 0000000000000028 c0000000010e3ac0 #012GPR28: 0000000000000000 c00000000740dd00 c0000000317b5900 0000000000000190 NIP [c0000000001de588] build_sched_domains+0x6e8/0x1540 LR [c0000000001de584] build_sched_domains+0x6e4/0x1540 Call Trace: [c00000002772b4f0] [c0000000001de584] build_sched_domains+0x6e4/0x1540 (unreliable) [c00000002772b640] [c0000000001e08dc] partition_sched_domains_locked+0x3ec/0x530 [c00000002772b6e0] [c0000000002a2144] rebuild_sched_domains_locked+0x524/0xbf0 [c00000002772b7e0] [c0000000002a5620] rebuild_sched_domains+0x40/0x70 [c00000002772b810] [c0000000002a58e4] cpuset_hotplug_workfn+0x294/0xe20 [c00000002772bc30] [c000000000187510] process_one_work+0x300/0x670 [c00000002772bd10] [c0000000001878f8] worker_thread+0x78/0x520 [c00000002772bda0] [c0000000001937f0] kthread+0x1a0/0x1b0 [c00000002772be10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70 Instruction dump: 7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19 fbe10060 3863e558 4802dd31 60000000 <0fe00000> 3920fff4 f9210080 e86100b0 Detailed analysis of the failing scenario showed that the span in question belongs to NODE domain and further the cpumasks for some cpus in NODE overlapped. There are two possible reasons how we ended up here: (1) The numa node was offline or blank with no CPUs or memory. Hence the sched_max_numa_distance could not be set correctly, or the sched_domains_numa_distance happened to be partially populated. (2) Depending on a bogus node_distance of an offline node to populate cpumasks is the issue. On POWER platform the node_distance is correctly available only for an online node which has some CPU or memory resource associated with it. For example distance info from numactl from a fully populated 8 node system at boot may look like this. node distances: node 0 1 2 3 4 5 6 7 0: 10 20 40 40 40 40 40 40 1: 20 10 40 40 40 40 40 40 2: 40 40 10 20 40 40 40 40 3: 40 40 20 10 40 40 40 40 4: 40 40 40 40 10 20 40 40 5: 40 40 40 40 20 10 40 40 6: 40 40 40 40 40 40 10 20 7: 40 40 40 40 40 40 20 10 However the same system when only two nodes are online at boot, then the numa topology will look like node distances: node 0 1 0: 10 20 1: 20 10 This series tries to fix both these problems. Note: These problems are now visible, thanks to Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") Cc: LKML <linux-kernel@vger.kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rik van Riel <riel@surriel.com> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Srikar Dronamraju (3): sched/topology: Allow archs to populate distance map powerpc/numa: Populate distance map correctly sched/topology: Skip updating masks for non-online nodes arch/powerpc/include/asm/topology.h | 3 +++ arch/powerpc/mm/numa.c | 19 +++++++++++++++ kernel/sched/topology.c | 38 +++++++++++++++++++++-------- 3 files changed, 50 insertions(+), 10 deletions(-) base-commit: 1699949d3314e5d1956fb082e4cd4798bf6149fc -- 2.27.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju @ 2021-05-20 15:44 ` Srikar Dronamraju 2021-05-20 18:56 ` Peter Zijlstra 2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju 2 siblings, 1 reply; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani Currently scheduler populates the distance map by looking at distance of each node from all other nodes. This should work for most architectures and platforms. However there are some architectures like POWER that may not expose the distance of nodes that are not yet onlined because those resources are not yet allocated to the OS instance. Such architectures have other means to provide valid distance data for the current platform. For example distance info from numactl from a fully populated 8 node system at boot may look like this. node distances: node 0 1 2 3 4 5 6 7 0: 10 20 40 40 40 40 40 40 1: 20 10 40 40 40 40 40 40 2: 40 40 10 20 40 40 40 40 3: 40 40 20 10 40 40 40 40 4: 40 40 40 40 10 20 40 40 5: 40 40 40 40 20 10 40 40 6: 40 40 40 40 40 40 10 20 7: 40 40 40 40 40 40 20 10 However the same system when only two nodes are online at boot, then the numa topology will look like node distances: node 0 1 0: 10 20 1: 20 10 It may be implementation dependent on what node_distance(0,3) where node 0 is online and node 3 is offline. In POWER case, it returns LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max distance between nodes is 20. However that would not be true. When Nodes are onlined and CPUs from those nodes are hotplugged, the max node distance would be 40. To handle such scenarios, let scheduler allow architectures to populate the distance map. Architectures that like to populate the distance map can overload arch_populate_distance_map(). Cc: LKML <linux-kernel@vger.kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rik van Riel <riel@surriel.com> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- kernel/sched/topology.c | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 053115b55f89..ccb9aff59add 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1630,6 +1630,26 @@ static void init_numa_topology_type(void) #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) +#ifndef arch_populate_distance_map +static int arch_populate_distance_map(unsigned long *distance_map) +{ + int i, j; + + for (i = 0; i < nr_node_ids; i++) { + for (j = 0; j < nr_node_ids; j++) { + int distance = node_distance(i, j); + + if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { + sched_numa_warn("Invalid distance value range"); + return -1; + } + bitmap_set(distance_map, distance, 1); + } + } + return 0; +} +#endif + void sched_init_numa(void) { struct sched_domain_topology_level *tl; @@ -1646,18 +1666,10 @@ void sched_init_numa(void) return; bitmap_zero(distance_map, NR_DISTANCE_VALUES); - for (i = 0; i < nr_node_ids; i++) { - for (j = 0; j < nr_node_ids; j++) { - int distance = node_distance(i, j); - if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { - sched_numa_warn("Invalid distance value range"); - return; - } + if (arch_populate_distance_map(distance_map)) + return; - bitmap_set(distance_map, distance, 1); - } - } /* * We can now figure out how many unique distance values there are and * allocate memory accordingly. -- 2.27.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju @ 2021-05-20 18:56 ` Peter Zijlstra 2021-05-21 2:38 ` Srikar Dronamraju 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2021-05-20 18:56 UTC (permalink / raw) To: Srikar Dronamraju Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote: > Currently scheduler populates the distance map by looking at distance > of each node from all other nodes. This should work for most > architectures and platforms. > > However there are some architectures like POWER that may not expose > the distance of nodes that are not yet onlined because those resources > are not yet allocated to the OS instance. Such architectures have > other means to provide valid distance data for the current platform. > > For example distance info from numactl from a fully populated 8 node > system at boot may look like this. > > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 20 40 40 40 40 40 40 > 1: 20 10 40 40 40 40 40 40 > 2: 40 40 10 20 40 40 40 40 > 3: 40 40 20 10 40 40 40 40 > 4: 40 40 40 40 10 20 40 40 > 5: 40 40 40 40 20 10 40 40 > 6: 40 40 40 40 40 40 10 20 > 7: 40 40 40 40 40 40 20 10 > > However the same system when only two nodes are online at boot, then the > numa topology will look like > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > It may be implementation dependent on what node_distance(0,3) where > node 0 is online and node 3 is offline. In POWER case, it returns > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max > distance between nodes is 20. However that would not be true. > > When Nodes are onlined and CPUs from those nodes are hotplugged, > the max node distance would be 40. > > To handle such scenarios, let scheduler allow architectures to populate > the distance map. Architectures that like to populate the distance map > can overload arch_populate_distance_map(). Why? Why can't your node_distance() DTRT? The arch interface is nr_node_ids and node_distance(), I don't see why we need something new and then replace one special use of it. By virtue of you being able to actually implement this new hook, you supposedly can actually do node_distance() right too. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-20 18:56 ` Peter Zijlstra @ 2021-05-21 2:38 ` Srikar Dronamraju 2021-05-21 8:14 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-21 2:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]: > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote: > > Currently scheduler populates the distance map by looking at distance > > of each node from all other nodes. This should work for most > > architectures and platforms. > > > > However there are some architectures like POWER that may not expose > > the distance of nodes that are not yet onlined because those resources > > are not yet allocated to the OS instance. Such architectures have > > other means to provide valid distance data for the current platform. > > > > For example distance info from numactl from a fully populated 8 node > > system at boot may look like this. > > > > node distances: > > node 0 1 2 3 4 5 6 7 > > 0: 10 20 40 40 40 40 40 40 > > 1: 20 10 40 40 40 40 40 40 > > 2: 40 40 10 20 40 40 40 40 > > 3: 40 40 20 10 40 40 40 40 > > 4: 40 40 40 40 10 20 40 40 > > 5: 40 40 40 40 20 10 40 40 > > 6: 40 40 40 40 40 40 10 20 > > 7: 40 40 40 40 40 40 20 10 > > > > However the same system when only two nodes are online at boot, then the > > numa topology will look like > > node distances: > > node 0 1 > > 0: 10 20 > > 1: 20 10 > > > > It may be implementation dependent on what node_distance(0,3) where > > node 0 is online and node 3 is offline. In POWER case, it returns > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max > > distance between nodes is 20. However that would not be true. > > > > When Nodes are onlined and CPUs from those nodes are hotplugged, > > the max node distance would be 40. > > > > To handle such scenarios, let scheduler allow architectures to populate > > the distance map. Architectures that like to populate the distance map > > can overload arch_populate_distance_map(). > > Why? Why can't your node_distance() DTRT? The arch interface is > nr_node_ids and node_distance(), I don't see why we need something new > and then replace one special use of it. > > By virtue of you being able to actually implement this new hook, you > supposedly can actually do node_distance() right too. Since for an offline node, arch interface code doesn't have the info. As far as I know/understand, in POWER, unless there is an active memory or CPU that's getting onlined, arch can't fetch the correct node distance. Taking the above example: node 3 is offline, then node_distance of (3,X) where X is anything other than 3, is not reliable. The moment node 3 is onlined, the node distance is reliable. This problem will not happen even on POWER if all the nodes have either memory or CPUs active at the time of boot. -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-21 2:38 ` Srikar Dronamraju @ 2021-05-21 8:14 ` Peter Zijlstra 2021-05-21 9:28 ` Srikar Dronamraju 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2021-05-21 8:14 UTC (permalink / raw) To: Srikar Dronamraju Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote: > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]: > > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote: > > > Currently scheduler populates the distance map by looking at distance > > > of each node from all other nodes. This should work for most > > > architectures and platforms. > > > > > > However there are some architectures like POWER that may not expose > > > the distance of nodes that are not yet onlined because those resources > > > are not yet allocated to the OS instance. Such architectures have > > > other means to provide valid distance data for the current platform. > > > > > > For example distance info from numactl from a fully populated 8 node > > > system at boot may look like this. > > > > > > node distances: > > > node 0 1 2 3 4 5 6 7 > > > 0: 10 20 40 40 40 40 40 40 > > > 1: 20 10 40 40 40 40 40 40 > > > 2: 40 40 10 20 40 40 40 40 > > > 3: 40 40 20 10 40 40 40 40 > > > 4: 40 40 40 40 10 20 40 40 > > > 5: 40 40 40 40 20 10 40 40 > > > 6: 40 40 40 40 40 40 10 20 > > > 7: 40 40 40 40 40 40 20 10 > > > > > > However the same system when only two nodes are online at boot, then the > > > numa topology will look like > > > node distances: > > > node 0 1 > > > 0: 10 20 > > > 1: 20 10 > > > > > > It may be implementation dependent on what node_distance(0,3) where > > > node 0 is online and node 3 is offline. In POWER case, it returns > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max > > > distance between nodes is 20. However that would not be true. > > > > > > When Nodes are onlined and CPUs from those nodes are hotplugged, > > > the max node distance would be 40. > > > > > > To handle such scenarios, let scheduler allow architectures to populate > > > the distance map. Architectures that like to populate the distance map > > > can overload arch_populate_distance_map(). > > > > Why? Why can't your node_distance() DTRT? The arch interface is > > nr_node_ids and node_distance(), I don't see why we need something new > > and then replace one special use of it. > > > > By virtue of you being able to actually implement this new hook, you > > supposedly can actually do node_distance() right too. > > Since for an offline node, arch interface code doesn't have the info. > As far as I know/understand, in POWER, unless there is an active memory or > CPU that's getting onlined, arch can't fetch the correct node distance. > > Taking the above example: node 3 is offline, then node_distance of (3,X) > where X is anything other than 3, is not reliable. The moment node 3 is > onlined, the node distance is reliable. > > This problem will not happen even on POWER if all the nodes have either > memory or CPUs active at the time of boot. But then how can you implement this new hook? Going by the fact that both nr_node_ids and distance_ref_points_depth are fixed, how many possible __node_distance() configurations are there left? The example provided above does not suggest there's much room for alternatives, and hence for actual need of this new interface. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-21 8:14 ` Peter Zijlstra @ 2021-05-21 9:28 ` Srikar Dronamraju 2021-05-24 14:16 ` Valentin Schneider 0 siblings, 1 reply; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-21 9:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]: > On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote: > > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]: > > > > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote: > > > > Currently scheduler populates the distance map by looking at distance > > > > of each node from all other nodes. This should work for most > > > > architectures and platforms. > > > > > > > > However there are some architectures like POWER that may not expose > > > > the distance of nodes that are not yet onlined because those resources > > > > are not yet allocated to the OS instance. Such architectures have > > > > other means to provide valid distance data for the current platform. > > > > > > > > For example distance info from numactl from a fully populated 8 node > > > > system at boot may look like this. > > > > > > > > node distances: > > > > node 0 1 2 3 4 5 6 7 > > > > 0: 10 20 40 40 40 40 40 40 > > > > 1: 20 10 40 40 40 40 40 40 > > > > 2: 40 40 10 20 40 40 40 40 > > > > 3: 40 40 20 10 40 40 40 40 > > > > 4: 40 40 40 40 10 20 40 40 > > > > 5: 40 40 40 40 20 10 40 40 > > > > 6: 40 40 40 40 40 40 10 20 > > > > 7: 40 40 40 40 40 40 20 10 > > > > > > > > However the same system when only two nodes are online at boot, then the > > > > numa topology will look like > > > > node distances: > > > > node 0 1 > > > > 0: 10 20 > > > > 1: 20 10 > > > > > > > > It may be implementation dependent on what node_distance(0,3) where > > > > node 0 is online and node 3 is offline. In POWER case, it returns > > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max > > > > distance between nodes is 20. However that would not be true. > > > > > > > > When Nodes are onlined and CPUs from those nodes are hotplugged, > > > > the max node distance would be 40. > > > > > > > > To handle such scenarios, let scheduler allow architectures to populate > > > > the distance map. Architectures that like to populate the distance map > > > > can overload arch_populate_distance_map(). > > > > > > Why? Why can't your node_distance() DTRT? The arch interface is > > > nr_node_ids and node_distance(), I don't see why we need something new > > > and then replace one special use of it. > > > > > > By virtue of you being able to actually implement this new hook, you > > > supposedly can actually do node_distance() right too. > > > > Since for an offline node, arch interface code doesn't have the info. > > As far as I know/understand, in POWER, unless there is an active memory or > > CPU that's getting onlined, arch can't fetch the correct node distance. > > > > Taking the above example: node 3 is offline, then node_distance of (3,X) > > where X is anything other than 3, is not reliable. The moment node 3 is > > onlined, the node distance is reliable. > > > > This problem will not happen even on POWER if all the nodes have either > > memory or CPUs active at the time of boot. > > But then how can you implement this new hook? Going by the fact that > both nr_node_ids and distance_ref_points_depth are fixed, how many > possible __node_distance() configurations are there left? > distance_ref_point_depth is provided as a different property and is readily available at boot. The new api will use just use that. So based on the distance_ref_point_depth, we know all possible node distances for that platform. For an offline node, we don't have that specific nodes distance_lookup_table array entries. Each array would be of distance_ref_point_depth entries. Without the distance_lookup_table for an array populated, we will not be able to tell how far the node is with respect to other nodes. We can lookup the correct distance_lookup_table for a node based on memory or the CPUs attached to that node. Since in an offline node, both of them would not be around, the distance_lookup_table will have stale values. > The example provided above does not suggest there's much room for > alternatives, and hence for actual need of this new interface. > -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-21 9:28 ` Srikar Dronamraju @ 2021-05-24 14:16 ` Valentin Schneider 2021-05-24 16:18 ` Srikar Dronamraju 0 siblings, 1 reply; 17+ messages in thread From: Valentin Schneider @ 2021-05-24 14:16 UTC (permalink / raw) To: Srikar Dronamraju, Peter Zijlstra Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani On 21/05/21 14:58, Srikar Dronamraju wrote: > * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]: > >> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote: >> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]: >> > >> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote: >> > > > Currently scheduler populates the distance map by looking at distance >> > > > of each node from all other nodes. This should work for most >> > > > architectures and platforms. >> > > > >> > > > However there are some architectures like POWER that may not expose >> > > > the distance of nodes that are not yet onlined because those resources >> > > > are not yet allocated to the OS instance. Such architectures have >> > > > other means to provide valid distance data for the current platform. >> > > > >> > > > For example distance info from numactl from a fully populated 8 node >> > > > system at boot may look like this. >> > > > >> > > > node distances: >> > > > node 0 1 2 3 4 5 6 7 >> > > > 0: 10 20 40 40 40 40 40 40 >> > > > 1: 20 10 40 40 40 40 40 40 >> > > > 2: 40 40 10 20 40 40 40 40 >> > > > 3: 40 40 20 10 40 40 40 40 >> > > > 4: 40 40 40 40 10 20 40 40 >> > > > 5: 40 40 40 40 20 10 40 40 >> > > > 6: 40 40 40 40 40 40 10 20 >> > > > 7: 40 40 40 40 40 40 20 10 >> > > > >> > > > However the same system when only two nodes are online at boot, then the >> > > > numa topology will look like >> > > > node distances: >> > > > node 0 1 >> > > > 0: 10 20 >> > > > 1: 20 10 >> > > > >> > > > It may be implementation dependent on what node_distance(0,3) where >> > > > node 0 is online and node 3 is offline. In POWER case, it returns >> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max >> > > > distance between nodes is 20. However that would not be true. >> > > > >> > > > When Nodes are onlined and CPUs from those nodes are hotplugged, >> > > > the max node distance would be 40. >> > > > >> > > > To handle such scenarios, let scheduler allow architectures to populate >> > > > the distance map. Architectures that like to populate the distance map >> > > > can overload arch_populate_distance_map(). >> > > >> > > Why? Why can't your node_distance() DTRT? The arch interface is >> > > nr_node_ids and node_distance(), I don't see why we need something new >> > > and then replace one special use of it. >> > > >> > > By virtue of you being able to actually implement this new hook, you >> > > supposedly can actually do node_distance() right too. >> > >> > Since for an offline node, arch interface code doesn't have the info. >> > As far as I know/understand, in POWER, unless there is an active memory or >> > CPU that's getting onlined, arch can't fetch the correct node distance. >> > >> > Taking the above example: node 3 is offline, then node_distance of (3,X) >> > where X is anything other than 3, is not reliable. The moment node 3 is >> > onlined, the node distance is reliable. >> > >> > This problem will not happen even on POWER if all the nodes have either >> > memory or CPUs active at the time of boot. >> >> But then how can you implement this new hook? Going by the fact that >> both nr_node_ids and distance_ref_points_depth are fixed, how many >> possible __node_distance() configurations are there left? >> > > distance_ref_point_depth is provided as a different property and is readily > available at boot. The new api will use just use that. So based on the > distance_ref_point_depth, we know all possible node distances for that > platform. > > For an offline node, we don't have that specific nodes distance_lookup_table > array entries. Each array would be of distance_ref_point_depth entries. > Without the distance_lookup_table for an array populated, we will not be > able to tell how far the node is with respect to other nodes. > > We can lookup the correct distance_lookup_table for a node based on memory > or the CPUs attached to that node. Since in an offline node, both of them > would not be around, the distance_lookup_table will have stale values. > Ok so from your arch you can figure out the *size* of the set of unique distances, but not the individual node_distance(a, b)... That's quite unfortunate. I suppose one way to avoid the hook would be to write some "fake" distance values into your distance_lookup_table[] for offline nodes using your distance_ref_point_depth thing, i.e. ensure an iteration of node_distance(a, b) covers all distance values [1]. You can then keep patch 3 around, and that should roughly be it. >> The example provided above does not suggest there's much room for >> alternatives, and hence for actual need of this new interface. >> > > -- > Thanks and Regards > Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-24 14:16 ` Valentin Schneider @ 2021-05-24 16:18 ` Srikar Dronamraju 2021-05-25 10:21 ` Valentin Schneider 2021-05-28 8:43 ` Peter Zijlstra 0 siblings, 2 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-24 16:18 UTC (permalink / raw) To: Valentin Schneider Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]: > On 21/05/21 14:58, Srikar Dronamraju wrote: > > * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]: > > > >> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote: > >> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]: > >> > > >> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote: > >> > > > Currently scheduler populates the distance map by looking at distance > >> > > > of each node from all other nodes. This should work for most > >> > > > architectures and platforms. > >> > > > > >> > > > However there are some architectures like POWER that may not expose > >> > > > the distance of nodes that are not yet onlined because those resources > >> > > > are not yet allocated to the OS instance. Such architectures have > >> > > > other means to provide valid distance data for the current platform. > >> > > > > >> > > > For example distance info from numactl from a fully populated 8 node > >> > > > system at boot may look like this. > >> > > > > >> > > > node distances: > >> > > > node 0 1 2 3 4 5 6 7 > >> > > > 0: 10 20 40 40 40 40 40 40 > >> > > > 1: 20 10 40 40 40 40 40 40 > >> > > > 2: 40 40 10 20 40 40 40 40 > >> > > > 3: 40 40 20 10 40 40 40 40 > >> > > > 4: 40 40 40 40 10 20 40 40 > >> > > > 5: 40 40 40 40 20 10 40 40 > >> > > > 6: 40 40 40 40 40 40 10 20 > >> > > > 7: 40 40 40 40 40 40 20 10 > >> > > > > >> > > > However the same system when only two nodes are online at boot, then the > >> > > > numa topology will look like > >> > > > node distances: > >> > > > node 0 1 > >> > > > 0: 10 20 > >> > > > 1: 20 10 > >> > > > > >> > > > It may be implementation dependent on what node_distance(0,3) where > >> > > > node 0 is online and node 3 is offline. In POWER case, it returns > >> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max > >> > > > distance between nodes is 20. However that would not be true. > >> > > > > >> > > > When Nodes are onlined and CPUs from those nodes are hotplugged, > >> > > > the max node distance would be 40. > >> > > > > >> > > > To handle such scenarios, let scheduler allow architectures to populate > >> > > > the distance map. Architectures that like to populate the distance map > >> > > > can overload arch_populate_distance_map(). > >> > > > >> > > Why? Why can't your node_distance() DTRT? The arch interface is > >> > > nr_node_ids and node_distance(), I don't see why we need something new > >> > > and then replace one special use of it. > >> > > > >> > > By virtue of you being able to actually implement this new hook, you > >> > > supposedly can actually do node_distance() right too. > >> > > >> > Since for an offline node, arch interface code doesn't have the info. > >> > As far as I know/understand, in POWER, unless there is an active memory or > >> > CPU that's getting onlined, arch can't fetch the correct node distance. > >> > > >> > Taking the above example: node 3 is offline, then node_distance of (3,X) > >> > where X is anything other than 3, is not reliable. The moment node 3 is > >> > onlined, the node distance is reliable. > >> > > >> > This problem will not happen even on POWER if all the nodes have either > >> > memory or CPUs active at the time of boot. > >> > >> But then how can you implement this new hook? Going by the fact that > >> both nr_node_ids and distance_ref_points_depth are fixed, how many > >> possible __node_distance() configurations are there left? > >> > > > > distance_ref_point_depth is provided as a different property and is readily > > available at boot. The new api will use just use that. So based on the > > distance_ref_point_depth, we know all possible node distances for that > > platform. > > > > For an offline node, we don't have that specific nodes distance_lookup_table > > array entries. Each array would be of distance_ref_point_depth entries. > > Without the distance_lookup_table for an array populated, we will not be > > able to tell how far the node is with respect to other nodes. > > > > We can lookup the correct distance_lookup_table for a node based on memory > > or the CPUs attached to that node. Since in an offline node, both of them > > would not be around, the distance_lookup_table will have stale values. > > > > Ok so from your arch you can figure out the *size* of the set of unique > distances, but not the individual node_distance(a, b)... That's quite > unfortunate. Yes, thats true. > > I suppose one way to avoid the hook would be to write some "fake" distance > values into your distance_lookup_table[] for offline nodes using your > distance_ref_point_depth thing, i.e. ensure an iteration of > node_distance(a, b) covers all distance values [1]. You can then keep patch > 3 around, and that should roughly be it. > Yes, this would suffice but to me its not very clean. static int found[distance_ref_point_depth]; for_each_node(node){ int i, nd, distance = LOCAL_DISTANCE; goto out; nd = node_distance(node, first_online_node) for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { if (node_online) { if (distance != nd) continue; found[i] ++; break; } if (found[i]) continue; distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i]; found[i] ++; break; } } But do note: We are setting a precedent for node distance between two nodes to change. > > >> The example provided above does not suggest there's much room for > >> alternatives, and hence for actual need of this new interface. > >> > > > > -- > > Thanks and Regards > > Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-24 16:18 ` Srikar Dronamraju @ 2021-05-25 10:21 ` Valentin Schneider 2021-05-25 11:32 ` Srikar Dronamraju 2021-05-28 5:21 ` Srikar Dronamraju 2021-05-28 8:43 ` Peter Zijlstra 1 sibling, 2 replies; 17+ messages in thread From: Valentin Schneider @ 2021-05-25 10:21 UTC (permalink / raw) To: Srikar Dronamraju Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani On 24/05/21 21:48, Srikar Dronamraju wrote: > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]: >> Ok so from your arch you can figure out the *size* of the set of unique >> distances, but not the individual node_distance(a, b)... That's quite >> unfortunate. > > Yes, thats true. > >> >> I suppose one way to avoid the hook would be to write some "fake" distance >> values into your distance_lookup_table[] for offline nodes using your >> distance_ref_point_depth thing, i.e. ensure an iteration of >> node_distance(a, b) covers all distance values [1]. You can then keep patch >> 3 around, and that should roughly be it. >> > > Yes, this would suffice but to me its not very clean. > static int found[distance_ref_point_depth]; > > for_each_node(node){ > int i, nd, distance = LOCAL_DISTANCE; > goto out; > > nd = node_distance(node, first_online_node) > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > if (node_online) { > if (distance != nd) > continue; > found[i] ++; > break; > } > if (found[i]) > continue; > distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i]; > found[i] ++; > break; > } > } > > But do note: We are setting a precedent for node distance between two nodes > to change. > Indeed. AFAICT it's that or the unique-distance-values hook :/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-25 10:21 ` Valentin Schneider @ 2021-05-25 11:32 ` Srikar Dronamraju 2021-05-28 5:21 ` Srikar Dronamraju 1 sibling, 0 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-25 11:32 UTC (permalink / raw) To: Valentin Schneider Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Valentin Schneider <valentin.schneider@arm.com> [2021-05-25 11:21:02]: > On 24/05/21 21:48, Srikar Dronamraju wrote: > > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]: > >> Ok so from your arch you can figure out the *size* of the set of unique > >> distances, but not the individual node_distance(a, b)... That's quite > >> unfortunate. > > > > Yes, thats true. > > > >> > >> I suppose one way to avoid the hook would be to write some "fake" distance > >> values into your distance_lookup_table[] for offline nodes using your > >> distance_ref_point_depth thing, i.e. ensure an iteration of > >> node_distance(a, b) covers all distance values [1]. You can then keep patch > >> 3 around, and that should roughly be it. > >> > > > > Yes, this would suffice but to me its not very clean. > > static int found[distance_ref_point_depth]; > > > > for_each_node(node){ > > int i, nd, distance = LOCAL_DISTANCE; > > goto out; > > > > nd = node_distance(node, first_online_node) > > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > > if (node_online) { > > if (distance != nd) > > continue; > > found[i] ++; > > break; > > } > > if (found[i]) > > continue; > > distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i]; > > found[i] ++; > > break; > > } > > } > > > > But do note: We are setting a precedent for node distance between two nodes > > to change. > > > > Indeed. AFAICT it's that or the unique-distance-values hook :/ Peter, Please let me know which approach would you prefer. I am open to try any other approach too. In my humble opinion, unique-distance-values hook is more cleaner. Do you still have any concerns with the unique-distance-values hook? -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-25 10:21 ` Valentin Schneider 2021-05-25 11:32 ` Srikar Dronamraju @ 2021-05-28 5:21 ` Srikar Dronamraju 1 sibling, 0 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-28 5:21 UTC (permalink / raw) To: Valentin Schneider Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Valentin Schneider <valentin.schneider@arm.com> [2021-05-25 11:21:02]: > On 24/05/21 21:48, Srikar Dronamraju wrote: > > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]: > >> Ok so from your arch you can figure out the *size* of the set of unique > >> distances, but not the individual node_distance(a, b)... That's quite > >> unfortunate. > > > > Yes, thats true. > > > >> > >> I suppose one way to avoid the hook would be to write some "fake" distance > >> values into your distance_lookup_table[] for offline nodes using your > >> distance_ref_point_depth thing, i.e. ensure an iteration of > >> node_distance(a, b) covers all distance values [1]. You can then keep patch > >> 3 around, and that should roughly be it. > >> > > > > Yes, this would suffice but to me its not very clean. > > static int found[distance_ref_point_depth]; > > > > for_each_node(node){ > > int i, nd, distance = LOCAL_DISTANCE; > > goto out; > > > > nd = node_distance(node, first_online_node) > > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > > if (node_online) { > > if (distance != nd) > > continue; > > found[i] ++; > > break; > > } > > if (found[i]) > > continue; > > distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i]; > > found[i] ++; > > break; > > } > > } > > > > But do note: We are setting a precedent for node distance between two nodes > > to change. > > > > Indeed. AFAICT it's that or the unique-distance-values hook :/ Peter, Valentin, Michael, Can you please let me know which approach you would want me to follow. Or do let me know any other alternative solutions that you would want me to try. -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-24 16:18 ` Srikar Dronamraju 2021-05-25 10:21 ` Valentin Schneider @ 2021-05-28 8:43 ` Peter Zijlstra 2021-05-28 10:24 ` Srikar Dronamraju 1 sibling, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2021-05-28 8:43 UTC (permalink / raw) To: Srikar Dronamraju Cc: Valentin Schneider, Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote: > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]: > > I suppose one way to avoid the hook would be to write some "fake" distance > > values into your distance_lookup_table[] for offline nodes using your > > distance_ref_point_depth thing, i.e. ensure an iteration of > > node_distance(a, b) covers all distance values [1]. You can then keep patch > > 3 around, and that should roughly be it. > > > > Yes, this would suffice but to me its not very clean. > static int found[distance_ref_point_depth]; > > for_each_node(node){ > int i, nd, distance = LOCAL_DISTANCE; > goto out; > > nd = node_distance(node, first_online_node) > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > if (node_online) { > if (distance != nd) > continue; > found[i] ++; > break; > } > if (found[i]) > continue; > distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i]; > found[i] ++; > break; > } > } > > But do note: We are setting a precedent for node distance between two nodes > to change. Not really; or rather not more than already is the case AFAICT. Because currently your distance table will have *something* in it (LOCAL_DISTANCE afaict) for nodes that have never been online, which is what triggered the whole problem to begin with. Only after the node has come online for the first time, will it contain the right value. So both before and after this proposal the actual distance value changes after the first time a node goes online. Yes that's unfortunate, but I don't see a problem with pre-filling it with something useful in order to avoid aditional arch hooks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map 2021-05-28 8:43 ` Peter Zijlstra @ 2021-05-28 10:24 ` Srikar Dronamraju 0 siblings, 0 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-28 10:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Valentin Schneider, Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Peter Zijlstra <peterz@infradead.org> [2021-05-28 10:43:23]: > On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote: > > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]: > > > > I suppose one way to avoid the hook would be to write some "fake" distance > > > values into your distance_lookup_table[] for offline nodes using your > > > distance_ref_point_depth thing, i.e. ensure an iteration of > > > node_distance(a, b) covers all distance values [1]. You can then keep patch > > > 3 around, and that should roughly be it. > > > > > > > Yes, this would suffice but to me its not very clean. > > static int found[distance_ref_point_depth]; > > > > for_each_node(node){ > > int i, nd, distance = LOCAL_DISTANCE; > > goto out; > > > > nd = node_distance(node, first_online_node) > > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > > if (node_online) { > > if (distance != nd) > > continue; > > found[i] ++; > > break; > > } > > if (found[i]) > > continue; > > distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i]; > > found[i] ++; > > break; > > } > > } > > > > But do note: We are setting a precedent for node distance between two nodes > > to change. > > Not really; or rather not more than already is the case AFAICT. Because > currently your distance table will have *something* in it > (LOCAL_DISTANCE afaict) for nodes that have never been online, which is > what triggered the whole problem to begin with. > > Only after the node has come online for the first time, will it contain > the right value. > > So both before and after this proposal the actual distance value changes > after the first time a node goes online. > > Yes that's unfortunate, but I don't see a problem with pre-filling it > with something useful in order to avoid aditional arch hooks. > > Okay, Will post a v2 with prefilling. Thanks for the update. -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3] powerpc/numa: Populate distance map correctly 2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju @ 2021-05-20 15:44 ` Srikar Dronamraju 2021-05-24 14:16 ` Valentin Schneider 2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju 2 siblings, 1 reply; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani As per PAPR that defines the OS to hypervisor interface on POWER, there is no way to calculate the node_distance between 2 nodes, when either of the nodes are offline. However scheduler needs the distance map to be populated at boot time. On POWER, this information is provided within the distance_ref_points_depth array, which needs to be parsed to extract the potential node distances. To handle this scenario, lets overload arch_populate_distance_map(), to provide all the distances that are possible in the current platform. Cc: LKML <linux-kernel@vger.kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rik van Riel <riel@surriel.com> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/powerpc/include/asm/topology.h | 3 +++ arch/powerpc/mm/numa.c | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index e4db64c0e184..d7605d833b8d 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -22,6 +22,9 @@ struct drmem_lmb; cpu_all_mask : \ node_to_cpumask_map[node]) +#define arch_populate_distance_map arch_populate_distance_map +extern int arch_populate_distance_map(unsigned long *distance_map); + struct pci_bus; #ifdef CONFIG_PCI extern int pcibus_to_node(struct pci_bus *bus); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f2bf98bdcea2..9a225b29814a 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -221,6 +221,25 @@ static void initialize_distance_lookup_table(int nid, } } +int arch_populate_distance_map(unsigned long *distance_map) +{ + int i; + int distance = LOCAL_DISTANCE; + + bitmap_set(distance_map, distance, 1); + + if (!form1_affinity) { + bitmap_set(distance_map, REMOTE_DISTANCE, 1); + return 0; + } + + for (i = 0; i < distance_ref_points_depth; i++) { + distance *= 2; + bitmap_set(distance_map, distance, 1); + } + return 0; +} + /* * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA * info is found. -- 2.27.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly 2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju @ 2021-05-24 14:16 ` Valentin Schneider 2021-05-24 14:50 ` Srikar Dronamraju 0 siblings, 1 reply; 17+ messages in thread From: Valentin Schneider @ 2021-05-24 14:16 UTC (permalink / raw) To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani On 20/05/21 21:14, Srikar Dronamraju wrote: > +int arch_populate_distance_map(unsigned long *distance_map) > +{ > + int i; > + int distance = LOCAL_DISTANCE; > + > + bitmap_set(distance_map, distance, 1); > + > + if (!form1_affinity) { > + bitmap_set(distance_map, REMOTE_DISTANCE, 1); > + return 0; > + } > + > + for (i = 0; i < distance_ref_points_depth; i++) { > + distance *= 2; > + bitmap_set(distance_map, distance, 1); Do you have guarantees your distance values will always be in the form of LOCAL_DISTANCE * 2^i because that certainly isn't true for x86/arm64. > + } > + return 0; > +} > + > /* > * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA > * info is found. > -- > 2.27.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly 2021-05-24 14:16 ` Valentin Schneider @ 2021-05-24 14:50 ` Srikar Dronamraju 0 siblings, 0 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-24 14:50 UTC (permalink / raw) To: Valentin Schneider Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:22]: > On 20/05/21 21:14, Srikar Dronamraju wrote: > > +int arch_populate_distance_map(unsigned long *distance_map) > > +{ > > + int i; > > + int distance = LOCAL_DISTANCE; > > + > > + bitmap_set(distance_map, distance, 1); > > + > > + if (!form1_affinity) { > > + bitmap_set(distance_map, REMOTE_DISTANCE, 1); > > + return 0; > > + } > > + > > + for (i = 0; i < distance_ref_points_depth; i++) { > > + distance *= 2; > > + bitmap_set(distance_map, distance, 1); > > Do you have guarantees your distance values will always be in the form of > > LOCAL_DISTANCE * 2^i > > because that certainly isn't true for x86/arm64. > This is true till now. It don't think that's going to change anytime soon, but we never know what lies ahead. For all practical purposes, (unless a newer, shinier property is proposed,) distance_ref_points_depth is going to give us the unique distances. > > + } > > + return 0; > > +} > > + > > /* > > * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA > > * info is found. > > -- > > 2.27.0 -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes 2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju @ 2021-05-20 15:44 ` Srikar Dronamraju 2 siblings, 0 replies; 17+ messages in thread From: Srikar Dronamraju @ 2021-05-20 15:44 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner, Valentin Schneider, Vincent Guittot, Dietmar Eggemann, linuxppc-dev, Nathan Lynch, Michael Ellerman, Scott Cheloha, Gautham R Shenoy, Geetika Moolchandani Currently scheduler doesn't check if node is online before adding CPUs to the node mask. However on some architectures, node distance is only available for nodes that are online. Its not sure how much to rely on the node distance, when one of the nodes is offline. If said node distance is fake (since one of the nodes is offline) and the actual node distance is different, then the cpumask of such nodes when the nodes become becomes online will be wrong. This can cause topology_span_sane to throw up a warning message and the rest of the topology being not updated properly. Resolve this by skipping update of cpumask for nodes that are not online. Cc: LKML <linux-kernel@vger.kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rik van Riel <riel@surriel.com> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- kernel/sched/topology.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index ccb9aff59add..ba0555e83548 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1731,6 +1731,9 @@ void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; for_each_node(k) { + if (!node_online(j)) + continue; + if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) sched_numa_warn("Node-distance not symmetric"); @@ -1793,6 +1796,9 @@ void sched_domains_numa_masks_set(unsigned int cpu) for (i = 0; i < sched_domains_numa_levels; i++) { for (j = 0; j < nr_node_ids; j++) { + if (!node_online(j)) + continue; + if (node_distance(j, node) <= sched_domains_numa_distance[i]) cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); } -- 2.27.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-05-28 10:25 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju 2021-05-20 18:56 ` Peter Zijlstra 2021-05-21 2:38 ` Srikar Dronamraju 2021-05-21 8:14 ` Peter Zijlstra 2021-05-21 9:28 ` Srikar Dronamraju 2021-05-24 14:16 ` Valentin Schneider 2021-05-24 16:18 ` Srikar Dronamraju 2021-05-25 10:21 ` Valentin Schneider 2021-05-25 11:32 ` Srikar Dronamraju 2021-05-28 5:21 ` Srikar Dronamraju 2021-05-28 8:43 ` Peter Zijlstra 2021-05-28 10:24 ` Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju 2021-05-24 14:16 ` Valentin Schneider 2021-05-24 14:50 ` Srikar Dronamraju 2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).