[1/3] sched/topology: Allow archs to populate distance map
diff mbox series

Message ID 20210520154427.1041031-2-srikar@linux.vnet.ibm.com
State New, archived
Headers show
Series
  • Skip numa distance for offline nodes
Related show

Commit Message

Srikar Dronamraju May 20, 2021, 3:44 p.m. UTC
Currently scheduler populates the distance map by looking at distance
of each node from all other nodes. This should work for most
architectures and platforms.

However there are some architectures like POWER that may not expose
the distance of nodes that are not yet onlined because those resources
are not yet allocated to the OS instance. Such architectures have
other means to provide valid distance data for the current platform.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

It may be implementation dependent on what node_distance(0,3) where
node 0 is online and node 3 is offline. In POWER case, it returns
LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
distance between nodes is 20. However that would not be true.

When Nodes are onlined and CPUs from those nodes are hotplugged,
the max node distance would be 40.

To handle such scenarios, let scheduler allow architectures to populate
the distance map. Architectures that like to populate the distance map
can overload arch_populate_distance_map().

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/topology.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

Comments

Peter Zijlstra May 20, 2021, 6:56 p.m. UTC | #1
On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> Currently scheduler populates the distance map by looking at distance
> of each node from all other nodes. This should work for most
> architectures and platforms.
> 
> However there are some architectures like POWER that may not expose
> the distance of nodes that are not yet onlined because those resources
> are not yet allocated to the OS instance. Such architectures have
> other means to provide valid distance data for the current platform.
> 
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  20  40  40  40  40  40  40
>   1:  20  10  40  40  40  40  40  40
>   2:  40  40  10  20  40  40  40  40
>   3:  40  40  20  10  40  40  40  40
>   4:  40  40  40  40  10  20  40  40
>   5:  40  40  40  40  20  10  40  40
>   6:  40  40  40  40  40  40  10  20
>   7:  40  40  40  40  40  40  20  10
> 
> However the same system when only two nodes are online at boot, then the
> numa topology will look like
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> It may be implementation dependent on what node_distance(0,3) where
> node 0 is online and node 3 is offline. In POWER case, it returns
> LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> distance between nodes is 20. However that would not be true.
> 
> When Nodes are onlined and CPUs from those nodes are hotplugged,
> the max node distance would be 40.
> 
> To handle such scenarios, let scheduler allow architectures to populate
> the distance map. Architectures that like to populate the distance map
> can overload arch_populate_distance_map().

Why? Why can't your node_distance() DTRT? The arch interface is
nr_node_ids and node_distance(), I don't see why we need something new
and then replace one special use of it.

By virtue of you being able to actually implement this new hook, you
supposedly can actually do node_distance() right too.
Srikar Dronamraju May 21, 2021, 2:38 a.m. UTC | #2
* Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:

> On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > Currently scheduler populates the distance map by looking at distance
> > of each node from all other nodes. This should work for most
> > architectures and platforms.
> > 
> > However there are some architectures like POWER that may not expose
> > the distance of nodes that are not yet onlined because those resources
> > are not yet allocated to the OS instance. Such architectures have
> > other means to provide valid distance data for the current platform.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then the
> > numa topology will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > It may be implementation dependent on what node_distance(0,3) where
> > node 0 is online and node 3 is offline. In POWER case, it returns
> > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > distance between nodes is 20. However that would not be true.
> > 
> > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > the max node distance would be 40.
> > 
> > To handle such scenarios, let scheduler allow architectures to populate
> > the distance map. Architectures that like to populate the distance map
> > can overload arch_populate_distance_map().
> 
> Why? Why can't your node_distance() DTRT? The arch interface is
> nr_node_ids and node_distance(), I don't see why we need something new
> and then replace one special use of it.
> 
> By virtue of you being able to actually implement this new hook, you
> supposedly can actually do node_distance() right too.

Since for an offline node, arch interface code doesn't have the info.
As far as I know/understand, in POWER, unless there is an active memory or
CPU that's getting onlined, arch can't fetch the correct node distance.

Taking the above example: node 3 is offline, then node_distance of (3,X)
where X is anything other than 3, is not reliable. The moment node 3 is
onlined, the node distance is reliable.

This problem will not happen even on POWER if all the nodes have either
memory or CPUs active at the time of boot.
Peter Zijlstra May 21, 2021, 8:14 a.m. UTC | #3
On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
> 
> > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > Currently scheduler populates the distance map by looking at distance
> > > of each node from all other nodes. This should work for most
> > > architectures and platforms.
> > > 
> > > However there are some architectures like POWER that may not expose
> > > the distance of nodes that are not yet onlined because those resources
> > > are not yet allocated to the OS instance. Such architectures have
> > > other means to provide valid distance data for the current platform.
> > > 
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  20  40  40  40  40  40  40
> > >   1:  20  10  40  40  40  40  40  40
> > >   2:  40  40  10  20  40  40  40  40
> > >   3:  40  40  20  10  40  40  40  40
> > >   4:  40  40  40  40  10  20  40  40
> > >   5:  40  40  40  40  20  10  40  40
> > >   6:  40  40  40  40  40  40  10  20
> > >   7:  40  40  40  40  40  40  20  10
> > > 
> > > However the same system when only two nodes are online at boot, then the
> > > numa topology will look like
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > It may be implementation dependent on what node_distance(0,3) where
> > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > distance between nodes is 20. However that would not be true.
> > > 
> > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > the max node distance would be 40.
> > > 
> > > To handle such scenarios, let scheduler allow architectures to populate
> > > the distance map. Architectures that like to populate the distance map
> > > can overload arch_populate_distance_map().
> > 
> > Why? Why can't your node_distance() DTRT? The arch interface is
> > nr_node_ids and node_distance(), I don't see why we need something new
> > and then replace one special use of it.
> > 
> > By virtue of you being able to actually implement this new hook, you
> > supposedly can actually do node_distance() right too.
> 
> Since for an offline node, arch interface code doesn't have the info.
> As far as I know/understand, in POWER, unless there is an active memory or
> CPU that's getting onlined, arch can't fetch the correct node distance.
> 
> Taking the above example: node 3 is offline, then node_distance of (3,X)
> where X is anything other than 3, is not reliable. The moment node 3 is
> onlined, the node distance is reliable.
> 
> This problem will not happen even on POWER if all the nodes have either
> memory or CPUs active at the time of boot.

But then how can you implement this new hook? Going by the fact that
both nr_node_ids and distance_ref_points_depth are fixed, how many
possible __node_distance() configurations are there left?

The example provided above does not suggest there's much room for
alternatives, and hence for actual need of this new interface.
Srikar Dronamraju May 21, 2021, 9:28 a.m. UTC | #4
* Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]:

> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
> > 
> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > > Currently scheduler populates the distance map by looking at distance
> > > > of each node from all other nodes. This should work for most
> > > > architectures and platforms.
> > > > 
> > > > However there are some architectures like POWER that may not expose
> > > > the distance of nodes that are not yet onlined because those resources
> > > > are not yet allocated to the OS instance. Such architectures have
> > > > other means to provide valid distance data for the current platform.
> > > > 
> > > > For example distance info from numactl from a fully populated 8 node
> > > > system at boot may look like this.
> > > > 
> > > > node distances:
> > > > node   0   1   2   3   4   5   6   7
> > > >   0:  10  20  40  40  40  40  40  40
> > > >   1:  20  10  40  40  40  40  40  40
> > > >   2:  40  40  10  20  40  40  40  40
> > > >   3:  40  40  20  10  40  40  40  40
> > > >   4:  40  40  40  40  10  20  40  40
> > > >   5:  40  40  40  40  20  10  40  40
> > > >   6:  40  40  40  40  40  40  10  20
> > > >   7:  40  40  40  40  40  40  20  10
> > > > 
> > > > However the same system when only two nodes are online at boot, then the
> > > > numa topology will look like
> > > > node distances:
> > > > node   0   1
> > > >   0:  10  20
> > > >   1:  20  10
> > > > 
> > > > It may be implementation dependent on what node_distance(0,3) where
> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > > distance between nodes is 20. However that would not be true.
> > > > 
> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > > the max node distance would be 40.
> > > > 
> > > > To handle such scenarios, let scheduler allow architectures to populate
> > > > the distance map. Architectures that like to populate the distance map
> > > > can overload arch_populate_distance_map().
> > > 
> > > Why? Why can't your node_distance() DTRT? The arch interface is
> > > nr_node_ids and node_distance(), I don't see why we need something new
> > > and then replace one special use of it.
> > > 
> > > By virtue of you being able to actually implement this new hook, you
> > > supposedly can actually do node_distance() right too.
> > 
> > Since for an offline node, arch interface code doesn't have the info.
> > As far as I know/understand, in POWER, unless there is an active memory or
> > CPU that's getting onlined, arch can't fetch the correct node distance.
> > 
> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> > where X is anything other than 3, is not reliable. The moment node 3 is
> > onlined, the node distance is reliable.
> > 
> > This problem will not happen even on POWER if all the nodes have either
> > memory or CPUs active at the time of boot.
> 
> But then how can you implement this new hook? Going by the fact that
> both nr_node_ids and distance_ref_points_depth are fixed, how many
> possible __node_distance() configurations are there left?
> 

distance_ref_point_depth is provided as a different property and is readily
available at boot. The new api will use just use that. So based on the
distance_ref_point_depth, we know all possible node distances for that
platform.

For an offline node, we don't have that specific nodes distance_lookup_table
array entries. Each array would be of distance_ref_point_depth entries.
Without the distance_lookup_table for an array populated, we will not be
able to tell how far the node is with respect to other nodes.

We can lookup the correct distance_lookup_table for a node based on memory
or the CPUs attached to that node. Since in an offline node, both of them
would not be around, the distance_lookup_table will have stale values.

> The example provided above does not suggest there's much room for
> alternatives, and hence for actual need of this new interface.
>
Valentin Schneider May 24, 2021, 2:16 p.m. UTC | #5
On 21/05/21 14:58, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]:
>
>> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
>> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
>> >
>> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
>> > > > Currently scheduler populates the distance map by looking at distance
>> > > > of each node from all other nodes. This should work for most
>> > > > architectures and platforms.
>> > > >
>> > > > However there are some architectures like POWER that may not expose
>> > > > the distance of nodes that are not yet onlined because those resources
>> > > > are not yet allocated to the OS instance. Such architectures have
>> > > > other means to provide valid distance data for the current platform.
>> > > >
>> > > > For example distance info from numactl from a fully populated 8 node
>> > > > system at boot may look like this.
>> > > >
>> > > > node distances:
>> > > > node   0   1   2   3   4   5   6   7
>> > > >   0:  10  20  40  40  40  40  40  40
>> > > >   1:  20  10  40  40  40  40  40  40
>> > > >   2:  40  40  10  20  40  40  40  40
>> > > >   3:  40  40  20  10  40  40  40  40
>> > > >   4:  40  40  40  40  10  20  40  40
>> > > >   5:  40  40  40  40  20  10  40  40
>> > > >   6:  40  40  40  40  40  40  10  20
>> > > >   7:  40  40  40  40  40  40  20  10
>> > > >
>> > > > However the same system when only two nodes are online at boot, then the
>> > > > numa topology will look like
>> > > > node distances:
>> > > > node   0   1
>> > > >   0:  10  20
>> > > >   1:  20  10
>> > > >
>> > > > It may be implementation dependent on what node_distance(0,3) where
>> > > > node 0 is online and node 3 is offline. In POWER case, it returns
>> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
>> > > > distance between nodes is 20. However that would not be true.
>> > > >
>> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
>> > > > the max node distance would be 40.
>> > > >
>> > > > To handle such scenarios, let scheduler allow architectures to populate
>> > > > the distance map. Architectures that like to populate the distance map
>> > > > can overload arch_populate_distance_map().
>> > >
>> > > Why? Why can't your node_distance() DTRT? The arch interface is
>> > > nr_node_ids and node_distance(), I don't see why we need something new
>> > > and then replace one special use of it.
>> > >
>> > > By virtue of you being able to actually implement this new hook, you
>> > > supposedly can actually do node_distance() right too.
>> >
>> > Since for an offline node, arch interface code doesn't have the info.
>> > As far as I know/understand, in POWER, unless there is an active memory or
>> > CPU that's getting onlined, arch can't fetch the correct node distance.
>> >
>> > Taking the above example: node 3 is offline, then node_distance of (3,X)
>> > where X is anything other than 3, is not reliable. The moment node 3 is
>> > onlined, the node distance is reliable.
>> >
>> > This problem will not happen even on POWER if all the nodes have either
>> > memory or CPUs active at the time of boot.
>>
>> But then how can you implement this new hook? Going by the fact that
>> both nr_node_ids and distance_ref_points_depth are fixed, how many
>> possible __node_distance() configurations are there left?
>>
>
> distance_ref_point_depth is provided as a different property and is readily
> available at boot. The new api will use just use that. So based on the
> distance_ref_point_depth, we know all possible node distances for that
> platform.
>
> For an offline node, we don't have that specific nodes distance_lookup_table
> array entries. Each array would be of distance_ref_point_depth entries.
> Without the distance_lookup_table for an array populated, we will not be
> able to tell how far the node is with respect to other nodes.
>
> We can lookup the correct distance_lookup_table for a node based on memory
> or the CPUs attached to that node. Since in an offline node, both of them
> would not be around, the distance_lookup_table will have stale values.
>

Ok so from your arch you can figure out the *size* of the set of unique
distances, but not the individual node_distance(a, b)... That's quite
unfortunate.

I suppose one way to avoid the hook would be to write some "fake" distance
values into your distance_lookup_table[] for offline nodes using your
distance_ref_point_depth thing, i.e. ensure an iteration of
node_distance(a, b) covers all distance values [1]. You can then keep patch
3 around, and that should roughly be it.


>> The example provided above does not suggest there's much room for
>> alternatives, and hence for actual need of this new interface.
>>
>
> --
> Thanks and Regards
> Srikar Dronamraju
Srikar Dronamraju May 24, 2021, 4:18 p.m. UTC | #6
* Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:

> On 21/05/21 14:58, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2021-05-21 10:14:10]:
> >
> >> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> >> > * Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:
> >> >
> >> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> >> > > > Currently scheduler populates the distance map by looking at distance
> >> > > > of each node from all other nodes. This should work for most
> >> > > > architectures and platforms.
> >> > > >
> >> > > > However there are some architectures like POWER that may not expose
> >> > > > the distance of nodes that are not yet onlined because those resources
> >> > > > are not yet allocated to the OS instance. Such architectures have
> >> > > > other means to provide valid distance data for the current platform.
> >> > > >
> >> > > > For example distance info from numactl from a fully populated 8 node
> >> > > > system at boot may look like this.
> >> > > >
> >> > > > node distances:
> >> > > > node   0   1   2   3   4   5   6   7
> >> > > >   0:  10  20  40  40  40  40  40  40
> >> > > >   1:  20  10  40  40  40  40  40  40
> >> > > >   2:  40  40  10  20  40  40  40  40
> >> > > >   3:  40  40  20  10  40  40  40  40
> >> > > >   4:  40  40  40  40  10  20  40  40
> >> > > >   5:  40  40  40  40  20  10  40  40
> >> > > >   6:  40  40  40  40  40  40  10  20
> >> > > >   7:  40  40  40  40  40  40  20  10
> >> > > >
> >> > > > However the same system when only two nodes are online at boot, then the
> >> > > > numa topology will look like
> >> > > > node distances:
> >> > > > node   0   1
> >> > > >   0:  10  20
> >> > > >   1:  20  10
> >> > > >
> >> > > > It may be implementation dependent on what node_distance(0,3) where
> >> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> >> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> >> > > > distance between nodes is 20. However that would not be true.
> >> > > >
> >> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> >> > > > the max node distance would be 40.
> >> > > >
> >> > > > To handle such scenarios, let scheduler allow architectures to populate
> >> > > > the distance map. Architectures that like to populate the distance map
> >> > > > can overload arch_populate_distance_map().
> >> > >
> >> > > Why? Why can't your node_distance() DTRT? The arch interface is
> >> > > nr_node_ids and node_distance(), I don't see why we need something new
> >> > > and then replace one special use of it.
> >> > >
> >> > > By virtue of you being able to actually implement this new hook, you
> >> > > supposedly can actually do node_distance() right too.
> >> >
> >> > Since for an offline node, arch interface code doesn't have the info.
> >> > As far as I know/understand, in POWER, unless there is an active memory or
> >> > CPU that's getting onlined, arch can't fetch the correct node distance.
> >> >
> >> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> >> > where X is anything other than 3, is not reliable. The moment node 3 is
> >> > onlined, the node distance is reliable.
> >> >
> >> > This problem will not happen even on POWER if all the nodes have either
> >> > memory or CPUs active at the time of boot.
> >>
> >> But then how can you implement this new hook? Going by the fact that
> >> both nr_node_ids and distance_ref_points_depth are fixed, how many
> >> possible __node_distance() configurations are there left?
> >>
> >
> > distance_ref_point_depth is provided as a different property and is readily
> > available at boot. The new api will use just use that. So based on the
> > distance_ref_point_depth, we know all possible node distances for that
> > platform.
> >
> > For an offline node, we don't have that specific nodes distance_lookup_table
> > array entries. Each array would be of distance_ref_point_depth entries.
> > Without the distance_lookup_table for an array populated, we will not be
> > able to tell how far the node is with respect to other nodes.
> >
> > We can lookup the correct distance_lookup_table for a node based on memory
> > or the CPUs attached to that node. Since in an offline node, both of them
> > would not be around, the distance_lookup_table will have stale values.
> >
> 
> Ok so from your arch you can figure out the *size* of the set of unique
> distances, but not the individual node_distance(a, b)... That's quite
> unfortunate.

Yes, thats true.

> 
> I suppose one way to avoid the hook would be to write some "fake" distance
> values into your distance_lookup_table[] for offline nodes using your
> distance_ref_point_depth thing, i.e. ensure an iteration of
> node_distance(a, b) covers all distance values [1]. You can then keep patch
> 3 around, and that should roughly be it.
> 

Yes, this would suffice but to me its not very clean.
static int found[distance_ref_point_depth];

for_each_node(node){
	int i, nd, distance = LOCAL_DISTANCE;
		goto out;

	nd = node_distance(node, first_online_node)
	for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
		if (node_online) {
			if (distance != nd)
				continue;
			found[i] ++;
			break;
		}
		if (found[i])
			continue;
		distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
		found[i] ++;
		break;
	}
}

But do note: We are setting a precedent for node distance between two nodes
to change.


> 
> >> The example provided above does not suggest there's much room for
> >> alternatives, and hence for actual need of this new interface.
> >>
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju
Valentin Schneider May 25, 2021, 10:21 a.m. UTC | #7
On 24/05/21 21:48, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
>> Ok so from your arch you can figure out the *size* of the set of unique
>> distances, but not the individual node_distance(a, b)... That's quite
>> unfortunate.
>
> Yes, thats true.
>
>>
>> I suppose one way to avoid the hook would be to write some "fake" distance
>> values into your distance_lookup_table[] for offline nodes using your
>> distance_ref_point_depth thing, i.e. ensure an iteration of
>> node_distance(a, b) covers all distance values [1]. You can then keep patch
>> 3 around, and that should roughly be it.
>>
>
> Yes, this would suffice but to me its not very clean.
> static int found[distance_ref_point_depth];
>
> for_each_node(node){
>       int i, nd, distance = LOCAL_DISTANCE;
>               goto out;
>
>       nd = node_distance(node, first_online_node)
>       for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
>               if (node_online) {
>                       if (distance != nd)
>                               continue;
>                       found[i] ++;
>                       break;
>               }
>               if (found[i])
>                       continue;
>               distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
>               found[i] ++;
>               break;
>       }
> }
>
> But do note: We are setting a precedent for node distance between two nodes
> to change.
>

Indeed. AFAICT it's that or the unique-distance-values hook :/
Srikar Dronamraju May 25, 2021, 11:32 a.m. UTC | #8
* Valentin Schneider <valentin.schneider@arm.com> [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >       int i, nd, distance = LOCAL_DISTANCE;
> >               goto out;
> >
> >       nd = node_distance(node, first_online_node)
> >       for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >               if (node_online) {
> >                       if (distance != nd)
> >                               continue;
> >                       found[i] ++;
> >                       break;
> >               }
> >               if (found[i])
> >                       continue;
> >               distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> >               found[i] ++;
> >               break;
> >       }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter,

Please let me know which approach would you prefer.
I am open to try any other approach too.

In my humble opinion, unique-distance-values hook is more cleaner.
Do you still have any concerns with the unique-distance-values hook?
Srikar Dronamraju May 28, 2021, 5:21 a.m. UTC | #9
* Valentin Schneider <valentin.schneider@arm.com> [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >       int i, nd, distance = LOCAL_DISTANCE;
> >               goto out;
> >
> >       nd = node_distance(node, first_online_node)
> >       for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >               if (node_online) {
> >                       if (distance != nd)
> >                               continue;
> >                       found[i] ++;
> >                       break;
> >               }
> >               if (found[i])
> >                       continue;
> >               distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> >               found[i] ++;
> >               break;
> >       }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter, Valentin, Michael,

Can you please let me know which approach you would want me to follow.

Or do let me know any other alternative solutions that you would want me to
try.
Peter Zijlstra May 28, 2021, 8:43 a.m. UTC | #10
On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:

> > I suppose one way to avoid the hook would be to write some "fake" distance
> > values into your distance_lookup_table[] for offline nodes using your
> > distance_ref_point_depth thing, i.e. ensure an iteration of
> > node_distance(a, b) covers all distance values [1]. You can then keep patch
> > 3 around, and that should roughly be it.
> > 
> 
> Yes, this would suffice but to me its not very clean.
> static int found[distance_ref_point_depth];
> 
> for_each_node(node){
> 	int i, nd, distance = LOCAL_DISTANCE;
> 		goto out;
> 
> 	nd = node_distance(node, first_online_node)
> 	for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> 		if (node_online) {
> 			if (distance != nd)
> 				continue;
> 			found[i] ++;
> 			break;
> 		}
> 		if (found[i])
> 			continue;
> 		distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> 		found[i] ++;
> 		break;
> 	}
> }
> 
> But do note: We are setting a precedent for node distance between two nodes
> to change.

Not really; or rather not more than already is the case AFAICT. Because
currently your distance table will have *something* in it
(LOCAL_DISTANCE afaict) for nodes that have never been online, which is
what triggered the whole problem to begin with.

Only after the node has come online for the first time, will it contain
the right value.

So both before and after this proposal the actual distance value changes
after the first time a node goes online.

Yes that's unfortunate, but I don't see a problem with pre-filling it
with something useful in order to avoid aditional arch hooks.
Srikar Dronamraju May 28, 2021, 10:24 a.m. UTC | #11
* Peter Zijlstra <peterz@infradead.org> [2021-05-28 10:43:23]:

> On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-24 15:16:09]:
> 
> > > I suppose one way to avoid the hook would be to write some "fake" distance
> > > values into your distance_lookup_table[] for offline nodes using your
> > > distance_ref_point_depth thing, i.e. ensure an iteration of
> > > node_distance(a, b) covers all distance values [1]. You can then keep patch
> > > 3 around, and that should roughly be it.
> > > 
> > 
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> > 
> > for_each_node(node){
> > 	int i, nd, distance = LOCAL_DISTANCE;
> > 		goto out;
> > 
> > 	nd = node_distance(node, first_online_node)
> > 	for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> > 		if (node_online) {
> > 			if (distance != nd)
> > 				continue;
> > 			found[i] ++;
> > 			break;
> > 		}
> > 		if (found[i])
> > 			continue;
> > 		distance_lookup_table[node][i] = distance_lookup_table[first_online_node][i];
> > 		found[i] ++;
> > 		break;
> > 	}
> > }
> > 
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> 
> Not really; or rather not more than already is the case AFAICT. Because
> currently your distance table will have *something* in it
> (LOCAL_DISTANCE afaict) for nodes that have never been online, which is
> what triggered the whole problem to begin with.
> 
> Only after the node has come online for the first time, will it contain
> the right value.
> 
> So both before and after this proposal the actual distance value changes
> after the first time a node goes online.
> 
> Yes that's unfortunate, but I don't see a problem with pre-filling it
> with something useful in order to avoid aditional arch hooks.
> 
> 

Okay,

Will post a v2 with prefilling.
Thanks for the update.

Patch
diff mbox series

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 053115b55f89..ccb9aff59add 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1630,6 +1630,26 @@  static void init_numa_topology_type(void)
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
+#ifndef arch_populate_distance_map
+static int arch_populate_distance_map(unsigned long *distance_map)
+{
+	int i, j;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		for (j = 0; j < nr_node_ids; j++) {
+			int distance = node_distance(i, j);
+
+			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
+				sched_numa_warn("Invalid distance value range");
+				return -1;
+			}
+			bitmap_set(distance_map, distance, 1);
+		}
+	}
+	return 0;
+}
+#endif
+
 void sched_init_numa(void)
 {
 	struct sched_domain_topology_level *tl;
@@ -1646,18 +1666,10 @@  void sched_init_numa(void)
 		return;
 
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
-	for (i = 0; i < nr_node_ids; i++) {
-		for (j = 0; j < nr_node_ids; j++) {
-			int distance = node_distance(i, j);
 
-			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
-				sched_numa_warn("Invalid distance value range");
-				return;
-			}
+	if (arch_populate_distance_map(distance_map))
+		return;
 
-			bitmap_set(distance_map, distance, 1);
-		}
-	}
 	/*
 	 * We can now figure out how many unique distance values there are and
 	 * allocate memory accordingly.