On 2022/4/14 6:04, Andrew Morton wrote:

On Wed, 13 Apr 2022 14:27:54 +0800 "liupeng (DM)" <liupeng256@huawei.com> wrote:

On 2022/4/13 12:42, Andrew Morton wrote:

On Wed, 13 Apr 2022 03:29:12 +0000 Peng Liu<liupeng256@huawei.com>  wrote:

Certain systems are designed to have sparse/discontiguous nodes. In
this case, nr_online_nodes can not be used to walk through numa node.
Also, a valid node may be greater than nr_online_nodes.

However, in hugetlb, it is assumed that nodes are contiguous. Recheck
all the places that use nr_online_nodes, and repair them one by one.

What are the runtime effects of this shortcoming?
.

For sparse/discontiguous nodes, the current code may treat a valid node
as invalid, and will fail to allocate all hugepages on a valid node that
"nid >= nr_online_nodes".

As David suggested:
if (tmp >= nr_online_nodes)
	goto invalid;

Just imagine node 0 and node 2 are online, and node 1 is offline. Assuming
that "node < 2" is valid is wrong.

So do you think we should backport thtis fix into earlier kernel releases?
.

I think it is not an urgent bug, because:
1) Qemu does not support sparse node so far, although there are some sparse-node
issues to make qemu support sparse node.
2) I don't find an actual normal machine that reports sparse-node and need to
use hugepages so far.