Re: [PATCH bpf] bpf: Support for setting numa node in bpf memory allocator

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: Hao Luo <haoluo@google.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	houtao1@huawei.com
Subject: Re: [PATCH bpf] bpf: Support for setting numa node in bpf memory allocator
Date: Thu, 20 Oct 2022 18:48:07 -0700	[thread overview]
Message-ID: <20221021014807.pvjppg433lucybui@macbook-pro-4.dhcp.thefacebook.com> (raw)
In-Reply-To: <212fbd46-7371-c3f9-e900-3a49d9fafab8@huaweicloud.com>

On Fri, Oct 21, 2022 at 09:43:08AM +0800, Hou Tao wrote:
> Hi,
> 
> On 10/21/2022 2:01 AM, Hao Luo wrote:
> > On Thu, Oct 20, 2022 at 6:57 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Since commit fba1a1c6c912 ("bpf: Convert hash map to bpf_mem_alloc."),
> >> numa node setting for non-preallocated hash table is ignored. The reason
> >> is that bpf memory allocator only supports NUMA_NO_NODE, but it seems it
> >> is trivial to support numa node setting for bpf memory allocator.
> >>
> >> So adding support for setting numa node in bpf memory allocator and
> >> updating hash map accordingly.
> >>
> >> Fixes: fba1a1c6c912 ("bpf: Convert hash map to bpf_mem_alloc.")
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> > Looks good to me with a few nits.
> >
> > <...>
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index fc116cf47d24..44c531ba9534 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> > <...>
> >> +static inline bool is_valid_numa_node(int numa_node, bool percpu)
> >> +{
> >> +       return numa_node == NUMA_NO_NODE ||
> >> +              (!percpu && (unsigned int)numa_node < nr_node_ids);
> > Maybe also check node_online? There is a similar helper function in
> > kernel/bpf/syscall.c.
> Will factor out as a helper function and use it in bpf memory allocator in v2.
> >
> > It may help debugging if we could log the reason here, for example,
> > PERCPU map but with numa_node specified.
> Not sure about it, because the validity check must have been done in map related
> code.
> 
> >
> >> +}
> >> +
> >> +/* The initial prefill is running in the context of map creation process, so
> >> + * if the preferred numa node is NUMA_NO_NODE, needs to use numa node of the
> >> + * specific cpu instead.
> >> + */
> >> +static inline int get_prefill_numa_node(int numa_node, int cpu)
> >> +{
> >> +       int prefill_numa_node;
> >> +
> >> +       if (numa_node == NUMA_NO_NODE)
> >> +               prefill_numa_node = cpu_to_node(cpu);
> >> +       else
> >> +               prefill_numa_node = numa_node;
> >> +       return prefill_numa_node;
> >>  }
> > nit: an alternative implementation is
> >
> >  return numa_node == NUMA_NO_NODE ? cpu_to_node(cpu) : numa_node;
> It is shorter and better. Will do it in v2.
> >
> >>  /* When size != 0 bpf_mem_cache for each cpu.
> >> @@ -359,13 +383,17 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
> >>   * kmalloc/kfree. Max allocation size is 4096 in this case.
> >>   * This is bpf_dynptr and bpf_kptr use case.
> >>   */
> > We added a parameter to this function, I think it is worth mentioning
> > the 'numa_node' argument's behavior under different values of
> > 'percpu'.
> How about the following comments ?
> 
>  * For per-cpu allocator (percpu=true), the only valid value of numa_node is
>  * NUMA_NO_NODE. For non-per-cpu allocator, if numa_node is NUMA_NO_NODE, the
>  * preferred memory allocation node is the numa node where the allocating CPU
>  * is located, else the preferred node is the specified numa_node.

No. This patch doesn't make sense to me.
As far as I can see it can only make things worse.
Why would you want a cpu to use non local memory?

The commit log:
" is that bpf memory allocator only supports NUMA_NO_NODE, but it seems it
  is trivial to support numa node setting for bpf memory allocator."
got it wrong.

See the existing comment:
                /* irq_work runs on this cpu and kmalloc will allocate
                 * from the current numa node which is what we want here.
                 */
                alloc_bulk(c, c->batch, NUMA_NO_NODE);