Re: [1/2 bpf-next] bpf: expose net_device from xdp for metadata

From: Yonghong Song <yhs@meta.com>
To: John Fastabend <john.fastabend@gmail.com>,
	hawk@kernel.org, daniel@iogearbox.net, kuba@kernel.org,
	davem@davemloft.net, ast@kernel.org
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, sdf@google.com
Subject: Re: [1/2 bpf-next] bpf: expose net_device from xdp for metadata
Date: Thu, 10 Nov 2022 22:34:27 -0800	[thread overview]
Message-ID: <86af974c-a970-863f-53f5-c57ebba9754e@meta.com> (raw)
In-Reply-To: <636d853a8d59_15505d20826@john.notmuch>

On 11/10/22 3:11 PM, John Fastabend wrote:
> John Fastabend wrote:
>> Yonghong Song wrote:
>>>
>>>
>>> On 11/9/22 6:17 PM, John Fastabend wrote:
>>>> Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 11/9/22 1:52 PM, John Fastabend wrote:
>>>>>> Allow xdp progs to read the net_device structure. Its useful to extract
>>>>>> info from the dev itself. Currently, our tracing tooling uses kprobes
>>>>>> to capture statistics and information about running net devices. We use
>>>>>> kprobes instead of other hooks tc/xdp because we need to collect
>>>>>> information about the interface not exposed through the xdp_md structures.
>>>>>> This has some down sides that we want to avoid by moving these into the
>>>>>> XDP hook itself. First, placing the kprobes in a generic function in
>>>>>> the kernel is after XDP so we miss redirects and such done by the
>>>>>> XDP networking program. And its needless overhead because we are
>>>>>> already paying the cost for calling the XDP program, calling yet
>>>>>> another prog is a waste. Better to do everything in one hook from
>>>>>> performance side.
>>>>>>
>>>>>> Of course we could one-off each one of these fields, but that would
>>>>>> explode the xdp_md struct and then require writing convert_ctx_access
>>>>>> writers for each field. By using BTF we avoid writing field specific
>>>>>> convertion logic, BTF just knows how to read the fields, we don't
>>>>>> have to add many fields to xdp_md, and I don't have to get every
>>>>>> field we will use in the future correct.
>>>>>>
>>>>>> For reference current examples in our code base use the ifindex,
>>>>>> ifname, qdisc stats, net_ns fields, among others. With this
>>>>>> patch we can now do the following,
>>>>>>
>>>>>>            dev = ctx->rx_dev;
>>>>>>            net = dev->nd_net.net;
>>>>>>
>>>>>> 	uid.ifindex = dev->ifindex;
>>>>>> 	memcpy(uid.ifname, dev->ifname, NAME);
>>>>>>            if (net)
>>>>>> 		uid.inum = net->ns.inum;
>>>>>>
>>>>>> to report the name, index and ns.inum which identifies an
>>>>>> interface in our system.
>>>>>
>>>>> In
>>>>> https://lore.kernel.org/bpf/ad15b398-9069-4a0e-48cb-4bb651ec3088@meta.com/
>>>>> Namhyung Kim wanted to access new perf data with a helper.
>>>>> I proposed a helper bpf_get_kern_ctx() which will get
>>>>> the kernel ctx struct from which the actual perf data
>>>>> can be retrieved. The interface looks like
>>>>> 	void *bpf_get_kern_ctx(void *)
>>>>> the input parameter needs to be a PTR_TO_CTX and
>>>>> the verifer is able to return the corresponding kernel
>>>>> ctx struct based on program type.
>>>>>
>>>>> The following is really hacked demonstration with
>>>>> some of change coming from my bpf_rcu_read_lock()
>>>>> patch set https://lore.kernel.org/bpf/20221109211944.3213817-1-yhs@fb.com/
>>>>>
>>>>> I modified your test to utilize the
>>>>> bpf_get_kern_ctx() helper in your test_xdp_md.c.
>>>>>
>>>>> With this single helper, we can cover the above perf
>>>>> data use case and your use case and maybe others
>>>>> to avoid new UAPI changes.
>>>>
>>>> hmm I like the idea of just accessing the xdp_buff directly
>>>> instead of adding more fields. I'm less convinced of the
>>>> kfunc approach. What about a terminating field *self in the
>>>> xdp_md. Then we can use existing convert_ctx_access to make
>>>> it BPF inlined and no verifier changes needed.
>>>>
>>>> Something like this quickly typed up and not compiled, but
>>>> I think shows what I'm thinking.
>>>>
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 94659f6b3395..10ebd90d6677 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -6123,6 +6123,10 @@ struct xdp_md {
>>>>           __u32 rx_queue_index;  /* rxq->queue_index  */
>>>>    
>>>>           __u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> +       /* Last xdp_md entry, for new types add directly to xdp_buff and use
>>>> +        * BTF access. Reading this gives BTF access to xdp_buff.
>>>> +        */
>>>> +       __bpf_md_ptr(struct xdp_buff *, self);
>>>>    };
>>>
>>> This would be the first instance to have a kernel internal struct
>>> in a uapi struct. Not sure whether this is a good idea or not.
>>
>> We can use probe_read from some of the socket progs already but
>> sure.
>>
>>>
>>>>    
>>>>    /* DEVMAP map-value layout
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index bb0136e7a8e4..547e9576a918 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -9808,6 +9808,11 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
>>>>                   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
>>>>                                         offsetof(struct net_device, ifindex));
>>>>                   break;
>>>> +       case offsetof(struct xdp_md, self):
>>>> +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, self),
>>>> +                                     si->dst_reg, si->src_reg,
>>>> +                                     offsetof(struct xdp_buff, 0));
>>>> +               break;
>>>>           }
>>>>    
>>>>           return insn - insn_buf;
>>>>
>>>> Actually even that single insn conversion is a bit unnessary because
>>>> should be enough to just change the type to the correct BTF_ID in the
>>>> verifier and omit any instructions. But it wwould be a bit confusing
>>>> for C side. Might be a good use for passing 'cast' info through to
>>>> the verifier as an annotation so it could just do the BTF_ID cast for
>>>> us without any insns.
>>>
>>> We cannot change the context type to BTF_ID style which will be a
>>> uapi violation.
>>
>> I don't think it would be uapi violation if user asks for it
>> by annotating the cast.
>>
>>>
>>> The helper I proposed can be rewritten by verifier as
>>> 	r0 = r1
>>> so we should not have overhead for this.
>>
>> Agree other than reading the bpf asm where its a bit odd.
>>
>>> It cover all program types with known uapi ctx -> kern ctx
>>> conversions. So there is no need to change existing uapi structs.
>>> Also I except that most people probably won't use this kfunc.
>>> The existing uapi fields might already serve most needs.
>>
>> Maybe not sure missing some things we need.
>>
>>>
>>> Internally we have another use case to access some 'struct sock' fields
>>> but the uapi struct only has struct bpf_sock. Currently it is advised
>>> to use bpf_probe_read_kernel(...) to get the needed information.
>>> The proposed helper should help that too without uapi change.
>>
>> Yep.
>>
>> I'm fine doing it with bpf_get_kern_ctx() did you want me to code it
>> the rest of the way up and test it?
>>
>> .John
> 
> Related I think. We also want to get kernel variable net_namespace_list,
> this points to the network namespace lists. Based on above should
> we do something like,
> 
>    void *bpf_get_kern_var(enum var_id);
> 
> then,
> 
>    net_ns_list = bpf_get_kern_var(__btf_net_namesapce_list);
> 
> would get us a ptr to the list? The other thought was to put it in the
> xdp_md but from above seems better idea to get it through helper.

Sounds great. I guess my new proposed bpf_get_kern_btf_id() kfunc could
cover such a use case as well.