From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39822C433FE for ; Tue, 8 Nov 2022 23:02:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229734AbiKHXCJ (ORCPT ); Tue, 8 Nov 2022 18:02:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229702AbiKHXCI (ORCPT ); Tue, 8 Nov 2022 18:02:08 -0500 Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B70C722B1F for ; Tue, 8 Nov 2022 15:02:05 -0800 (PST) Received: by mail-ej1-x62f.google.com with SMTP id ft34so6055438ejc.12 for ; Tue, 08 Nov 2022 15:02:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=f3gJtHiZpbHaQgjxW/MuhxkN0YHSxCA4RWHswebzhsw=; b=ONsMxY6HFZ9sdTWkCYzimTIfowyIMbYSVVQEqmJyT6+mF1ccNX+Ki91yIt7KFRJpMd EFGvGq2mG+/zbxKKMpzU48Jb6HznkkZhz+9KWnTq752A6L9yeSL0KKCL/CqD27LHE/0W 46UhGwIlqbPF3dJ6/wCB8cOUoJJaIzjSl0Z4812mmjTjPC0Kz3nkgl9A8UFBbqyHEKZJ aMrIKzQoXMENC6Y002zL7ppw68CFbvogyVmqUpptmHMmwj+H2ndZAsguwnGqGIg8hgIB qc5l6bUONq/4nCuW+pQukuYAJlwl8zuP8HvJZqYGPAuMKr2zkly8fGndNEwOMXkBzlUy qmNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=f3gJtHiZpbHaQgjxW/MuhxkN0YHSxCA4RWHswebzhsw=; b=aDP0Icmd5FKBwnDdmk2kraeeEs+SY/Ibi320+GhIuG3CitcU2zgHDiUvRvRhyKKa8j iLCHJ0oeUxv2SheTkxLwtvz2feLJ9GB4MQsHaAutRl53WvwKJGrQGEBaj/P3YdSl2swx 3ihsJ3u3V2uRWbk6N6CsfVLlj/torvKB5XUcSprLW88R38Me87Hw6rkA70FtJIAiMedt Qx22TeLsLLEmtyArWpiYjTQ4fVj8xm23JP56DMg328wVW5PqKFtqc8fiw1WnmXKuzCap q//iM0SgrWnZ7BkAqrWh5N2VsLEFXH7oKescvvHaMhJhrCFk8cyA7IgGB4w3krp6H5os PvWg== X-Gm-Message-State: ACrzQf3TXdD96xqt8P5yqiIVEFbpXKncP9YNULeS0YcwKKKlvz90gewb BaCq5zNcHUyxXwYvu4etcWXvBFopxzhH7BnIp00RLZOd X-Google-Smtp-Source: AMsMyM6jHDRYiyWASk7IqPH21mtlD0i0vza0fI/wAHTNKO0AO5PIHa17+PxqF6Aj9TMOYl9TVbR4fYGMd0SS1O4Ev3k= X-Received: by 2002:a17:906:b050:b0:78d:99ee:4e68 with SMTP id bj16-20020a170906b05000b0078d99ee4e68mr1023652ejb.302.1667948524078; Tue, 08 Nov 2022 15:02:04 -0800 (PST) MIME-Version: 1.0 References: <20221107230950.7117-1-memxor@gmail.com> <20221107230950.7117-4-memxor@gmail.com> In-Reply-To: <20221107230950.7117-4-memxor@gmail.com> From: Andrii Nakryiko Date: Tue, 8 Nov 2022 15:01:52 -0800 Message-ID: Subject: Re: [PATCH bpf-next v5 03/25] bpf: Support bpf_list_head in map values To: Kumar Kartikeya Dwivedi Cc: bpf@vger.kernel.org, Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Dave Marchevsky , Delyan Kratunov Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Mon, Nov 7, 2022 at 3:10 PM Kumar Kartikeya Dwivedi wrote: > > Add the support on the map side to parse, recognize, verify, and build > metadata table for a new special field of the type struct bpf_list_head. > To parameterize the bpf_list_head for a certain value type and the > list_node member it will accept in that value type, we use BTF > declaration tags. > > The definition of bpf_list_head in a map value will be done as follows: > > struct foo { > struct bpf_list_node node; > int data; > }; > > struct map_value { > struct bpf_list_head head __contains(foo, node); > }; > > Then, the bpf_list_head only allows adding to the list 'head' using the > bpf_list_node 'node' for the type struct foo. > > The 'contains' annotation is a BTF declaration tag composed of four > parts, "contains:name:node" where the name is then used to look up the > type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during > the lookup. The node defines name of the member in this type that has > the type struct bpf_list_node, which is actually used for linking into > the linked list. For now, 'kind' part is hardcoded as struct. > > This allows building intrusive linked lists in BPF, using container_of > to obtain pointer to entry, while being completely type safe from the > perspective of the verifier. The verifier knows exactly the type of the > nodes, and knows that list helpers return that type at some fixed offset > where the bpf_list_node member used for this list exists. The verifier > also uses this information to disallow adding types that are not > accepted by a certain list. > > For now, no elements can be added to such lists. Support for that is > coming in future patches, hence draining and freeing items is done with > a TODO that will be resolved in a future patch. > > Note that the bpf_list_head_free function moves the list out to a local > variable under the lock and releases it, doing the actual draining of > the list items outside the lock. While this helps with not holding the > lock for too long pessimizing other concurrent list operations, it is > also necessary for deadlock prevention: unless every function called in > the critical section would be notrace, a fentry/fexit program could > attach and call bpf_map_update_elem again on the map, leading to the > same lock being acquired if the key matches and lead to a deadlock. > While this requires some special effort on part of the BPF programmer to > trigger and is highly unlikely to occur in practice, it is always better > if we can avoid such a condition. > > While notrace would prevent this, doing the draining outside the lock > has advantages of its own, hence it is used to also fix the deadlock > related problem. > > Signed-off-by: Kumar Kartikeya Dwivedi > --- > include/linux/bpf.h | 17 ++++ > include/uapi/linux/bpf.h | 10 +++ > kernel/bpf/btf.c | 143 ++++++++++++++++++++++++++++++++- > kernel/bpf/helpers.c | 32 ++++++++ > kernel/bpf/syscall.c | 22 ++++- > kernel/bpf/verifier.c | 7 ++ > tools/include/uapi/linux/bpf.h | 10 +++ > 7 files changed, 237 insertions(+), 4 deletions(-) > [...] > struct bpf_offload_dev; > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 94659f6b3395..dd381086bad9 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -6887,6 +6887,16 @@ struct bpf_dynptr { > __u64 :64; > } __attribute__((aligned(8))); > > +struct bpf_list_head { > + __u64 :64; > + __u64 :64; > +} __attribute__((aligned(8))); > + > +struct bpf_list_node { > + __u64 :64; > + __u64 :64; > +} __attribute__((aligned(8))); Dave mentioned that this `__u64 :64` trick makes vmlinux.h lose the alignment information, as the struct itself is empty, and so there is nothing indicating that it has to be 8-byte aligned. So what if we have struct bpf_list_node { __u64 __opaque[2]; } __attribute__((aligned(8))); ? > + > struct bpf_sysctl { > __u32 write; /* Sysctl is being read (= 0) or written (= 1). > * Allows 1,2,4-byte read, but no write. > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c [...] > @@ -3284,6 +3347,12 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask, > goto end; > } > } > + if (field_mask & BPF_LIST_HEAD) { > + if (!strcmp(name, "bpf_list_head")) { > + type = BPF_LIST_HEAD; > + goto end; > + } > + } > /* Only return BPF_KPTR when all other types with matchable names fail */ > if (field_mask & BPF_KPTR) { > type = BPF_KPTR_REF; > @@ -3317,6 +3386,8 @@ static int btf_find_struct_field(const struct btf *btf, > return field_type; > > off = __btf_member_bit_offset(t, member); > + if (i && !off) > + return -EFAULT; why? why can't my struct has zero-sized field in the beginning? This seems like a very incomplete and unnecessary check to me. > if (off % 8) > /* valid C code cannot generate such BTF */ > return -EINVAL; > @@ -3339,6 +3410,12 @@ static int btf_find_struct_field(const struct btf *btf, > if (ret < 0) > return ret; > break; > + case BPF_LIST_HEAD: > + ret = btf_find_list_head(btf, t, member_type, i, off, sz, > + idx < info_cnt ? &info[idx] : &tmp); > + if (ret < 0) > + return ret; > + break; > default: > return -EFAULT; > } > @@ -3373,6 +3450,8 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t, > return field_type; > > off = vsi->offset; > + if (i && !off) > + return -EFAULT; similarly, I'd say that either we'd need to calculate the exact expected offset, or just not do anything here? > if (vsi->size != sz) > continue; > if (off % align) > @@ -3393,6 +3472,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t, > if (ret < 0) > return ret; > break; > + case BPF_LIST_HEAD: > + ret = btf_find_list_head(btf, var, var_type, -1, off, sz, > + idx < info_cnt ? &info[idx] : &tmp); > + if (ret < 0) > + return ret; > + break; > default: > return -EFAULT; > } [...]