From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 39822C433FE
	for <bpf@archiver.kernel.org>; Tue,  8 Nov 2022 23:02:11 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229734AbiKHXCJ (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Tue, 8 Nov 2022 18:02:09 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48326 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229702AbiKHXCI (ORCPT <rfc822;bpf@vger.kernel.org>);
        Tue, 8 Nov 2022 18:02:08 -0500
Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B70C722B1F
        for <bpf@vger.kernel.org>; Tue,  8 Nov 2022 15:02:05 -0800 (PST)
Received: by mail-ej1-x62f.google.com with SMTP id ft34so6055438ejc.12
        for <bpf@vger.kernel.org>; Tue, 08 Nov 2022 15:02:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=f3gJtHiZpbHaQgjxW/MuhxkN0YHSxCA4RWHswebzhsw=;
        b=ONsMxY6HFZ9sdTWkCYzimTIfowyIMbYSVVQEqmJyT6+mF1ccNX+Ki91yIt7KFRJpMd
         EFGvGq2mG+/zbxKKMpzU48Jb6HznkkZhz+9KWnTq752A6L9yeSL0KKCL/CqD27LHE/0W
         46UhGwIlqbPF3dJ6/wCB8cOUoJJaIzjSl0Z4812mmjTjPC0Kz3nkgl9A8UFBbqyHEKZJ
         aMrIKzQoXMENC6Y002zL7ppw68CFbvogyVmqUpptmHMmwj+H2ndZAsguwnGqGIg8hgIB
         qc5l6bUONq/4nCuW+pQukuYAJlwl8zuP8HvJZqYGPAuMKr2zkly8fGndNEwOMXkBzlUy
         qmNQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=f3gJtHiZpbHaQgjxW/MuhxkN0YHSxCA4RWHswebzhsw=;
        b=aDP0Icmd5FKBwnDdmk2kraeeEs+SY/Ibi320+GhIuG3CitcU2zgHDiUvRvRhyKKa8j
         iLCHJ0oeUxv2SheTkxLwtvz2feLJ9GB4MQsHaAutRl53WvwKJGrQGEBaj/P3YdSl2swx
         3ihsJ3u3V2uRWbk6N6CsfVLlj/torvKB5XUcSprLW88R38Me87Hw6rkA70FtJIAiMedt
         Qx22TeLsLLEmtyArWpiYjTQ4fVj8xm23JP56DMg328wVW5PqKFtqc8fiw1WnmXKuzCap
         q//iM0SgrWnZ7BkAqrWh5N2VsLEFXH7oKescvvHaMhJhrCFk8cyA7IgGB4w3krp6H5os
         PvWg==
X-Gm-Message-State: ACrzQf3TXdD96xqt8P5yqiIVEFbpXKncP9YNULeS0YcwKKKlvz90gewb
        BaCq5zNcHUyxXwYvu4etcWXvBFopxzhH7BnIp00RLZOd
X-Google-Smtp-Source: AMsMyM6jHDRYiyWASk7IqPH21mtlD0i0vza0fI/wAHTNKO0AO5PIHa17+PxqF6Aj9TMOYl9TVbR4fYGMd0SS1O4Ev3k=
X-Received: by 2002:a17:906:b050:b0:78d:99ee:4e68 with SMTP id
 bj16-20020a170906b05000b0078d99ee4e68mr1023652ejb.302.1667948524078; Tue, 08
 Nov 2022 15:02:04 -0800 (PST)
MIME-Version: 1.0
References: <20221107230950.7117-1-memxor@gmail.com> <20221107230950.7117-4-memxor@gmail.com>
In-Reply-To: <20221107230950.7117-4-memxor@gmail.com>
From:   Andrii Nakryiko <andrii.nakryiko@gmail.com>
Date:   Tue, 8 Nov 2022 15:01:52 -0800
Message-ID: <CAEf4Bza6R67US05R6Oh-FY9Kit8abH6eiJ33Z6TnSSpC_n5FBA@mail.gmail.com>
Subject: Re: [PATCH bpf-next v5 03/25] bpf: Support bpf_list_head in map values
To:     Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc:     bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Martin KaFai Lau <martin.lau@kernel.org>,
        Dave Marchevsky <davemarchevsky@meta.com>,
        Delyan Kratunov <delyank@meta.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org

On Mon, Nov 7, 2022 at 3:10 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Add the support on the map side to parse, recognize, verify, and build
> metadata table for a new special field of the type struct bpf_list_head.
> To parameterize the bpf_list_head for a certain value type and the
> list_node member it will accept in that value type, we use BTF
> declaration tags.
>
> The definition of bpf_list_head in a map value will be done as follows:
>
> struct foo {
>         struct bpf_list_node node;
>         int data;
> };
>
> struct map_value {
>         struct bpf_list_head head __contains(foo, node);
> };
>
> Then, the bpf_list_head only allows adding to the list 'head' using the
> bpf_list_node 'node' for the type struct foo.
>
> The 'contains' annotation is a BTF declaration tag composed of four
> parts, "contains:name:node" where the name is then used to look up the
> type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
> the lookup. The node defines name of the member in this type that has
> the type struct bpf_list_node, which is actually used for linking into
> the linked list. For now, 'kind' part is hardcoded as struct.
>
> This allows building intrusive linked lists in BPF, using container_of
> to obtain pointer to entry, while being completely type safe from the
> perspective of the verifier. The verifier knows exactly the type of the
> nodes, and knows that list helpers return that type at some fixed offset
> where the bpf_list_node member used for this list exists. The verifier
> also uses this information to disallow adding types that are not
> accepted by a certain list.
>
> For now, no elements can be added to such lists. Support for that is
> coming in future patches, hence draining and freeing items is done with
> a TODO that will be resolved in a future patch.
>
> Note that the bpf_list_head_free function moves the list out to a local
> variable under the lock and releases it, doing the actual draining of
> the list items outside the lock. While this helps with not holding the
> lock for too long pessimizing other concurrent list operations, it is
> also necessary for deadlock prevention: unless every function called in
> the critical section would be notrace, a fentry/fexit program could
> attach and call bpf_map_update_elem again on the map, leading to the
> same lock being acquired if the key matches and lead to a deadlock.
> While this requires some special effort on part of the BPF programmer to
> trigger and is highly unlikely to occur in practice, it is always better
> if we can avoid such a condition.
>
> While notrace would prevent this, doing the draining outside the lock
> has advantages of its own, hence it is used to also fix the deadlock
> related problem.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h            |  17 ++++
>  include/uapi/linux/bpf.h       |  10 +++
>  kernel/bpf/btf.c               | 143 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/helpers.c           |  32 ++++++++
>  kernel/bpf/syscall.c           |  22 ++++-
>  kernel/bpf/verifier.c          |   7 ++
>  tools/include/uapi/linux/bpf.h |  10 +++
>  7 files changed, 237 insertions(+), 4 deletions(-)
>

[...]

>  struct bpf_offload_dev;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 94659f6b3395..dd381086bad9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6887,6 +6887,16 @@ struct bpf_dynptr {
>         __u64 :64;
>  } __attribute__((aligned(8)));
>
> +struct bpf_list_head {
> +       __u64 :64;
> +       __u64 :64;
> +} __attribute__((aligned(8)));
> +
> +struct bpf_list_node {
> +       __u64 :64;
> +       __u64 :64;
> +} __attribute__((aligned(8)));

Dave mentioned that this `__u64 :64` trick makes vmlinux.h lose the
alignment information, as the struct itself is empty, and so there is
nothing indicating that it has to be 8-byte aligned.

So what if we have

struct bpf_list_node {
    __u64 __opaque[2];
} __attribute__((aligned(8)));

?

> +
>  struct bpf_sysctl {
>         __u32   write;          /* Sysctl is being read (= 0) or written (= 1).
>                                  * Allows 1,2,4-byte read, but no write.
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c

[...]

> @@ -3284,6 +3347,12 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
>                         goto end;
>                 }
>         }
> +       if (field_mask & BPF_LIST_HEAD) {
> +               if (!strcmp(name, "bpf_list_head")) {
> +                       type = BPF_LIST_HEAD;
> +                       goto end;
> +               }
> +       }
>         /* Only return BPF_KPTR when all other types with matchable names fail */
>         if (field_mask & BPF_KPTR) {
>                 type = BPF_KPTR_REF;
> @@ -3317,6 +3386,8 @@ static int btf_find_struct_field(const struct btf *btf,
>                         return field_type;
>
>                 off = __btf_member_bit_offset(t, member);
> +               if (i && !off)
> +                       return -EFAULT;

why? why can't my struct has zero-sized field in the beginning? This
seems like a very incomplete and unnecessary check to me.

>                 if (off % 8)
>                         /* valid C code cannot generate such BTF */
>                         return -EINVAL;
> @@ -3339,6 +3410,12 @@ static int btf_find_struct_field(const struct btf *btf,
>                         if (ret < 0)
>                                 return ret;
>                         break;
> +               case BPF_LIST_HEAD:
> +                       ret = btf_find_list_head(btf, t, member_type, i, off, sz,
> +                                                idx < info_cnt ? &info[idx] : &tmp);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
>                 default:
>                         return -EFAULT;
>                 }
> @@ -3373,6 +3450,8 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                         return field_type;
>
>                 off = vsi->offset;
> +               if (i && !off)
> +                       return -EFAULT;

similarly, I'd say that either we'd need to calculate the exact
expected offset, or just not do anything here?

>                 if (vsi->size != sz)
>                         continue;
>                 if (off % align)
> @@ -3393,6 +3472,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                         if (ret < 0)
>                                 return ret;
>                         break;
> +               case BPF_LIST_HEAD:
> +                       ret = btf_find_list_head(btf, var, var_type, -1, off, sz,
> +                                                idx < info_cnt ? &info[idx] : &tmp);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
>                 default:
>                         return -EFAULT;
>                 }

[...]