From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFD80C433FF for ; Tue, 13 Aug 2019 17:16:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B7CBB20679 for ; Tue, 13 Aug 2019 17:16:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=fomichev-me.20150623.gappssmtp.com header.i=@fomichev-me.20150623.gappssmtp.com header.b="uXVYDxZH" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726373AbfHMRQN (ORCPT ); Tue, 13 Aug 2019 13:16:13 -0400 Received: from mail-pl1-f196.google.com ([209.85.214.196]:38762 "EHLO mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726029AbfHMRQN (ORCPT ); Tue, 13 Aug 2019 13:16:13 -0400 Received: by mail-pl1-f196.google.com with SMTP id m12so10920271plt.5 for ; Tue, 13 Aug 2019 10:16:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fomichev-me.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=FEfH6j8NvNEpjXm2CU4s0KlCALUxSnZClYA3AgIfqb4=; b=uXVYDxZHqs+zQCwX6Ejlerpe7gac5nMgUSDCzK0ImdBor2No46R70Y5w7fnhA/6jha QdOqrm9iEEFmQSkBry8eSdfBrFF8diOmWVsoFvsGYr4TIAOBQQGpOLEXHDtQh08xTCLw +t5gUoddaLhaUL6wLhnY3WcRM9UWdwBUdvoZiOaNfjmk0tyx+6n0M7GPiUtbRc8hAyxI zaQgbS2XbNvhB64Ai4T6QKRMpIuxYcv6QWJzA8dDgBsG9AQcu6j8LfcJRJi2Se6d2uz+ /GED7HbO2nGqOO/KzdOauvkm+fW4Omo2YSyhkg+tjanirV6HSWZvaYWrM3mqK9zs+B+P 9GWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=FEfH6j8NvNEpjXm2CU4s0KlCALUxSnZClYA3AgIfqb4=; b=lONlZ6eXzIJ5/VclB785FMO24qlJGyfM6Xf8WM0Fua9jw2Y3vi50kX0Gc0/T4HEyh7 DnG8e650hmIDSBmg16Vw6C0GCCZuMEwpb7FteE71IoIhPUTyt9si6/lj1spcL+/3Wl+2 DAuPy2KaYlXZupFkIaMQVa3/OjC1fcqp6Imn5gv1veNyIM+HkQTP5G+j5hJ6N/Ub4FcL t7IWN8cdoIfjJuOd9KS5eOSuq9jmMnuMP85cS24dN6AdEDG5yrtkCm20q/909yhbZYNZ T/4FTtTG0uSZMY6hmkr0gywR6CVWCgK5hzD+9WEAOi68gVUX1eq88l8iLPEJXq4qHrh0 ulQw== X-Gm-Message-State: APjAAAVyAD03aS5nN8gdR6mnhBKne/7A5y8+RvU0hYrzg9NtvxlKzGoE 27LqdeJhcsj56il77T05F+aPpHasBDI= X-Google-Smtp-Source: APXvYqwxzzMQM2aB6Vm4Y+iIDsE/IDkkghW628nLd54LO6KgiFA7PIJXC7cW5PBft0WFcSuKMSfDxw== X-Received: by 2002:a17:902:8489:: with SMTP id c9mr39245819plo.327.1565716572187; Tue, 13 Aug 2019 10:16:12 -0700 (PDT) Received: from localhost ([2601:646:8f00:18d9:d0fa:7a4b:764f:de48]) by smtp.gmail.com with ESMTPSA id a15sm152112751pfg.102.2019.08.13.10.16.11 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 13 Aug 2019 10:16:11 -0700 (PDT) Date: Tue, 13 Aug 2019 10:16:10 -0700 From: Stanislav Fomichev To: Yonghong Song Cc: Stanislav Fomichev , "netdev@vger.kernel.org" , "bpf@vger.kernel.org" , "davem@davemloft.net" , "ast@kernel.org" , "daniel@iogearbox.net" , Martin Lau Subject: Re: [PATCH bpf-next v3 2/4] bpf: support cloning sk storage on accept() Message-ID: <20190813171610.GH2820@mini-arch> References: <20190813162630.124544-1-sdf@google.com> <20190813162630.124544-3-sdf@google.com> <173b3736-97af-cad5-9432-8e6422a89d05@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <173b3736-97af-cad5-9432-8e6422a89d05@fb.com> User-Agent: Mutt/1.12.1 (2019-06-15) Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 08/13, Yonghong Song wrote: > > > On 8/13/19 9:26 AM, Stanislav Fomichev wrote: > > Add new helper bpf_sk_storage_clone which optionally clones sk storage > > and call it from sk_clone_lock. > > > > Cc: Martin KaFai Lau > > Cc: Yonghong Song > > Acked-by: Yonghong Song > > Signed-off-by: Stanislav Fomichev > > --- > > include/net/bpf_sk_storage.h | 10 ++++ > > include/uapi/linux/bpf.h | 3 + > > net/core/bpf_sk_storage.c | 103 ++++++++++++++++++++++++++++++++++- > > net/core/sock.c | 9 ++- > > 4 files changed, 119 insertions(+), 6 deletions(-) > > > > diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h > > index b9dcb02e756b..8e4f831d2e52 100644 > > --- a/include/net/bpf_sk_storage.h > > +++ b/include/net/bpf_sk_storage.h > > @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk); > > extern const struct bpf_func_proto bpf_sk_storage_get_proto; > > extern const struct bpf_func_proto bpf_sk_storage_delete_proto; > > > > +#ifdef CONFIG_BPF_SYSCALL > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk); > > +#else > > +static inline int bpf_sk_storage_clone(const struct sock *sk, > > + struct sock *newsk) > > +{ > > + return 0; > > +} > > +#endif > > + > > #endif /* _BPF_SK_STORAGE_H */ > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 4393bd4b2419..0ef594ac3899 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -337,6 +337,9 @@ enum bpf_attach_type { > > #define BPF_F_RDONLY_PROG (1U << 7) > > #define BPF_F_WRONLY_PROG (1U << 8) > > > > +/* Clone map from listener for newly accepted socket */ > > +#define BPF_F_CLONE (1U << 9) > > + > > /* flags for BPF_PROG_QUERY */ > > #define BPF_F_QUERY_EFFECTIVE (1U << 0) > > > > diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c > > index 94c7f77ecb6b..1bc7de7e18ba 100644 > > --- a/net/core/bpf_sk_storage.c > > +++ b/net/core/bpf_sk_storage.c > > @@ -12,6 +12,9 @@ > > > > static atomic_t cache_idx; > > > > +#define SK_STORAGE_CREATE_FLAG_MASK \ > > + (BPF_F_NO_PREALLOC | BPF_F_CLONE) > > + > > struct bucket { > > struct hlist_head list; > > raw_spinlock_t lock; > > @@ -209,7 +212,6 @@ static void selem_unlink_sk(struct bpf_sk_storage_elem *selem) > > kfree_rcu(sk_storage, rcu); > > } > > > > -/* sk_storage->lock must be held and sk_storage->list cannot be empty */ > > static void __selem_link_sk(struct bpf_sk_storage *sk_storage, > > struct bpf_sk_storage_elem *selem) > > { > > @@ -509,7 +511,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map) > > return 0; > > } > > > > -/* Called by __sk_destruct() */ > > +/* Called by __sk_destruct() & bpf_sk_storage_clone() */ > > void bpf_sk_storage_free(struct sock *sk) > > { > > struct bpf_sk_storage_elem *selem; > > @@ -557,6 +559,11 @@ static void bpf_sk_storage_map_free(struct bpf_map *map) > > > > smap = (struct bpf_sk_storage_map *)map; > > > > + /* Note that this map might be concurrently cloned from > > + * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone > > + * RCU read section to finish before proceeding. New RCU > > + * read sections should be prevented via bpf_map_inc_not_zero. > > + */ > > synchronize_rcu(); > > > > /* bpf prog and the userspace can no longer access this map > > @@ -601,7 +608,9 @@ static void bpf_sk_storage_map_free(struct bpf_map *map) > > > > static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr) > > { > > - if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries || > > + if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK || > > + !(attr->map_flags & BPF_F_NO_PREALLOC) || > > + attr->max_entries || > > attr->key_size != sizeof(int) || !attr->value_size || > > /* Enforce BTF for userspace sk dumping */ > > !attr->btf_key_type_id || !attr->btf_value_type_id) > > @@ -739,6 +748,94 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key) > > return err; > > } > > > > +static struct bpf_sk_storage_elem * > > +bpf_sk_storage_clone_elem(struct sock *newsk, > > + struct bpf_sk_storage_map *smap, > > + struct bpf_sk_storage_elem *selem) > > +{ > > + struct bpf_sk_storage_elem *copy_selem; > > + > > + copy_selem = selem_alloc(smap, newsk, NULL, true); > > + if (!copy_selem) > > + return NULL; > > + > > + if (map_value_has_spin_lock(&smap->map)) > > + copy_map_value_locked(&smap->map, SDATA(copy_selem)->data, > > + SDATA(selem)->data, true); > > + else > > + copy_map_value(&smap->map, SDATA(copy_selem)->data, > > + SDATA(selem)->data); > > + > > + return copy_selem; > > +} > > + > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) > > +{ > > + struct bpf_sk_storage *new_sk_storage = NULL; > > + struct bpf_sk_storage *sk_storage; > > + struct bpf_sk_storage_elem *selem; > > + int ret; > > + > > + RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL); > > + > > + rcu_read_lock(); > > + sk_storage = rcu_dereference(sk->sk_bpf_storage); > > + > > + if (!sk_storage || hlist_empty(&sk_storage->list)) > > + goto out; > > + > > + hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) { > > + struct bpf_sk_storage_elem *copy_selem; > > + struct bpf_sk_storage_map *smap; > > + struct bpf_map *map; > > + > > + smap = rcu_dereference(SDATA(selem)->smap); > > + if (!(smap->map.map_flags & BPF_F_CLONE)) > > + continue; > > + > > + map = bpf_map_inc_not_zero(&smap->map, false); > > + if (IS_ERR(map)) > > + continue; > > + > > + copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem); > > + if (!copy_selem) { > > + ret = -ENOMEM; > > + bpf_map_put(map); > > + goto err; > > + } > > + > > + if (new_sk_storage) { > > + selem_link_map(smap, copy_selem); > > + __selem_link_sk(new_sk_storage, copy_selem); > > + } else { > > + ret = sk_storage_alloc(newsk, smap, copy_selem); > > + if (ret) { > > + kfree(copy_selem); > > + atomic_sub(smap->elem_size, > > + &newsk->sk_omem_alloc); > > + bpf_map_put(map); > > + goto err; > > + } > > + > > + new_sk_storage = rcu_dereference(copy_selem->sk_storage); > > + } > > + bpf_map_put(map); > > + } > > + > > +out: > > + rcu_read_unlock(); > > + return 0; > > + > > +err: > > + rcu_read_unlock(); > > + > > + /* Don't free anything explicitly here, caller is responsible to > > + * call bpf_sk_storage_free in case of an error. > > + */ > > + > > + return ret; > > A nit. > If you set ret = 0 initially, you do not need the above two > rcu_read_unlock(). One "return ret" should be enough. > The comment can be changed to > /* In case of an error, ... */ I thought about it, but decided to keep them separate because of this comment. OTOH, adding "In case of an error," might help with that. Can do another respin once Martin gets to review this version (don't want to spam). > > +} > > + > > BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk, > > void *, value, u64, flags) > > { > > diff --git a/net/core/sock.c b/net/core/sock.c > > index d57b0cc995a0..f5e801a9cea4 100644 > > --- a/net/core/sock.c > > +++ b/net/core/sock.c > > @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) > > goto out; > > } > > RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL); > > -#ifdef CONFIG_BPF_SYSCALL > > - RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL); > > -#endif > > + > > + if (bpf_sk_storage_clone(sk, newsk)) { > > + sk_free_unlock_clone(newsk); > > + newsk = NULL; > > + goto out; > > + } > > > > newsk->sk_err = 0; > > newsk->sk_err_soft = 0; > >