From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-23.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF1F9C433E0 for ; Wed, 13 Jan 2021 19:09:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A14A722D05 for ; Wed, 13 Jan 2021 19:09:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728224AbhAMTJX (ORCPT ); Wed, 13 Jan 2021 14:09:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726803AbhAMTJX (ORCPT ); Wed, 13 Jan 2021 14:09:23 -0500 Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC9E7C061575 for ; Wed, 13 Jan 2021 11:08:42 -0800 (PST) Received: by mail-qk1-x72a.google.com with SMTP id n142so3449717qkn.2 for ; Wed, 13 Jan 2021 11:08:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=s6uN9HnmWi+fbfCHH8WL9JZMreHouOy6KnP5V+1JvEI=; b=vUIh4fDRsQ3oNYQEf0NCUNNFllmPh8mtOkb8jFQvr4r9iCUMKaDXBGGyFT0Xf0KMdB yPkHTBY8hLQND7ynIIfeP9BGfGRpsoszE10zOwZSV8+A5NH8KNB98ZFBH12ltJcDM6cu ae+BguPMBL4PxWzagC+uSUPtBK0QU+GkC2RQ/a22Lre5Ggs2w+dJ9b1mB5RvXOVSX1Fb 1ukxYEtcaIEJuHNyYq9jClKpW85VSB1WG3DrXPfuiBU9jmTyXvdfpynSdu42UkluQfaO nyX0cFY+1bG2aphXsndSG/NZQbwXrfYs9zL5qnqFRIF/y3oS6EKUsmZNMUSlej2kpbcj BfdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=s6uN9HnmWi+fbfCHH8WL9JZMreHouOy6KnP5V+1JvEI=; b=U2luDPgNTFiibmL+iteBzIOWyU0X7kNvFdUJvB32LyPzoxj70k9m+XzOXYamAsHTkb vKk2OBCb8sMivOBxKfyYbQ3pgy42Jm5OZIxAQTJAgVGURn7Hfxawh/zZwZKehw/p5OFz EzleNBIuaIeSDEYo3eYDibrQTuh6gjNjJ0VuUBufgDiwjjQh3/BZAjHfwdhUYpy6Bij2 epS26t0JqDLryXd0uJ9wzilPYT83WULKjvrl4e++fzO4nR7c/zXZikCOnN49oOpHc+0O Lt385Jb+0B3PgDypOCQ3TyWwAzGk377AEUrAbf6qJ9wZHNW7qzYTYZEmPhoLXxquq+YW klzA== X-Gm-Message-State: AOAM533X0ai/oLwtd3LjNid1VLeoWLm9R1QEfYGDhD5XIkVwHr7RvNez p9SOHO8jQuklk0xU36O+/jfNfww4EaQ21t4MHPFTww== X-Google-Smtp-Source: ABdhPJwhdhHqtS9gArUHcQslJwirDXgJAlgJ6vkWNk5rmlJ2lQih1qbc29VFc0tB1INgyREtRasnWcGHLwqsuWaVkMA= X-Received: by 2002:a37:a80a:: with SMTP id r10mr3643083qke.448.1610564921398; Wed, 13 Jan 2021 11:08:41 -0800 (PST) MIME-Version: 1.0 References: <20210112223847.1915615-1-sdf@google.com> <20210112223847.1915615-4-sdf@google.com> <20210113190342.dzqylb6oqrkfhccv@kafai-mbp.dhcp.thefacebook.com> In-Reply-To: <20210113190342.dzqylb6oqrkfhccv@kafai-mbp.dhcp.thefacebook.com> From: Stanislav Fomichev Date: Wed, 13 Jan 2021 11:08:30 -0800 Message-ID: Subject: Re: [PATCH bpf-next v7 3/4] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt To: Martin KaFai Lau Cc: Netdev , bpf , Alexei Starovoitov , Daniel Borkmann , Song Liu Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Archived-At: List-Archive: List-Post: On Wed, Jan 13, 2021 at 11:03 AM Martin KaFai Lau wrote: > > On Tue, Jan 12, 2021 at 02:38:46PM -0800, Stanislav Fomichev wrote: > > When we attach a bpf program to cgroup/getsockopt any other getsockopt() > > syscall starts incurring kzalloc/kfree cost. > > > > Let add a small buffer on the stack and use it for small (majority) > > {s,g}etsockopt values. The buffer is small enough to fit into > > the cache line and cover the majority of simple options (most > > of them are 4 byte ints). > > > > It seems natural to do the same for setsockopt, but it's a bit more > > involved when the BPF program modifies the data (where we have to > > kmalloc). The assumption is that for the majority of setsockopt > > calls (which are doing pure BPF options or apply policy) this > > will bring some benefit as well. > > > > Without this patch (we remove about 1% __kmalloc): > > 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt > > | > > --3.30%--__cgroup_bpf_run_filter_getsockopt > > | > > --0.81%--__kmalloc > > > > Signed-off-by: Stanislav Fomichev > > Cc: Martin KaFai Lau > > Cc: Song Liu > > --- > > include/linux/filter.h | 5 ++++ > > kernel/bpf/cgroup.c | 52 ++++++++++++++++++++++++++++++++++++------ > > 2 files changed, 50 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h > > index 29c27656165b..8739f1d4cac4 100644 > > --- a/include/linux/filter.h > > +++ b/include/linux/filter.h > > @@ -1281,6 +1281,11 @@ struct bpf_sysctl_kern { > > u64 tmp_reg; > > }; > > > > +#define BPF_SOCKOPT_KERN_BUF_SIZE 32 > > +struct bpf_sockopt_buf { > > + u8 data[BPF_SOCKOPT_KERN_BUF_SIZE]; > > +}; > > + > > struct bpf_sockopt_kern { > > struct sock *sk; > > u8 *optval; > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > index 416e7738981b..dbeef7afbbf9 100644 > > --- a/kernel/bpf/cgroup.c > > +++ b/kernel/bpf/cgroup.c > > @@ -1298,7 +1298,8 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > return empty; > > } > > > > -static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen, > > + struct bpf_sockopt_buf *buf) > > { > > if (unlikely(max_optlen < 0)) > > return -EINVAL; > > @@ -1310,6 +1311,15 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > max_optlen = PAGE_SIZE; > > } > > > > + if (max_optlen <= sizeof(buf->data)) { > > + /* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE > > + * bytes avoid the cost of kzalloc. > > + */ > > + ctx->optval = buf->data; > > + ctx->optval_end = ctx->optval + max_optlen; > > + return max_optlen; > > + } > > + > > ctx->optval = kzalloc(max_optlen, GFP_USER); > > if (!ctx->optval) > > return -ENOMEM; > > @@ -1319,16 +1329,26 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > return max_optlen; > > } > > > > -static void sockopt_free_buf(struct bpf_sockopt_kern *ctx) > > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx, > > + struct bpf_sockopt_buf *buf) > > { > > + if (ctx->optval == buf->data) > > + return; > > kfree(ctx->optval); > > } > > > > +static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx, > > + struct bpf_sockopt_buf *buf) > > +{ > > + return ctx->optval != buf->data; > > +} > > + > > int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, > > int *optname, char __user *optval, > > int *optlen, char **kernel_optval) > > { > > struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); > > + struct bpf_sockopt_buf buf = {}; > > struct bpf_sockopt_kern ctx = { > > .sk = sk, > > .level = *level, > > @@ -1350,7 +1370,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, > > */ > > max_optlen = max_t(int, 16, *optlen); > > > > - max_optlen = sockopt_alloc_buf(&ctx, max_optlen); > > + max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf); > > if (max_optlen < 0) > > return max_optlen; > > > > @@ -1390,14 +1410,31 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, > > */ > > if (ctx.optlen != 0) { > > *optlen = ctx.optlen; > > - *kernel_optval = ctx.optval; > > + /* We've used bpf_sockopt_kern->buf as an intermediary > > + * storage, but the BPF program indicates that we need > > + * to pass this data to the kernel setsockopt handler. > > + * No way to export on-stack buf, have to allocate a > > + * new buffer. > > + */ > > + if (!sockopt_buf_allocated(&ctx, &buf)) { > > + void *p = kzalloc(ctx.optlen, GFP_USER); > nit. zero-ing is unnecessary when memcpy() will be done later. SG, will switch to kmalloc, thanks! > Acked-by: Martin KaFai Lau