Re: [PATCH bpf-next v6 1/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

From: Martin KaFai Lau <kafai@fb.com>
To: Stanislav Fomichev <sdf@google.com>
Cc: <netdev@vger.kernel.org>, <bpf@vger.kernel.org>, <ast@kernel.org>,
	<daniel@iogearbox.net>, Song Liu <songliubraving@fb.com>,
	Eric Dumazet <edumazet@google.com>
Subject: Re: [PATCH bpf-next v6 1/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
Date: Fri, 8 Jan 2021 17:37:39 -0800	[thread overview]
Message-ID: <20210109013739.vbqm4gllpo7g5xro@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20210108210223.972802-2-sdf@google.com>

On Fri, Jan 08, 2021 at 01:02:21PM -0800, Stanislav Fomichev wrote:
> Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> call in do_tcp_getsockopt using the on-stack data. This removes
> 3% overhead for locking/unlocking the socket.
> 
> Without this patch:
>      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
>             |
>              --3.30%--__cgroup_bpf_run_filter_getsockopt
>                        |
>                         --0.81%--__kmalloc
> 
> With the patch applied:
>      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Song Liu <songliubraving@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/bpf-cgroup.h                    | 27 +++++++++++--
>  include/linux/indirect_call_wrapper.h         |  6 +++
>  include/net/sock.h                            |  2 +
>  include/net/tcp.h                             |  1 +
>  kernel/bpf/cgroup.c                           | 38 +++++++++++++++++++
>  net/ipv4/tcp.c                                | 14 +++++++
>  net/ipv4/tcp_ipv4.c                           |  1 +
>  net/ipv6/tcp_ipv6.c                           |  1 +
>  net/socket.c                                  |  3 ++
>  .../selftests/bpf/prog_tests/sockopt_sk.c     | 22 +++++++++++
>  .../testing/selftests/bpf/progs/sockopt_sk.c  | 15 ++++++++
>  11 files changed, 126 insertions(+), 4 deletions(-)
> 
[ ... ]

> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 6ec088a96302..c41bb2f34013 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1485,6 +1485,44 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
>  	sockopt_free_buf(&ctx);
>  	return ret;
>  }
> +
> +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level,
> +					    int optname, void *optval,
> +					    int *optlen, int retval)
> +{
> +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	struct bpf_sockopt_kern ctx = {
> +		.sk = sk,
> +		.level = level,
> +		.optname = optname,
> +		.retval = retval,
> +		.optlen = *optlen,
> +		.optval = optval,
> +		.optval_end = optval + *optlen,
> +	};
> +	int ret;
> +
The current behavior only passes kernel optval to bpf prog when
retval == 0.  Can you explain a few words here about
the difference and why it is fine?
Just in case some other options may want to reuse the
__cgroup_bpf_run_filter_getsockopt_kern() in the future.

> +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> +				 &ctx, BPF_PROG_RUN);
> +	if (!ret)
> +		return -EPERM;
> +
> +	if (ctx.optlen > *optlen)
> +		return -EFAULT;
> +
> +	/* BPF programs only allowed to set retval to 0, not some
> +	 * arbitrary value.
> +	 */
> +	if (ctx.retval != 0 && ctx.retval != retval)
> +		return -EFAULT;
> +
> +	/* BPF programs can shrink the buffer, export the modifications.
> +	 */
> +	if (ctx.optlen != 0)
> +		*optlen = ctx.optlen;
> +
> +	return ctx.retval;
> +}
>  #endif
>  
>  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,

[ ... ]

> diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> index b25c9c45c148..6bb18b1d8578 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> @@ -11,6 +11,7 @@ static int getsetsockopt(void)
>  		char u8[4];
>  		__u32 u32;
>  		char cc[16]; /* TCP_CA_NAME_MAX */
> +		struct tcp_zerocopy_receive zc;
I suspect it won't compile at least in my setup.

However, I compile tools/testing/selftests/net/tcp_mmap.c fine though.
I _guess_ it is because the net's test has included kernel/usr/include.

AFAIK, bpf's tests use tools/include/uapi/.

Others LGTM.