From: Cong Wang <xiyou.wangcong@gmail.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, duanxiongchun@bytedance.com,
wangdongdong.6@bytedance.com, jiang.wang@bytedance.com,
Cong Wang <cong.wang@bytedance.com>
Subject: [Patch bpf-next v8 00/16] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP
Date: Tue, 30 Mar 2021 19:32:21 -0700 [thread overview]
Message-ID: <20210331023237.41094-1-xiyou.wangcong@gmail.com> (raw)
From: Cong Wang <cong.wang@bytedance.com>
We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).
Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
On the high level, ->read_sock() is required for each protocol to support
sockmap redirection, and in order to do sock proto update, a new ops
->psock_update_sk_prot() is introduced, which is also required. And the
BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
retrieve skmsg. To make life easier, we have to get rid of lock_sock()
in sk_psock_handle_skb(), otherwise we would have to implement
->sendmsg_locked() on top of ->sendmsg(), which is ugly.
Please see each patch for more details.
To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2
---
v8: get rid of 'offset' in udp_read_sock()
add checks for skb_verdict/stream_verdict conflict
add two cleanup patches for sock_map_link()
add a new test case
v7: use work_mutex to protect psock->work
return err in udp_read_sock()
add patch 6/13
clean up test case
v6: get rid of sk_psock_zap_ingress()
add rcu work patch
v5: use INDIRECT_CALL_2() for function pointers
use ingress_lock to fix a race condition found by Jacub
rename two helper functions
v4: get rid of lock_sock() in sk_psock_handle_skb()
get rid of udp_sendmsg_locked()
remove an empty line
update cover letter
v3: export tcp/udp_update_proto()
rename sk->sk_prot->psock_update_sk_prot()
improve changelogs
v2: separate from the original large patchset
rebase to the latest bpf-next
split UDP test case
move inet_csk_has_ulp() check to tcp_bpf.c
clean up udp_read_sock()
Cong Wang (16):
skmsg: lock ingress_skb when purging
skmsg: introduce a spinlock to protect ingress_msg
net: introduce skb_send_sock() for sock_map
skmsg: avoid lock_sock() in sk_psock_backlog()
skmsg: use rcu work for destroying psock
skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg()
sock_map: simplify sock_map_link() a bit
sock_map: kill sock_map_link_no_progs()
sock_map: introduce BPF_SK_SKB_VERDICT
sock: introduce sk->sk_prot->psock_update_sk_prot()
udp: implement ->read_sock() for sockmap
skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
udp: implement udp_bpf_recvmsg() for sockmap
sock_map: update sock type checks for UDP
selftests/bpf: add a test case for udp sockmap
selftests/bpf: add a test case for loading BPF_SK_SKB_VERDICT
include/linux/skbuff.h | 1 +
include/linux/skmsg.h | 77 ++++++--
include/net/sock.h | 3 +
include/net/tcp.h | 3 +-
include/net/udp.h | 3 +
include/uapi/linux/bpf.h | 1 +
kernel/bpf/syscall.c | 1 +
net/core/skbuff.c | 55 +++++-
net/core/skmsg.c | 177 ++++++++++++++----
net/core/sock_map.c | 118 ++++++------
net/ipv4/af_inet.c | 1 +
net/ipv4/tcp_bpf.c | 130 +++----------
net/ipv4/tcp_ipv4.c | 3 +
net/ipv4/udp.c | 32 ++++
net/ipv4/udp_bpf.c | 79 +++++++-
net/ipv6/af_inet6.c | 1 +
net/ipv6/tcp_ipv6.c | 3 +
net/ipv6/udp.c | 3 +
net/tls/tls_sw.c | 4 +-
tools/bpf/bpftool/common.c | 1 +
tools/bpf/bpftool/prog.c | 1 +
tools/include/uapi/linux/bpf.h | 1 +
.../selftests/bpf/prog_tests/sockmap_basic.c | 40 ++++
.../selftests/bpf/prog_tests/sockmap_listen.c | 136 ++++++++++++++
.../selftests/bpf/progs/test_sockmap_listen.c | 22 +++
.../progs/test_sockmap_skb_verdict_attach.c | 18 ++
26 files changed, 677 insertions(+), 237 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_skb_verdict_attach.c
--
2.25.1
next reply other threads:[~2021-03-31 2:33 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-31 2:32 Cong Wang [this message]
2021-03-31 2:32 ` [Patch bpf-next v8 01/16] skmsg: lock ingress_skb when purging Cong Wang
2021-03-31 22:00 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 02/16] skmsg: introduce a spinlock to protect ingress_msg Cong Wang
2021-03-31 2:32 ` [Patch bpf-next v8 03/16] net: introduce skb_send_sock() for sock_map Cong Wang
2021-04-01 8:10 ` Jakub Sitnicki
2021-03-31 2:32 ` [Patch bpf-next v8 04/16] skmsg: avoid lock_sock() in sk_psock_backlog() Cong Wang
2021-03-31 2:32 ` [Patch bpf-next v8 05/16] skmsg: use rcu work for destroying psock Cong Wang
2021-03-31 2:32 ` [Patch bpf-next v8 06/16] skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg() Cong Wang
2021-03-31 2:32 ` [Patch bpf-next v8 07/16] sock_map: simplify sock_map_link() a bit Cong Wang
2021-04-01 5:48 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 08/16] sock_map: kill sock_map_link_no_progs() Cong Wang
2021-03-31 2:32 ` [Patch bpf-next v8 09/16] sock_map: introduce BPF_SK_SKB_VERDICT Cong Wang
2021-04-01 5:51 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 10/16] sock: introduce sk->sk_prot->psock_update_sk_prot() Cong Wang
2021-04-02 10:16 ` Jakub Sitnicki
2021-04-03 5:13 ` Cong Wang
2021-04-05 8:25 ` Eric Dumazet
2021-04-06 18:12 ` John Fastabend
2021-04-06 18:30 ` Cong Wang
2021-04-06 21:07 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 11/16] udp: implement ->read_sock() for sockmap Cong Wang
2021-04-01 6:00 ` John Fastabend
2021-04-03 5:08 ` Cong Wang
2021-04-03 6:45 ` Alexei Starovoitov
2021-03-31 2:32 ` [Patch bpf-next v8 12/16] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() Cong Wang
2021-04-01 16:36 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 13/16] udp: implement udp_bpf_recvmsg() for sockmap Cong Wang
2021-04-01 16:24 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 14/16] sock_map: update sock type checks for UDP Cong Wang
2021-04-01 6:02 ` John Fastabend
2021-03-31 2:32 ` [Patch bpf-next v8 15/16] selftests/bpf: add a test case for udp sockmap Cong Wang
2021-03-31 2:32 ` [Patch bpf-next v8 16/16] selftests/bpf: add a test case for loading BPF_SK_SKB_VERDICT Cong Wang
2021-04-01 16:51 ` [Patch bpf-next v8 00/16] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP John Fastabend
2021-04-01 18:03 ` Alexei Starovoitov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210331023237.41094-1-xiyou.wangcong@gmail.com \
--to=xiyou.wangcong@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=cong.wang@bytedance.com \
--cc=duanxiongchun@bytedance.com \
--cc=jiang.wang@bytedance.com \
--cc=netdev@vger.kernel.org \
--cc=wangdongdong.6@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).