From: Jakub Sitnicki <jakub@cloudflare.com> To: Cong Wang <xiyou.wangcong@gmail.com> Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>, duanxiongchun@bytedance.com, Dongdong Wang <wangdongdong.6@bytedance.com>, Jiang Wang <jiang.wang@bytedance.com>, Cong Wang <cong.wang@bytedance.com>, Daniel Borkmann <daniel@iogearbox.net>, Lorenz Bauer <lmb@cloudflare.com>, John Fastabend <john.fastabend@gmail.com> Subject: Re: [Patch bpf-next v6 4/8] skmsg: move sk_redir from TCP_SKB_CB to skb Date: Tue, 23 Feb 2021 18:52:58 +0100 Message-ID: <875z2i4qo5.fsf@cloudflare.com> (raw) In-Reply-To: <CAM_iQpVS_sJy=sM31pHZVi6njZEAa7Hv_Bkt2sB7JcAjFw3guw@mail.gmail.com> On Mon, Feb 22, 2021 at 08:27 PM CET, Cong Wang wrote: > On Mon, Feb 22, 2021 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote: >> >> On Sat, Feb 20, 2021 at 06:29 AM CET, Cong Wang wrote: >> > From: Cong Wang <cong.wang@bytedance.com> >> > >> > Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly >> > does not work for any other non-TCP protocols. We can move them to >> > skb ext, but it introduces a memory allocation on fast path. >> > >> > Fortunately, we only need to a word-size to store all the information, >> > because the flags actually only contains 1 bit so can be just packed >> > into the lowest bit of the "pointer", which is stored as unsigned >> > long. >> > >> > Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is >> > no longer needed after ->sk_data_ready() so we can just drop it. >> > >> > Cc: Daniel Borkmann <daniel@iogearbox.net> >> > Cc: Jakub Sitnicki <jakub@cloudflare.com> >> > Cc: Lorenz Bauer <lmb@cloudflare.com> >> > Acked-by: John Fastabend <john.fastabend@gmail.com> >> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> >> > --- >> >> LGTM. I have some questions (below) that would help me confirm if I >> understand the changes, and what could be improved, if anything. >> >> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> >> >> > include/linux/skbuff.h | 3 +++ >> > include/linux/skmsg.h | 35 +++++++++++++++++++++++++++++++++++ >> > include/net/tcp.h | 19 ------------------- >> > net/core/skmsg.c | 32 ++++++++++++++++++++------------ >> > net/core/sock_map.c | 8 ++------ >> > 5 files changed, 60 insertions(+), 37 deletions(-) >> > >> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h >> > index 6d0a33d1c0db..bd84f799c952 100644 >> > --- a/include/linux/skbuff.h >> > +++ b/include/linux/skbuff.h >> > @@ -755,6 +755,9 @@ struct sk_buff { >> > void (*destructor)(struct sk_buff *skb); >> > }; >> > struct list_head tcp_tsorted_anchor; >> > +#ifdef CONFIG_NET_SOCK_MSG >> > + unsigned long _sk_redir; >> > +#endif >> > }; >> > >> > #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) >> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h >> > index e3bb712af257..fc234d507fd7 100644 >> > --- a/include/linux/skmsg.h >> > +++ b/include/linux/skmsg.h >> > @@ -459,4 +459,39 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock) >> > return false; >> > return !!psock->saved_data_ready; >> > } >> > + >> > +#if IS_ENABLED(CONFIG_NET_SOCK_MSG) >> > +static inline bool skb_bpf_ingress(const struct sk_buff *skb) >> > +{ >> > + unsigned long sk_redir = skb->_sk_redir; >> > + >> > + return sk_redir & BPF_F_INGRESS; >> > +} >> > + >> > +static inline void skb_bpf_set_ingress(struct sk_buff *skb) >> > +{ >> > + skb->_sk_redir |= BPF_F_INGRESS; >> > +} >> > + >> > +static inline void skb_bpf_set_redir(struct sk_buff *skb, struct sock *sk_redir, >> > + bool ingress) >> > +{ >> > + skb->_sk_redir = (unsigned long)sk_redir; >> > + if (ingress) >> > + skb->_sk_redir |= BPF_F_INGRESS; >> > +} >> > + >> > +static inline struct sock *skb_bpf_redirect_fetch(const struct sk_buff *skb) >> > +{ >> > + unsigned long sk_redir = skb->_sk_redir; >> > + >> > + sk_redir &= ~0x1UL; >> >> We're using the enum when setting the bit flag, but a hardcoded constant >> when masking it. ~BPF_F_INGRESS would be more consistent here. > > Well, here we need a mask, not a bit, but we don't have a mask yet, > hence I just use hard-coded 0x1. Does #define BPF_F_MASK 0x1UL > look any better? Based on what I've seen around, mask for sanitizing tagged pointers is usually derived from the flag(s). For instance: #define SKB_DST_NOREF 1UL #define SKB_DST_PTRMASK ~(SKB_DST_NOREF) #define SK_USER_DATA_NOCOPY 1UL #define SK_USER_DATA_BPF 2UL /* Managed by BPF */ #define SK_USER_DATA_PTRMASK ~(SK_USER_DATA_NOCOPY | SK_USER_DATA_BPF) Using ~(BPF_F_INGRESS) expression would be like substituting mask definition. [..] >> > diff --git a/include/net/tcp.h b/include/net/tcp.h >> > index 947ef5da6867..075de26f449d 100644 >> > --- a/include/net/tcp.h >> > +++ b/include/net/tcp.h >> > @@ -883,30 +883,11 @@ struct tcp_skb_cb { >> > struct inet6_skb_parm h6; >> > #endif >> > } header; /* For incoming skbs */ >> > - struct { >> > - __u32 flags; >> > - struct sock *sk_redir; >> > - } bpf; >> > }; >> > }; >> > >> > #define TCP_SKB_CB(__skb) ((struct tcp_skb_cb *)&((__skb)->cb[0])) >> > >> > -static inline bool tcp_skb_bpf_ingress(const struct sk_buff *skb) >> > -{ >> > - return TCP_SKB_CB(skb)->bpf.flags & BPF_F_INGRESS; >> > -} >> > - >> > -static inline struct sock *tcp_skb_bpf_redirect_fetch(struct sk_buff *skb) >> > -{ >> > - return TCP_SKB_CB(skb)->bpf.sk_redir; >> > -} >> > - >> > -static inline void tcp_skb_bpf_redirect_clear(struct sk_buff *skb) >> > -{ >> > - TCP_SKB_CB(skb)->bpf.sk_redir = NULL; >> > -} >> > - >> > extern const struct inet_connection_sock_af_ops ipv4_specific; >> > >> > #if IS_ENABLED(CONFIG_IPV6) >> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c >> > index 2d8bbb3fd87c..05b5af09ff42 100644 >> > --- a/net/core/skmsg.c >> > +++ b/net/core/skmsg.c >> > @@ -494,6 +494,8 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb >> > static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, >> > u32 off, u32 len, bool ingress) >> > { >> > + skb_bpf_redirect_clear(skb); >> >> This is called to avoid leaking state in skb->_skb_refdst. Correct? > > This is to teach kfree_skb() not to consider it as a valid _skb_refdst. OK > >> >> I'm wondering why we're doing it every time sk_psock_handle_skb() gets >> invoked from the do/while loop in sk_psock_backlog(), instead of doing >> it once after reading ingress flag with skb_bpf_ingress()? > > It should also work, I don't see much difference here, as we almost > always process a full skb, that is, ret == skb->len. OK > > >> >> > + >> > if (!ingress) { >> > if (!sock_writeable(psock->sk)) >> > return -EAGAIN; >> > @@ -525,7 +527,7 @@ static void sk_psock_backlog(struct work_struct *work) >> > len = skb->len; >> > off = 0; >> > start: >> > - ingress = tcp_skb_bpf_ingress(skb); >> > + ingress = skb_bpf_ingress(skb); >> > do { >> > ret = -EIO; >> > if (likely(psock->sk->sk_socket)) >> > @@ -631,7 +633,12 @@ void __sk_psock_purge_ingress_msg(struct sk_psock *psock) >> > >> > static void sk_psock_zap_ingress(struct sk_psock *psock) >> > { >> > - __skb_queue_purge(&psock->ingress_skb); >> > + struct sk_buff *skb; >> > + >> > + while ((skb = __skb_dequeue(&psock->ingress_skb)) != NULL) { >> > + skb_bpf_redirect_clear(skb); >> >> I believe we clone the skb before enqueuing it psock->ingress_skb. >> Clone happens either in sk_psock_verdict_recv() or in __strp_recv(). >> There are not other users holding a ref, so clearing the redirect seems >> unneeded. Unless I'm missing something? > > Yes, skb dst is also cloned: > > 980 static void __copy_skb_header(struct sk_buff *new, const struct > sk_buff *old) > 981 { > 982 new->tstamp = old->tstamp; > 983 /* We do not copy old->sk */ > 984 new->dev = old->dev; > 985 memcpy(new->cb, old->cb, sizeof(old->cb)); > 986 skb_dst_copy(new, old); > > Also, if without this, dst_release() would complain again. I was not smart > enough to add it in the beginning, dst_release() taught me this lesson. ;) OK, I think I follow you now. Alternatively we could clear _skb_refdest after clone, but before enqueuing the skb in ingress_skb. And only for when we're redirecting. I believe that would be in sk_psock_skb_redirect, right before skb_queue_tail. > >> >> > + kfree_skb(skb); >> > + } >> > __sk_psock_purge_ingress_msg(psock); >> > } >> > >> > @@ -752,7 +759,7 @@ static void sk_psock_skb_redirect(struct sk_buff *skb) >> > struct sk_psock *psock_other; >> > struct sock *sk_other; >> > >> > - sk_other = tcp_skb_bpf_redirect_fetch(skb); >> > + sk_other = skb_bpf_redirect_fetch(skb); >> > /* This error is a buggy BPF program, it returned a redirect >> > * return code, but then didn't set a redirect interface. >> > */ >> > @@ -802,9 +809,10 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb) >> > * TLS context. >> > */ >> > skb->sk = psock->sk; >> > - tcp_skb_bpf_redirect_clear(skb); >> > + skb_dst_drop(skb); >> > + skb_bpf_redirect_clear(skb); >> >> After skb_dst_drop(), skb->_skb_refdst is clear. So it seems the >> redirect_clear() is not needed. But I'm guessing it is being invoked >> to communicate the intention? > > Technically true, but I prefer to call them explicitly, not to rely on the > fact skb->_skb_refdst shares the same storage with skb->_sk_redir, > which would also require some comments to explain. > OK
next prev parent reply index Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-20 5:29 [Patch bpf-next v6 0/8] sock_map: clean up and refactor code for BPF_SK_SKB_VERDICT Cong Wang 2021-02-20 5:29 ` [Patch bpf-next v6 1/8] bpf: clean up sockmap related Kconfigs Cong Wang 2021-02-22 8:51 ` Jakub Sitnicki 2021-02-22 23:23 ` Cong Wang 2021-02-20 5:29 ` [Patch bpf-next v6 2/8] skmsg: get rid of struct sk_psock_parser Cong Wang 2021-02-20 5:29 ` [Patch bpf-next v6 3/8] bpf: compute data_end dynamically with JIT code Cong Wang 2021-02-20 5:29 ` [Patch bpf-next v6 4/8] skmsg: move sk_redir from TCP_SKB_CB to skb Cong Wang 2021-02-22 12:20 ` Jakub Sitnicki 2021-02-22 19:27 ` Cong Wang 2021-02-23 17:52 ` Jakub Sitnicki [this message] 2021-02-23 18:04 ` Cong Wang 2021-02-23 18:36 ` Jakub Sitnicki 2021-02-20 5:29 ` [Patch bpf-next v6 5/8] sock_map: rename skb_parser and skb_verdict Cong Wang 2021-02-22 12:28 ` Jakub Sitnicki 2021-02-22 19:32 ` Cong Wang 2021-02-20 5:29 ` [Patch bpf-next v6 6/8] sock_map: make sock_map_prog_update() static Cong Wang 2021-02-22 12:29 ` Jakub Sitnicki 2021-02-20 5:29 ` [Patch bpf-next v6 7/8] skmsg: make __sk_psock_purge_ingress_msg() static Cong Wang 2021-02-22 12:30 ` Jakub Sitnicki 2021-02-20 5:29 ` [Patch bpf-next v6 8/8] skmsg: get rid of sk_psock_bpf_run() Cong Wang 2021-02-22 12:31 ` Jakub Sitnicki 2021-02-22 12:32 ` [Patch bpf-next v6 0/8] sock_map: clean up and refactor code for BPF_SK_SKB_VERDICT Jakub Sitnicki
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=875z2i4qo5.fsf@cloudflare.com \ --to=jakub@cloudflare.com \ --cc=bpf@vger.kernel.org \ --cc=cong.wang@bytedance.com \ --cc=daniel@iogearbox.net \ --cc=duanxiongchun@bytedance.com \ --cc=jiang.wang@bytedance.com \ --cc=john.fastabend@gmail.com \ --cc=lmb@cloudflare.com \ --cc=netdev@vger.kernel.org \ --cc=wangdongdong.6@bytedance.com \ --cc=xiyou.wangcong@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
BPF Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \ bpf@vger.kernel.org public-inbox-index bpf Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.bpf AGPL code for this site: git clone https://public-inbox.org/public-inbox.git