All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joe Stringer <joe@wand.net.nz>
To: Florian Westphal <fw@strlen.de>
Cc: Joe Stringer <joe@wand.net.nz>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	netdev <netdev@vger.kernel.org>,
	john fastabend <john.fastabend@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Lorenz Bauer <lmb@cloudflare.com>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Paolo Abeni <pabeni@redhat.com>
Subject: Re: Removing skb_orphan() from ip_rcv_core()
Date: Mon, 24 Jun 2019 20:17:07 -0700	[thread overview]
Message-ID: <CAOftzPi5SO_tZeoEs1Apd5np=Sd2fFUPm1oome_31=rMqSD-=g@mail.gmail.com> (raw)
In-Reply-To: <20190621205935.og7ajx57j7usgycq@breakpoint.cc>

On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote:
>
> Joe Stringer <joe@wand.net.nz> wrote:
> > As discussed during LSFMM, I've been looking at adding something like
> > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
> > be implemented with integration into other BPF logic, however
> > currently any attempts to do so are blocked by the skb_orphan() call
> > in ip_rcv_core() (which will effectively ignore any socket assign
> > decision made by the TC BPF program).
> >
> > Recently I was attempting to remove the skb_orphan() call, and I've
> > been trying different things but there seems to be some context I'm
> > missing. Here's the core of the patch:
> >
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index ed97724c5e33..16aea980318a 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
> > *skb, struct net *net)
> >        memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> >        IPCB(skb)->iif = skb->skb_iif;
> >
> > -       /* Must drop socket now because of tproxy. */
> > -       skb_orphan(skb);
> >
> >        return skb;
> >
> > The statement that the socket must be dropped because of tproxy
> > doesn't make sense to me, because the PRE_ROUTING hook is hit after
> > this, which will call into the tproxy logic and eventually
> > nf_tproxy_assign_sock() which already does the skb_orphan() itself.
>
> in comment: s/tproxy/skb_steal_sock/

For reference, I was following the path like this:

ip_rcv()
( -> ip_rcv_core() for skb_orphan)
-> NF_INET_PRE_ROUTING hook
(... invoke iptables hooks)
-> iptable_mangle_hook()
-> ipt_do_table()
... -> tproxy_tg4()
... -> nf_tproxy_assign_sock()
-> skb_orphan()
(... finish iptables processing)
( -> ip_rcv_finish())
( ... -> ip_rcv_finish_core() for early demux / route lookup )
(... -> dst_input())
(... -> tcp_v4_rcv())
( -> __inet_lookup_skb())
( -> skb_steal_sock() )

> at least thats what I concluded a few years ago when I looked into
> the skb_oprhan() need.
>
> IIRC some device drivers use skb->sk for backpressure, so without this
> non-tcp socket would be stolen by skb_steal_sock.

Do you happen to recall which device drivers? Or have some idea of a
list I could try to go through? Are you referring to virtual drivers
like veth or something else?

> We also recently removed skb orphan when crossing netns:
>
> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
> Author: Flavio Leitner <fbl@redhat.com>
> skbuff: preserve sock reference when scrubbing the skb.
>
> So thats another case where this orphan is needed.

Presumably the orphan is only needed in this case if the packet
crosses a namespace and then is subsequently passed back into the
stack?

> What could be done is adding some way to delay/defer the orphaning
> further, but we would need at the very least some annotation for
> skb_steal_sock to know when the skb->sk is really from TPROXY or
> if it has to orphan.

Eric mentions in another response to this thread that skb_orphan()
should be called from any ndo_start_xmit() which sends traffic back
into the stack. With that, presumably we would be pushing the
orphaning earlier such that the only way that the skb->sk ref can be
non-NULL around this point in receive would be because it was
specifically set by some kind of tproxy logic?

> Same for the safety check in the forwarding path.
> Netfilter modules need o be audited as well, they might make assumptions
> wrt. skb->sk being inet sockets (set by local stack or early demux).
>
> > However, if I drop these lines then I end up causing sockets to
> > release references too many times. Seems like if we don't orphan the
> > skb here, then later logic assumes that we have one more reference
> > than we actually have, and decrements the count when it shouldn't
> > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> > to assume we always have a reference to the socket?)
>
> We might be calling the wrong destructor (i.e., the one set by tcp
> receive instead of the one set at tx time)?

Hmm, interesting thought. Sure enough, with a bit of bpftrace
debugging we find it's tcp_wfree():

$ cat ip_rcv.bt
#include <linux/skbuff.h>

kprobe:ip_rcv {
       $sk = ((struct sk_buff *)arg0)->sk;
       $des = ((struct sk_buff *)arg0)->destructor;
       if ($sk) {
               if ($des) {
                       printf("received %s on %s with sk destructor %s
set\n", str(arg0), str(arg1), ksym($des));
                       @ip4_stacks[kstack] = count();
               }
       }
}
$ sudo bpftrace ip_rcv.bt
Attaching 1 probe...
received  on eth0 with sk destructor tcp_wfree set
^C

@ip4_stacks[
   ip_rcv+1
   __netif_receive_skb+24
   process_backlog+179
   net_rx_action+304
   __do_softirq+220
   do_softirq_own_stack+42
   do_softirq.part.17+70
   __local_bh_enable_ip+101
   ip_finish_output2+421
   __ip_finish_output+187
   ip_finish_output+44
   ip_output+109
   ip_local_out+59
   __ip_queue_xmit+368
   ip_queue_xmit+16
   __tcp_transmit_skb+1303
   tcp_connect+2758
   tcp_v4_connect+1135
   __inet_stream_connect+214
   inet_stream_connect+59
   __sys_connect+237
   __x64_sys_connect+26
   do_syscall_64+90
   entry_SYSCALL_64_after_hwframe+68
]: 1

Is there a solution here where we call the destructor if it's not
sock_efree()? When the socket is later stolen, it will only return the
reference via a call to sock_put(), so presumably at that point in the
stack we already assume that the skb->destructor is not one of these
other destructors (otherwise we wouldn't release the resources
correctly).

  reply	other threads:[~2019-06-25  3:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
2019-06-21 20:59 ` Florian Westphal
2019-06-25  3:17   ` Joe Stringer [this message]
2019-06-25  6:37     ` Eric Dumazet
2019-06-25  9:35       ` Daniel Borkmann
2019-06-25 17:03         ` Eric Dumazet
2019-06-25 18:20       ` Joe Stringer
2019-06-22  0:36 ` Eric Dumazet
2019-06-24 14:47 ` Jamal Hadi Salim
2019-06-24 16:49   ` Eric Dumazet
2019-06-25 10:55     ` Jamal Hadi Salim
2019-06-25  3:26   ` Joe Stringer
2019-06-25 11:06     ` Jamal Hadi Salim
2019-06-25 18:29       ` Joe Stringer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOftzPi5SO_tZeoEs1Apd5np=Sd2fFUPm1oome_31=rMqSD-=g@mail.gmail.com' \
    --to=joe@wand.net.nz \
    --cc=daniel@iogearbox.net \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=lmb@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.