netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Removing skb_orphan() from ip_rcv_core()
@ 2019-06-21 17:58 Joe Stringer
  2019-06-21 20:59 ` Florian Westphal
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Joe Stringer @ 2019-06-21 17:58 UTC (permalink / raw)
  To: Eric Dumazet, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni

Hi folks, picking this up again..

As discussed during LSFMM, I've been looking at adding something like
an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
be implemented with integration into other BPF logic, however
currently any attempts to do so are blocked by the skb_orphan() call
in ip_rcv_core() (which will effectively ignore any socket assign
decision made by the TC BPF program).

Recently I was attempting to remove the skb_orphan() call, and I've
been trying different things but there seems to be some context I'm
missing. Here's the core of the patch:

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index ed97724c5e33..16aea980318a 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
*skb, struct net *net)
       memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
       IPCB(skb)->iif = skb->skb_iif;

-       /* Must drop socket now because of tproxy. */
-       skb_orphan(skb);

       return skb;

The statement that the socket must be dropped because of tproxy
doesn't make sense to me, because the PRE_ROUTING hook is hit after
this, which will call into the tproxy logic and eventually
nf_tproxy_assign_sock() which already does the skb_orphan() itself.

However, if I drop these lines then I end up causing sockets to
release references too many times. Seems like if we don't orphan the
skb here, then later logic assumes that we have one more reference
than we actually have, and decrements the count when it shouldn't
(perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
to assume we always have a reference to the socket?)

Splat:

refcount_t hit zero at sk_stop_timer+0x2c/0x30 in cilium-agent[16359],
uid/euid: 0/0
WARNING: CPU: 0 PID: 16359 at kernel/panic.c:686 refcount_error_report+0x9c/0xa1
...
? inet_put_port+0xa6/0xd0
inet_csk_clear_xmit_timers+0x2e/0x50
tcp_done+0x8b/0xf0
tcp_reset+0x49/0xc0
tcp_validate_incoming+0x2f7/0x410
tcp_rcv_state_process+0x250/0xdb6
? tcp_v4_connect+0x46f/0x4e0
tcp_v4_do_rcv+0xbd/0x1f0
__release_sock+0x84/0xd0
release_sock+0x30/0xa0
inet_stream_connect+0x47/0x60

(Full version: https://gist.github.com/joestringer/d5313e4bf4231e2c46405bd7a3053936
)

This seems potentially related to some of the socket referencing
discussion in the peer thread "[RFC bpf-next 0/7] Programming socket
lookup with BPF".

During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
necessary in that path in the current version of the code, and that we
may be able to remove it. Florian, I know you weren't in the room for
that discussion, so raising it again now with a stack trace, Do you
have some sense what's going on here and whether there's a path
towards removing it from this path or allowing the skb->sk to be
retained during ip_rcv() in some conditions?

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
@ 2019-06-21 20:59 ` Florian Westphal
  2019-06-25  3:17   ` Joe Stringer
  2019-06-22  0:36 ` Eric Dumazet
  2019-06-24 14:47 ` Jamal Hadi Salim
  2 siblings, 1 reply; 14+ messages in thread
From: Florian Westphal @ 2019-06-21 20:59 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Eric Dumazet, Florian Westphal, netdev, john fastabend,
	Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki, Paolo Abeni

Joe Stringer <joe@wand.net.nz> wrote:
> As discussed during LSFMM, I've been looking at adding something like
> an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
> be implemented with integration into other BPF logic, however
> currently any attempts to do so are blocked by the skb_orphan() call
> in ip_rcv_core() (which will effectively ignore any socket assign
> decision made by the TC BPF program).
> 
> Recently I was attempting to remove the skb_orphan() call, and I've
> been trying different things but there seems to be some context I'm
> missing. Here's the core of the patch:
> 
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index ed97724c5e33..16aea980318a 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
> *skb, struct net *net)
>        memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>        IPCB(skb)->iif = skb->skb_iif;
> 
> -       /* Must drop socket now because of tproxy. */
> -       skb_orphan(skb);
> 
>        return skb;
> 
> The statement that the socket must be dropped because of tproxy
> doesn't make sense to me, because the PRE_ROUTING hook is hit after
> this, which will call into the tproxy logic and eventually
> nf_tproxy_assign_sock() which already does the skb_orphan() itself.

in comment: s/tproxy/skb_steal_sock/

at least thats what I concluded a few years ago when I looked into
the skb_oprhan() need.

IIRC some device drivers use skb->sk for backpressure, so without this
non-tcp socket would be stolen by skb_steal_sock.

We also recently removed skb orphan when crossing netns:

commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
Author: Flavio Leitner <fbl@redhat.com>
skbuff: preserve sock reference when scrubbing the skb.

So thats another case where this orphan is needed.

What could be done is adding some way to delay/defer the orphaning
further, but we would need at the very least some annotation for
skb_steal_sock to know when the skb->sk is really from TPROXY or
if it has to orphan.

Same for the safety check in the forwarding path.
Netfilter modules need o be audited as well, they might make assumptions
wrt. skb->sk being inet sockets (set by local stack or early demux).

> However, if I drop these lines then I end up causing sockets to
> release references too many times. Seems like if we don't orphan the
> skb here, then later logic assumes that we have one more reference
> than we actually have, and decrements the count when it shouldn't
> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> to assume we always have a reference to the socket?)

We might be calling the wrong destructor (i.e., the one set by tcp
receive instead of the one set at tx time)?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
  2019-06-21 20:59 ` Florian Westphal
@ 2019-06-22  0:36 ` Eric Dumazet
  2019-06-24 14:47 ` Jamal Hadi Salim
  2 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2019-06-22  0:36 UTC (permalink / raw)
  To: Joe Stringer, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni



On 6/21/19 10:58 AM, Joe Stringer wrote:
> Hi folks, picking this up again..
> 
> As discussed during LSFMM, I've been looking at adding something like
> an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
> be implemented with integration into other BPF logic, however
> currently any attempts to do so are blocked by the skb_orphan() call
> in ip_rcv_core() (which will effectively ignore any socket assign
> decision made by the TC BPF program).
> 
> Recently I was attempting to remove the skb_orphan() call, and I've
> been trying different things but there seems to be some context I'm
> missing. Here's the core of the patch:
> 
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index ed97724c5e33..16aea980318a 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
> *skb, struct net *net)
>        memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>        IPCB(skb)->iif = skb->skb_iif;
> 
> -       /* Must drop socket now because of tproxy. */
> -       skb_orphan(skb);
> 
>        return skb;
> 
> The statement that the socket must be dropped because of tproxy
> doesn't make sense to me, because the PRE_ROUTING hook is hit after
> this, which will call into the tproxy logic and eventually
> nf_tproxy_assign_sock() which already does the skb_orphan() itself.
> 
> However, if I drop these lines then I end up causing sockets to
> release references too many times. Seems like if we don't orphan the
> skb here, then later logic assumes that we have one more reference
> than we actually have, and decrements the count when it shouldn't
> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> to assume we always have a reference to the socket?)
> 
> Splat:
> 
> refcount_t hit zero at sk_stop_timer+0x2c/0x30 in cilium-agent[16359],
> uid/euid: 0/0
> WARNING: CPU: 0 PID: 16359 at kernel/panic.c:686 refcount_error_report+0x9c/0xa1
> ...
> ? inet_put_port+0xa6/0xd0
> inet_csk_clear_xmit_timers+0x2e/0x50
> tcp_done+0x8b/0xf0
> tcp_reset+0x49/0xc0
> tcp_validate_incoming+0x2f7/0x410
> tcp_rcv_state_process+0x250/0xdb6
> ? tcp_v4_connect+0x46f/0x4e0
> tcp_v4_do_rcv+0xbd/0x1f0
> __release_sock+0x84/0xd0
> release_sock+0x30/0xa0
> inet_stream_connect+0x47/0x60
> 
> (Full version: https://gist.github.com/joestringer/d5313e4bf4231e2c46405bd7a3053936
> )
> 
> This seems potentially related to some of the socket referencing
> discussion in the peer thread "[RFC bpf-next 0/7] Programming socket
> lookup with BPF".
> 
> During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
> necessary in that path in the current version of the code, and that we
> may be able to remove it. Florian, I know you weren't in the room for
> that discussion, so raising it again now with a stack trace, Do you
> have some sense what's going on here and whether there's a path
> towards removing it from this path or allowing the skb->sk to be
> retained during ip_rcv() in some conditions?
> 

I guess you might missed part of the LSFMM discussion :

We need to make sure skb_orphan() is called from any virtual
driver ndo_start_xmit() that re-injects packet back to the stack.

loopback driver has it, but probably some driver lacks it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
  2019-06-21 20:59 ` Florian Westphal
  2019-06-22  0:36 ` Eric Dumazet
@ 2019-06-24 14:47 ` Jamal Hadi Salim
  2019-06-24 16:49   ` Eric Dumazet
  2019-06-25  3:26   ` Joe Stringer
  2 siblings, 2 replies; 14+ messages in thread
From: Jamal Hadi Salim @ 2019-06-24 14:47 UTC (permalink / raw)
  To: Joe Stringer, Eric Dumazet, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni

[-- Attachment #1: Type: text/plain, Size: 1567 bytes --]

On 2019-06-21 1:58 p.m., Joe Stringer wrote:
> Hi folks, picking this up again..
[..]
> During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
> necessary in that path in the current version of the code, and that we
> may be able to remove it. Florian, I know you weren't in the room for
> that discussion, so raising it again now with a stack trace, Do you
> have some sense what's going on here and whether there's a path
> towards removing it from this path or allowing the skb->sk to be
> retained during ip_rcv() in some conditions?


Sorry - I havent followed the discussion but saw your email over
the weekend and wanted to be at work to refresh my memory on some
code. For maybe 2-3 years we have deployed the tproxy
equivalent as a tc action on ingress (with no netfilter dependency).

And, of course, we had to work around that specific code you are
referring to - we didnt remove it. The tc action code increments
the sk refcount and sets the tc index. The net core doesnt orphan
the skb if a speacial tc index value is set (see attached patch)

I never bothered up streaming the patch because the hack is a bit 
embarrassing (but worked ;->); and never posted the action code
either because i thought this was just us that had this requirement.
I am glad other people see the need for this feature. Is there effort
to make this _not_ depend on iptables/netfilter? I am guessing if you
want to do this from ebpf (tc or xdp) that is a requirement.
Our need was with tcp at the time; so left udp dependency on netfilter
alone.

cheers,
jamal

[-- Attachment #2: tp_tcindex.patch --]
[-- Type: text/x-patch, Size: 1927 bytes --]

commit 4d130b0a883b4aebc36a88ca116746594e176c6a
Author: Jamal Hadi Salim <hadi@mojatatu.com>
Date:   Fri Nov 25 15:45:48 2016 -0400

    transparent proxy workaround so we can get the tcaction to work

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fa2dc8f692c6..29b303dbbfd9 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -482,8 +482,11 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
 	IPCB(skb)->iif = skb->skb_iif;
 
-	/* Must drop socket now because of tproxy. */
-	skb_orphan(skb);
+	/* Must drop socket now because of tproxy,
+	 * if we didnt set it already as usable
+	 * */
+	if(skb->tc_index != 0xFFFF)
+		skb_orphan(skb);
 
 	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
 		       net, NULL, skb, dev, NULL,
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a348f5..10148f2eec03 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -77,12 +77,16 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt
 	u32 pkt_len;
 	struct inet6_dev *idev;
 	struct net *net = dev_net(skb->dev);
+	struct sock *orig_sk = NULL;
 
 	if (skb->pkt_type == PACKET_OTHERHOST) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
 
+	if(skb->tc_index == 0xFFFF)
+		orig_sk = skb->sk;
+
 	rcu_read_lock();
 
 	idev = __in6_dev_get(skb->dev);
@@ -202,8 +206,17 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt
 
 	rcu_read_unlock();
 
+	if (skb->tc_index == 0xFFFF && !skb->sk && orig_sk)
+	{
+		skb_orphan(skb);
+		skb->sk = orig_sk;
+		skb->destructor = sock_edemux;
+		atomic_inc_not_zero(&skb->sk->sk_refcnt);
+	}
+
 	/* Must drop socket now because of tproxy. */
-	skb_orphan(skb);
+	if(skb->tc_index != 0xFFFF)
+		skb_orphan(skb);
 
 	return NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING,
 		       net, NULL, skb, dev, NULL,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-24 14:47 ` Jamal Hadi Salim
@ 2019-06-24 16:49   ` Eric Dumazet
  2019-06-25 10:55     ` Jamal Hadi Salim
  2019-06-25  3:26   ` Joe Stringer
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2019-06-24 16:49 UTC (permalink / raw)
  To: Jamal Hadi Salim, Joe Stringer, Eric Dumazet, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni



On 6/24/19 7:47 AM, Jamal Hadi Salim wrote:
> On 2019-06-21 1:58 p.m., Joe Stringer wrote:
>> Hi folks, picking this up again..
> [..]
>> During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
>> necessary in that path in the current version of the code, and that we
>> may be able to remove it. Florian, I know you weren't in the room for
>> that discussion, so raising it again now with a stack trace, Do you
>> have some sense what's going on here and whether there's a path
>> towards removing it from this path or allowing the skb->sk to be
>> retained during ip_rcv() in some conditions?
> 
> 
> Sorry - I havent followed the discussion but saw your email over
> the weekend and wanted to be at work to refresh my memory on some
> code. For maybe 2-3 years we have deployed the tproxy
> equivalent as a tc action on ingress (with no netfilter dependency).
> 
> And, of course, we had to work around that specific code you are
> referring to - we didnt remove it. The tc action code increments
> the sk refcount and sets the tc index. The net core doesnt orphan
> the skb if a speacial tc index value is set (see attached patch)
> 
> I never bothered up streaming the patch because the hack is a bit embarrassing (but worked ;->); and never posted the action code
> either because i thought this was just us that had this requirement.
> I am glad other people see the need for this feature. Is there effort
> to make this _not_ depend on iptables/netfilter? I am guessing if you
> want to do this from ebpf (tc or xdp) that is a requirement.
> Our need was with tcp at the time; so left udp dependency on netfilter
> alone.
> 


Well, I would simply remove the skb_orphan() call completely.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-21 20:59 ` Florian Westphal
@ 2019-06-25  3:17   ` Joe Stringer
  2019-06-25  6:37     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Stringer @ 2019-06-25  3:17 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Joe Stringer, Eric Dumazet, netdev, john fastabend,
	Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki, Paolo Abeni

On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote:
>
> Joe Stringer <joe@wand.net.nz> wrote:
> > As discussed during LSFMM, I've been looking at adding something like
> > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
> > be implemented with integration into other BPF logic, however
> > currently any attempts to do so are blocked by the skb_orphan() call
> > in ip_rcv_core() (which will effectively ignore any socket assign
> > decision made by the TC BPF program).
> >
> > Recently I was attempting to remove the skb_orphan() call, and I've
> > been trying different things but there seems to be some context I'm
> > missing. Here's the core of the patch:
> >
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index ed97724c5e33..16aea980318a 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
> > *skb, struct net *net)
> >        memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> >        IPCB(skb)->iif = skb->skb_iif;
> >
> > -       /* Must drop socket now because of tproxy. */
> > -       skb_orphan(skb);
> >
> >        return skb;
> >
> > The statement that the socket must be dropped because of tproxy
> > doesn't make sense to me, because the PRE_ROUTING hook is hit after
> > this, which will call into the tproxy logic and eventually
> > nf_tproxy_assign_sock() which already does the skb_orphan() itself.
>
> in comment: s/tproxy/skb_steal_sock/

For reference, I was following the path like this:

ip_rcv()
( -> ip_rcv_core() for skb_orphan)
-> NF_INET_PRE_ROUTING hook
(... invoke iptables hooks)
-> iptable_mangle_hook()
-> ipt_do_table()
... -> tproxy_tg4()
... -> nf_tproxy_assign_sock()
-> skb_orphan()
(... finish iptables processing)
( -> ip_rcv_finish())
( ... -> ip_rcv_finish_core() for early demux / route lookup )
(... -> dst_input())
(... -> tcp_v4_rcv())
( -> __inet_lookup_skb())
( -> skb_steal_sock() )

> at least thats what I concluded a few years ago when I looked into
> the skb_oprhan() need.
>
> IIRC some device drivers use skb->sk for backpressure, so without this
> non-tcp socket would be stolen by skb_steal_sock.

Do you happen to recall which device drivers? Or have some idea of a
list I could try to go through? Are you referring to virtual drivers
like veth or something else?

> We also recently removed skb orphan when crossing netns:
>
> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
> Author: Flavio Leitner <fbl@redhat.com>
> skbuff: preserve sock reference when scrubbing the skb.
>
> So thats another case where this orphan is needed.

Presumably the orphan is only needed in this case if the packet
crosses a namespace and then is subsequently passed back into the
stack?

> What could be done is adding some way to delay/defer the orphaning
> further, but we would need at the very least some annotation for
> skb_steal_sock to know when the skb->sk is really from TPROXY or
> if it has to orphan.

Eric mentions in another response to this thread that skb_orphan()
should be called from any ndo_start_xmit() which sends traffic back
into the stack. With that, presumably we would be pushing the
orphaning earlier such that the only way that the skb->sk ref can be
non-NULL around this point in receive would be because it was
specifically set by some kind of tproxy logic?

> Same for the safety check in the forwarding path.
> Netfilter modules need o be audited as well, they might make assumptions
> wrt. skb->sk being inet sockets (set by local stack or early demux).
>
> > However, if I drop these lines then I end up causing sockets to
> > release references too many times. Seems like if we don't orphan the
> > skb here, then later logic assumes that we have one more reference
> > than we actually have, and decrements the count when it shouldn't
> > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> > to assume we always have a reference to the socket?)
>
> We might be calling the wrong destructor (i.e., the one set by tcp
> receive instead of the one set at tx time)?

Hmm, interesting thought. Sure enough, with a bit of bpftrace
debugging we find it's tcp_wfree():

$ cat ip_rcv.bt
#include <linux/skbuff.h>

kprobe:ip_rcv {
       $sk = ((struct sk_buff *)arg0)->sk;
       $des = ((struct sk_buff *)arg0)->destructor;
       if ($sk) {
               if ($des) {
                       printf("received %s on %s with sk destructor %s
set\n", str(arg0), str(arg1), ksym($des));
                       @ip4_stacks[kstack] = count();
               }
       }
}
$ sudo bpftrace ip_rcv.bt
Attaching 1 probe...
received  on eth0 with sk destructor tcp_wfree set
^C

@ip4_stacks[
   ip_rcv+1
   __netif_receive_skb+24
   process_backlog+179
   net_rx_action+304
   __do_softirq+220
   do_softirq_own_stack+42
   do_softirq.part.17+70
   __local_bh_enable_ip+101
   ip_finish_output2+421
   __ip_finish_output+187
   ip_finish_output+44
   ip_output+109
   ip_local_out+59
   __ip_queue_xmit+368
   ip_queue_xmit+16
   __tcp_transmit_skb+1303
   tcp_connect+2758
   tcp_v4_connect+1135
   __inet_stream_connect+214
   inet_stream_connect+59
   __sys_connect+237
   __x64_sys_connect+26
   do_syscall_64+90
   entry_SYSCALL_64_after_hwframe+68
]: 1

Is there a solution here where we call the destructor if it's not
sock_efree()? When the socket is later stolen, it will only return the
reference via a call to sock_put(), so presumably at that point in the
stack we already assume that the skb->destructor is not one of these
other destructors (otherwise we wouldn't release the resources
correctly).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-24 14:47 ` Jamal Hadi Salim
  2019-06-24 16:49   ` Eric Dumazet
@ 2019-06-25  3:26   ` Joe Stringer
  2019-06-25 11:06     ` Jamal Hadi Salim
  1 sibling, 1 reply; 14+ messages in thread
From: Joe Stringer @ 2019-06-25  3:26 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Joe Stringer, Eric Dumazet, Florian Westphal, netdev,
	john fastabend, Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki,
	Paolo Abeni

On Mon, Jun 24, 2019 at 7:47 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2019-06-21 1:58 p.m., Joe Stringer wrote:
> > Hi folks, picking this up again..
> [..]
> > During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
> > necessary in that path in the current version of the code, and that we
> > may be able to remove it. Florian, I know you weren't in the room for
> > that discussion, so raising it again now with a stack trace, Do you
> > have some sense what's going on here and whether there's a path
> > towards removing it from this path or allowing the skb->sk to be
> > retained during ip_rcv() in some conditions?
>
>
> Sorry - I havent followed the discussion but saw your email over
> the weekend and wanted to be at work to refresh my memory on some
> code. For maybe 2-3 years we have deployed the tproxy
> equivalent as a tc action on ingress (with no netfilter dependency).
>
> And, of course, we had to work around that specific code you are
> referring to - we didnt remove it. The tc action code increments
> the sk refcount and sets the tc index. The net core doesnt orphan
> the skb if a speacial tc index value is set (see attached patch)
>
> I never bothered up streaming the patch because the hack is a bit
> embarrassing (but worked ;->); and never posted the action code
> either because i thought this was just us that had this requirement.
> I am glad other people see the need for this feature. Is there effort
> to make this _not_ depend on iptables/netfilter? I am guessing if you
> want to do this from ebpf (tc or xdp) that is a requirement.
> Our need was with tcp at the time; so left udp dependency on netfilter
> alone.

I haven't got as far as UDP yet, but I didn't see any need for a
dependency on netfilter.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-25  3:17   ` Joe Stringer
@ 2019-06-25  6:37     ` Eric Dumazet
  2019-06-25  9:35       ` Daniel Borkmann
  2019-06-25 18:20       ` Joe Stringer
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2019-06-25  6:37 UTC (permalink / raw)
  To: Joe Stringer, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni



On 6/24/19 8:17 PM, Joe Stringer wrote:
> On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote:
>>
>> Joe Stringer <joe@wand.net.nz> wrote:
>>> As discussed during LSFMM, I've been looking at adding something like
>>> an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
>>> be implemented with integration into other BPF logic, however
>>> currently any attempts to do so are blocked by the skb_orphan() call
>>> in ip_rcv_core() (which will effectively ignore any socket assign
>>> decision made by the TC BPF program).
>>>
>>> Recently I was attempting to remove the skb_orphan() call, and I've
>>> been trying different things but there seems to be some context I'm
>>> missing. Here's the core of the patch:
>>>
>>> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
>>> index ed97724c5e33..16aea980318a 100644
>>> --- a/net/ipv4/ip_input.c
>>> +++ b/net/ipv4/ip_input.c
>>> @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
>>> *skb, struct net *net)
>>>        memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>>>        IPCB(skb)->iif = skb->skb_iif;
>>>
>>> -       /* Must drop socket now because of tproxy. */
>>> -       skb_orphan(skb);
>>>
>>>        return skb;
>>>
>>> The statement that the socket must be dropped because of tproxy
>>> doesn't make sense to me, because the PRE_ROUTING hook is hit after
>>> this, which will call into the tproxy logic and eventually
>>> nf_tproxy_assign_sock() which already does the skb_orphan() itself.
>>
>> in comment: s/tproxy/skb_steal_sock/
> 
> For reference, I was following the path like this:
> 
> ip_rcv()
> ( -> ip_rcv_core() for skb_orphan)
> -> NF_INET_PRE_ROUTING hook
> (... invoke iptables hooks)
> -> iptable_mangle_hook()
> -> ipt_do_table()
> ... -> tproxy_tg4()
> ... -> nf_tproxy_assign_sock()
> -> skb_orphan()
> (... finish iptables processing)
> ( -> ip_rcv_finish())
> ( ... -> ip_rcv_finish_core() for early demux / route lookup )
> (... -> dst_input())
> (... -> tcp_v4_rcv())
> ( -> __inet_lookup_skb())
> ( -> skb_steal_sock() )
> 
>> at least thats what I concluded a few years ago when I looked into
>> the skb_oprhan() need.
>>
>> IIRC some device drivers use skb->sk for backpressure, so without this
>> non-tcp socket would be stolen by skb_steal_sock.
> 
> Do you happen to recall which device drivers? Or have some idea of a
> list I could try to go through? Are you referring to virtual drivers
> like veth or something else?
> 
>> We also recently removed skb orphan when crossing netns:
>>
>> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
>> Author: Flavio Leitner <fbl@redhat.com>
>> skbuff: preserve sock reference when scrubbing the skb.
>>
>> So thats another case where this orphan is needed.
> 
> Presumably the orphan is only needed in this case if the packet
> crosses a namespace and then is subsequently passed back into the
> stack?

Yes, I understand we do not want the skb_orphan() when 'srubing' the skb.

But we want the skb_orphan() right before the packet is reinjected in ingress path. 

> 
>> What could be done is adding some way to delay/defer the orphaning
>> further, but we would need at the very least some annotation for
>> skb_steal_sock to know when the skb->sk is really from TPROXY or
>> if it has to orphan.
> 
> Eric mentions in another response to this thread that skb_orphan()
> should be called from any ndo_start_xmit() which sends traffic back
> into the stack. With that, presumably we would be pushing the
> orphaning earlier such that the only way that the skb->sk ref can be
> non-NULL around this point in receive would be because it was
> specifically set by some kind of tproxy logic?
> 
>> Same for the safety check in the forwarding path.
>> Netfilter modules need o be audited as well, they might make assumptions
>> wrt. skb->sk being inet sockets (set by local stack or early demux).
>>
>>> However, if I drop these lines then I end up causing sockets to
>>> release references too many times. Seems like if we don't orphan the
>>> skb here, then later logic assumes that we have one more reference
>>> than we actually have, and decrements the count when it shouldn't
>>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
>>> to assume we always have a reference to the socket?)
>>
>> We might be calling the wrong destructor (i.e., the one set by tcp
>> receive instead of the one set at tx time)?
> 
> Hmm, interesting thought. Sure enough, with a bit of bpftrace
> debugging we find it's tcp_wfree():
> 
> $ cat ip_rcv.bt
> #include <linux/skbuff.h>
> 
> kprobe:ip_rcv {
>        $sk = ((struct sk_buff *)arg0)->sk;
>        $des = ((struct sk_buff *)arg0)->destructor;
>        if ($sk) {
>                if ($des) {
>                        printf("received %s on %s with sk destructor %s
> set\n", str(arg0), str(arg1), ksym($des));
>                        @ip4_stacks[kstack] = count();
>                }
>        }
> }
> $ sudo bpftrace ip_rcv.bt
> Attaching 1 probe...
> received  on eth0 with sk destructor tcp_wfree set
> ^C
> 
> @ip4_stacks[
>    ip_rcv+1
>    __netif_receive_skb+24
>    process_backlog+179
>    net_rx_action+304
>    __do_softirq+220
>    do_softirq_own_stack+42
>    do_softirq.part.17+70
>    __local_bh_enable_ip+101
>    ip_finish_output2+421
>    __ip_finish_output+187
>    ip_finish_output+44
>    ip_output+109
>    ip_local_out+59
>    __ip_queue_xmit+368
>    ip_queue_xmit+16
>    __tcp_transmit_skb+1303
>    tcp_connect+2758
>    tcp_v4_connect+1135
>    __inet_stream_connect+214
>    inet_stream_connect+59
>    __sys_connect+237
>    __x64_sys_connect+26
>    do_syscall_64+90
>    entry_SYSCALL_64_after_hwframe+68
> ]: 1
> 
> Is there a solution here where we call the destructor if it's not
> sock_efree()? When the socket is later stolen, it will only return the
> reference via a call to sock_put(), so presumably at that point in the
> stack we already assume that the skb->destructor is not one of these
> other destructors (otherwise we wouldn't release the resources
> correctly).
> 

What was the driver here ? In any case, the following patch should help.

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
                return NET_RX_DROP;
        }
 
+       skb_orphan(skb);
        skb_scrub_packet(skb, true);
        skb->priority = 0;
        return 0;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-25  6:37     ` Eric Dumazet
@ 2019-06-25  9:35       ` Daniel Borkmann
  2019-06-25 17:03         ` Eric Dumazet
  2019-06-25 18:20       ` Joe Stringer
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2019-06-25  9:35 UTC (permalink / raw)
  To: Eric Dumazet, Joe Stringer, Florian Westphal
  Cc: netdev, john fastabend, Lorenz Bauer, Jakub Sitnicki,
	Paolo Abeni, Flavio Leitner, ast

On 06/25/2019 08:37 AM, Eric Dumazet wrote:
> On 6/24/19 8:17 PM, Joe Stringer wrote:
>> On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote:
>>>
>>> Joe Stringer <joe@wand.net.nz> wrote:
>>>> As discussed during LSFMM, I've been looking at adding something like
>>>> an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
>>>> be implemented with integration into other BPF logic, however
>>>> currently any attempts to do so are blocked by the skb_orphan() call
>>>> in ip_rcv_core() (which will effectively ignore any socket assign
>>>> decision made by the TC BPF program).
>>>>
>>>> Recently I was attempting to remove the skb_orphan() call, and I've
>>>> been trying different things but there seems to be some context I'm
>>>> missing. Here's the core of the patch:
>>>>
>>>> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
>>>> index ed97724c5e33..16aea980318a 100644
>>>> --- a/net/ipv4/ip_input.c
>>>> +++ b/net/ipv4/ip_input.c
>>>> @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
>>>> *skb, struct net *net)
>>>>        memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>>>>        IPCB(skb)->iif = skb->skb_iif;
>>>>
>>>> -       /* Must drop socket now because of tproxy. */
>>>> -       skb_orphan(skb);
>>>>
>>>>        return skb;
>>>>
>>>> The statement that the socket must be dropped because of tproxy
>>>> doesn't make sense to me, because the PRE_ROUTING hook is hit after
>>>> this, which will call into the tproxy logic and eventually
>>>> nf_tproxy_assign_sock() which already does the skb_orphan() itself.
>>>
>>> in comment: s/tproxy/skb_steal_sock/
>>
>> For reference, I was following the path like this:
>>
>> ip_rcv()
>> ( -> ip_rcv_core() for skb_orphan)
>> -> NF_INET_PRE_ROUTING hook
>> (... invoke iptables hooks)
>> -> iptable_mangle_hook()
>> -> ipt_do_table()
>> ... -> tproxy_tg4()
>> ... -> nf_tproxy_assign_sock()
>> -> skb_orphan()
>> (... finish iptables processing)
>> ( -> ip_rcv_finish())
>> ( ... -> ip_rcv_finish_core() for early demux / route lookup )
>> (... -> dst_input())
>> (... -> tcp_v4_rcv())
>> ( -> __inet_lookup_skb())
>> ( -> skb_steal_sock() )
>>
>>> at least thats what I concluded a few years ago when I looked into
>>> the skb_oprhan() need.
>>>
>>> IIRC some device drivers use skb->sk for backpressure, so without this
>>> non-tcp socket would be stolen by skb_steal_sock.
>>
>> Do you happen to recall which device drivers? Or have some idea of a
>> list I could try to go through? Are you referring to virtual drivers
>> like veth or something else?
>>
>>> We also recently removed skb orphan when crossing netns:
>>>
>>> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
>>> Author: Flavio Leitner <fbl@redhat.com>
>>> skbuff: preserve sock reference when scrubbing the skb.
>>>
>>> So thats another case where this orphan is needed.
>>
>> Presumably the orphan is only needed in this case if the packet
>> crosses a namespace and then is subsequently passed back into the
>> stack?
> 
> Yes, I understand we do not want the skb_orphan() when 'srubing' the skb.
> 
> But we want the skb_orphan() right before the packet is reinjected in ingress path. 
> 
>>> What could be done is adding some way to delay/defer the orphaning
>>> further, but we would need at the very least some annotation for
>>> skb_steal_sock to know when the skb->sk is really from TPROXY or
>>> if it has to orphan.
>>
>> Eric mentions in another response to this thread that skb_orphan()
>> should be called from any ndo_start_xmit() which sends traffic back
>> into the stack. With that, presumably we would be pushing the
>> orphaning earlier such that the only way that the skb->sk ref can be
>> non-NULL around this point in receive would be because it was
>> specifically set by some kind of tproxy logic?
>>
>>> Same for the safety check in the forwarding path.
>>> Netfilter modules need o be audited as well, they might make assumptions
>>> wrt. skb->sk being inet sockets (set by local stack or early demux).
>>>
>>>> However, if I drop these lines then I end up causing sockets to
>>>> release references too many times. Seems like if we don't orphan the
>>>> skb here, then later logic assumes that we have one more reference
>>>> than we actually have, and decrements the count when it shouldn't
>>>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
>>>> to assume we always have a reference to the socket?)
>>>
>>> We might be calling the wrong destructor (i.e., the one set by tcp
>>> receive instead of the one set at tx time)?
>>
>> Hmm, interesting thought. Sure enough, with a bit of bpftrace
>> debugging we find it's tcp_wfree():
>>
>> $ cat ip_rcv.bt
>> #include <linux/skbuff.h>
>>
>> kprobe:ip_rcv {
>>        $sk = ((struct sk_buff *)arg0)->sk;
>>        $des = ((struct sk_buff *)arg0)->destructor;
>>        if ($sk) {
>>                if ($des) {
>>                        printf("received %s on %s with sk destructor %s
>> set\n", str(arg0), str(arg1), ksym($des));
>>                        @ip4_stacks[kstack] = count();
>>                }
>>        }
>> }
>> $ sudo bpftrace ip_rcv.bt
>> Attaching 1 probe...
>> received  on eth0 with sk destructor tcp_wfree set
>> ^C
>>
>> @ip4_stacks[
>>    ip_rcv+1
>>    __netif_receive_skb+24
>>    process_backlog+179
>>    net_rx_action+304
>>    __do_softirq+220
>>    do_softirq_own_stack+42
>>    do_softirq.part.17+70
>>    __local_bh_enable_ip+101
>>    ip_finish_output2+421
>>    __ip_finish_output+187
>>    ip_finish_output+44
>>    ip_output+109
>>    ip_local_out+59
>>    __ip_queue_xmit+368
>>    ip_queue_xmit+16
>>    __tcp_transmit_skb+1303
>>    tcp_connect+2758
>>    tcp_v4_connect+1135
>>    __inet_stream_connect+214
>>    inet_stream_connect+59
>>    __sys_connect+237
>>    __x64_sys_connect+26
>>    do_syscall_64+90
>>    entry_SYSCALL_64_after_hwframe+68
>> ]: 1
>>
>> Is there a solution here where we call the destructor if it's not
>> sock_efree()? When the socket is later stolen, it will only return the
>> reference via a call to sock_put(), so presumably at that point in the
>> stack we already assume that the skb->destructor is not one of these
>> other destructors (otherwise we wouldn't release the resources
>> correctly).
> 
> What was the driver here ? In any case, the following patch should help.
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>                 return NET_RX_DROP;
>         }
>  
> +       skb_orphan(skb);
>         skb_scrub_packet(skb, true);
>         skb->priority = 0;
>         return 0;
> 

But wasn't the whole point of 9c4c325252c5 ("skbuff: preserve sock reference when
scrubbing the skb.") to defer orphaning to as late as possible? If I'm not missing
anything, then above would reintroduce the issues that 9c4c325252c5 was trying to
solve wrt TSQ/XPS/etc when skb was sent via veth based data path to cross netns and
then forwarded to phys dev for transmission; meaning, skb->sk is lost at the point
of dev_queue_xmit() for the latter. A side-effect this would also have is that this
changes behavior again for tc egress programs sitting on phys dev (e.g. querying
sock cookie or other related features).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-24 16:49   ` Eric Dumazet
@ 2019-06-25 10:55     ` Jamal Hadi Salim
  0 siblings, 0 replies; 14+ messages in thread
From: Jamal Hadi Salim @ 2019-06-25 10:55 UTC (permalink / raw)
  To: Eric Dumazet, Joe Stringer, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni

On 2019-06-24 12:49 p.m., Eric Dumazet wrote:
>  
> 
> Well, I would simply remove the skb_orphan() call completely.

My experience: You still need to call skb_orphan() from the lower level
(and set the skb destructor etc).

cheers,
jamal


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-25  3:26   ` Joe Stringer
@ 2019-06-25 11:06     ` Jamal Hadi Salim
  2019-06-25 18:29       ` Joe Stringer
  0 siblings, 1 reply; 14+ messages in thread
From: Jamal Hadi Salim @ 2019-06-25 11:06 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Eric Dumazet, Florian Westphal, netdev, john fastabend,
	Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki, Paolo Abeni

On 2019-06-24 11:26 p.m., Joe Stringer wrote:
[..]
> 
> I haven't got as far as UDP yet, but I didn't see any need for a
> dependency on netfilter.

I'd be curious to see what you did. My experience, even for TCP is
the socket(transparent/tproxy) lookup code (to set skb->sk either
listening or established) is entangled in
CONFIG_NETFILTER_SOMETHING_OR_OTHER. You have to rip it out of
there (in the tproxy tc action into that  code). Only then can you
compile out netfilter.
I didnt bother to rip out code for udp case.
i.e if you needed udp to work with the tc action,
youd have to turn on NF. But that was because we had
no need for udp transparent proxying.
IOW:
There is really no reason, afaik, for tproxy code to only be
accessed if netfilter is compiled in. Not sure i made sense.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-25  9:35       ` Daniel Borkmann
@ 2019-06-25 17:03         ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2019-06-25 17:03 UTC (permalink / raw)
  To: Daniel Borkmann, Joe Stringer, Florian Westphal
  Cc: netdev, john fastabend, Lorenz Bauer, Jakub Sitnicki,
	Paolo Abeni, Flavio Leitner, ast



On 6/25/19 2:35 AM, Daniel Borkmann wrote:

> 
> But wasn't the whole point of 9c4c325252c5 ("skbuff: preserve sock reference when
> scrubbing the skb.") to defer orphaning to as late as possible? If I'm not missing
> anything, then above would reintroduce the issues that 9c4c325252c5 was trying to
> solve wrt TSQ/XPS/etc when skb was sent via veth based data path to cross netns and
> then forwarded to phys dev for transmission; meaning, skb->sk is lost at the point
> of dev_queue_xmit() for the latter. A side-effect this would also have is that this
> changes behavior again for tc egress programs sitting on phys dev (e.g. querying
> sock cookie or other related features).


Unless we can detect/decide that a packet going through veth pair is going to be locally
consumed, or forwarded to a physical device (another ndo_start_xmit()), we need
to skb_orphan() the packet, exactly the same way than loopback ndo_start_xmit()

(We could have setups where these packets going through lo interface could be forwarded
to a NIC...)

Backpressure is a best effort, we should not make it an absolute requirement and
prevent doing early demux as early as possible in RX path.

 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-25  6:37     ` Eric Dumazet
  2019-06-25  9:35       ` Daniel Borkmann
@ 2019-06-25 18:20       ` Joe Stringer
  1 sibling, 0 replies; 14+ messages in thread
From: Joe Stringer @ 2019-06-25 18:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Joe Stringer, Florian Westphal, netdev, john fastabend,
	Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki, Paolo Abeni

On Mon, Jun 24, 2019 at 11:37 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 6/24/19 8:17 PM, Joe Stringer wrote:
> > On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote:
> >> Joe Stringer <joe@wand.net.nz> wrote:
> >>> However, if I drop these lines then I end up causing sockets to
> >>> release references too many times. Seems like if we don't orphan the
> >>> skb here, then later logic assumes that we have one more reference
> >>> than we actually have, and decrements the count when it shouldn't
> >>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> >>> to assume we always have a reference to the socket?)
> >>
> >> We might be calling the wrong destructor (i.e., the one set by tcp
> >> receive instead of the one set at tx time)?
> >
> > Hmm, interesting thought. Sure enough, with a bit of bpftrace
> > debugging we find it's tcp_wfree():
> >
> > $ cat ip_rcv.bt
> > #include <linux/skbuff.h>
> >
> > kprobe:ip_rcv {
> >        $sk = ((struct sk_buff *)arg0)->sk;
> >        $des = ((struct sk_buff *)arg0)->destructor;
> >        if ($sk) {
> >                if ($des) {
> >                        printf("received %s on %s with sk destructor %s
> > set\n", str(arg0), str(arg1), ksym($des));
> >                        @ip4_stacks[kstack] = count();
> >                }
> >        }
> > }
> > $ sudo bpftrace ip_rcv.bt
> > Attaching 1 probe...
> > received  on eth0 with sk destructor tcp_wfree set
> > ^C
> >
> > @ip4_stacks[
> >    ip_rcv+1
> >    __netif_receive_skb+24
> >    process_backlog+179
> >    net_rx_action+304
> >    __do_softirq+220
> >    do_softirq_own_stack+42
> >    do_softirq.part.17+70
> >    __local_bh_enable_ip+101
> >    ip_finish_output2+421
> >    __ip_finish_output+187
> >    ip_finish_output+44
> >    ip_output+109
> >    ip_local_out+59
> >    __ip_queue_xmit+368
> >    ip_queue_xmit+16
> >    __tcp_transmit_skb+1303
> >    tcp_connect+2758
> >    tcp_v4_connect+1135
> >    __inet_stream_connect+214
> >    inet_stream_connect+59
> >    __sys_connect+237
> >    __x64_sys_connect+26
> >    do_syscall_64+90
> >    entry_SYSCALL_64_after_hwframe+68
> > ]: 1
> >
> > Is there a solution here where we call the destructor if it's not
> > sock_efree()? When the socket is later stolen, it will only return the
> > reference via a call to sock_put(), so presumably at that point in the
> > stack we already assume that the skb->destructor is not one of these
> > other destructors (otherwise we wouldn't release the resources
> > correctly).
> >
>
> What was the driver here ? In any case, the following patch should help.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>                 return NET_RX_DROP;
>         }
>
> +       skb_orphan(skb);
>         skb_scrub_packet(skb, true);
>         skb->priority = 0;
>         return 0;

Looks like it was bridge in the end, found by attaching a similar
bpftrace program to __dev_forward_sk(). Interestingly enough, the
device attached to the skb reported its name as "eth0" despite not
having such a named link or named bridge that I could find anywhere
via "ip link" / "brctl show"..

    __dev_forward_skb+1
   dev_hard_start_xmit+151
   __dev_queue_xmit+1928
   dev_queue_xmit+16
   br_dev_queue_push_xmit+123
   br_forward_finish+69
   __br_forward+327
   br_forward+204
   br_dev_xmit+598
   dev_hard_start_xmit+151
   __dev_queue_xmit+1928
   dev_queue_xmit+16
   neigh_resolve_output+339
   ip_finish_output2+402
   __ip_finish_output+187
   ip_finish_output+44
   ip_output+109
   ip_local_out+59
   __ip_queue_xmit+368
   ip_queue_xmit+16
   __tcp_transmit_skb+1303
   tcp_connect+2758
   tcp_v4_connect+1135
   __inet_stream_connect+214
   inet_stream_connect+59
   __sys_connect+237
   __x64_sys_connect+26
   do_syscall_64+90
   entry_SYSCALL_64_after_hwframe+68

So I guess something like this could be another alternative:

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 82225b8b54f5..c2de2bb35080 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -65,6 +65,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);

int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
+       skb_orphan(skb);
       skb->tstamp = 0;
       return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
                      net, sk, skb, NULL, skb->dev,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Removing skb_orphan() from ip_rcv_core()
  2019-06-25 11:06     ` Jamal Hadi Salim
@ 2019-06-25 18:29       ` Joe Stringer
  0 siblings, 0 replies; 14+ messages in thread
From: Joe Stringer @ 2019-06-25 18:29 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Joe Stringer, Eric Dumazet, Florian Westphal, netdev,
	john fastabend, Daniel Borkmann, Lorenz Bauer, Jakub Sitnicki,
	Paolo Abeni

On Tue, Jun 25, 2019 at 4:07 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2019-06-24 11:26 p.m., Joe Stringer wrote:
> [..]
> >
> > I haven't got as far as UDP yet, but I didn't see any need for a
> > dependency on netfilter.
>
> I'd be curious to see what you did. My experience, even for TCP is
> the socket(transparent/tproxy) lookup code (to set skb->sk either
> listening or established) is entangled in
> CONFIG_NETFILTER_SOMETHING_OR_OTHER. You have to rip it out of
> there (in the tproxy tc action into that  code). Only then can you
> compile out netfilter.
> I didnt bother to rip out code for udp case.
> i.e if you needed udp to work with the tc action,
> youd have to turn on NF. But that was because we had
> no need for udp transparent proxying.
> IOW:
> There is really no reason, afaik, for tproxy code to only be
> accessed if netfilter is compiled in. Not sure i made sense.

Oh, I see. Between the existing bpf_skc_lookup_tcp() and
bpf_sk_lookup_tcp() helpers in BPF, plus a new bpf_sk_assign() helper
and a little bit of lookup code using the appropriate tproxy ports
etc. from the BPF side, I was able to get it working. One could
imagine perhaps wrapping all this logic up in a higher level
"bpf_sk_lookup_tproxy()" helper call or similar, but I didn't go that
direction given that the BPF socket primitives seemed to provide the
necessary functionality in a more generic manner.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-06-25 18:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
2019-06-21 20:59 ` Florian Westphal
2019-06-25  3:17   ` Joe Stringer
2019-06-25  6:37     ` Eric Dumazet
2019-06-25  9:35       ` Daniel Borkmann
2019-06-25 17:03         ` Eric Dumazet
2019-06-25 18:20       ` Joe Stringer
2019-06-22  0:36 ` Eric Dumazet
2019-06-24 14:47 ` Jamal Hadi Salim
2019-06-24 16:49   ` Eric Dumazet
2019-06-25 10:55     ` Jamal Hadi Salim
2019-06-25  3:26   ` Joe Stringer
2019-06-25 11:06     ` Jamal Hadi Salim
2019-06-25 18:29       ` Joe Stringer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).