All of lore.kernel.org
 help / color / mirror / Atom feed
* Removing skb_orphan() from ip_rcv_core()
@ 2019-06-21 17:58 Joe Stringer
  2019-06-21 20:59 ` Florian Westphal
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Joe Stringer @ 2019-06-21 17:58 UTC (permalink / raw)
  To: Eric Dumazet, Florian Westphal
  Cc: netdev, john fastabend, Daniel Borkmann, Lorenz Bauer,
	Jakub Sitnicki, Paolo Abeni

Hi folks, picking this up again..

As discussed during LSFMM, I've been looking at adding something like
an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
be implemented with integration into other BPF logic, however
currently any attempts to do so are blocked by the skb_orphan() call
in ip_rcv_core() (which will effectively ignore any socket assign
decision made by the TC BPF program).

Recently I was attempting to remove the skb_orphan() call, and I've
been trying different things but there seems to be some context I'm
missing. Here's the core of the patch:

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index ed97724c5e33..16aea980318a 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
*skb, struct net *net)
       memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
       IPCB(skb)->iif = skb->skb_iif;

-       /* Must drop socket now because of tproxy. */
-       skb_orphan(skb);

       return skb;

The statement that the socket must be dropped because of tproxy
doesn't make sense to me, because the PRE_ROUTING hook is hit after
this, which will call into the tproxy logic and eventually
nf_tproxy_assign_sock() which already does the skb_orphan() itself.

However, if I drop these lines then I end up causing sockets to
release references too many times. Seems like if we don't orphan the
skb here, then later logic assumes that we have one more reference
than we actually have, and decrements the count when it shouldn't
(perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
to assume we always have a reference to the socket?)

Splat:

refcount_t hit zero at sk_stop_timer+0x2c/0x30 in cilium-agent[16359],
uid/euid: 0/0
WARNING: CPU: 0 PID: 16359 at kernel/panic.c:686 refcount_error_report+0x9c/0xa1
...
? inet_put_port+0xa6/0xd0
inet_csk_clear_xmit_timers+0x2e/0x50
tcp_done+0x8b/0xf0
tcp_reset+0x49/0xc0
tcp_validate_incoming+0x2f7/0x410
tcp_rcv_state_process+0x250/0xdb6
? tcp_v4_connect+0x46f/0x4e0
tcp_v4_do_rcv+0xbd/0x1f0
__release_sock+0x84/0xd0
release_sock+0x30/0xa0
inet_stream_connect+0x47/0x60

(Full version: https://gist.github.com/joestringer/d5313e4bf4231e2c46405bd7a3053936
)

This seems potentially related to some of the socket referencing
discussion in the peer thread "[RFC bpf-next 0/7] Programming socket
lookup with BPF".

During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
necessary in that path in the current version of the code, and that we
may be able to remove it. Florian, I know you weren't in the room for
that discussion, so raising it again now with a stack trace, Do you
have some sense what's going on here and whether there's a path
towards removing it from this path or allowing the skb->sk to be
retained during ip_rcv() in some conditions?

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-06-25 18:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-21 17:58 Removing skb_orphan() from ip_rcv_core() Joe Stringer
2019-06-21 20:59 ` Florian Westphal
2019-06-25  3:17   ` Joe Stringer
2019-06-25  6:37     ` Eric Dumazet
2019-06-25  9:35       ` Daniel Borkmann
2019-06-25 17:03         ` Eric Dumazet
2019-06-25 18:20       ` Joe Stringer
2019-06-22  0:36 ` Eric Dumazet
2019-06-24 14:47 ` Jamal Hadi Salim
2019-06-24 16:49   ` Eric Dumazet
2019-06-25 10:55     ` Jamal Hadi Salim
2019-06-25  3:26   ` Joe Stringer
2019-06-25 11:06     ` Jamal Hadi Salim
2019-06-25 18:29       ` Joe Stringer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.