All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: chuck.lever@oracle.com, jlayton@kernel.org,
	linux-nfs@vger.kernel.org, edumazet@google.com, kuba@kernel.org,
	kernel-team@fb.com
Subject: [PATCH] sunrpc: hold a ref on netns for tcp sockets
Date: Tue, 19 Mar 2024 15:49:49 -0400	[thread overview]
Message-ID: <512efbd56ad3679068759586c6fa9b681aec14f0.1710877783.git.josef@toxicpanda.com> (raw)

We've been seeing variations of the following panic in production

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  RIP: 0010:ip6_pol_route+0x59/0x7a0
  Call Trace:
   <IRQ>
   ? __die+0x78/0xc0
   ? page_fault_oops+0x286/0x380
   ? fib6_table_lookup+0x95/0xf40
   ? exc_page_fault+0x5d/0x110
   ? asm_exc_page_fault+0x22/0x30
   ? ip6_pol_route+0x59/0x7a0
   ? unlink_anon_vmas+0x370/0x370
   fib6_rule_lookup+0x56/0x1b0
   ? update_blocked_averages+0x2c6/0x6a0
   ip6_route_output_flags+0xd2/0x130
   ip6_dst_lookup_tail+0x3b/0x220
   ip6_dst_lookup_flow+0x2c/0x80
   inet6_sk_rebuild_header+0x14c/0x1e0
   ? tcp_release_cb+0x150/0x150
   __tcp_retransmit_skb+0x68/0x6b0
   ? tcp_current_mss+0xca/0x150
   ? tcp_release_cb+0x150/0x150
   tcp_send_loss_probe+0x8e/0x220
   tcp_write_timer+0xbe/0x2d0
   run_timer_softirq+0x272/0x840
   ? hrtimer_interrupt+0x2c9/0x5f0
   ? sched_clock_cpu+0xc/0x170
   irq_exit_rcu+0x171/0x330
   sysvec_apic_timer_interrupt+0x6d/0x80
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x16/0x20
  RIP: 0010:cpuidle_enter_state+0xe7/0x243

Inspecting the vmcore with drgn you can see why this is a NULL pointer deref

    >>> prog.crashed_thread().stack_trace()[0]
    #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in ip6_pol_route at net/ipv6/route.c:2212:40

    2212        if (net->ipv6.devconf_all->forwarding == 0)
    2213              strict |= RT6_LOOKUP_F_REACHABLE;

    >>> prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all
    (struct ipv6_devconf *)0x0

Looking at the socket you can see that it's been closed

    >>> decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_flags, prog.type('enum sock_flags'))
    'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE'
    >>> decode_enum_type_flags(1 << prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.value_(), prog["TCPF_CLOSE"].type_, bit_numbers=False)
    'TCPF_FIN_WAIT1'

This occurs in our container setup where we have an NFS mount that
belongs to the containers network namespace.  On container shutdown our
netns goes away, which sets net->ipv6.defconf_all = NULL, and then we
panic.  In the kernel we're responsible for destroying our sockets when
the network namespace exits, or holding a reference on the network
namespace for our sockets so this doesn't happen.

Even once we shutdown the socket we can still have TCP timers that fire
in the background, hence this panic.  SUNRPC shuts down the socket and
throws away all knowledge of it, but it's still doing things in the
background.

Fix this by grabbing a reference on the network namespace for any tcp
sockets we open.  With this patch I'm able to cycle my 500 node stress
tier over and over again without panicing, whereas previously I was
losing 10-20 nodes every shutdown cycle.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bb81050c870e..f02387751a94 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 
 	if (!transport->inet) {
 		struct sock *sk = sock->sk;
+		struct net *net = sock_net(sk);
 
 		/* Avoid temporary address, they are bad for long-lived
 		 * connections such as NFS mounts.
@@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		tcp_sock_set_nodelay(sk);
 
 		lock_sock(sk);
+		/*
+		 * Because timers can fire after the fact we need to hold a
+		 * reference on the netns for this socket.
+		 */
+		if (!sk->sk_net_refcnt) {
+			if (!maybe_get_net(net)) {
+			       release_sock(sk);
+			       return -ENOTCONN;
+		       }
+		       /*
+			* For kernel sockets we have a tracker put in place for
+			* the tracing, we need to free this to maintaine
+			* consistent tracking info.
+			*/
+		       __netns_tracker_free(net, &sk->ns_tracker, false);
 
+		       sk->sk_net_refcnt = 1;
+		       netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
+		       sock_inuse_add(net, 1);
+		}
 		xs_save_old_callbacks(transport, sk);
 
 		sk->sk_user_data = xprt;
-- 
2.43.0


WARNING: multiple messages have this Message-ID (diff)
From: Josef Bacik <josef@toxicpanda.com>
To: trond.myklebust@hammerspace.com, anna@kernel.org,
	linux-nfs@vger.kernel.org, edumazet@google.com, kuba@kernel.org,
	kernel-team@fb.com
Subject: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets
Date: Tue, 19 Mar 2024 16:07:48 -0400	[thread overview]
Message-ID: <512efbd56ad3679068759586c6fa9b681aec14f0.1710877783.git.josef@toxicpanda.com> (raw)
Message-ID: <20240319200748.4wonF1kLS7qA8HrURJDQ0TqoQZhO5vsiSNvIt3p3DYo@z> (raw)

We've been seeing variations of the following panic in production

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  RIP: 0010:ip6_pol_route+0x59/0x7a0
  Call Trace:
   <IRQ>
   ? __die+0x78/0xc0
   ? page_fault_oops+0x286/0x380
   ? fib6_table_lookup+0x95/0xf40
   ? exc_page_fault+0x5d/0x110
   ? asm_exc_page_fault+0x22/0x30
   ? ip6_pol_route+0x59/0x7a0
   ? unlink_anon_vmas+0x370/0x370
   fib6_rule_lookup+0x56/0x1b0
   ? update_blocked_averages+0x2c6/0x6a0
   ip6_route_output_flags+0xd2/0x130
   ip6_dst_lookup_tail+0x3b/0x220
   ip6_dst_lookup_flow+0x2c/0x80
   inet6_sk_rebuild_header+0x14c/0x1e0
   ? tcp_release_cb+0x150/0x150
   __tcp_retransmit_skb+0x68/0x6b0
   ? tcp_current_mss+0xca/0x150
   ? tcp_release_cb+0x150/0x150
   tcp_send_loss_probe+0x8e/0x220
   tcp_write_timer+0xbe/0x2d0
   run_timer_softirq+0x272/0x840
   ? hrtimer_interrupt+0x2c9/0x5f0
   ? sched_clock_cpu+0xc/0x170
   irq_exit_rcu+0x171/0x330
   sysvec_apic_timer_interrupt+0x6d/0x80
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x16/0x20
  RIP: 0010:cpuidle_enter_state+0xe7/0x243

Inspecting the vmcore with drgn you can see why this is a NULL pointer deref

    >>> prog.crashed_thread().stack_trace()[0]
    #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in ip6_pol_route at net/ipv6/route.c:2212:40

    2212        if (net->ipv6.devconf_all->forwarding == 0)
    2213              strict |= RT6_LOOKUP_F_REACHABLE;

    >>> prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all
    (struct ipv6_devconf *)0x0

Looking at the socket you can see that it's been closed

    >>> decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_flags, prog.type('enum sock_flags'))
    'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE'
    >>> decode_enum_type_flags(1 << prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.value_(), prog["TCPF_CLOSE"].type_, bit_numbers=False)
    'TCPF_FIN_WAIT1'

This occurs in our container setup where we have an NFS mount that
belongs to the containers network namespace.  On container shutdown our
netns goes away, which sets net->ipv6.defconf_all = NULL, and then we
panic.  In the kernel we're responsible for destroying our sockets when
the network namespace exits, or holding a reference on the network
namespace for our sockets so this doesn't happen.

Even once we shutdown the socket we can still have TCP timers that fire
in the background, hence this panic.  SUNRPC shuts down the socket and
throws away all knowledge of it, but it's still doing things in the
background.

Fix this by grabbing a reference on the network namespace for any tcp
sockets we open.  With this patch I'm able to cycle my 500 node stress
tier over and over again without panicing, whereas previously I was
losing 10-20 nodes every shutdown cycle.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't realize there was
a division of the client and server side of SUNRPC.

 net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bb81050c870e..f02387751a94 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 
 	if (!transport->inet) {
 		struct sock *sk = sock->sk;
+		struct net *net = sock_net(sk);
 
 		/* Avoid temporary address, they are bad for long-lived
 		 * connections such as NFS mounts.
@@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		tcp_sock_set_nodelay(sk);
 
 		lock_sock(sk);
+		/*
+		 * Because timers can fire after the fact we need to hold a
+		 * reference on the netns for this socket.
+		 */
+		if (!sk->sk_net_refcnt) {
+			if (!maybe_get_net(net)) {
+			       release_sock(sk);
+			       return -ENOTCONN;
+		       }
+		       /*
+			* For kernel sockets we have a tracker put in place for
+			* the tracing, we need to free this to maintaine
+			* consistent tracking info.
+			*/
+		       __netns_tracker_free(net, &sk->ns_tracker, false);
 
+		       sk->sk_net_refcnt = 1;
+		       netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
+		       sock_inuse_add(net, 1);
+		}
 		xs_save_old_callbacks(transport, sk);
 
 		sk->sk_user_data = xprt;
-- 
2.43.0


             reply	other threads:[~2024-03-19 19:51 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-19 19:49 Josef Bacik [this message]
2024-03-19 20:07 ` [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets Josef Bacik
2024-03-19 19:53 ` [PATCH] " Chuck Lever III
2024-03-19 21:59 ` [PATCH][RESEND] " Trond Myklebust
2024-03-20 14:10   ` Josef Bacik
2024-03-20 14:28     ` Eric Dumazet
2024-03-20 14:56       ` Josef Bacik
2024-03-20 15:00         ` Eric Dumazet
2024-03-20 15:35           ` Josef Bacik
2024-03-21 18:49           ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=512efbd56ad3679068759586c6fa9b681aec14f0.1710877783.git.josef@toxicpanda.com \
    --to=josef@toxicpanda.com \
    --cc=chuck.lever@oracle.com \
    --cc=edumazet@google.com \
    --cc=jlayton@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=kuba@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.