dccp.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] tcp/dcpp: tw_timer tweaks for nohz_full and PREEMPT_RT
@ 2023-11-15 21:05 Valentin Schneider
  2023-11-15 21:05 ` [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer Valentin Schneider
  2023-11-15 21:05 ` [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization Valentin Schneider
  0 siblings, 2 replies; 11+ messages in thread
From: Valentin Schneider @ 2023-11-15 21:05 UTC (permalink / raw)
  To: dccp, netdev, linux-kernel, linux-rt-users
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Juri Lelli, Tomas Glozar, Sebastian Andrzej Siewior,
	Thomas Gleixner

Hi,

This is v2 of [1] where the tw_timer is un-pinned to get rid of interferences in
isolated CPUs setups.

Patch 1 is pretty much the same as v1, just got an extra comment in
inet_twsk_deschedule_put() to highlight the race.

Patch 2 was added as AFAICT the bh_disable is no longer needed after patch 1,
and Sebastian mentioned during LPC the he had been looking at getting rid of it
for removing softirq_ctrl.lock in PREEMPT_RT.

Eric mentionned rsk_timer needs looking into, but I haven't had the time to do
that. It doesn't show up in our testing, which might be due to its relatively
low timeout (IIUC 3s).

[1]: https://lore.kernel.org/all/20231016125934.1970789-1-vschneid@redhat.com/

Valentin Schneider (2):
  tcp/dcpp: Un-pin tw_timer
  tcp/dcpp: Don't disable bh around timewait_sock initialization

 net/dccp/minisocks.c          | 14 ++++----------
 net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
 net/ipv4/tcp_minisocks.c      | 14 ++++----------
 3 files changed, 23 insertions(+), 25 deletions(-)

--
2.41.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-15 21:05 [PATCH v2 0/2] tcp/dcpp: tw_timer tweaks for nohz_full and PREEMPT_RT Valentin Schneider
@ 2023-11-15 21:05 ` Valentin Schneider
  2023-11-20 17:55   ` Eric Dumazet
  2023-11-15 21:05 ` [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization Valentin Schneider
  1 sibling, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2023-11-15 21:05 UTC (permalink / raw)
  To: dccp, netdev, linux-kernel, linux-rt-users
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Juri Lelli, Tomas Glozar, Sebastian Andrzej Siewior,
	Thomas Gleixner

The TCP timewait timer is proving to be problematic for setups where scheduler
CPU isolation is achieved at runtime via cpusets (as opposed to statically via
isolcpus=domains).

What happens there is a CPU goes through tcp_time_wait(), arming the time_wait
timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing
interference for the now-isolated CPU. This is conceptually similar to the issue
described in
  e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()")

Keep softirqs disabled, but make the timer un-pinned and arm it *after* the
hashdance.

This introduces the following (non-fatal) race:

  CPU0                        CPU1
    allocates a tw
    insert it in hash table
				finds the TW and removes it
				(timer cancel does nothing)
    arms a TW timer, lasting

This partially reverts
  ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
and
  ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")

This also reinstores a comment from
  ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step
2" had gone missing.

Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 net/dccp/minisocks.c          | 16 +++++++---------
 net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
 net/ipv4/tcp_minisocks.c      | 16 +++++++---------
 3 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27adde..2f0fad4255e36 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
 		if (state == DCCP_TIME_WAIT)
 			timeo = DCCP_TIMEWAIT_LEN;
 
-		/* tw_timer is pinned, so we need to make sure BH are disabled
-		 * in following section, otherwise timer handler could run before
-		 * we complete the initialization.
-		 */
-		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
-		/* Linkage updates.
-		 * Note that access to tw after this point is illegal.
-		 */
+	       local_bh_disable();
+
+		// Linkage updates
 		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
+		inet_twsk_schedule(tw, timeo);
+		// Access to tw after this point is illegal.
+		inet_twsk_put(tw);
+
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index dd37a5bf68811..f9b2bbedf1cfc 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -144,6 +144,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	spin_lock(lock);
 
+	/* Step 2: Hash TW into tcp ehash chain */
 	inet_twsk_add_node_rcu(tw, &ehead->chain);
 
 	/* Step 3: Remove SK from hash chain */
@@ -152,16 +153,15 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	spin_unlock(lock);
 
-	/* tw_refcnt is set to 3 because we have :
+	/* tw_refcnt is set to 4 because we have :
 	 * - one reference for bhash chain.
 	 * - one reference for ehash chain.
 	 * - one reference for timer.
+	 * - one reference for ourself (our caller will release it).
 	 * We can use atomic_set() because prior spin_lock()/spin_unlock()
 	 * committed into memory all tw fields.
-	 * Also note that after this point, we lost our implicit reference
-	 * so we are not allowed to use tw anymore.
 	 */
-	refcount_set(&tw->tw_refcnt, 3);
+	refcount_set(&tw->tw_refcnt, 4);
 }
 EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
 
@@ -207,7 +207,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		tw->tw_prot	    = sk->sk_prot_creator;
 		atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
 		twsk_net_set(tw, sock_net(sk));
-		timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED);
+		timer_setup(&tw->tw_timer, tw_timer_handler, 0);
 		/*
 		 * Because we use RCU lookups, we should not set tw_refcnt
 		 * to a non null value before everything is setup for this
@@ -232,6 +232,16 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
  */
 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 {
+	/* This can race with tcp_time_wait() and dccp_time_wait(), as the timer
+	 * is armed /after/ adding it to the hashtables.
+	 *
+	 * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(),
+	 * then this is a no-op: the timer will still end up armed.
+	 *
+	 * Conversely, if this successfully deletes the timer, then we know we
+	 * have already gone through {tcp,dcpp}_time_wait(), and we can safely
+	 * call inet_twsk_kill().
+	 */
 	if (del_timer_sync(&tw->tw_timer))
 		inet_twsk_kill(tw);
 	inet_twsk_put(tw);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index a9807eeb311ca..48eb0310fe837 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -338,16 +338,14 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		if (state == TCP_TIME_WAIT)
 			timeo = TCP_TIMEWAIT_LEN;
 
-		/* tw_timer is pinned, so we need to make sure BH are disabled
-		 * in following section, otherwise timer handler could run before
-		 * we complete the initialization.
-		 */
-		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
-		/* Linkage updates.
-		 * Note that access to tw after this point is illegal.
-		 */
+	       local_bh_disable();
+
+		// Linkage updates.
 		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
+		inet_twsk_schedule(tw, timeo);
+		// Access to tw after this point is illegal.
+		inet_twsk_put(tw);
+
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization
  2023-11-15 21:05 [PATCH v2 0/2] tcp/dcpp: tw_timer tweaks for nohz_full and PREEMPT_RT Valentin Schneider
  2023-11-15 21:05 ` [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer Valentin Schneider
@ 2023-11-15 21:05 ` Valentin Schneider
  2023-11-20 15:23   ` kernel test robot
  1 sibling, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2023-11-15 21:05 UTC (permalink / raw)
  To: dccp, netdev, linux-kernel, linux-rt-users
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Juri Lelli, Tomas Glozar, Sebastian Andrzej Siewior,
	Thomas Gleixner

Now that the tw_timer is armed *after* the hashdance, it is the last step
of the timewait initialization. We can thus enable softirqs without running
the risk of the timer handler running before the initialization is done.

This is conceptually a revert of
  cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer")

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 net/dccp/minisocks.c     | 4 ----
 net/ipv4/tcp_minisocks.c | 4 ----
 2 files changed, 8 deletions(-)

diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 2f0fad4255e36..cb990bc92a5c9 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -53,15 +53,11 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
 		if (state == DCCP_TIME_WAIT)
 			timeo = DCCP_TIMEWAIT_LEN;
 
-	       local_bh_disable();
-
 		// Linkage updates
 		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
 		inet_twsk_schedule(tw, timeo);
 		// Access to tw after this point is illegal.
 		inet_twsk_put(tw);
-
-		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
 		 * socket up.  We've got bigger problems than
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 48eb0310fe837..c7d46674d55cb 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -338,15 +338,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		if (state == TCP_TIME_WAIT)
 			timeo = TCP_TIMEWAIT_LEN;
 
-	       local_bh_disable();
-
 		// Linkage updates.
 		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
 		inet_twsk_schedule(tw, timeo);
 		// Access to tw after this point is illegal.
 		inet_twsk_put(tw);
-
-		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
 		 * socket up.  We've got bigger problems than
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization
  2023-11-15 21:05 ` [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization Valentin Schneider
@ 2023-11-20 15:23   ` kernel test robot
  2023-11-23 15:26     ` Valentin Schneider
  0 siblings, 1 reply; 11+ messages in thread
From: kernel test robot @ 2023-11-20 15:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: oe-lkp, lkp, dccp, netdev, linux-kernel, linux-rt-users,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Juri Lelli, Tomas Glozar, Sebastian Andrzej Siewior,
	Thomas Gleixner, oliver.sang



Hello,

kernel test robot noticed "WARNING:inconsistent_lock_state" on:

commit: 0258784e371906dfa1419556dcb7905333039f34 ("[PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization")
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/tcp-dcpp-Un-pin-tw_timer/20231116-053100
base: https://git.kernel.org/cgit/linux/kernel/git/davem/net.git 674e318089468ece99aef4796eaef7add57f36b2
patch link: https://lore.kernel.org/all/20231115210509.481514-3-vschneid@redhat.com/
patch subject: [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311202323.8c66ae1c-oliver.sang@intel.com



[   53.969777][    C0]
[   53.970087][    C0] ================================
[   53.970532][    C0] WARNING: inconsistent lock state
[   53.970976][    C0] 6.6.0-15915-g0258784e3719 #1 Tainted: G        W        N
[   53.971566][    C0] --------------------------------
[   53.972004][    C0] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   53.972562][    C0] kallsyms_test/97 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 53.973095][ C0] ffffc90000281348 (&tcp_hashinfo.bhash[i].lock){+.?.}-{2:2}, at: inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132) 
[   53.973952][    C0] {SOFTIRQ-ON-W} state was registered at:
[ 53.974434][ C0] __lock_acquire (kernel/locking/lockdep.c:5090) 
[ 53.974857][ C0] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755 kernel/locking/lockdep.c:5718) 
[ 53.975264][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) 
[ 53.975673][ C0] inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132) 
[ 53.976122][ C0] tcp_time_wait (include/net/inet_timewait_sock.h:108 net/ipv4/tcp_minisocks.c:343) 
[ 53.976533][ C0] tcp_fin (net/ipv4/tcp_input.c:4508) 
[ 53.976911][ C0] tcp_data_queue (net/ipv4/tcp_input.c:5188) 
[ 53.977336][ C0] tcp_rcv_state_process (net/ipv4/tcp_input.c:6850) 
[ 53.977802][ C0] tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929) 
[ 53.978216][ C0] __release_sock (net/core/sock.c:2970) 
[ 53.978634][ C0] __tcp_close (net/ipv4/tcp.c:2847) 
[ 53.979035][ C0] tcp_close (net/ipv4/tcp.c:2922) 
[ 53.979413][ C0] inet_release (net/ipv4/af_inet.c:434) 
[ 53.979817][ C0] __sock_release (net/socket.c:660) 
[ 53.980229][ C0] sock_close (net/socket.c:1421) 
[ 53.980610][ C0] __fput (fs/file_table.c:394) 
[ 53.980986][ C0] task_work_run (kernel/task_work.c:182 (discriminator 1)) 
[ 53.981401][ C0] do_exit (kernel/exit.c:872) 
[ 53.981777][ C0] do_group_exit (kernel/exit.c:1002) 
[ 53.982183][ C0] __ia32_sys_exit_group (kernel/exit.c:1030) 
[ 53.982632][ C0] __do_fast_syscall_32 (arch/x86/entry/common.c:164 arch/x86/entry/common.c:230) 
[ 53.983076][ C0] do_fast_syscall_32 (arch/x86/entry/common.c:255) 
[ 53.983505][ C0] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:121) 
[   53.984034][    C0] irq event stamp: 95812558
[ 53.984434][ C0] hardirqs last enabled at (95812558): _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
[ 53.985302][ C0] hardirqs last disabled at (95812557): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:108 kernel/locking/spinlock.c:162) 
[ 53.986123][ C0] softirqs last enabled at (95812510): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) 
[ 53.986900][ C0] softirqs last disabled at (95812513): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:622 kernel/softirq.c:644) 
[   53.987664][    C0]
[   53.987664][    C0] other info that might help us debug this:
[   53.988335][    C0]  Possible unsafe locking scenario:
[   53.988335][    C0]
[   53.988971][    C0]        CPU0
[   53.989291][    C0]        ----
[   53.989611][    C0]   lock(&tcp_hashinfo.bhash[i].lock);
[   53.990076][    C0]   <Interrupt>
[   53.990404][    C0]     lock(&tcp_hashinfo.bhash[i].lock);
[   53.990883][    C0]
[   53.990883][    C0]  *** DEADLOCK ***
[   53.990883][    C0]
[   53.991593][    C0] 3 locks held by kallsyms_test/97:
[ 53.992048][ C0] #0: ffffffff87560480 (rcu_read_lock){....}-{1:2}, at: netif_receive_skb_list_internal (include/linux/rcupdate.h:301 include/linux/rcupdate.h:747 net/core/dev.c:5748) 
[ 53.992919][ C0] #1: ffffffff87560480 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x1d3/0x410 
[ 53.993815][ C0] #2: ffff88816d626cb0 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv (include/linux/skbuff.h:1609 include/net/tcp.h:2458 net/ipv4/tcp_ipv4.c:2326) 
[   53.994590][    C0]
[   53.994590][    C0] stack backtrace:
[   53.995117][    C0] CPU: 0 PID: 97 Comm: kallsyms_test Tainted: G        W        N 6.6.0-15915-g0258784e3719 #1 f6bfa75c5f93478329dec95cb0ab5dfc3e35c498
[   53.996184][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   53.997020][    C0] Call Trace:
[   53.997341][    C0]  <IRQ>
[ 53.997631][ C0] dump_stack_lvl (lib/dump_stack.c:107) 
[ 53.998033][ C0] mark_lock_irq (kernel/locking/lockdep.c:4216) 
[ 53.998447][ C0] ? print_usage_bug+0x680/0x680 
[ 53.999002][ C0] ? filter_irq_stacks (kernel/stacktrace.c:114) 
[ 53.999514][ C0] ? save_trace (kernel/locking/lockdep.c:586) 
[ 53.999909][ C0] ? __lock_release+0xf6/0x400 
[ 54.000367][ C0] ? __stack_depot_save (lib/stackdepot.c:441) 
[ 54.000823][ C0] ? reacquire_held_locks (kernel/locking/lockdep.c:5404) 
[ 54.001285][ C0] mark_lock (kernel/locking/lockdep.c:4677) 
[ 54.001675][ C0] ? mark_lock_irq (kernel/locking/lockdep.c:4638) 
[ 54.002104][ C0] ? do_raw_spin_unlock (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/atomic/atomic-instrumented.h:33 include/asm-generic/qspinlock.h:57 kernel/locking/spinlock_debug.c:100 kernel/locking/spinlock_debug.c:140) 
[ 54.002546][ C0] mark_usage (kernel/locking/lockdep.c:4566) 
[ 54.002930][ C0] __lock_acquire (kernel/locking/lockdep.c:5090) 
[ 54.003352][ C0] ? mark_lock (arch/x86/include/asm/bitops.h:227 (discriminator 3) arch/x86/include/asm/bitops.h:239 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4655 (discriminator 3)) 
[ 54.003748][ C0] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755 kernel/locking/lockdep.c:5718) 
[ 54.004148][ C0] ? inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132) 
[ 54.004608][ C0] ? lock_sync (kernel/locking/lockdep.c:5721) 
[ 54.005007][ C0] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4566) 
[ 54.005548][ C0] ? lockdep_init_map_type (kernel/locking/lockdep.c:4891) 
[ 54.006012][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) 
[ 54.006413][ C0] ? inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132) 
[ 54.006868][ C0] inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132) 
[ 54.007313][ C0] ? inet_twsk_alloc (net/ipv4/inet_timewait_sock.c:222) 
[ 54.007742][ C0] tcp_time_wait (include/net/inet_timewait_sock.h:108 net/ipv4/tcp_minisocks.c:343) 
[ 54.008149][ C0] tcp_rcv_state_process (net/ipv4/tcp_input.c:6790) 
[ 54.008621][ C0] ? lock_sync (kernel/locking/lockdep.c:5721) 
[ 54.009020][ C0] ? tcp_rcv_synsent_state_process (net/ipv4/tcp_input.c:6605) 
[ 54.009547][ C0] ? do_raw_spin_lock (arch/x86/include/asm/atomic.h:115 include/linux/atomic/atomic-arch-fallback.h:2164 include/linux/atomic/atomic-instrumented.h:1296 include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:115) 
[ 54.009982][ C0] tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929) 
[ 54.010389][ C0] tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2329) 
[ 54.010795][ C0] ? tcp_v4_early_demux (net/ipv4/tcp_ipv4.c:2162) 
[ 54.011246][ C0] ? lock_sync (kernel/locking/lockdep.c:5721) 
[ 54.011641][ C0] ? raw_rcv (net/ipv4/raw.c:165) 
[ 54.012021][ C0] ? __kmem_cache_free (mm/slub.c:3809 mm/slub.c:3822) 
[ 54.012461][ C0] ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205) 
[ 54.012927][ C0] ip_local_deliver_finish+0x28b/0x410 
[ 54.013467][ C0] ip_sublist_rcv_finish (net/ipv4/ip_input.c:578) 
[ 54.013914][ C0] ip_list_rcv_finish+0x316/0x4b0 
[ 54.014422][ C0] ? ip_rcv_finish_core+0x13d0/0x13d0 
[ 54.014952][ C0] ip_list_rcv (net/ipv4/ip_input.c:645) 
[ 54.015351][ C0] ? ip_rcv (net/ipv4/ip_input.c:645) 
[ 54.015729][ C0] __netif_receive_skb_list_core (net/core/dev.c:5582) 
[ 54.016231][ C0] ? __netif_receive_skb_core+0x2050/0x2050 
[ 54.016798][ C0] ? reacquire_held_locks (kernel/locking/lockdep.c:5404) 
[ 54.017257][ C0] netif_receive_skb_list_internal (net/core/dev.c:5674 net/core/dev.c:5763) 
[ 54.017769][ C0] ? dev_gro_receive (arch/x86/include/asm/bitops.h:227 (discriminator 8) arch/x86/include/asm/bitops.h:239 (discriminator 8) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 8) net/core/gro.c:533 (discriminator 8)) 
[ 54.018208][ C0] ? process_backlog (net/core/dev.c:5735) 
[ 54.018641][ C0] ? napi_gro_receive (include/trace/events/net.h:288 net/core/gro.c:610) 
[ 54.019075][ C0] ? e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000/e1000_main.c:4480) 
[ 54.019528][ C0] napi_complete_done (include/linux/list.h:37 include/net/gro.h:440 include/net/gro.h:435 net/core/dev.c:6103) 
[ 54.019963][ C0] ? napi_busy_loop (net/core/dev.c:6070) 
[ 54.020389][ C0] ? e1000_clean_rx_ring (drivers/net/ethernet/intel/e1000/e1000_main.c:4350) 
[ 54.020848][ C0] e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3811) 
[ 54.021235][ C0] ? e1000_clean_tx_irq (drivers/net/ethernet/intel/e1000/e1000_main.c:3796) 
[ 54.021693][ C0] __napi_poll+0x99/0x430 
[ 54.022154][ C0] net_rx_action (net/core/dev.c:6604 net/core/dev.c:6735) 
[ 54.022565][ C0] ? __napi_poll+0x430/0x430 
[ 54.023043][ C0] ? asym_cpu_capacity_scan (kernel/sched/clock.c:389) 
[ 54.023516][ C0] ? lock_sync (kernel/locking/lockdep.c:5721) 
[ 54.023911][ C0] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:554) 
[ 54.024316][ C0] ? __lock_text_end (kernel/softirq.c:511) 
[ 54.024729][ C0] irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:622 kernel/softirq.c:644) 
[ 54.025116][ C0] common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14)) 
[   54.025529][    C0]  </IRQ>
[   54.025823][    C0]  <TASK>
[ 54.026117][ C0] asm_common_interrupt (arch/x86/include/asm/idtentry.h:636) 
[ 54.026551][ C0] RIP: kallsyms_expand_symbol+0xff/0x2e0 
[ 54.027129][ C0] Code: 48 81 c3 e0 0e a3 86 48 89 d8 49 89 d8 48 89 df 48 c1 e8 03 41 83 e0 07 42 0f b6 04 28 44 38 c0 7f 08 84 c0 0f 85 39 01 00 00 <0f> b6 ad e0 0e a3 86 40 84 ed 74 65 85 d2 75 2a 48 83 c3 01 48 89
All code
========
   0:	48 81 c3 e0 0e a3 86 	add    $0xffffffff86a30ee0,%rbx
   7:	48 89 d8             	mov    %rbx,%rax
   a:	49 89 d8             	mov    %rbx,%r8
   d:	48 89 df             	mov    %rbx,%rdi
  10:	48 c1 e8 03          	shr    $0x3,%rax
  14:	41 83 e0 07          	and    $0x7,%r8d
  18:	42 0f b6 04 28       	movzbl (%rax,%r13,1),%eax
  1d:	44 38 c0             	cmp    %r8b,%al
  20:	7f 08                	jg     0x2a
  22:	84 c0                	test   %al,%al
  24:	0f 85 39 01 00 00    	jne    0x163
  2a:*	0f b6 ad e0 0e a3 86 	movzbl -0x795cf120(%rbp),%ebp		<-- trapping instruction
  31:	40 84 ed             	test   %bpl,%bpl
  34:	74 65                	je     0x9b
  36:	85 d2                	test   %edx,%edx
  38:	75 2a                	jne    0x64
  3a:	48 83 c3 01          	add    $0x1,%rbx
  3e:	48                   	rex.W
  3f:	89                   	.byte 0x89

Code starting with the faulting instruction
===========================================
   0:	0f b6 ad e0 0e a3 86 	movzbl -0x795cf120(%rbp),%ebp
   7:	40 84 ed             	test   %bpl,%bpl
   a:	74 65                	je     0x71
   c:	85 d2                	test   %edx,%edx
   e:	75 2a                	jne    0x3a
  10:	48 83 c3 01          	add    $0x1,%rbx
  14:	48                   	rex.W
  15:	89                   	.byte 0x89


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231120/202311202323.8c66ae1c-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-15 21:05 ` [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer Valentin Schneider
@ 2023-11-20 17:55   ` Eric Dumazet
  2023-11-23 14:34     ` Valentin Schneider
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2023-11-20 17:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: dccp, netdev, linux-kernel, linux-rt-users, David S. Miller,
	Jakub Kicinski, Paolo Abeni, David Ahern, Juri Lelli,
	Tomas Glozar, Sebastian Andrzej Siewior, Thomas Gleixner

On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> The TCP timewait timer is proving to be problematic for setups where scheduler
> CPU isolation is achieved at runtime via cpusets (as opposed to statically via
> isolcpus=domains).
>
> What happens there is a CPU goes through tcp_time_wait(), arming the time_wait
> timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing
> interference for the now-isolated CPU. This is conceptually similar to the issue
> described in
>   e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()")
>
> Keep softirqs disabled, but make the timer un-pinned and arm it *after* the
> hashdance.
>
> This introduces the following (non-fatal) race:
>
>   CPU0                        CPU1
>     allocates a tw
>     insert it in hash table
>                                 finds the TW and removes it
>                                 (timer cancel does nothing)
>     arms a TW timer, lasting
>
> This partially reverts
>   ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
> and
>   ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
>
> This also reinstores a comment from
>   ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
> as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step
> 2" had gone missing.
>
> Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  net/dccp/minisocks.c          | 16 +++++++---------
>  net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
>  net/ipv4/tcp_minisocks.c      | 16 +++++++---------
>  3 files changed, 29 insertions(+), 23 deletions(-)
>
> diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> index 64d805b27adde..2f0fad4255e36 100644
> --- a/net/dccp/minisocks.c
> +++ b/net/dccp/minisocks.c
> @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
>                 if (state == DCCP_TIME_WAIT)
>                         timeo = DCCP_TIMEWAIT_LEN;
>
> -               /* tw_timer is pinned, so we need to make sure BH are disabled
> -                * in following section, otherwise timer handler could run before
> -                * we complete the initialization.
> -                */
> -               local_bh_disable();
> -               inet_twsk_schedule(tw, timeo);
> -               /* Linkage updates.
> -                * Note that access to tw after this point is illegal.
> -                */
> +              local_bh_disable();
> +
> +               // Linkage updates
>                 inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> +               inet_twsk_schedule(tw, timeo);

We could arm a timer there, while another thread/cpu found the TW in
the ehash table.



> +               // Access to tw after this point is illegal.
> +               inet_twsk_put(tw);

This would eventually call inet_twsk_free() while the timer is armed.

I think more work is needed.

Perhaps make sure that a live timer owns a reference on tw->tw_refcnt
(This is not the case atm)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-20 17:55   ` Eric Dumazet
@ 2023-11-23 14:34     ` Valentin Schneider
  2023-11-23 16:32       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2023-11-23 14:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dccp, netdev, linux-kernel, linux-rt-users, David S. Miller,
	Jakub Kicinski, Paolo Abeni, David Ahern, Juri Lelli,
	Tomas Glozar, Sebastian Andrzej Siewior, Thomas Gleixner

Hi Eric,

On Mon, 20 Nov 2023 at 18:56, Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
> >                 if (state == DCCP_TIME_WAIT)
> >                         timeo = DCCP_TIMEWAIT_LEN;
> >
> > -               /* tw_timer is pinned, so we need to make sure BH are disabled
> > -                * in following section, otherwise timer handler could run before
> > -                * we complete the initialization.
> > -                */
> > -               local_bh_disable();
> > -               inet_twsk_schedule(tw, timeo);
> > -               /* Linkage updates.
> > -                * Note that access to tw after this point is illegal.
> > -                */
> > +              local_bh_disable();
> > +
> > +               // Linkage updates
> >                 inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> > +               inet_twsk_schedule(tw, timeo);
>
> We could arm a timer there, while another thread/cpu found the TW in
> the ehash table.
>
>
>
> > +               // Access to tw after this point is illegal.
> > +               inet_twsk_put(tw);
>
> This would eventually call inet_twsk_free() while the timer is armed.
>
> I think more work is needed.
>
> Perhaps make sure that a live timer owns a reference on tw->tw_refcnt
> (This is not the case atm)
>

I thought that was already the case, per inet_twsk_hashdance():

/* tw_refcnt is set to 3 because we have :
 * - one reference for bhash chain.
 * - one reference for ehash chain.
 * - one reference for timer.

and

tw_timer_handler()
`\
  inet_twsk_kill()
  `\
    inet_twsk_put()

So AFAICT, after we go through the hashdance, there's a reference on
tw_refcnt held by the tw_timer.
inet_twsk_deschedule_put() can race with arming the timer, but it only
calls inet_twsk_kill() if the timer
was already armed & has been deleted, so there's no risk of calling it
twice... If I got it right :-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization
  2023-11-20 15:23   ` kernel test robot
@ 2023-11-23 15:26     ` Valentin Schneider
  0 siblings, 0 replies; 11+ messages in thread
From: Valentin Schneider @ 2023-11-23 15:26 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, dccp, netdev, linux-kernel, linux-rt-users,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Juri Lelli, Tomas Glozar, Sebastian Andrzej Siewior,
	Thomas Gleixner

On Mon, 20 Nov 2023 at 16:24, kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:inconsistent_lock_state" on:
>
> commit: 0258784e371906dfa1419556dcb7905333039f34 ("[PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization")
> url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/tcp-dcpp-Un-pin-tw_timer/20231116-053100
> base: https://git.kernel.org/cgit/linux/kernel/git/davem/net.git 674e318089468ece99aef4796eaef7add57f36b2
> patch link: https://lore.kernel.org/all/20231115210509.481514-3-vschneid@redhat.com/
> patch subject: [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization
>
> in testcase: boot
>
> compiler: gcc-12
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202311202323.8c66ae1c-oliver.sang@intel.com
>
>
>
> [   53.969777][    C0]
> [   53.970087][    C0] ================================
> [   53.970532][    C0] WARNING: inconsistent lock state
> [   53.970976][    C0] 6.6.0-15915-g0258784e3719 #1 Tainted: G        W        N
> [   53.971566][    C0] --------------------------------
> [   53.972004][    C0] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [   53.972562][    C0] kallsyms_test/97 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 53.973095][ C0] ffffc90000281348 (&tcp_hashinfo.bhash[i].lock){+.?.}-{2:2}, at: inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132)
> [   53.973952][    C0] {SOFTIRQ-ON-W} state was registered at:
> [ 53.974434][ C0] __lock_acquire (kernel/locking/lockdep.c:5090)
> [ 53.974857][ C0] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755 kernel/locking/lockdep.c:5718)
> [ 53.975264][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154)
> [ 53.975673][ C0] inet_twsk_hashdance (include/linux/spinlock.h:351 net/ipv4/inet_timewait_sock.c:132)
> [ 53.976122][ C0] tcp_time_wait (include/net/inet_timewait_sock.h:108 net/ipv4/tcp_minisocks.c:343)
> [ 53.976533][ C0] tcp_fin (net/ipv4/tcp_input.c:4508)
> [ 53.976911][ C0] tcp_data_queue (net/ipv4/tcp_input.c:5188)
> [ 53.977336][ C0] tcp_rcv_state_process (net/ipv4/tcp_input.c:6850)
> [ 53.977802][ C0] tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929)
> [ 53.978216][ C0] __release_sock (net/core/sock.c:2970)
> [ 53.978634][ C0] __tcp_close (net/ipv4/tcp.c:2847)
> [ 53.979035][ C0] tcp_close (net/ipv4/tcp.c:2922)
> [ 53.979413][ C0] inet_release (net/ipv4/af_inet.c:434)
> [ 53.979817][ C0] __sock_release (net/socket.c:660)
> [ 53.980229][ C0] sock_close (net/socket.c:1421)
> [ 53.980610][ C0] __fput (fs/file_table.c:394)
> [ 53.980986][ C0] task_work_run (kernel/task_work.c:182 (discriminator 1))
> [ 53.981401][ C0] do_exit (kernel/exit.c:872)
> [ 53.981777][ C0] do_group_exit (kernel/exit.c:1002)

> [   53.984034][    C0] irq event stamp: 95812558
> [ 53.984434][ C0] hardirqs last enabled at (95812558): _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> [ 53.985302][ C0] hardirqs last disabled at (95812557): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:108 kernel/locking/spinlock.c:162)
> [ 53.986123][ C0] softirqs last enabled at (95812510): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582)
> [ 53.986900][ C0] softirqs last disabled at (95812513): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:622 kernel/softirq.c:644)
> [   53.987664][    C0]
> [   53.987664][    C0] other info that might help us debug this:
> [   53.988335][    C0]  Possible unsafe locking scenario:
> [   53.988335][    C0]
> [   53.988971][    C0]        CPU0
> [   53.989291][    C0]        ----
> [   53.989611][    C0]   lock(&tcp_hashinfo.bhash[i].lock);
> [   53.990076][    C0]   <Interrupt>
> [   53.990404][    C0]     lock(&tcp_hashinfo.bhash[i].lock);
> [   53.990883][    C0]
> [   53.990883][    C0]  *** DEADLOCK ***
> [   53.990883][    C0]
> [   53.991593][    C0] 3 locks held by kallsyms_test/97:
> [ 53.992048][ C0] #0: ffffffff87560480 (rcu_read_lock){....}-{1:2}, at: netif_receive_skb_list_internal (include/linux/rcupdate.h:301 include/linux/rcupdate.h:747 net/core/dev.c:5748)
> [ 53.992919][ C0] #1: ffffffff87560480 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x1d3/0x410
> [ 53.993815][ C0] #2: ffff88816d626cb0 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv (include/linux/skbuff.h:1609 include/net/tcp.h:2458 net/ipv4/tcp_ipv4.c:2326)

So this looks like we do rely on disabling BH in the hashdance, not
only for the timer but also for the
  spin_lock(&bhead2->lock);
which needs to be softirq-safe.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-23 14:34     ` Valentin Schneider
@ 2023-11-23 16:32       ` Eric Dumazet
  2023-11-23 16:45         ` Valentin Schneider
                           ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric Dumazet @ 2023-11-23 16:32 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: dccp, netdev, linux-kernel, linux-rt-users, David S. Miller,
	Jakub Kicinski, Paolo Abeni, David Ahern, Juri Lelli,
	Tomas Glozar, Sebastian Andrzej Siewior, Thomas Gleixner

On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> Hi Eric,
>
> On Mon, 20 Nov 2023 at 18:56, Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > >
> > > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
> > >                 if (state == DCCP_TIME_WAIT)
> > >                         timeo = DCCP_TIMEWAIT_LEN;
> > >
> > > -               /* tw_timer is pinned, so we need to make sure BH are disabled
> > > -                * in following section, otherwise timer handler could run before
> > > -                * we complete the initialization.
> > > -                */
> > > -               local_bh_disable();
> > > -               inet_twsk_schedule(tw, timeo);
> > > -               /* Linkage updates.
> > > -                * Note that access to tw after this point is illegal.
> > > -                */
> > > +              local_bh_disable();
> > > +
> > > +               // Linkage updates
> > >                 inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> > > +               inet_twsk_schedule(tw, timeo);
> >
> > We could arm a timer there, while another thread/cpu found the TW in
> > the ehash table.
> >
> >
> >
> > > +               // Access to tw after this point is illegal.
> > > +               inet_twsk_put(tw);
> >
> > This would eventually call inet_twsk_free() while the timer is armed.
> >
> > I think more work is needed.
> >
> > Perhaps make sure that a live timer owns a reference on tw->tw_refcnt
> > (This is not the case atm)
> >
>
> I thought that was already the case, per inet_twsk_hashdance():
>
> /* tw_refcnt is set to 3 because we have :
>  * - one reference for bhash chain.
>  * - one reference for ehash chain.
>  * - one reference for timer.
>
> and
>
> tw_timer_handler()
> `\
>   inet_twsk_kill()
>   `\
>     inet_twsk_put()
>
> So AFAICT, after we go through the hashdance, there's a reference on
> tw_refcnt held by the tw_timer.
> inet_twsk_deschedule_put() can race with arming the timer, but it only
> calls inet_twsk_kill() if the timer
> was already armed & has been deleted, so there's no risk of calling it
> twice... If I got it right :-)
>

Again, I think you missed some details.

I am OOO for a few days, I do not have time to elaborate.

You will need to properly track active timer by elevating
tw->tw_refcnt, or I guarantee something wrong will happen.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-23 16:32       ` Eric Dumazet
@ 2023-11-23 16:45         ` Valentin Schneider
  2023-12-08 14:47         ` Valentin Schneider
  2024-01-12  9:08         ` Paolo Abeni
  2 siblings, 0 replies; 11+ messages in thread
From: Valentin Schneider @ 2023-11-23 16:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dccp, netdev, linux-kernel, linux-rt-users, David S. Miller,
	Jakub Kicinski, Paolo Abeni, David Ahern, Juri Lelli,
	Tomas Glozar, Sebastian Andrzej Siewior, Thomas Gleixner

On Thu, 23 Nov 2023 at 17:32, Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > I thought that was already the case, per inet_twsk_hashdance():
> >
> > /* tw_refcnt is set to 3 because we have :
> >  * - one reference for bhash chain.
> >  * - one reference for ehash chain.
> >  * - one reference for timer.
> >
> > and
> >
> > tw_timer_handler()
> > `\
> >   inet_twsk_kill()
> >   `\
> >     inet_twsk_put()
> >
> > So AFAICT, after we go through the hashdance, there's a reference on
> > tw_refcnt held by the tw_timer.
> > inet_twsk_deschedule_put() can race with arming the timer, but it only
> > calls inet_twsk_kill() if the timer
> > was already armed & has been deleted, so there's no risk of calling it
> > twice... If I got it right :-)
> >
>
> Again, I think you missed some details.
>
> I am OOO for a few days, I do not have time to elaborate.
>
> You will need to properly track active timer by elevating
> tw->tw_refcnt, or I guarantee something wrong will happen.
>

Gotcha, let me dig into this then!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-23 16:32       ` Eric Dumazet
  2023-11-23 16:45         ` Valentin Schneider
@ 2023-12-08 14:47         ` Valentin Schneider
  2024-01-12  9:08         ` Paolo Abeni
  2 siblings, 0 replies; 11+ messages in thread
From: Valentin Schneider @ 2023-12-08 14:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dccp, netdev, linux-kernel, linux-rt-users, David S. Miller,
	Jakub Kicinski, Paolo Abeni, David Ahern, Juri Lelli,
	Tomas Glozar, Sebastian Andrzej Siewior, Thomas Gleixner

On 23/11/23 17:32, Eric Dumazet wrote:
> Again, I think you missed some details.
>
> I am OOO for a few days, I do not have time to elaborate.
>
> You will need to properly track active timer by elevating
> tw->tw_refcnt, or I guarantee something wrong will happen.

I apologize if I'm being thick skulled, I've been staring at the code and
tracing on live systems and I still can't see the issue with refcounts.

The patch has the hashdance set the refcount to 4:
         * - one reference for bhash chain.
         * - one reference for ehash chain.
         * - one reference for timer.
         * - one reference for ourself (our caller will release it).

AFAICT, finding & using the socket via the ehash table comes with a refcount_inc
(e.g. __inet_lookup_established()).

Worst case, the lookup happens after programming the timer, and we get a
inet_twsk_deschedule_put(). This reduces the refcount by:

  3 via inet_twsk_kill():
   1 (sk_nulls_del_node_init_rcu())
   1 (inet_twsk_bind_unhash())
   1 (inet_twsk_put())
  1 via inet_twsk_put()

IOW 4 total. So we can have:

  tcp_time_wait()
    inet_twsk_hashdance();  // refcount = 4
    inet_twsk_schedule();   // timer armed

                            tcp_v4_rcv()
                              sk = __inet_lookup_skb(); // refcount = 5 (+1)
                              inet_twsk_deschedule_put(inet_twsk(sk));
                                inet_twsk_kill(tw) // refcount = 2 (-3)
                                inet_twsk_put(tw)  // refcount = 1 (-1)

    inet_twsk_put(tw) // refcount = 0 (-1)

__inet_hash_connect() can invoke inet_twsk_bind_unhash() by itself before
calling inet_twsk_deschedule_put(), but that just means it won't be done by
the latter, so the total refcount delta remains the same.

Thinking about it differently, this would mean that currently (without the
patch) another CPU can bring the refcount to 0 without disarming the timer,
because the patch is making the initial value one higher.

What am I missing?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer
  2023-11-23 16:32       ` Eric Dumazet
  2023-11-23 16:45         ` Valentin Schneider
  2023-12-08 14:47         ` Valentin Schneider
@ 2024-01-12  9:08         ` Paolo Abeni
  2 siblings, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2024-01-12  9:08 UTC (permalink / raw)
  To: Eric Dumazet, Valentin Schneider
  Cc: dccp, netdev, linux-kernel, linux-rt-users, David S. Miller,
	Jakub Kicinski, David Ahern, Juri Lelli, Tomas Glozar,
	Sebastian Andrzej Siewior, Thomas Gleixner

Hi,

On Thu, 2023-11-23 at 17:32 +0100, Eric Dumazet wrote:
> On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > So AFAICT, after we go through the hashdance, there's a reference on
> > tw_refcnt held by the tw_timer.
> > inet_twsk_deschedule_put() can race with arming the timer, but it only
> > calls inet_twsk_kill() if the timer
> > was already armed & has been deleted, so there's no risk of calling it
> > twice... If I got it right :-)
> 
> Again, I think you missed some details.
> 
> I am OOO for a few days, I do not have time to elaborate.
> 
> You will need to properly track active timer by elevating
> tw->tw_refcnt, or I guarantee something wrong will happen.

I'm sorry to bring this up again, but I tried to understand what is
missing in Valentin's patch and I could not find it.

Direct link to the patch, just in case the thread has been lost:
https://lore.kernel.org/lkml/20231115210509.481514-2-vschneid@redhat.com/

The patch raises the initial tw->tw_refcnt to 4, so it tracks (in
advance) the reference for the tw_timer. AFAICS the patch is still
prone to the race you mentioned on the RFC:

CPU0:

   allocates a tw, insert it in hash table

CPU1:
   finds the TW and removes it (timer cancel does nothing)

CPU0:
   arms a TW timer, lasting

but I understood such race is acceptable.

Could you please shed some light? 

Many thanks,

Paolo


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-01-12  9:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-15 21:05 [PATCH v2 0/2] tcp/dcpp: tw_timer tweaks for nohz_full and PREEMPT_RT Valentin Schneider
2023-11-15 21:05 ` [PATCH v2 1/2] tcp/dcpp: Un-pin tw_timer Valentin Schneider
2023-11-20 17:55   ` Eric Dumazet
2023-11-23 14:34     ` Valentin Schneider
2023-11-23 16:32       ` Eric Dumazet
2023-11-23 16:45         ` Valentin Schneider
2023-12-08 14:47         ` Valentin Schneider
2024-01-12  9:08         ` Paolo Abeni
2023-11-15 21:05 ` [PATCH v2 2/2] tcp/dcpp: Don't disable bh around timewait_sock initialization Valentin Schneider
2023-11-20 15:23   ` kernel test robot
2023-11-23 15:26     ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).