All of lore.kernel.org
 help / color / mirror / Atom feed
* Soft lockup in inet_put_port on 4.6
@ 2016-12-06 23:06 Tom Herbert
  2016-12-08 21:03 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2016-12-06 23:06 UTC (permalink / raw)
  To: Linux Kernel Network Developers, Josef Bacik

Hello,

We are seeing a fair number of machines getting into softlockup in 4.6
kernel. As near as I can tell this is happening on the spinlock in
bind hash bucket. When inet_csk_get_port exits and does spinunlock_bh
the TCP timer runs and we hit lockup in inet_put_port (presumably on
same lock). It seems like the locked isn't properly be unlocked
somewhere but I don't readily see it.

Any ideas?

Thanks,
Tom

NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [proxygend:4152094]
Modules linked in: fuse nf_log_ipv6 ip6t_REJECT nf_reject_ipv6
nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 xt_limit
xt_multiport ipip ip_tunnel tunnel4 ip6_tunnel tunnel6 coretemp mptctl
mptbase cls_bpf ipmi_watchdog tcp_diag inet_diag ip6table_filter
xt_NFLOG nfnetlink_log xt_comment xt_statistic iptable_filter xt_mark
tpm_crb i2c_piix4 dm_crypt loop ipmi_devintf acpi_cpufreq iTCO_wdt
iTCO_vendor_support ipmi_si ipmi_msghandler efivars i2c_i801 sg
lpc_ich mfd_core hpilo xhci_pci xhci_hcd button nvme nvme_core
CPU: 22 PID: 4152094 Comm: proxygend Tainted: G W L
4.6.7-13_fbk3_1119_g367d67b #13
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
task: ffff88168c52d100 ti: ffff881c12fb0000 task.ti: ffff881c12fb0000
RIP: 0010:[<ffffffff810b87b8>] [<ffffffff810b87b8>]
queued_spin_lock_slowpath+0xf8/0x170
RSP: 0018:ffff883fff303da0 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff881257163e00 RCX: 0000000000000001
RDX: ffff883fff375e40 RSI: 00000000005c0000 RDI: ffffc90018d6bae0
RBP: ffff883fff303da0 R08: ffff883fff315e40 R09: 0000000000000000
R10: 0000000000000020 R11: 00000000000001c0 R12: ffffc90018d6bae0
R13: ffffffff820f8a80 R14: ffff881257163f30 R15: 0000000000000000
FS: 00007fa7bb7ff700(0000) GS:ffff883fff300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff4be114d90 CR3: 000000243f99c000 CR4: 00000000003406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack: ffff883fff303db0 ffffffff817e5910 ffff883fff303dd8 ffffffff8172f6b4
ffff881257163e00 0000000000000007 0000000000000004 ffff883fff303e00
ffffffff81733237 ffff881257163e00 0000000000000000 ffffffff81ce7cc0
Call Trace:
<IRQ>
[<ffffffff817e5910>] _raw_spin_lock+0x20/0x30
[<ffffffff8172f6b4>] inet_put_port+0x54/0xb0
[<ffffffff81733237>] tcp_set_state+0x67/0xc0
[<ffffffff81733a33>] tcp_done+0x33/0x90
[<ffffffff81746431>] tcp_write_err+0x31/0x50
[<ffffffff81746bc9>] tcp_retransmit_timer+0x119/0x7d0
[<ffffffff81747460>] ? tcp_write_timer_handler+0x1e0/0x1e0
[<ffffffff8174730e>] tcp_write_timer_handler+0x8e/0x1e0
[<ffffffff817474c7>] tcp_write_timer+0x67/0x70
[<ffffffff810ccc35>] call_timer_fn+0x35/0x120
[<ffffffff81747460>] ? tcp_write_timer_handler+0x1e0/0x1e0
[<ffffffff810cd01c>] run_timer_softirq+0x1fc/0x2b0
[<ffffffff817e811c>] __do_softirq+0xcc/0x26c
[<ffffffff817e753c>] do_softirq_own_stack+0x1c/0x30 <EOI>
[<ffffffff8107b481>] do_softirq+0x31/0x40
[<ffffffff8107b508>] __local_bh_enable_ip+0x78/0x80
[<ffffffff817e572a>] _raw_spin_unlock_bh +0x1a/0x20
[<ffffffff81730a61>] inet_csk_get_port+0x1c1/0x5a0
[<ffffffff816c7637>] ? sock_poll+0x47/0xb0
[<ffffffff817313f5>] inet_csk_listen_start+0x65/0xc0
[<ffffffff8175ea8c>] inet_listen+0x9c/0xe0
[<ffffffff816c8560>] SyS_listen+0x80/0x90
[<ffffffff817e5adb>] entry_SYSCALL_64_fastpath+0x13/0x8f
Code: c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 5e 01 00 48 03 14
c5 c0 d4 d1 81 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 41 8b 40 08 <85>
c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb 02 f3 90 8b

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert
@ 2016-12-08 21:03 ` Hannes Frederic Sowa
  2016-12-08 21:36   ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-08 21:03 UTC (permalink / raw)
  To: Tom Herbert, Linux Kernel Network Developers, Josef Bacik

Hello Tom,

On Wed, Dec 7, 2016, at 00:06, Tom Herbert wrote:
> We are seeing a fair number of machines getting into softlockup in 4.6
> kernel. As near as I can tell this is happening on the spinlock in
> bind hash bucket. When inet_csk_get_port exits and does spinunlock_bh
> the TCP timer runs and we hit lockup in inet_put_port (presumably on
> same lock). It seems like the locked isn't properly be unlocked
> somewhere but I don't readily see it.
> 
> Any ideas?

Likewise we received reports that pretty much look the same on our
heavily patched kernel. Did you have a chance to investigate or
reproduce the problem?

I am wondering if you would be able to take a complete thread stack dump
if you can reproduce this to check if one of the user space processes is
looping inside finding a free port?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-08 21:03 ` Hannes Frederic Sowa
@ 2016-12-08 21:36   ` Josef Bacik
  2016-12-09  0:30     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-08 21:36 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Tom Herbert, Linux Kernel Network Developers

On Thu, Dec 8, 2016 at 4:03 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> Hello Tom,
> 
> On Wed, Dec 7, 2016, at 00:06, Tom Herbert wrote:
>>  We are seeing a fair number of machines getting into softlockup in 
>> 4.6
>>  kernel. As near as I can tell this is happening on the spinlock in
>>  bind hash bucket. When inet_csk_get_port exits and does 
>> spinunlock_bh
>>  the TCP timer runs and we hit lockup in inet_put_port (presumably on
>>  same lock). It seems like the locked isn't properly be unlocked
>>  somewhere but I don't readily see it.
>> 
>>  Any ideas?
> 
> Likewise we received reports that pretty much look the same on our
> heavily patched kernel. Did you have a chance to investigate or
> reproduce the problem?
> 
> I am wondering if you would be able to take a complete thread stack 
> dump
> if you can reproduce this to check if one of the user space processes 
> is
> looping inside finding a free port?

We can reproduce the problem at will, still trying to run down the 
problem.  I'll try and find one of the boxes that dumped a core and get 
a bt of everybody.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-08 21:36   ` Josef Bacik
@ 2016-12-09  0:30     ` Eric Dumazet
  2016-12-09  1:01       ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-12-09  0:30 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers

On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:

> We can reproduce the problem at will, still trying to run down the 
> problem.  I'll try and find one of the boxes that dumped a core and get 
> a bt of everybody.  Thanks,

OK, sounds good.

I had a look and :
- could not spot a fix that came after 4.6. 
- could not spot an obvious bug.

Anything special in the program triggering the issue ?
SO_REUSEPORT and/or special socket options ?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-09  0:30     ` Eric Dumazet
@ 2016-12-09  1:01       ` Josef Bacik
  2016-12-10  1:59         ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-09  1:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers


> On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:
>> 
>> We can reproduce the problem at will, still trying to run down the 
>> problem.  I'll try and find one of the boxes that dumped a core and get 
>> a bt of everybody.  Thanks,
> 
> OK, sounds good.
> 
> I had a look and :
> - could not spot a fix that came after 4.6. 
> - could not spot an obvious bug.
> 
> Anything special in the program triggering the issue ?
> SO_REUSEPORT and/or special socket options ?
> 

So they recently started using SO_REUSEPORT, that's what triggered it, if they don't use it then everything is fine.

I added some instrumentation for get_port to see if it was looping in there and none of my printk's triggered.  The softlockup messages are always on the inet_bind_bucket lock, sometimes in the process context in get_port or in the softirq context either through inet_put_port or inet_kill_twsk.  On the box that I have a coredump for there's only one processor in the inet code so I'm not sure what to make of that.  That was a box from last week so I'll look at a more recent core and see if it's different.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-09  1:01       ` Josef Bacik
@ 2016-12-10  1:59         ` Josef Bacik
  2016-12-10  3:47           ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-10  1:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers

On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik <jbacik@fb.com> wrote:
> 
>>  On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> 
>> wrote:
>> 
>>>  On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:
>>> 
>>>  We can reproduce the problem at will, still trying to run down the
>>>  problem.  I'll try and find one of the boxes that dumped a core 
>>> and get
>>>  a bt of everybody.  Thanks,
>> 
>>  OK, sounds good.
>> 
>>  I had a look and :
>>  - could not spot a fix that came after 4.6.
>>  - could not spot an obvious bug.
>> 
>>  Anything special in the program triggering the issue ?
>>  SO_REUSEPORT and/or special socket options ?
>> 
> 
> So they recently started using SO_REUSEPORT, that's what triggered 
> it, if they don't use it then everything is fine.
> 
> I added some instrumentation for get_port to see if it was looping in 
> there and none of my printk's triggered.  The softlockup messages are 
> always on the inet_bind_bucket lock, sometimes in the process context 
> in get_port or in the softirq context either through inet_put_port or 
> inet_kill_twsk.  On the box that I have a coredump for there's only 
> one processor in the inet code so I'm not sure what to make of that.  
> That was a box from last week so I'll look at a more recent core and 
> see if it's different.  Thanks,

Ok more investigation today, a few bullet points

- With all the debugging turned on the boxes seem to recover after 
about a minute.  I'd get the spam of the soft lockup messages all on 
the inet_bind_bucket, and then the box would be fine.
- I looked at a core I had from before I started investigating things 
and there's only one process trying to get the inet_bind_bucket of all 
the 48 cpus.
- I noticed that there was over 100k twsk's in that original core.
- I put a global counter of the twsk's (since most of the softlockup 
messages have the twsk timers in the stack) and noticed with the 
debugging kernel it started around 16k twsk's and once it recovered it 
was down to less than a thousand.  There's a jump where it goes from 8k 
to 2k and then there's only one more softlockup message and the box is 
fine.
- This happens when we restart the service with the config option to 
start using SO_REUSEPORT.

The application is our load balancing app, so obviously has lots of 
connections opened at any given time.  What I'm wondering and will test 
on Monday is if the SO_REUSEPORT change even matters, or if simply 
restarting the service is what triggers the problem.  One thing I 
forgot to mention is that it's also using TCP_FASTOPEN in both the 
non-reuseport and reuseport variants.

What I suspect is happening is the service stops, all of the sockets it 
had open go into TIMEWAIT with relatively the same timer period, and 
then suddenly all wake up at the same time which coupled with the 
massive amount of traffic that we see per box anyway results in so much 
contention and ksoftirqd usage that the box livelocks for a while.  
With the lock debugging and stuff turned on we aren't able to service 
as much traffic so it recovers relatively quickly, whereas a normal 
production kernel never recovers.

Please keep in mind that I"m a file system developer so my conclusions 
may be completely insane, any guidance would be welcome.  I'll continue 
hammering on this on Monday.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-10  1:59         ` Josef Bacik
@ 2016-12-10  3:47           ` Eric Dumazet
  2016-12-10  4:14             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-12-10  3:47 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers

On Fri, 2016-12-09 at 20:59 -0500, Josef Bacik wrote:
> On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik <jbacik@fb.com> wrote:
> > 
> >>  On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> 
> >> wrote:
> >> 
> >>>  On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:
> >>> 
> >>>  We can reproduce the problem at will, still trying to run down the
> >>>  problem.  I'll try and find one of the boxes that dumped a core 
> >>> and get
> >>>  a bt of everybody.  Thanks,
> >> 
> >>  OK, sounds good.
> >> 
> >>  I had a look and :
> >>  - could not spot a fix that came after 4.6.
> >>  - could not spot an obvious bug.
> >> 
> >>  Anything special in the program triggering the issue ?
> >>  SO_REUSEPORT and/or special socket options ?
> >> 
> > 
> > So they recently started using SO_REUSEPORT, that's what triggered 
> > it, if they don't use it then everything is fine.
> > 
> > I added some instrumentation for get_port to see if it was looping in 
> > there and none of my printk's triggered.  The softlockup messages are 
> > always on the inet_bind_bucket lock, sometimes in the process context 
> > in get_port or in the softirq context either through inet_put_port or 
> > inet_kill_twsk.  On the box that I have a coredump for there's only 
> > one processor in the inet code so I'm not sure what to make of that.  
> > That was a box from last week so I'll look at a more recent core and 
> > see if it's different.  Thanks,
> 
> Ok more investigation today, a few bullet points
> 
> - With all the debugging turned on the boxes seem to recover after 
> about a minute.  I'd get the spam of the soft lockup messages all on 
> the inet_bind_bucket, and then the box would be fine.
> - I looked at a core I had from before I started investigating things 
> and there's only one process trying to get the inet_bind_bucket of all 
> the 48 cpus.
> - I noticed that there was over 100k twsk's in that original core.
> - I put a global counter of the twsk's (since most of the softlockup 
> messages have the twsk timers in the stack) and noticed with the 
> debugging kernel it started around 16k twsk's and once it recovered it 
> was down to less than a thousand.  There's a jump where it goes from 8k 
> to 2k and then there's only one more softlockup message and the box is 
> fine.
> - This happens when we restart the service with the config option to 
> start using SO_REUSEPORT.
> 
> The application is our load balancing app, so obviously has lots of 
> connections opened at any given time.  What I'm wondering and will test 
> on Monday is if the SO_REUSEPORT change even matters, or if simply 
> restarting the service is what triggers the problem.  One thing I 
> forgot to mention is that it's also using TCP_FASTOPEN in both the 
> non-reuseport and reuseport variants.
> 
> What I suspect is happening is the service stops, all of the sockets it 
> had open go into TIMEWAIT with relatively the same timer period, and 
> then suddenly all wake up at the same time which coupled with the 
> massive amount of traffic that we see per box anyway results in so much 
> contention and ksoftirqd usage that the box livelocks for a while.  
> With the lock debugging and stuff turned on we aren't able to service 
> as much traffic so it recovers relatively quickly, whereas a normal 
> production kernel never recovers.
> 
> Please keep in mind that I"m a file system developer so my conclusions 
> may be completely insane, any guidance would be welcome.  I'll continue 
> hammering on this on Monday.  Thanks,

Hmm... Is your ephemeral port range includes the port your load
balancing app is using ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-10  3:47           ` Eric Dumazet
@ 2016-12-10  4:14             ` Eric Dumazet
  2016-12-12 18:05               ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-12-10  4:14 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers

On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:

> 
> Hmm... Is your ephemeral port range includes the port your load
> balancing app is using ?

I suspect that you might have processes doing bind( port = 0) that are
trapped into the bind_conflict() scan ?

With 100,000 + timewaits there, this possibly hurts.

Can you try the following loop breaker ?

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d5d3ead0a6c31e42e8843d30f8c643324a91b8e9..74f0f5ee6a02c624edb0263b9ddd27813f68d0a5 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -51,7 +51,7 @@ int inet_csk_bind_conflict(const struct sock *sk,
 	int reuse = sk->sk_reuse;
 	int reuseport = sk->sk_reuseport;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
-
+	unsigned int max_count;
 	/*
 	 * Unlike other sk lookup places we do not check
 	 * for sk_net here, since _all_ the socks listed
@@ -59,6 +59,7 @@ int inet_csk_bind_conflict(const struct sock *sk,
 	 * one this bucket belongs to.
 	 */
 
+	max_count = relax ? ~0U : 100;
 	sk_for_each_bound(sk2, &tb->owners) {
 		if (sk != sk2 &&
 		    !inet_v6_ipv6only(sk2) &&
@@ -84,6 +85,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
 					break;
 			}
 		}
+		if (--max_count == 0)
+			return 1;
 	}
 	return sk2 != NULL;
 }
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 1c86c478f578b49373e61a4c397f23f3dc7f3fc6..4f63d06e0d601da94eb3f2b35a988abd060e156c 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -35,12 +35,14 @@ int inet6_csk_bind_conflict(const struct sock *sk,
 	int reuse = sk->sk_reuse;
 	int reuseport = sk->sk_reuseport;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
+	unsigned int max_count;
 
 	/* We must walk the whole port owner list in this case. -DaveM */
 	/*
 	 * See comment in inet_csk_bind_conflict about sock lookup
 	 * vs net namespaces issues.
 	 */
+	max_count = relax ? ~0U : 100;
 	sk_for_each_bound(sk2, &tb->owners) {
 		if (sk != sk2 &&
 		    (!sk->sk_bound_dev_if ||
@@ -61,6 +63,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
 			    ipv6_rcv_saddr_equal(sk, sk2, true))
 				break;
 		}
+		if (--max_count == 0)
+			return 1;
 	}
 
 	return sk2 != NULL;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-10  4:14             ` Eric Dumazet
@ 2016-12-12 18:05               ` Josef Bacik
  2016-12-12 18:44                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-12 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers

On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> 
wrote:
> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
> 
>> 
>>  Hmm... Is your ephemeral port range includes the port your load
>>  balancing app is using ?
> 
> I suspect that you might have processes doing bind( port = 0) that are
> trapped into the bind_conflict() scan ?
> 
> With 100,000 + timewaits there, this possibly hurts.
> 
> Can you try the following loop breaker ?

It doesn't appear that the app is doing bind(port = 0) during normal 
operation.  I tested this patch and it made no difference.  I'm going 
to test simply restarting the app without changing to the SO_REUSEPORT 
option.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-12 18:05               ` Josef Bacik
@ 2016-12-12 18:44                 ` Hannes Frederic Sowa
  2016-12-12 21:23                   ` Josef Bacik
  2016-12-12 22:24                   ` Josef Bacik
  0 siblings, 2 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-12 18:44 UTC (permalink / raw)
  To: Josef Bacik, Eric Dumazet; +Cc: Tom Herbert, Linux Kernel Network Developers

On 12.12.2016 19:05, Josef Bacik wrote:
> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>
>>>
>>>  Hmm... Is your ephemeral port range includes the port your load
>>>  balancing app is using ?
>>
>> I suspect that you might have processes doing bind( port = 0) that are
>> trapped into the bind_conflict() scan ?
>>
>> With 100,000 + timewaits there, this possibly hurts.
>>
>> Can you try the following loop breaker ?
> 
> It doesn't appear that the app is doing bind(port = 0) during normal
> operation.  I tested this patch and it made no difference.  I'm going to
> test simply restarting the app without changing to the SO_REUSEPORT
> option.  Thanks,

Would it be possible to trace the time the function uses with trace? If
we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-12 18:44                 ` Hannes Frederic Sowa
@ 2016-12-12 21:23                   ` Josef Bacik
  2016-12-12 22:24                   ` Josef Bacik
  1 sibling, 0 replies; 32+ messages in thread
From: Josef Bacik @ 2016-12-12 21:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers

On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> On 12.12.2016 19:05, Josef Bacik wrote:
>>  On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
>> <eric.dumazet@gmail.com>
>>  wrote:
>>>  On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>> 
>>>> 
>>>>   Hmm... Is your ephemeral port range includes the port your load
>>>>   balancing app is using ?
>>> 
>>>  I suspect that you might have processes doing bind( port = 0) that 
>>> are
>>>  trapped into the bind_conflict() scan ?
>>> 
>>>  With 100,000 + timewaits there, this possibly hurts.
>>> 
>>>  Can you try the following loop breaker ?
>> 
>>  It doesn't appear that the app is doing bind(port = 0) during normal
>>  operation.  I tested this patch and it made no difference.  I'm 
>> going to
>>  test simply restarting the app without changing to the SO_REUSEPORT
>>  option.  Thanks,
> 
> Would it be possible to trace the time the function uses with trace? 
> If
> we don't see the number growing considerably over time we probably can
> rule out that we loop somewhere in there (I would instrument
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
> 
> __inet_hash_connect -> __inet_check_established also takes a lock
> (inet_ehash_lockp) which can be locked from inet_diag code path during
> socket diag info dumping.
> 
> Unfortunately we couldn't reproduce it so far. :/

Working on getting the timing info, will probably be tomorrow due to 
meetings.  I did test simply restarting the app without changing to the 
config that enabled the use of SO_REUSEPORT and the problem didn't 
occur, so it definitely has something to do with SO_REUSEPORT.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-12 18:44                 ` Hannes Frederic Sowa
  2016-12-12 21:23                   ` Josef Bacik
@ 2016-12-12 22:24                   ` Josef Bacik
  2016-12-13 20:51                     ` Tom Herbert
  1 sibling, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-12 22:24 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers, Josef Bacik


On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> On 12.12.2016 19:05, Josef Bacik wrote:
>>  On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
>> <eric.dumazet@gmail.com>
>>  wrote:
>>>  On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>> 
>>>> 
>>>>   Hmm... Is your ephemeral port range includes the port your load
>>>>   balancing app is using ?
>>> 
>>>  I suspect that you might have processes doing bind( port = 0) that 
>>> are
>>>  trapped into the bind_conflict() scan ?
>>> 
>>>  With 100,000 + timewaits there, this possibly hurts.
>>> 
>>>  Can you try the following loop breaker ?
>> 
>>  It doesn't appear that the app is doing bind(port = 0) during normal
>>  operation.  I tested this patch and it made no difference.  I'm 
>> going to
>>  test simply restarting the app without changing to the SO_REUSEPORT
>>  option.  Thanks,
> 
> Would it be possible to trace the time the function uses with trace? 
> If
> we don't see the number growing considerably over time we probably can
> rule out that we loop somewhere in there (I would instrument
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
> 
> __inet_hash_connect -> __inet_check_established also takes a lock
> (inet_ehash_lockp) which can be locked from inet_diag code path during
> socket diag info dumping.
> 
> Unfortunately we couldn't reproduce it so far. :/

So I had a bcc script running to time how long we spent in 
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but 
of course I'm an idiot and didn't actually separate out the stats so I 
could tell _which_ one was taking forever.  But anyway here's a normal 
distribution on the box

     Some shit           : count     distribution
         0 -> 1          : 0        |                                   
     |
         2 -> 3          : 0        |                                   
     |
         4 -> 7          : 0        |                                   
     |
         8 -> 15         : 0        |                                   
     |
        16 -> 31         : 0        |                                   
     |
        32 -> 63         : 0        |                                   
     |
        64 -> 127        : 0        |                                   
     |
       128 -> 255        : 0        |                                   
     |
       256 -> 511        : 0        |                                   
     |
       512 -> 1023       : 0        |                                   
     |
      1024 -> 2047       : 74       |                                   
     |
      2048 -> 4095       : 10537    
|****************************************|
      4096 -> 8191       : 8497     |********************************   
     |
      8192 -> 16383      : 3745     |**************                     
     |
     16384 -> 32767      : 300      |*                                  
     |
     32768 -> 65535      : 250      |                                   
     |
     65536 -> 131071     : 180      |                                   
     |
    131072 -> 262143     : 71       |                                   
     |
    262144 -> 524287     : 18       |                                   
     |
    524288 -> 1048575    : 5        |                                   
     |

With the times in nanoseconds, and here's the distribution during the 
problem

     Some shit           : count     distribution
         0 -> 1          : 0        |                                   
     |
         2 -> 3          : 0        |                                   
     |
         4 -> 7          : 0        |                                   
     |
         8 -> 15         : 0        |                                   
     |
        16 -> 31         : 0        |                                   
     |
        32 -> 63         : 0        |                                   
     |
        64 -> 127        : 0        |                                   
     |
       128 -> 255        : 0        |                                   
     |
       256 -> 511        : 0        |                                   
     |
       512 -> 1023       : 0        |                                   
     |
      1024 -> 2047       : 21       |                                   
     |
      2048 -> 4095       : 21820    
|****************************************|
      4096 -> 8191       : 11598    |*********************              
     |
      8192 -> 16383      : 4337     |*******                            
     |
     16384 -> 32767      : 290      |                                   
     |
     32768 -> 65535      : 59       |                                   
     |
     65536 -> 131071     : 23       |                                   
     |
    131072 -> 262143     : 12       |                                   
     |
    262144 -> 524287     : 6        |                                   
     |
    524288 -> 1048575    : 19       |                                   
     |
   1048576 -> 2097151    : 1079     |*                                  
     |
   2097152 -> 4194303    : 0        |                                   
     |
   4194304 -> 8388607    : 1        |                                   
     |
   8388608 -> 16777215   : 0        |                                   
     |
  16777216 -> 33554431   : 0        |                                   
     |
  33554432 -> 67108863   : 1192     |**                                 
     |
               Some shit                     : count     distribution
                   0 -> 1                    : 0        |               
     |
                   2 -> 3                    : 0        |               
     |
                   4 -> 7                    : 0        |               
     |
                   8 -> 15                   : 0        |               
     |
                  16 -> 31                   : 0        |               
     |
                  32 -> 63                   : 0        |               
     |
                  64 -> 127                  : 0        |               
     |
                 128 -> 255                  : 0        |               
     |
                 256 -> 511                  : 0        |               
     |
                 512 -> 1023                 : 0        |               
     |
                1024 -> 2047                 : 48       |               
     |
                2048 -> 4095                 : 14714    
|********************|
                4096 -> 8191                 : 6769     |*********      
     |
                8192 -> 16383                : 2234     |***            
     |
               16384 -> 32767                : 422      |               
     |
               32768 -> 65535                : 208      |               
     |
               65536 -> 131071               : 61       |               
     |
              131072 -> 262143               : 10       |               
     |
              262144 -> 524287               : 416      |               
     |
              524288 -> 1048575              : 826      |*              
     |
             1048576 -> 2097151              : 598      |               
     |
             2097152 -> 4194303              : 10       |               
     |
             4194304 -> 8388607              : 0        |               
     |
             8388608 -> 16777215             : 1        |               
     |
            16777216 -> 33554431             : 289      |               
     |
            33554432 -> 67108863             : 921      |*              
     |
            67108864 -> 134217727            : 74       |               
     |
           134217728 -> 268435455            : 75       |               
     |
           268435456 -> 536870911            : 48       |               
     |
           536870912 -> 1073741823           : 25       |               
     |
          1073741824 -> 2147483647           : 3        |               
     |
          2147483648 -> 4294967295           : 2        |               
     |
          4294967296 -> 8589934591           : 1        |               
     |

As you can see we start getting tail latencies of up to 4-8 seconds.  
Tomorrow I'll separate out the stats so we can know which function is 
the problem child.  Sorry about not doing that first.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-12 22:24                   ` Josef Bacik
@ 2016-12-13 20:51                     ` Tom Herbert
  2016-12-13 23:03                       ` Craig Gallek
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2016-12-13 20:51 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers

I think there may be some suspicious code in inet_csk_get_port. At
tb_found there is:

                if (((tb->fastreuse > 0 && reuse) ||
                     (tb->fastreuseport > 0 &&
                      !rcu_access_pointer(sk->sk_reuseport_cb) &&
                      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
                    smallest_size == -1)
                        goto success;
                if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
                        if ((reuse ||
                             (tb->fastreuseport > 0 &&
                              sk->sk_reuseport &&
                              !rcu_access_pointer(sk->sk_reuseport_cb) &&
                              uid_eq(tb->fastuid, uid))) &&
                            smallest_size != -1 && --attempts >= 0) {
                                spin_unlock_bh(&head->lock);
                                goto again;
                        }
                        goto fail_unlock;
                }

AFAICT there is redundancy in these two conditionals.  The same clause
is being checked in both: (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
first conditional should be hit, goto done,  and the second will never
evaluate that part to true-- unless the sk is changed (do we need
READ_ONCE for sk->sk_reuseport_cb?).

Another potential issue is the that the goto again goes back to doing
the port scan, but if snum had been set originally that doesn't seem
like what we want.

Thanks,
Tom




On Mon, Dec 12, 2016 at 2:24 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>>
>> On 12.12.2016 19:05, Josef Bacik wrote:
>>>
>>>  On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com>
>>>  wrote:
>>>>
>>>>  On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>>>
>>>>>
>>>>>   Hmm... Is your ephemeral port range includes the port your load
>>>>>   balancing app is using ?
>>>>
>>>>
>>>>  I suspect that you might have processes doing bind( port = 0) that are
>>>>  trapped into the bind_conflict() scan ?
>>>>
>>>>  With 100,000 + timewaits there, this possibly hurts.
>>>>
>>>>  Can you try the following loop breaker ?
>>>
>>>
>>>  It doesn't appear that the app is doing bind(port = 0) during normal
>>>  operation.  I tested this patch and it made no difference.  I'm going to
>>>  test simply restarting the app without changing to the SO_REUSEPORT
>>>  option.  Thanks,
>>
>>
>> Would it be possible to trace the time the function uses with trace? If
>> we don't see the number growing considerably over time we probably can
>> rule out that we loop somewhere in there (I would instrument
>> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
>>
>> __inet_hash_connect -> __inet_check_established also takes a lock
>> (inet_ehash_lockp) which can be locked from inet_diag code path during
>> socket diag info dumping.
>>
>> Unfortunately we couldn't reproduce it so far. :/
>
>
> So I had a bcc script running to time how long we spent in
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but of
> course I'm an idiot and didn't actually separate out the stats so I could
> tell _which_ one was taking forever.  But anyway here's a normal
> distribution on the box
>
>     Some shit           : count     distribution
>         0 -> 1          : 0        |                                       |
>         2 -> 3          : 0        |                                       |
>         4 -> 7          : 0        |                                       |
>         8 -> 15         : 0        |                                       |
>        16 -> 31         : 0        |                                       |
>        32 -> 63         : 0        |                                       |
>        64 -> 127        : 0        |                                       |
>       128 -> 255        : 0        |                                       |
>       256 -> 511        : 0        |                                       |
>       512 -> 1023       : 0        |                                       |
>      1024 -> 2047       : 74       |                                       |
>      2048 -> 4095       : 10537
> |****************************************|
>      4096 -> 8191       : 8497     |********************************       |
>      8192 -> 16383      : 3745     |**************                         |
>     16384 -> 32767      : 300      |*                                      |
>     32768 -> 65535      : 250      |                                       |
>     65536 -> 131071     : 180      |                                       |
>    131072 -> 262143     : 71       |                                       |
>    262144 -> 524287     : 18       |                                       |
>    524288 -> 1048575    : 5        |                                       |
>
> With the times in nanoseconds, and here's the distribution during the
> problem
>
>     Some shit           : count     distribution
>         0 -> 1          : 0        |                                       |
>         2 -> 3          : 0        |                                       |
>         4 -> 7          : 0        |                                       |
>         8 -> 15         : 0        |                                       |
>        16 -> 31         : 0        |                                       |
>        32 -> 63         : 0        |                                       |
>        64 -> 127        : 0        |                                       |
>       128 -> 255        : 0        |                                       |
>       256 -> 511        : 0        |                                       |
>       512 -> 1023       : 0        |                                       |
>      1024 -> 2047       : 21       |                                       |
>      2048 -> 4095       : 21820
> |****************************************|
>      4096 -> 8191       : 11598    |*********************                  |
>      8192 -> 16383      : 4337     |*******                                |
>     16384 -> 32767      : 290      |                                       |
>     32768 -> 65535      : 59       |                                       |
>     65536 -> 131071     : 23       |                                       |
>    131072 -> 262143     : 12       |                                       |
>    262144 -> 524287     : 6        |                                       |
>    524288 -> 1048575    : 19       |                                       |
>   1048576 -> 2097151    : 1079     |*                                      |
>   2097152 -> 4194303    : 0        |                                       |
>   4194304 -> 8388607    : 1        |                                       |
>   8388608 -> 16777215   : 0        |                                       |
>  16777216 -> 33554431   : 0        |                                       |
>  33554432 -> 67108863   : 1192     |**                                     |
>               Some shit                     : count     distribution
>                   0 -> 1                    : 0        |                   |
>                   2 -> 3                    : 0        |                   |
>                   4 -> 7                    : 0        |                   |
>                   8 -> 15                   : 0        |                   |
>                  16 -> 31                   : 0        |                   |
>                  32 -> 63                   : 0        |                   |
>                  64 -> 127                  : 0        |                   |
>                 128 -> 255                  : 0        |                   |
>                 256 -> 511                  : 0        |                   |
>                 512 -> 1023                 : 0        |                   |
>                1024 -> 2047                 : 48       |                   |
>                2048 -> 4095                 : 14714
> |********************|
>                4096 -> 8191                 : 6769     |*********          |
>                8192 -> 16383                : 2234     |***                |
>               16384 -> 32767                : 422      |                   |
>               32768 -> 65535                : 208      |                   |
>               65536 -> 131071               : 61       |                   |
>              131072 -> 262143               : 10       |                   |
>              262144 -> 524287               : 416      |                   |
>              524288 -> 1048575              : 826      |*                  |
>             1048576 -> 2097151              : 598      |                   |
>             2097152 -> 4194303              : 10       |                   |
>             4194304 -> 8388607              : 0        |                   |
>             8388608 -> 16777215             : 1        |                   |
>            16777216 -> 33554431             : 289      |                   |
>            33554432 -> 67108863             : 921      |*                  |
>            67108864 -> 134217727            : 74       |                   |
>           134217728 -> 268435455            : 75       |                   |
>           268435456 -> 536870911            : 48       |                   |
>           536870912 -> 1073741823           : 25       |                   |
>          1073741824 -> 2147483647           : 3        |                   |
>          2147483648 -> 4294967295           : 2        |                   |
>          4294967296 -> 8589934591           : 1        |                   |
>
> As you can see we start getting tail latencies of up to 4-8 seconds.
> Tomorrow I'll separate out the stats so we can know which function is the
> problem child.  Sorry about not doing that first.  Thanks,
>
> Josef
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-13 20:51                     ` Tom Herbert
@ 2016-12-13 23:03                       ` Craig Gallek
  2016-12-13 23:32                         ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: Craig Gallek @ 2016-12-13 23:03 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Josef Bacik, Hannes Frederic Sowa, Eric Dumazet,
	Linux Kernel Network Developers

On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> wrote:
> I think there may be some suspicious code in inet_csk_get_port. At
> tb_found there is:
>
>                 if (((tb->fastreuse > 0 && reuse) ||
>                      (tb->fastreuseport > 0 &&
>                       !rcu_access_pointer(sk->sk_reuseport_cb) &&
>                       sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>                     smallest_size == -1)
>                         goto success;
>                 if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
>                         if ((reuse ||
>                              (tb->fastreuseport > 0 &&
>                               sk->sk_reuseport &&
>                               !rcu_access_pointer(sk->sk_reuseport_cb) &&
>                               uid_eq(tb->fastuid, uid))) &&
>                             smallest_size != -1 && --attempts >= 0) {
>                                 spin_unlock_bh(&head->lock);
>                                 goto again;
>                         }
>                         goto fail_unlock;
>                 }
>
> AFAICT there is redundancy in these two conditionals.  The same clause
> is being checked in both: (tb->fastreuseport > 0 &&
> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
> first conditional should be hit, goto done,  and the second will never
> evaluate that part to true-- unless the sk is changed (do we need
> READ_ONCE for sk->sk_reuseport_cb?).
That's an interesting point... It looks like this function also
changed in 4.6 from using a single local_bh_disable() at the beginning
with several spin_lock(&head->lock) to exclusively
spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
disable variant was preventing the timers in your stack trace from
running interleaved with this function before?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-13 23:03                       ` Craig Gallek
@ 2016-12-13 23:32                         ` Tom Herbert
  2016-12-15 18:53                           ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2016-12-13 23:32 UTC (permalink / raw)
  To: Craig Gallek
  Cc: Josef Bacik, Hannes Frederic Sowa, Eric Dumazet,
	Linux Kernel Network Developers

On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> wrote:
> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> wrote:
>> I think there may be some suspicious code in inet_csk_get_port. At
>> tb_found there is:
>>
>>                 if (((tb->fastreuse > 0 && reuse) ||
>>                      (tb->fastreuseport > 0 &&
>>                       !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>                       sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>>                     smallest_size == -1)
>>                         goto success;
>>                 if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
>>                         if ((reuse ||
>>                              (tb->fastreuseport > 0 &&
>>                               sk->sk_reuseport &&
>>                               !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>                               uid_eq(tb->fastuid, uid))) &&
>>                             smallest_size != -1 && --attempts >= 0) {
>>                                 spin_unlock_bh(&head->lock);
>>                                 goto again;
>>                         }
>>                         goto fail_unlock;
>>                 }
>>
>> AFAICT there is redundancy in these two conditionals.  The same clause
>> is being checked in both: (tb->fastreuseport > 0 &&
>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>> first conditional should be hit, goto done,  and the second will never
>> evaluate that part to true-- unless the sk is changed (do we need
>> READ_ONCE for sk->sk_reuseport_cb?).
> That's an interesting point... It looks like this function also
> changed in 4.6 from using a single local_bh_disable() at the beginning
> with several spin_lock(&head->lock) to exclusively
> spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
> disable variant was preventing the timers in your stack trace from
> running interleaved with this function before?

Could be, although dropping the lock shouldn't be able to affect the
search state. TBH, I'm a little lost in reading function, the
SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
function and also in every call to inet_csk_bind_conflict. I wonder if
we can simply this under the assumption that SO_REUSEPORT is only
allowed if the port number (snum) is explicitly specified.

Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-13 23:32                         ` Tom Herbert
@ 2016-12-15 18:53                           ` Josef Bacik
  2016-12-15 22:39                             ` Tom Herbert
  2016-12-16  0:07                             ` Hannes Frederic Sowa
  0 siblings, 2 replies; 32+ messages in thread
From: Josef Bacik @ 2016-12-15 18:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Craig Gallek, Hannes Frederic Sowa, Eric Dumazet,
	Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 3507 bytes --]

On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 
wrote:
> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> 
> wrote:
>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> 
>> wrote:
>>>  I think there may be some suspicious code in inet_csk_get_port. At
>>>  tb_found there is:
>>> 
>>>                  if (((tb->fastreuse > 0 && reuse) ||
>>>                       (tb->fastreuseport > 0 &&
>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>                        sk->sk_reuseport && uid_eq(tb->fastuid, 
>>> uid))) &&
>>>                      smallest_size == -1)
>>>                          goto success;
>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, 
>>> tb, true)) {
>>>                          if ((reuse ||
>>>                               (tb->fastreuseport > 0 &&
>>>                                sk->sk_reuseport &&
>>>                                
>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>                                uid_eq(tb->fastuid, uid))) &&
>>>                              smallest_size != -1 && --attempts >= 
>>> 0) {
>>>                                  spin_unlock_bh(&head->lock);
>>>                                  goto again;
>>>                          }
>>>                          goto fail_unlock;
>>>                  }
>>> 
>>>  AFAICT there is redundancy in these two conditionals.  The same 
>>> clause
>>>  is being checked in both: (tb->fastreuseport > 0 &&
>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true 
>>> the
>>>  first conditional should be hit, goto done,  and the second will 
>>> never
>>>  evaluate that part to true-- unless the sk is changed (do we need
>>>  READ_ONCE for sk->sk_reuseport_cb?).
>>  That's an interesting point... It looks like this function also
>>  changed in 4.6 from using a single local_bh_disable() at the 
>> beginning
>>  with several spin_lock(&head->lock) to exclusively
>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full 
>> bh
>>  disable variant was preventing the timers in your stack trace from
>>  running interleaved with this function before?
> 
> Could be, although dropping the lock shouldn't be able to affect the
> search state. TBH, I'm a little lost in reading function, the
> SO_REUSEPORT handling is pretty complicated. For instance,
> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
> function and also in every call to inet_csk_bind_conflict. I wonder if
> we can simply this under the assumption that SO_REUSEPORT is only
> allowed if the port number (snum) is explicitly specified.

Ok first I have data for you Hannes, here's the time distributions 
before during and after the lockup (with all the debugging in place the 
box eventually recovers).  I've attached it as a text file since it is 
long.

Second is I was thinking about why we would spend so much time doing 
the ->owners list, and obviously it's because of the massive amount of 
timewait sockets on the owners list.  I wrote the following dumb patch 
and tested it and the problem has disappeared completely.  Now I don't 
know if this is right at all, but I thought it was weird we weren't 
copying the soreuseport option from the original socket onto the twsk.  
Is there are reason we aren't doing this currently?  Does this help 
explain what is happening?  Thanks,

Josef

[-- Attachment #2.1: timing-dist.txt --]
[-- Type: text/plain, Size: 30972 bytes --]

     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 4        |*                                       |
      2048 -> 4095       : 100      |****************************************|
      4096 -> 8191       : 64       |*************************               |
      8192 -> 16383      : 35       |**************                          |
     16384 -> 32767      : 2        |                                        |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 1        |*                                       |
      2048 -> 4095       : 38       |****************************************|
      4096 -> 8191       : 9        |*********                               |
      8192 -> 16383      : 2        |**                                      |
     16384 -> 32767      : 1        |*                                       |
<restart happens>
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 9        |**                                      |
      2048 -> 4095       : 54       |****************                        |
      4096 -> 8191       : 15       |****                                    |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 130      |****************************************|
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 0        |                                        |
  33554432 -> 67108863   : 92       |****************************            |
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 11       |                                        |
      2048 -> 4095       : 132      |*********                               |
      4096 -> 8191       : 91       |******                                  |
      8192 -> 16383      : 13       |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 401      |****************************            |
   4194304 -> 8388607    : 274      |*******************                     |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 16       |*                                       |
  33554432 -> 67108863   : 561      |****************************************|
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 6        |                                        |
      2048 -> 4095       : 68       |****                                    |
      4096 -> 8191       : 9        |                                        |
      8192 -> 16383      : 2        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 650      |****************************************|
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 15       |                                        |
  33554432 -> 67108863   : 583      |***********************************     |
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 18       |*                                       |
      2048 -> 4095       : 263      |********************                    |
      4096 -> 8191       : 188      |**************                          |
      8192 -> 16383      : 186      |**************                          |
     16384 -> 32767      : 7        |                                        |
     32768 -> 65535      : 1        |                                        |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 37       |**                                      |
   4194304 -> 8388607    : 454      |**********************************      |
   8388608 -> 16777215   : 9        |                                        |
  16777216 -> 33554431   : 24       |*                                       |
  33554432 -> 67108863   : 526      |****************************************|
<soft lockup messages start happening>
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 20       |*                                       |
      2048 -> 4095       : 130      |**********                              |
      4096 -> 8191       : 40       |***                                     |
      8192 -> 16383      : 2        |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 506      |*************************************** |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 23       |*                                       |
  33554432 -> 67108863   : 511      |****************************************|
               inet_csk_get_port             : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 9        |                    |
                2048 -> 4095                 : 356      |********************|
                4096 -> 8191                 : 230      |************        |
                8192 -> 16383                : 342      |******************* |
               16384 -> 32767                : 12       |                    |
               32768 -> 65535                : 1        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 1        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 311      |*****************   |
             4194304 -> 8388607              : 163      |*********           |
             8388608 -> 16777215             : 1        |                    |
            16777216 -> 33554431             : 3        |                    |
            33554432 -> 67108863             : 338      |******************  |
            67108864 -> 134217727            : 55       |***                 |
           134217728 -> 268435455            : 65       |***                 |
           268435456 -> 536870911            : 36       |**                  |
           536870912 -> 1073741823           : 22       |*                   |
          1073741824 -> 2147483647           : 16       |                    |
          2147483648 -> 4294967295           : 7        |                    |
          4294967296 -> 8589934591           : 1        |                    |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2        |                                        |
      2048 -> 4095       : 86       |***                                     |
      4096 -> 8191       : 16       |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 187      |*******                                 |
   2097152 -> 4194303    : 975      |****************************************|
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 337      |*************                           |
  33554432 -> 67108863   : 442      |******************                      |
               inet_csk_get_port             : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 162      |****                |
                2048 -> 4095                 : 495      |**************      |
                4096 -> 8191                 : 66       |*                   |
                8192 -> 16383                : 6        |                    |
               16384 -> 32767                : 2        |                    |
               32768 -> 65535                : 0        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 680      |********************|
             4194304 -> 8388607              : 166      |****                |
             8388608 -> 16777215             : 10       |                    |
            16777216 -> 33554431             : 6        |                    |
            33554432 -> 67108863             : 150      |****                |
            67108864 -> 134217727            : 275      |********            |
           134217728 -> 268435455            : 205      |******              |
           268435456 -> 536870911            : 151      |****                |
           536870912 -> 1073741823           : 137      |****                |
          1073741824 -> 2147483647           : 76       |**                  |
          2147483648 -> 4294967295           : 48       |*                   |
          4294967296 -> 8589934591           : 6        |                    |
          8589934592 -> 17179869183          : 2        |                    |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 7        |                                        |
      2048 -> 4095       : 40       |***                                     |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 33       |**                                      |
   2097152 -> 4194303    : 159      |************                            |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 311      |*************************               |
  33554432 -> 67108863   : 493      |****************************************|
               inet_csk_get_port             : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 129      |******************* |
                2048 -> 4095                 : 55       |********            |
                4096 -> 8191                 : 47       |*******             |
                8192 -> 16383                : 17       |**                  |
               16384 -> 32767                : 2        |                    |
               32768 -> 65535                : 0        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 30       |****                |
             2097152 -> 4194303              : 130      |********************|
             4194304 -> 8388607              : 24       |***                 |
             8388608 -> 16777215             : 0        |                    |
            16777216 -> 33554431             : 13       |**                  |
            33554432 -> 67108863             : 118      |******************  |
            67108864 -> 134217727            : 58       |********            |
           134217728 -> 268435455            : 17       |**                  |
           268435456 -> 536870911            : 7        |*                   |
           536870912 -> 1073741823           : 0        |                    |
          1073741824 -> 2147483647           : 1        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 1        |                    |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 6        |*                                       |
      2048 -> 4095       : 14       |**                                      |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 158      |********************************        |
   2097152 -> 4194303    : 22       |****                                    |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 192      |****************************************|
  33554432 -> 67108863   : 9        |*                                       |
<recovers>
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 10       |****************                        |
      2048 -> 4095       : 25       |****************************************|
      4096 -> 8191       : 16       |*************************               |
      8192 -> 16383      : 1        |*                                       |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 1        |*                                       |
     inet_csk_bind_conflict : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 10       |*********************************       |
      2048 -> 4095       : 12       |****************************************|
     inet_csk_get_port   : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 4        |****************************************|
      8192 -> 16383      : 1        |**********                              |

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2.2: tw-reuseport.patch --]
[-- Type: text/x-patch, Size: 1229 bytes --]

commit ea66f43c5b4d94625ad7322e4097acd9a06d7fdd
Author: Josef Bacik <jbacik@fb.com>
Date:   Wed Dec 14 11:54:49 2016 -0800

    do reuseport too

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c9b3eb7..567017b 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -55,6 +55,7 @@ struct inet_timewait_sock {
 #define tw_family		__tw_common.skc_family
 #define tw_state		__tw_common.skc_state
 #define tw_reuse		__tw_common.skc_reuse
+#define tw_reuseport		__tw_common.skc_reuseport
 #define tw_ipv6only		__tw_common.skc_ipv6only
 #define tw_bound_dev_if		__tw_common.skc_bound_dev_if
 #define tw_node			__tw_common.skc_nulls_node
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index a1b1057..04c560e 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -183,6 +183,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		tw->tw_dport	    = inet->inet_dport;
 		tw->tw_family	    = sk->sk_family;
 		tw->tw_reuse	    = sk->sk_reuse;
+		tw->tw_reuseport    = sk->sk_reuseport;
 		tw->tw_hash	    = sk->sk_hash;
 		tw->tw_ipv6only	    = 0;
 		tw->tw_transparent  = inet->transparent;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-15 18:53                           ` Josef Bacik
@ 2016-12-15 22:39                             ` Tom Herbert
  2016-12-15 23:25                               ` Craig Gallek
  2016-12-16  0:07                             ` Hannes Frederic Sowa
  1 sibling, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2016-12-15 22:39 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Craig Gallek, Hannes Frederic Sowa, Eric Dumazet,
	Linux Kernel Network Developers

On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@fb.com> wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote:
>>
>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>
>> wrote:
>>>
>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>
>>> wrote:
>>>>
>>>>  I think there may be some suspicious code in inet_csk_get_port. At
>>>>  tb_found there is:
>>>>
>>>>                  if (((tb->fastreuse > 0 && reuse) ||
>>>>                       (tb->fastreuseport > 0 &&
>>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>                        sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>>>>                      smallest_size == -1)
>>>>                          goto success;
>>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,
>>>> true)) {
>>>>                          if ((reuse ||
>>>>                               (tb->fastreuseport > 0 &&
>>>>                                sk->sk_reuseport &&
>>>>                                !rcu_access_pointer(sk->sk_reuseport_cb)
>>>> &&
>>>>                                uid_eq(tb->fastuid, uid))) &&
>>>>                              smallest_size != -1 && --attempts >= 0) {
>>>>                                  spin_unlock_bh(&head->lock);
>>>>                                  goto again;
>>>>                          }
>>>>                          goto fail_unlock;
>>>>                  }
>>>>
>>>>  AFAICT there is redundancy in these two conditionals.  The same clause
>>>>  is being checked in both: (tb->fastreuseport > 0 &&
>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>>>>  first conditional should be hit, goto done,  and the second will never
>>>>  evaluate that part to true-- unless the sk is changed (do we need
>>>>  READ_ONCE for sk->sk_reuseport_cb?).
>>>
>>>  That's an interesting point... It looks like this function also
>>>  changed in 4.6 from using a single local_bh_disable() at the beginning
>>>  with several spin_lock(&head->lock) to exclusively
>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
>>>  disable variant was preventing the timers in your stack trace from
>>>  running interleaved with this function before?
>>
>>
>> Could be, although dropping the lock shouldn't be able to affect the
>> search state. TBH, I'm a little lost in reading function, the
>> SO_REUSEPORT handling is pretty complicated. For instance,
>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>> function and also in every call to inet_csk_bind_conflict. I wonder if
>> we can simply this under the assumption that SO_REUSEPORT is only
>> allowed if the port number (snum) is explicitly specified.
>
>
> Ok first I have data for you Hannes, here's the time distributions before
> during and after the lockup (with all the debugging in place the box
> eventually recovers).  I've attached it as a text file since it is long.
>
> Second is I was thinking about why we would spend so much time doing the
> ->owners list, and obviously it's because of the massive amount of timewait
> sockets on the owners list.  I wrote the following dumb patch and tested it
> and the problem has disappeared completely.  Now I don't know if this is
> right at all, but I thought it was weird we weren't copying the soreuseport
> option from the original socket onto the twsk.  Is there are reason we
> aren't doing this currently?  Does this help explain what is happening?
> Thanks,
>
I think that would explain it. We would be walking long lists of TW
sockets in inet_bind_bucket_for_each(tb, &head->chain). This should
break, although now I'm wondering if there's other ways we can get
into this situation. reuseport ensures that we can have long lists of
sockets in a single bucket, TW sockets can make that list really long.

Tom

> Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-15 22:39                             ` Tom Herbert
@ 2016-12-15 23:25                               ` Craig Gallek
  0 siblings, 0 replies; 32+ messages in thread
From: Craig Gallek @ 2016-12-15 23:25 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Josef Bacik, Hannes Frederic Sowa, Eric Dumazet,
	Linux Kernel Network Developers

On Thu, Dec 15, 2016 at 5:39 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@fb.com> wrote:
>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>
>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>
>>> wrote:
>>>>
>>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>
>>>> wrote:
>>>>>
>>>>>  I think there may be some suspicious code in inet_csk_get_port. At
>>>>>  tb_found there is:
>>>>>
>>>>>                  if (((tb->fastreuse > 0 && reuse) ||
>>>>>                       (tb->fastreuseport > 0 &&
>>>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>                        sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>>>>>                      smallest_size == -1)
>>>>>                          goto success;
>>>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,
>>>>> true)) {
>>>>>                          if ((reuse ||
>>>>>                               (tb->fastreuseport > 0 &&
>>>>>                                sk->sk_reuseport &&
>>>>>                                !rcu_access_pointer(sk->sk_reuseport_cb)
>>>>> &&
>>>>>                                uid_eq(tb->fastuid, uid))) &&
>>>>>                              smallest_size != -1 && --attempts >= 0) {
>>>>>                                  spin_unlock_bh(&head->lock);
>>>>>                                  goto again;
>>>>>                          }
>>>>>                          goto fail_unlock;
>>>>>                  }
>>>>>
>>>>>  AFAICT there is redundancy in these two conditionals.  The same clause
>>>>>  is being checked in both: (tb->fastreuseport > 0 &&
>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>>>>>  first conditional should be hit, goto done,  and the second will never
>>>>>  evaluate that part to true-- unless the sk is changed (do we need
>>>>>  READ_ONCE for sk->sk_reuseport_cb?).
>>>>
>>>>  That's an interesting point... It looks like this function also
>>>>  changed in 4.6 from using a single local_bh_disable() at the beginning
>>>>  with several spin_lock(&head->lock) to exclusively
>>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
>>>>  disable variant was preventing the timers in your stack trace from
>>>>  running interleaved with this function before?
>>>
>>>
>>> Could be, although dropping the lock shouldn't be able to affect the
>>> search state. TBH, I'm a little lost in reading function, the
>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>>> function and also in every call to inet_csk_bind_conflict. I wonder if
>>> we can simply this under the assumption that SO_REUSEPORT is only
>>> allowed if the port number (snum) is explicitly specified.
>>
>>
>> Ok first I have data for you Hannes, here's the time distributions before
>> during and after the lockup (with all the debugging in place the box
>> eventually recovers).  I've attached it as a text file since it is long.
>>
>> Second is I was thinking about why we would spend so much time doing the
>> ->owners list, and obviously it's because of the massive amount of timewait
>> sockets on the owners list.  I wrote the following dumb patch and tested it
>> and the problem has disappeared completely.  Now I don't know if this is
>> right at all, but I thought it was weird we weren't copying the soreuseport
>> option from the original socket onto the twsk.  Is there are reason we
>> aren't doing this currently?  Does this help explain what is happening?
>> Thanks,
>>
> I think that would explain it. We would be walking long lists of TW
> sockets in inet_bind_bucket_for_each(tb, &head->chain). This should
> break, although now I'm wondering if there's other ways we can get
> into this situation. reuseport ensures that we can have long lists of
> sockets in a single bucket, TW sockets can make that list really long.
What if the time-wait timer implementation was changed to do more
opportunistic removals?  In this case, you seem to have a coordinated
timer event causing many independent locking events on the bucket in
question.  If one of those firing events realized it could handle all
of them, you could greatly reduce the contention.  The fact that they
all hash to the same bucket may make this even easier...

> Tom
>
>> Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-15 18:53                           ` Josef Bacik
  2016-12-15 22:39                             ` Tom Herbert
@ 2016-12-16  0:07                             ` Hannes Frederic Sowa
  2016-12-16 14:54                               ` Josef Bacik
  1 sibling, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-16  0:07 UTC (permalink / raw)
  To: Josef Bacik, Tom Herbert
  Cc: Craig Gallek, Eric Dumazet, Linux Kernel Network Developers

Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>
>> wrote:
>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>
>>> wrote:
>>>>  I think there may be some suspicious code in inet_csk_get_port. At
>>>>  tb_found there is:
>>>>
>>>>                  if (((tb->fastreuse > 0 && reuse) ||
>>>>                       (tb->fastreuseport > 0 &&
>>>>                        !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>                        sk->sk_reuseport && uid_eq(tb->fastuid,
>>>> uid))) &&
>>>>                      smallest_size == -1)
>>>>                          goto success;
>>>>                  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>> tb, true)) {
>>>>                          if ((reuse ||
>>>>                               (tb->fastreuseport > 0 &&
>>>>                                sk->sk_reuseport &&
>>>>                               
>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>                                uid_eq(tb->fastuid, uid))) &&
>>>>                              smallest_size != -1 && --attempts >= 0) {
>>>>                                  spin_unlock_bh(&head->lock);
>>>>                                  goto again;
>>>>                          }
>>>>                          goto fail_unlock;
>>>>                  }
>>>>
>>>>  AFAICT there is redundancy in these two conditionals.  The same clause
>>>>  is being checked in both: (tb->fastreuseport > 0 &&
>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>>>>  first conditional should be hit, goto done,  and the second will never
>>>>  evaluate that part to true-- unless the sk is changed (do we need
>>>>  READ_ONCE for sk->sk_reuseport_cb?).
>>>  That's an interesting point... It looks like this function also
>>>  changed in 4.6 from using a single local_bh_disable() at the beginning
>>>  with several spin_lock(&head->lock) to exclusively
>>>  spin_lock_bh(&head->lock) at each locking point.  Perhaps the full bh
>>>  disable variant was preventing the timers in your stack trace from
>>>  running interleaved with this function before?
>>
>> Could be, although dropping the lock shouldn't be able to affect the
>> search state. TBH, I'm a little lost in reading function, the
>> SO_REUSEPORT handling is pretty complicated. For instance,
>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>> function and also in every call to inet_csk_bind_conflict. I wonder if
>> we can simply this under the assumption that SO_REUSEPORT is only
>> allowed if the port number (snum) is explicitly specified.
> 
> Ok first I have data for you Hannes, here's the time distributions
> before during and after the lockup (with all the debugging in place the
> box eventually recovers).  I've attached it as a text file since it is
> long.

Thanks a lot!

> Second is I was thinking about why we would spend so much time doing the
> ->owners list, and obviously it's because of the massive amount of
> timewait sockets on the owners list.  I wrote the following dumb patch
> and tested it and the problem has disappeared completely.  Now I don't
> know if this is right at all, but I thought it was weird we weren't
> copying the soreuseport option from the original socket onto the twsk. 
> Is there are reason we aren't doing this currently?  Does this help
> explain what is happening?  Thanks,

The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to keep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't help
there while also guaranteeing uniqueness.

Thanks a lot!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-16  0:07                             ` Hannes Frederic Sowa
@ 2016-12-16 14:54                               ` Josef Bacik
  2016-12-16 15:21                                 ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-16 14:54 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers

On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> Hi Josef,
> 
> On 15.12.2016 19:53, Josef Bacik wrote:
>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 
>> wrote:
>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>> <kraigatgoog@gmail.com>
>>>  wrote:
>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>> <tom@herbertland.com>
>>>>  wrote:
>>>>>   I think there may be some suspicious code in inet_csk_get_port. 
>>>>> At
>>>>>   tb_found there is:
>>>>> 
>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>                        (tb->fastreuseport > 0 &&
>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) 
>>>>> &&
>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>  uid))) &&
>>>>>                       smallest_size == -1)
>>>>>                           goto success;
>>>>>                   if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>  tb, true)) {
>>>>>                           if ((reuse ||
>>>>>                                (tb->fastreuseport > 0 &&
>>>>>                                 sk->sk_reuseport &&
>>>>> 
>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>                               smallest_size != -1 && --attempts 
>>>>> >= 0) {
>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>                                   goto again;
>>>>>                           }
>>>>>                           goto fail_unlock;
>>>>>                   }
>>>>> 
>>>>>   AFAICT there is redundancy in these two conditionals.  The same 
>>>>> clause
>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
>>>>> true the
>>>>>   first conditional should be hit, goto done,  and the second 
>>>>> will never
>>>>>   evaluate that part to true-- unless the sk is changed (do we 
>>>>> need
>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>   That's an interesting point... It looks like this function also
>>>>   changed in 4.6 from using a single local_bh_disable() at the 
>>>> beginning
>>>>   with several spin_lock(&head->lock) to exclusively
>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 
>>>> full bh
>>>>   disable variant was preventing the timers in your stack trace 
>>>> from
>>>>   running interleaved with this function before?
>>> 
>>>  Could be, although dropping the lock shouldn't be able to affect 
>>> the
>>>  search state. TBH, I'm a little lost in reading function, the
>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 
>>> that
>>>  function and also in every call to inet_csk_bind_conflict. I 
>>> wonder if
>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>  allowed if the port number (snum) is explicitly specified.
>> 
>>  Ok first I have data for you Hannes, here's the time distributions
>>  before during and after the lockup (with all the debugging in place 
>> the
>>  box eventually recovers).  I've attached it as a text file since it 
>> is
>>  long.
> 
> Thanks a lot!
> 
>>  Second is I was thinking about why we would spend so much time 
>> doing the
>>  ->owners list, and obviously it's because of the massive amount of
>>  timewait sockets on the owners list.  I wrote the following dumb 
>> patch
>>  and tested it and the problem has disappeared completely.  Now I 
>> don't
>>  know if this is right at all, but I thought it was weird we weren't
>>  copying the soreuseport option from the original socket onto the 
>> twsk.
>>  Is there are reason we aren't doing this currently?  Does this help
>>  explain what is happening?  Thanks,
> 
> The patch is interesting and a good clue, but I am immediately a bit
> concerned that we don't copy/tag the socket with the uid also to keep
> the security properties for SO_REUSEPORT. I have to think a bit more
> about this.
> 
> We have seen hangs during connect. I am afraid this patch wouldn't 
> help
> there while also guaranteeing uniqueness.


Yeah so I looked at the code some more and actually my patch is really 
bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, 
which is outside of the timewait sock, so that's definitely bad.

But we should at least be setting it to 0 so that we don't do this 
normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So for some reason having ->sk_reuseport set to 1 on a 
timewait socket makes this problem non-existent, which is strange.

So back to the drawing board I guess.  I wonder if doing what craig 
suggested and batching the timewait timer expires so it hurts less 
would accomplish the same results.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-16 14:54                               ` Josef Bacik
@ 2016-12-16 15:21                                 ` Josef Bacik
  2016-12-16 22:08                                   ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-16 15:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers

On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
> <hannes@stressinduktion.org> wrote:
>> Hi Josef,
>> 
>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 
>>> wrote:
>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>>> <kraigatgoog@gmail.com>
>>>>  wrote:
>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>>> <tom@herbertland.com>
>>>>>  wrote:
>>>>>>   I think there may be some suspicious code in 
>>>>>> inet_csk_get_port. At
>>>>>>   tb_found there is:
>>>>>> 
>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>>                        (tb->fastreuseport > 0 &&
>>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) 
>>>>>> &&
>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>  uid))) &&
>>>>>>                       smallest_size == -1)
>>>>>>                           goto success;
>>>>>>                   if 
>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>  tb, true)) {
>>>>>>                           if ((reuse ||
>>>>>>                                (tb->fastreuseport > 0 &&
>>>>>>                                 sk->sk_reuseport &&
>>>>>> 
>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>>                               smallest_size != -1 && --attempts 
>>>>>> >= 0) {
>>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>>                                   goto again;
>>>>>>                           }
>>>>>>                           goto fail_unlock;
>>>>>>                   }
>>>>>> 
>>>>>>   AFAICT there is redundancy in these two conditionals.  The 
>>>>>> same clause
>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
>>>>>> true the
>>>>>>   first conditional should be hit, goto done,  and the second 
>>>>>> will never
>>>>>>   evaluate that part to true-- unless the sk is changed (do we 
>>>>>> need
>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>>   That's an interesting point... It looks like this function also
>>>>>   changed in 4.6 from using a single local_bh_disable() at the 
>>>>> beginning
>>>>>   with several spin_lock(&head->lock) to exclusively
>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 
>>>>> full bh
>>>>>   disable variant was preventing the timers in your stack trace 
>>>>> from
>>>>>   running interleaved with this function before?
>>>> 
>>>>  Could be, although dropping the lock shouldn't be able to affect 
>>>> the
>>>>  search state. TBH, I'm a little lost in reading function, the
>>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 
>>>> that
>>>>  function and also in every call to inet_csk_bind_conflict. I 
>>>> wonder if
>>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>>  allowed if the port number (snum) is explicitly specified.
>>> 
>>>  Ok first I have data for you Hannes, here's the time distributions
>>>  before during and after the lockup (with all the debugging in 
>>> place the
>>>  box eventually recovers).  I've attached it as a text file since 
>>> it is
>>>  long.
>> 
>> Thanks a lot!
>> 
>>>  Second is I was thinking about why we would spend so much time 
>>> doing the
>>>  ->owners list, and obviously it's because of the massive amount of
>>>  timewait sockets on the owners list.  I wrote the following dumb 
>>> patch
>>>  and tested it and the problem has disappeared completely.  Now I 
>>> don't
>>>  know if this is right at all, but I thought it was weird we weren't
>>>  copying the soreuseport option from the original socket onto the 
>>> twsk.
>>>  Is there are reason we aren't doing this currently?  Does this help
>>>  explain what is happening?  Thanks,
>> 
>> The patch is interesting and a good clue, but I am immediately a bit
>> concerned that we don't copy/tag the socket with the uid also to keep
>> the security properties for SO_REUSEPORT. I have to think a bit more
>> about this.
>> 
>> We have seen hangs during connect. I am afraid this patch wouldn't 
>> help
>> there while also guaranteeing uniqueness.
> 
> 
> Yeah so I looked at the code some more and actually my patch is 
> really bad.  If sk2->sk_reuseport is set we'll look at 
> sk2->sk_reuseport_cb, which is outside of the timewait sock, so 
> that's definitely bad.
> 
> But we should at least be setting it to 0 so that we don't do this 
> normally.  Unfortunately simply setting it to 0 doesn't fix the 
> problem.  So for some reason having ->sk_reuseport set to 1 on a 
> timewait socket makes this problem non-existent, which is strange.
> 
> So back to the drawing board I guess.  I wonder if doing what craig 
> suggested and batching the timewait timer expires so it hurts less 
> would accomplish the same results.  Thanks,

Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This is 
the code

                        if ((!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) &&
                            (!reuseport || !sk2->sk_reuseport ||
                             rcu_access_pointer(sk->sk_reuseport_cb) ||
                             (sk2->sk_state != TCP_TIME_WAIT &&
                             !uid_eq(uid, sock_i_uid(sk2))))) {

                                if (!sk2->sk_rcv_saddr || 
!sk->sk_rcv_saddr ||
                                    sk2->sk_rcv_saddr == 
sk->sk_rcv_saddr)
                                        break;
                        }

so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 
1.  But now we are using reuseport, so sk->sk_reuseport_cb should be 
non-NULL right?  So really setting the timewait sock's sk_reuseport 
should have no bearing on how this loop plays out right?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-16 15:21                                 ` Josef Bacik
@ 2016-12-16 22:08                                   ` Josef Bacik
  2016-12-16 22:18                                     ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-16 22:08 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers

On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
>> <hannes@stressinduktion.org> wrote:
>>> Hi Josef,
>>> 
>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
>>>> <tom@herbertland.com> wrote:
>>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>>>> <kraigatgoog@gmail.com>
>>>>>  wrote:
>>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>>>> <tom@herbertland.com>
>>>>>>  wrote:
>>>>>>>   I think there may be some suspicious code in 
>>>>>>> inet_csk_get_port. At
>>>>>>>   tb_found there is:
>>>>>>> 
>>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>                        (tb->fastreuseport > 0 &&
>>>>>>>                         
>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>  uid))) &&
>>>>>>>                       smallest_size == -1)
>>>>>>>                           goto success;
>>>>>>>                   if 
>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>  tb, true)) {
>>>>>>>                           if ((reuse ||
>>>>>>>                                (tb->fastreuseport > 0 &&
>>>>>>>                                 sk->sk_reuseport &&
>>>>>>> 
>>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>>>                               smallest_size != -1 && --attempts 
>>>>>>> >= 0) {
>>>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>>>                                   goto again;
>>>>>>>                           }
>>>>>>>                           goto fail_unlock;
>>>>>>>                   }
>>>>>>> 
>>>>>>>   AFAICT there is redundancy in these two conditionals.  The 
>>>>>>> same clause
>>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport 
>>>>>>> &&
>>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
>>>>>>> true the
>>>>>>>   first conditional should be hit, goto done,  and the second 
>>>>>>> will never
>>>>>>>   evaluate that part to true-- unless the sk is changed (do we 
>>>>>>> need
>>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>   That's an interesting point... It looks like this function also
>>>>>>   changed in 4.6 from using a single local_bh_disable() at the 
>>>>>> beginning
>>>>>>   with several spin_lock(&head->lock) to exclusively
>>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 
>>>>>> full bh
>>>>>>   disable variant was preventing the timers in your stack trace 
>>>>>> from
>>>>>>   running interleaved with this function before?
>>>>> 
>>>>>  Could be, although dropping the lock shouldn't be able to affect 
>>>>> the
>>>>>  search state. TBH, I'm a little lost in reading function, the
>>>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times 
>>>>> in that
>>>>>  function and also in every call to inet_csk_bind_conflict. I 
>>>>> wonder if
>>>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>>>  allowed if the port number (snum) is explicitly specified.
>>>> 
>>>>  Ok first I have data for you Hannes, here's the time distributions
>>>>  before during and after the lockup (with all the debugging in 
>>>> place the
>>>>  box eventually recovers).  I've attached it as a text file since 
>>>> it is
>>>>  long.
>>> 
>>> Thanks a lot!
>>> 
>>>>  Second is I was thinking about why we would spend so much time 
>>>> doing the
>>>>  ->owners list, and obviously it's because of the massive amount of
>>>>  timewait sockets on the owners list.  I wrote the following dumb 
>>>> patch
>>>>  and tested it and the problem has disappeared completely.  Now I 
>>>> don't
>>>>  know if this is right at all, but I thought it was weird we 
>>>> weren't
>>>>  copying the soreuseport option from the original socket onto the 
>>>> twsk.
>>>>  Is there are reason we aren't doing this currently?  Does this 
>>>> help
>>>>  explain what is happening?  Thanks,
>>> 
>>> The patch is interesting and a good clue, but I am immediately a bit
>>> concerned that we don't copy/tag the socket with the uid also to 
>>> keep
>>> the security properties for SO_REUSEPORT. I have to think a bit more
>>> about this.
>>> 
>>> We have seen hangs during connect. I am afraid this patch wouldn't 
>>> help
>>> there while also guaranteeing uniqueness.
>> 
>> 
>> Yeah so I looked at the code some more and actually my patch is 
>> really bad.  If sk2->sk_reuseport is set we'll look at 
>> sk2->sk_reuseport_cb, which is outside of the timewait sock, so 
>> that's definitely bad.
>> 
>> But we should at least be setting it to 0 so that we don't do this 
>> normally.  Unfortunately simply setting it to 0 doesn't fix the 
>> problem.  So for some reason having ->sk_reuseport set to 1 on a 
>> timewait socket makes this problem non-existent, which is strange.
>> 
>> So back to the drawing board I guess.  I wonder if doing what craig 
>> suggested and batching the timewait timer expires so it hurts less 
>> would accomplish the same results.  Thanks,
> 
> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This 
> is the code
> 
>                        if ((!reuse || !sk2->sk_reuse ||
>                            sk2->sk_state == TCP_LISTEN) &&
>                            (!reuseport || !sk2->sk_reuseport ||
>                             rcu_access_pointer(sk->sk_reuseport_cb) ||
>                             (sk2->sk_state != TCP_TIME_WAIT &&
>                             !uid_eq(uid, sock_i_uid(sk2))))) {
> 
>                                if (!sk2->sk_rcv_saddr || 
> !sk->sk_rcv_saddr ||
>                                    sk2->sk_rcv_saddr == 
> sk->sk_rcv_saddr)
>                                        break;
>                        }
> 
> so in my patches case we now have reuseport == 1, sk2->sk_reuseport 
> == 1.  But now we are using reuseport, so sk->sk_reuseport_cb should 
> be non-NULL right?  So really setting the timewait sock's 
> sk_reuseport should have no bearing on how this loop plays out right? 
>  Thanks,


So more messing around and I noticed that we basically don't do the 
tb->fastreuseport logic at all if we've ended up with a non 
SO_REUSEPORT socket on that tb.  So before I fully understood what I 
was doing I fixed it so that after we go through ->bind_conflict() once 
with a SO_REUSEPORT socket, we reset tb->fastreuseport to 1 and set the 
uid to match the uid of the socket.  This made the problem go away.  
Tom pointed out that if we bind to the same port on a different address 
and we have a non SO_REUSEPORT socket with the same address on this tb 
then we'd be screwed with my code.

Which brings me to his proposed solution.  We need another hash table 
that is indexed based on the binding address.  Then each node 
corresponds to one address/port binding, with non-SO_REUSEPORT entries 
having only one entry, and normal SO_REUSEPORT entries having many.  
This cleans up the need to search all the possible sockets on any given 
tb, we just go and look at the one we care about.  Does this make 
sense?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-16 22:08                                   ` Josef Bacik
@ 2016-12-16 22:18                                     ` Tom Herbert
  2016-12-16 22:50                                       ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2016-12-16 22:18 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Craig Gallek, Eric Dumazet,
	Linux Kernel Network Developers

On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>
>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>
>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>> <hannes@stressinduktion.org> wrote:
>>>>
>>>> Hi Josef,
>>>>
>>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>
>>>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>
>>>>> wrote:
>>>>>>
>>>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>   I think there may be some suspicious code in inet_csk_get_port. At
>>>>>>>>   tb_found there is:
>>>>>>>>
>>>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>                        (tb->fastreuseport > 0 &&
>>>>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>  uid))) &&
>>>>>>>>                       smallest_size == -1)
>>>>>>>>                           goto success;
>>>>>>>>                   if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>  tb, true)) {
>>>>>>>>                           if ((reuse ||
>>>>>>>>                                (tb->fastreuseport > 0 &&
>>>>>>>>                                 sk->sk_reuseport &&
>>>>>>>>
>>>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>>>>                               smallest_size != -1 && --attempts >=
>>>>>>>> 0) {
>>>>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>>>>                                   goto again;
>>>>>>>>                           }
>>>>>>>>                           goto fail_unlock;
>>>>>>>>                   }
>>>>>>>>
>>>>>>>>   AFAICT there is redundancy in these two conditionals.  The same
>>>>>>>> clause
>>>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true
>>>>>>>> the
>>>>>>>>   first conditional should be hit, goto done,  and the second will
>>>>>>>> never
>>>>>>>>   evaluate that part to true-- unless the sk is changed (do we need
>>>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>
>>>>>>>   That's an interesting point... It looks like this function also
>>>>>>>   changed in 4.6 from using a single local_bh_disable() at the
>>>>>>> beginning
>>>>>>>   with several spin_lock(&head->lock) to exclusively
>>>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the full
>>>>>>> bh
>>>>>>>   disable variant was preventing the timers in your stack trace from
>>>>>>>   running interleaved with this function before?
>>>>>>
>>>>>>
>>>>>>  Could be, although dropping the lock shouldn't be able to affect the
>>>>>>  search state. TBH, I'm a little lost in reading function, the
>>>>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>> that
>>>>>>  function and also in every call to inet_csk_bind_conflict. I wonder
>>>>>> if
>>>>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>  allowed if the port number (snum) is explicitly specified.
>>>>>
>>>>>
>>>>>  Ok first I have data for you Hannes, here's the time distributions
>>>>>  before during and after the lockup (with all the debugging in place
>>>>> the
>>>>>  box eventually recovers).  I've attached it as a text file since it is
>>>>>  long.
>>>>
>>>>
>>>> Thanks a lot!
>>>>
>>>>>  Second is I was thinking about why we would spend so much time doing
>>>>> the
>>>>>  ->owners list, and obviously it's because of the massive amount of
>>>>>  timewait sockets on the owners list.  I wrote the following dumb patch
>>>>>  and tested it and the problem has disappeared completely.  Now I don't
>>>>>  know if this is right at all, but I thought it was weird we weren't
>>>>>  copying the soreuseport option from the original socket onto the twsk.
>>>>>  Is there are reason we aren't doing this currently?  Does this help
>>>>>  explain what is happening?  Thanks,
>>>>
>>>>
>>>> The patch is interesting and a good clue, but I am immediately a bit
>>>> concerned that we don't copy/tag the socket with the uid also to keep
>>>> the security properties for SO_REUSEPORT. I have to think a bit more
>>>> about this.
>>>>
>>>> We have seen hangs during connect. I am afraid this patch wouldn't help
>>>> there while also guaranteeing uniqueness.
>>>
>>>
>>>
>>> Yeah so I looked at the code some more and actually my patch is really
>>> bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which
>>> is outside of the timewait sock, so that's definitely bad.
>>>
>>> But we should at least be setting it to 0 so that we don't do this
>>> normally.  Unfortunately simply setting it to 0 doesn't fix the problem.  So
>>> for some reason having ->sk_reuseport set to 1 on a timewait socket makes
>>> this problem non-existent, which is strange.
>>>
>>> So back to the drawing board I guess.  I wonder if doing what craig
>>> suggested and batching the timewait timer expires so it hurts less would
>>> accomplish the same results.  Thanks,
>>
>>
>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This is the
>> code
>>
>>                        if ((!reuse || !sk2->sk_reuse ||
>>                            sk2->sk_state == TCP_LISTEN) &&
>>                            (!reuseport || !sk2->sk_reuseport ||
>>                             rcu_access_pointer(sk->sk_reuseport_cb) ||
>>                             (sk2->sk_state != TCP_TIME_WAIT &&
>>                             !uid_eq(uid, sock_i_uid(sk2))))) {
>>
>>                                if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr
>> ||
>>                                    sk2->sk_rcv_saddr == sk->sk_rcv_saddr)
>>                                        break;
>>                        }
>>
>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 1.
>> But now we are using reuseport, so sk->sk_reuseport_cb should be non-NULL
>> right?  So really setting the timewait sock's sk_reuseport should have no
>> bearing on how this loop plays out right?  Thanks,
>
>
>
> So more messing around and I noticed that we basically don't do the
> tb->fastreuseport logic at all if we've ended up with a non SO_REUSEPORT
> socket on that tb.  So before I fully understood what I was doing I fixed it
> so that after we go through ->bind_conflict() once with a SO_REUSEPORT
> socket, we reset tb->fastreuseport to 1 and set the uid to match the uid of
> the socket.  This made the problem go away.  Tom pointed out that if we bind
> to the same port on a different address and we have a non SO_REUSEPORT
> socket with the same address on this tb then we'd be screwed with my code.
>
> Which brings me to his proposed solution.  We need another hash table that
> is indexed based on the binding address.  Then each node corresponds to one
> address/port binding, with non-SO_REUSEPORT entries having only one entry,
> and normal SO_REUSEPORT entries having many.  This cleans up the need to
> search all the possible sockets on any given tb, we just go and look at the
> one we care about.  Does this make sense?  Thanks,
>
Hi Josef,

Thinking about it some more the hash table won't work because of the
rules of binding different addresses to the same port. What I think we
can do is to change inet_bind_bucket to be structure that contains all
the information used to detect conflicts (reuse*, if, address, uid,
etc.) and a list of sockets that share that exact same information--
for instance all socket in timewait state create through some listener
socket should wind up on single bucket. When we do the bind_conflict
function we only should have to walk this buckets, not the full list
of sockets.

Thoughts on this?

Thanks,
Tom

> Josef
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-16 22:18                                     ` Tom Herbert
@ 2016-12-16 22:50                                       ` Josef Bacik
  2016-12-17 11:08                                         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-16 22:50 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hannes Frederic Sowa, Craig Gallek, Eric Dumazet,
	Linux Kernel Network Developers

On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> 
wrote:
> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>> 
>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>> 
>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>  <hannes@stressinduktion.org> wrote:
>>>>> 
>>>>>  Hi Josef,
>>>>> 
>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>> 
>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
>>>>>> <tom@herbertland.com>
>>>>>>  wrote:
>>>>>>> 
>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>   wrote:
>>>>>>>> 
>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>>>>>> <tom@herbertland.com>
>>>>>>>>   wrote:
>>>>>>>>> 
>>>>>>>>>    I think there may be some suspicious code in 
>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>    tb_found there is:
>>>>>>>>> 
>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>                         (tb->fastreuseport > 0 &&
>>>>>>>>>                          
>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>                          sk->sk_reuseport && 
>>>>>>>>> uid_eq(tb->fastuid,
>>>>>>>>>   uid))) &&
>>>>>>>>>                        smallest_size == -1)
>>>>>>>>>                            goto success;
>>>>>>>>>                    if 
>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>   tb, true)) {
>>>>>>>>>                            if ((reuse ||
>>>>>>>>>                                 (tb->fastreuseport > 0 &&
>>>>>>>>>                                  sk->sk_reuseport &&
>>>>>>>>> 
>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>                                smallest_size != -1 && 
>>>>>>>>> --attempts >=
>>>>>>>>>  0) {
>>>>>>>>>                                    
>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>>                                    goto again;
>>>>>>>>>                            }
>>>>>>>>>                            goto fail_unlock;
>>>>>>>>>                    }
>>>>>>>>> 
>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The 
>>>>>>>>> same
>>>>>>>>>  clause
>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) && 
>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this 
>>>>>>>>> is true
>>>>>>>>>  the
>>>>>>>>>    first conditional should be hit, goto done,  and the 
>>>>>>>>> second will
>>>>>>>>>  never
>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do 
>>>>>>>>> we need
>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>> 
>>>>>>>>    That's an interesting point... It looks like this function 
>>>>>>>> also
>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>  beginning
>>>>>>>>    with several spin_lock(&head->lock) to exclusively
>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps 
>>>>>>>> the full
>>>>>>>>  bh
>>>>>>>>    disable variant was preventing the timers in your stack 
>>>>>>>> trace from
>>>>>>>>    running interleaved with this function before?
>>>>>>> 
>>>>>>> 
>>>>>>>   Could be, although dropping the lock shouldn't be able to 
>>>>>>> affect the
>>>>>>>   search state. TBH, I'm a little lost in reading function, the
>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three 
>>>>>>> times in
>>>>>>>  that
>>>>>>>   function and also in every call to inet_csk_bind_conflict. I 
>>>>>>> wonder
>>>>>>>  if
>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is 
>>>>>>> only
>>>>>>>   allowed if the port number (snum) is explicitly specified.
>>>>>> 
>>>>>> 
>>>>>>   Ok first I have data for you Hannes, here's the time 
>>>>>> distributions
>>>>>>   before during and after the lockup (with all the debugging in 
>>>>>> place
>>>>>>  the
>>>>>>   box eventually recovers).  I've attached it as a text file 
>>>>>> since it is
>>>>>>   long.
>>>>> 
>>>>> 
>>>>>  Thanks a lot!
>>>>> 
>>>>>>   Second is I was thinking about why we would spend so much time 
>>>>>> doing
>>>>>>  the
>>>>>>   ->owners list, and obviously it's because of the massive 
>>>>>> amount of
>>>>>>   timewait sockets on the owners list.  I wrote the following 
>>>>>> dumb patch
>>>>>>   and tested it and the problem has disappeared completely.  Now 
>>>>>> I don't
>>>>>>   know if this is right at all, but I thought it was weird we 
>>>>>> weren't
>>>>>>   copying the soreuseport option from the original socket onto 
>>>>>> the twsk.
>>>>>>   Is there are reason we aren't doing this currently?  Does this 
>>>>>> help
>>>>>>   explain what is happening?  Thanks,
>>>>> 
>>>>> 
>>>>>  The patch is interesting and a good clue, but I am immediately a 
>>>>> bit
>>>>>  concerned that we don't copy/tag the socket with the uid also to 
>>>>> keep
>>>>>  the security properties for SO_REUSEPORT. I have to think a bit 
>>>>> more
>>>>>  about this.
>>>>> 
>>>>>  We have seen hangs during connect. I am afraid this patch 
>>>>> wouldn't help
>>>>>  there while also guaranteeing uniqueness.
>>>> 
>>>> 
>>>> 
>>>>  Yeah so I looked at the code some more and actually my patch is 
>>>> really
>>>>  bad.  If sk2->sk_reuseport is set we'll look at 
>>>> sk2->sk_reuseport_cb, which
>>>>  is outside of the timewait sock, so that's definitely bad.
>>>> 
>>>>  But we should at least be setting it to 0 so that we don't do this
>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the 
>>>> problem.  So
>>>>  for some reason having ->sk_reuseport set to 1 on a timewait 
>>>> socket makes
>>>>  this problem non-existent, which is strange.
>>>> 
>>>>  So back to the drawing board I guess.  I wonder if doing what 
>>>> craig
>>>>  suggested and batching the timewait timer expires so it hurts 
>>>> less would
>>>>  accomplish the same results.  Thanks,
>>> 
>>> 
>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  
>>> This is the
>>>  code
>>> 
>>>                         if ((!reuse || !sk2->sk_reuse ||
>>>                             sk2->sk_state == TCP_LISTEN) &&
>>>                             (!reuseport || !sk2->sk_reuseport ||
>>>                              
>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>                              (sk2->sk_state != TCP_TIME_WAIT &&
>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {
>>> 
>>>                                 if (!sk2->sk_rcv_saddr || 
>>> !sk->sk_rcv_saddr
>>>  ||
>>>                                     sk2->sk_rcv_saddr == 
>>> sk->sk_rcv_saddr)
>>>                                         break;
>>>                         }
>>> 
>>>  so in my patches case we now have reuseport == 1, 
>>> sk2->sk_reuseport == 1.
>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be 
>>> non-NULL
>>>  right?  So really setting the timewait sock's sk_reuseport should 
>>> have no
>>>  bearing on how this loop plays out right?  Thanks,
>> 
>> 
>> 
>>  So more messing around and I noticed that we basically don't do the
>>  tb->fastreuseport logic at all if we've ended up with a non 
>> SO_REUSEPORT
>>  socket on that tb.  So before I fully understood what I was doing I 
>> fixed it
>>  so that after we go through ->bind_conflict() once with a 
>> SO_REUSEPORT
>>  socket, we reset tb->fastreuseport to 1 and set the uid to match 
>> the uid of
>>  the socket.  This made the problem go away.  Tom pointed out that 
>> if we bind
>>  to the same port on a different address and we have a non 
>> SO_REUSEPORT
>>  socket with the same address on this tb then we'd be screwed with 
>> my code.
>> 
>>  Which brings me to his proposed solution.  We need another hash 
>> table that
>>  is indexed based on the binding address.  Then each node 
>> corresponds to one
>>  address/port binding, with non-SO_REUSEPORT entries having only one 
>> entry,
>>  and normal SO_REUSEPORT entries having many.  This cleans up the 
>> need to
>>  search all the possible sockets on any given tb, we just go and 
>> look at the
>>  one we care about.  Does this make sense?  Thanks,
>> 
> Hi Josef,
> 
> Thinking about it some more the hash table won't work because of the
> rules of binding different addresses to the same port. What I think we
> can do is to change inet_bind_bucket to be structure that contains all
> the information used to detect conflicts (reuse*, if, address, uid,
> etc.) and a list of sockets that share that exact same information--
> for instance all socket in timewait state create through some listener
> socket should wind up on single bucket. When we do the bind_conflict
> function we only should have to walk this buckets, not the full list
> of sockets.
> 
> Thoughts on this?

This sounds good, maybe tb->owners be a list of say

struct inet_unique_shit {
	struct sock_common sk;
	struct hlist socks;
};

Then we make inet_unique_shit like twsks', just copy the relevant 
information, then hang the real sockets off of the socks hlist.  
Something like that?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-16 22:50                                       ` Josef Bacik
@ 2016-12-17 11:08                                         ` Hannes Frederic Sowa
  2016-12-17 13:26                                           ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-17 11:08 UTC (permalink / raw)
  To: Josef Bacik, Tom Herbert
  Cc: Craig Gallek, Eric Dumazet, Linux Kernel Network Developers

On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>
>>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>>
>>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>>  <hannes@stressinduktion.org> wrote:
>>>>>>
>>>>>>  Hi Josef,
>>>>>>
>>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>>
>>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>>   wrote:
>>>>>>>>>
>>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>>> <tom@herbertland.com>
>>>>>>>>>   wrote:
>>>>>>>>>>
>>>>>>>>>>    I think there may be some suspicious code in
>>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>>    tb_found there is:
>>>>>>>>>>
>>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>>                         (tb->fastreuseport > 0 &&
>>>>>>>>>>                         
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>                          sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>>>   uid))) &&
>>>>>>>>>>                        smallest_size == -1)
>>>>>>>>>>                            goto success;
>>>>>>>>>>                    if
>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>>   tb, true)) {
>>>>>>>>>>                            if ((reuse ||
>>>>>>>>>>                                 (tb->fastreuseport > 0 &&
>>>>>>>>>>                                  sk->sk_reuseport &&
>>>>>>>>>>
>>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>>                                smallest_size != -1 &&
>>>>>>>>>> --attempts >=
>>>>>>>>>>  0) {
>>>>>>>>>>                                    spin_unlock_bh(&head->lock);
>>>>>>>>>>                                    goto again;
>>>>>>>>>>                            }
>>>>>>>>>>                            goto fail_unlock;
>>>>>>>>>>                    }
>>>>>>>>>>
>>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The
>>>>>>>>>> same
>>>>>>>>>>  clause
>>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>>> is true
>>>>>>>>>>  the
>>>>>>>>>>    first conditional should be hit, goto done,  and the second
>>>>>>>>>> will
>>>>>>>>>>  never
>>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do
>>>>>>>>>> we need
>>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>>
>>>>>>>>>    That's an interesting point... It looks like this function also
>>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>>  beginning
>>>>>>>>>    with several spin_lock(&head->lock) to exclusively
>>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps
>>>>>>>>> the full
>>>>>>>>>  bh
>>>>>>>>>    disable variant was preventing the timers in your stack
>>>>>>>>> trace from
>>>>>>>>>    running interleaved with this function before?
>>>>>>>>
>>>>>>>>
>>>>>>>>   Could be, although dropping the lock shouldn't be able to
>>>>>>>> affect the
>>>>>>>>   search state. TBH, I'm a little lost in reading function, the
>>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>>>>  that
>>>>>>>>   function and also in every call to inet_csk_bind_conflict. I
>>>>>>>> wonder
>>>>>>>>  if
>>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>>>   allowed if the port number (snum) is explicitly specified.
>>>>>>>
>>>>>>>
>>>>>>>   Ok first I have data for you Hannes, here's the time distributions
>>>>>>>   before during and after the lockup (with all the debugging in
>>>>>>> place
>>>>>>>  the
>>>>>>>   box eventually recovers).  I've attached it as a text file
>>>>>>> since it is
>>>>>>>   long.
>>>>>>
>>>>>>
>>>>>>  Thanks a lot!
>>>>>>
>>>>>>>   Second is I was thinking about why we would spend so much time
>>>>>>> doing
>>>>>>>  the
>>>>>>>   ->owners list, and obviously it's because of the massive amount of
>>>>>>>   timewait sockets on the owners list.  I wrote the following
>>>>>>> dumb patch
>>>>>>>   and tested it and the problem has disappeared completely.  Now
>>>>>>> I don't
>>>>>>>   know if this is right at all, but I thought it was weird we
>>>>>>> weren't
>>>>>>>   copying the soreuseport option from the original socket onto
>>>>>>> the twsk.
>>>>>>>   Is there are reason we aren't doing this currently?  Does this
>>>>>>> help
>>>>>>>   explain what is happening?  Thanks,
>>>>>>
>>>>>>
>>>>>>  The patch is interesting and a good clue, but I am immediately a bit
>>>>>>  concerned that we don't copy/tag the socket with the uid also to
>>>>>> keep
>>>>>>  the security properties for SO_REUSEPORT. I have to think a bit more
>>>>>>  about this.
>>>>>>
>>>>>>  We have seen hangs during connect. I am afraid this patch
>>>>>> wouldn't help
>>>>>>  there while also guaranteeing uniqueness.
>>>>>
>>>>>
>>>>>
>>>>>  Yeah so I looked at the code some more and actually my patch is
>>>>> really
>>>>>  bad.  If sk2->sk_reuseport is set we'll look at
>>>>> sk2->sk_reuseport_cb, which
>>>>>  is outside of the timewait sock, so that's definitely bad.
>>>>>
>>>>>  But we should at least be setting it to 0 so that we don't do this
>>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the
>>>>> problem.  So
>>>>>  for some reason having ->sk_reuseport set to 1 on a timewait
>>>>> socket makes
>>>>>  this problem non-existent, which is strange.
>>>>>
>>>>>  So back to the drawing board I guess.  I wonder if doing what craig
>>>>>  suggested and batching the timewait timer expires so it hurts less
>>>>> would
>>>>>  accomplish the same results.  Thanks,
>>>>
>>>>
>>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This
>>>> is the
>>>>  code
>>>>
>>>>                         if ((!reuse || !sk2->sk_reuse ||
>>>>                             sk2->sk_state == TCP_LISTEN) &&
>>>>                             (!reuseport || !sk2->sk_reuseport ||
>>>>                              rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>>                              (sk2->sk_state != TCP_TIME_WAIT &&
>>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {
>>>>
>>>>                                 if (!sk2->sk_rcv_saddr ||
>>>> !sk->sk_rcv_saddr
>>>>  ||
>>>>                                     sk2->sk_rcv_saddr ==
>>>> sk->sk_rcv_saddr)
>>>>                                         break;
>>>>                         }
>>>>
>>>>  so in my patches case we now have reuseport == 1, sk2->sk_reuseport
>>>> == 1.
>>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be
>>>> non-NULL
>>>>  right?  So really setting the timewait sock's sk_reuseport should
>>>> have no
>>>>  bearing on how this loop plays out right?  Thanks,
>>>
>>>
>>>
>>>  So more messing around and I noticed that we basically don't do the
>>>  tb->fastreuseport logic at all if we've ended up with a non
>>> SO_REUSEPORT
>>>  socket on that tb.  So before I fully understood what I was doing I
>>> fixed it
>>>  so that after we go through ->bind_conflict() once with a SO_REUSEPORT
>>>  socket, we reset tb->fastreuseport to 1 and set the uid to match the
>>> uid of
>>>  the socket.  This made the problem go away.  Tom pointed out that if
>>> we bind
>>>  to the same port on a different address and we have a non SO_REUSEPORT
>>>  socket with the same address on this tb then we'd be screwed with my
>>> code.
>>>
>>>  Which brings me to his proposed solution.  We need another hash
>>> table that
>>>  is indexed based on the binding address.  Then each node corresponds
>>> to one
>>>  address/port binding, with non-SO_REUSEPORT entries having only one
>>> entry,
>>>  and normal SO_REUSEPORT entries having many.  This cleans up the
>>> need to
>>>  search all the possible sockets on any given tb, we just go and look
>>> at the
>>>  one we care about.  Does this make sense?  Thanks,
>>>
>> Hi Josef,
>>
>> Thinking about it some more the hash table won't work because of the
>> rules of binding different addresses to the same port. What I think we
>> can do is to change inet_bind_bucket to be structure that contains all
>> the information used to detect conflicts (reuse*, if, address, uid,
>> etc.) and a list of sockets that share that exact same information--
>> for instance all socket in timewait state create through some listener
>> socket should wind up on single bucket. When we do the bind_conflict
>> function we only should have to walk this buckets, not the full list
>> of sockets.
>>
>> Thoughts on this?
> 
> This sounds good, maybe tb->owners be a list of say
> 
> struct inet_unique_shit {
>     struct sock_common sk;
>     struct hlist socks;
> };
> 
> Then we make inet_unique_shit like twsks', just copy the relevant
> information, then hang the real sockets off of the socks hlist. 
> Something like that?  Thanks,

As a counter idea: can we push up a flag to the inet_bind_bucket that we
check in the fast- way style that indicates that we have wildcarded
non-reuse(port) in there, so we can skip the bind_bucket much more
quickly? This wouldn't require a new data structure.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-17 11:08                                         ` Hannes Frederic Sowa
@ 2016-12-17 13:26                                           ` Josef Bacik
  2016-12-20  1:56                                             ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-17 13:26 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers


> On Dec 17, 2016, at 6:09 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> 
>> On 16.12.2016 23:50, Josef Bacik wrote:
>>> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>> 
>>>>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>>> 
>>>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>>> <hannes@stressinduktion.org> wrote:
>>>>>>> 
>>>>>>> Hi Josef,
>>>>>>> 
>>>>>>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>>> 
>>>>>>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>>>  wrote:
>>>>>>>>>> 
>>>>>>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>>>> <tom@herbertland.com>
>>>>>>>>>>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>>   I think there may be some suspicious code in
>>>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>>>   tb_found there is:
>>>>>>>>>>> 
>>>>>>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>>>                        (tb->fastreuseport > 0 &&
>>>>>>>>>>> 
>>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>>>>  uid))) &&
>>>>>>>>>>>                       smallest_size == -1)
>>>>>>>>>>>                           goto success;
>>>>>>>>>>>                   if
>>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>>>  tb, true)) {
>>>>>>>>>>>                           if ((reuse ||
>>>>>>>>>>>                                (tb->fastreuseport > 0 &&
>>>>>>>>>>>                                 sk->sk_reuseport &&
>>>>>>>>>>> 
>>>>>>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>>>                               smallest_size != -1 &&
>>>>>>>>>>> --attempts >=
>>>>>>>>>>> 0) {
>>>>>>>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>>>>>>>                                   goto again;
>>>>>>>>>>>                           }
>>>>>>>>>>>                           goto fail_unlock;
>>>>>>>>>>>                   }
>>>>>>>>>>> 
>>>>>>>>>>>   AFAICT there is redundancy in these two conditionals.  The
>>>>>>>>>>> same
>>>>>>>>>>> clause
>>>>>>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>>>> is true
>>>>>>>>>>> the
>>>>>>>>>>>   first conditional should be hit, goto done,  and the second
>>>>>>>>>>> will
>>>>>>>>>>> never
>>>>>>>>>>>   evaluate that part to true-- unless the sk is changed (do
>>>>>>>>>>> we need
>>>>>>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>>> 
>>>>>>>>>>   That's an interesting point... It looks like this function also
>>>>>>>>>>   changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>>> beginning
>>>>>>>>>>   with several spin_lock(&head->lock) to exclusively
>>>>>>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps
>>>>>>>>>> the full
>>>>>>>>>> bh
>>>>>>>>>>   disable variant was preventing the timers in your stack
>>>>>>>>>> trace from
>>>>>>>>>>   running interleaved with this function before?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  Could be, although dropping the lock shouldn't be able to
>>>>>>>>> affect the
>>>>>>>>>  search state. TBH, I'm a little lost in reading function, the
>>>>>>>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>>>>> that
>>>>>>>>>  function and also in every call to inet_csk_bind_conflict. I
>>>>>>>>> wonder
>>>>>>>>> if
>>>>>>>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>>>>  allowed if the port number (snum) is explicitly specified.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  Ok first I have data for you Hannes, here's the time distributions
>>>>>>>>  before during and after the lockup (with all the debugging in
>>>>>>>> place
>>>>>>>> the
>>>>>>>>  box eventually recovers).  I've attached it as a text file
>>>>>>>> since it is
>>>>>>>>  long.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks a lot!
>>>>>>> 
>>>>>>>>  Second is I was thinking about why we would spend so much time
>>>>>>>> doing
>>>>>>>> the
>>>>>>>>  ->owners list, and obviously it's because of the massive amount of
>>>>>>>>  timewait sockets on the owners list.  I wrote the following
>>>>>>>> dumb patch
>>>>>>>>  and tested it and the problem has disappeared completely.  Now
>>>>>>>> I don't
>>>>>>>>  know if this is right at all, but I thought it was weird we
>>>>>>>> weren't
>>>>>>>>  copying the soreuseport option from the original socket onto
>>>>>>>> the twsk.
>>>>>>>>  Is there are reason we aren't doing this currently?  Does this
>>>>>>>> help
>>>>>>>>  explain what is happening?  Thanks,
>>>>>>> 
>>>>>>> 
>>>>>>> The patch is interesting and a good clue, but I am immediately a bit
>>>>>>> concerned that we don't copy/tag the socket with the uid also to
>>>>>>> keep
>>>>>>> the security properties for SO_REUSEPORT. I have to think a bit more
>>>>>>> about this.
>>>>>>> 
>>>>>>> We have seen hangs during connect. I am afraid this patch
>>>>>>> wouldn't help
>>>>>>> there while also guaranteeing uniqueness.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Yeah so I looked at the code some more and actually my patch is
>>>>>> really
>>>>>> bad.  If sk2->sk_reuseport is set we'll look at
>>>>>> sk2->sk_reuseport_cb, which
>>>>>> is outside of the timewait sock, so that's definitely bad.
>>>>>> 
>>>>>> But we should at least be setting it to 0 so that we don't do this
>>>>>> normally.  Unfortunately simply setting it to 0 doesn't fix the
>>>>>> problem.  So
>>>>>> for some reason having ->sk_reuseport set to 1 on a timewait
>>>>>> socket makes
>>>>>> this problem non-existent, which is strange.
>>>>>> 
>>>>>> So back to the drawing board I guess.  I wonder if doing what craig
>>>>>> suggested and batching the timewait timer expires so it hurts less
>>>>>> would
>>>>>> accomplish the same results.  Thanks,
>>>>> 
>>>>> 
>>>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This
>>>>> is the
>>>>> code
>>>>> 
>>>>>                        if ((!reuse || !sk2->sk_reuse ||
>>>>>                            sk2->sk_state == TCP_LISTEN) &&
>>>>>                            (!reuseport || !sk2->sk_reuseport ||
>>>>>                             rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>>>                             (sk2->sk_state != TCP_TIME_WAIT &&
>>>>>                             !uid_eq(uid, sock_i_uid(sk2))))) {
>>>>> 
>>>>>                                if (!sk2->sk_rcv_saddr ||
>>>>> !sk->sk_rcv_saddr
>>>>> ||
>>>>>                                    sk2->sk_rcv_saddr ==
>>>>> sk->sk_rcv_saddr)
>>>>>                                        break;
>>>>>                        }
>>>>> 
>>>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport
>>>>> == 1.
>>>>> But now we are using reuseport, so sk->sk_reuseport_cb should be
>>>>> non-NULL
>>>>> right?  So really setting the timewait sock's sk_reuseport should
>>>>> have no
>>>>> bearing on how this loop plays out right?  Thanks,
>>>> 
>>>> 
>>>> 
>>>> So more messing around and I noticed that we basically don't do the
>>>> tb->fastreuseport logic at all if we've ended up with a non
>>>> SO_REUSEPORT
>>>> socket on that tb.  So before I fully understood what I was doing I
>>>> fixed it
>>>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT
>>>> socket, we reset tb->fastreuseport to 1 and set the uid to match the
>>>> uid of
>>>> the socket.  This made the problem go away.  Tom pointed out that if
>>>> we bind
>>>> to the same port on a different address and we have a non SO_REUSEPORT
>>>> socket with the same address on this tb then we'd be screwed with my
>>>> code.
>>>> 
>>>> Which brings me to his proposed solution.  We need another hash
>>>> table that
>>>> is indexed based on the binding address.  Then each node corresponds
>>>> to one
>>>> address/port binding, with non-SO_REUSEPORT entries having only one
>>>> entry,
>>>> and normal SO_REUSEPORT entries having many.  This cleans up the
>>>> need to
>>>> search all the possible sockets on any given tb, we just go and look
>>>> at the
>>>> one we care about.  Does this make sense?  Thanks,
>>>> 
>>> Hi Josef,
>>> 
>>> Thinking about it some more the hash table won't work because of the
>>> rules of binding different addresses to the same port. What I think we
>>> can do is to change inet_bind_bucket to be structure that contains all
>>> the information used to detect conflicts (reuse*, if, address, uid,
>>> etc.) and a list of sockets that share that exact same information--
>>> for instance all socket in timewait state create through some listener
>>> socket should wind up on single bucket. When we do the bind_conflict
>>> function we only should have to walk this buckets, not the full list
>>> of sockets.
>>> 
>>> Thoughts on this?
>> 
>> This sounds good, maybe tb->owners be a list of say
>> 
>> struct inet_unique_shit {
>>    struct sock_common sk;
>>    struct hlist socks;
>> };
>> 
>> Then we make inet_unique_shit like twsks', just copy the relevant
>> information, then hang the real sockets off of the socks hlist. 
>> Something like that?  Thanks,
> 
> As a counter idea: can we push up a flag to the inet_bind_bucket that we
> check in the fast- way style that indicates that we have wildcarded
> non-reuse(port) in there, so we can skip the bind_bucket much more
> quickly? This wouldn't require a new data structure.

So take my current duct tape fix and augment it with more information in the bind bucket?  I'm not sure how to make this work without at least having a list of the binded addrs as well to make sure we are really ok.  I suppose we could save the fastreuseport address that last succeeded to make it work properly, but I'd have to make it protocol agnostic and then have a callback to have the protocol to make sure we don't have to do the bind_conflict run.  Is that what you were thinking of?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-17 13:26                                           ` Josef Bacik
@ 2016-12-20  1:56                                             ` David Miller
  2016-12-20  2:07                                               ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2016-12-20  1:56 UTC (permalink / raw)
  To: jbacik; +Cc: hannes, tom, kraigatgoog, eric.dumazet, netdev

From: Josef Bacik <jbacik@fb.com>
Date: Sat, 17 Dec 2016 13:26:00 +0000

> So take my current duct tape fix and augment it with more
> information in the bind bucket?  I'm not sure how to make this work
> without at least having a list of the binded addrs as well to make
> sure we are really ok.  I suppose we could save the fastreuseport
> address that last succeeded to make it work properly, but I'd have
> to make it protocol agnostic and then have a callback to have the
> protocol to make sure we don't have to do the bind_conflict run.  Is
> that what you were thinking of?  Thanks,

So there isn't a deadlock or lockup here, something is just running
really slow, right?

And that "something" is a scan of the sockets on a tb list, and
there's lots of timewait sockets hung off of that tb.

As far as I can tell, this scan is happening in
inet_csk_bind_conflict().

Furthermore, reuseport is somehow required to make this problem
happen.  How exactly?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-20  1:56                                             ` David Miller
@ 2016-12-20  2:07                                               ` Tom Herbert
  2016-12-20  2:41                                                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2016-12-20  2:07 UTC (permalink / raw)
  To: David Miller
  Cc: Josef Bacik, Hannes Frederic Sowa, Craig Gallek, Eric Dumazet,
	Linux Kernel Network Developers

On Mon, Dec 19, 2016 at 5:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Josef Bacik <jbacik@fb.com>
> Date: Sat, 17 Dec 2016 13:26:00 +0000
>
>> So take my current duct tape fix and augment it with more
>> information in the bind bucket?  I'm not sure how to make this work
>> without at least having a list of the binded addrs as well to make
>> sure we are really ok.  I suppose we could save the fastreuseport
>> address that last succeeded to make it work properly, but I'd have
>> to make it protocol agnostic and then have a callback to have the
>> protocol to make sure we don't have to do the bind_conflict run.  Is
>> that what you were thinking of?  Thanks,
>
> So there isn't a deadlock or lockup here, something is just running
> really slow, right?
>
Correct.

> And that "something" is a scan of the sockets on a tb list, and
> there's lots of timewait sockets hung off of that tb.
>
Yes.

> As far as I can tell, this scan is happening in
> inet_csk_bind_conflict().
>
Yes.

> Furthermore, reuseport is somehow required to make this problem
> happen.  How exactly?

When sockets created SO_REUSEPORT move to TW state they are placed
back on the the tb->owners. fastreuse port is no longer set so we have
to walk potential long list of sockets in tb->owners to open a new
listener socket. I imagine this is happens when we try to open a new
listener SO_REUSEPORT after the system has been running a while and so
we hit the long tb->owners list.

Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-20  2:07                                               ` Tom Herbert
@ 2016-12-20  2:41                                                 ` Eric Dumazet
  2016-12-20  3:40                                                   ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-12-20  2:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Josef Bacik, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers

On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

> When sockets created SO_REUSEPORT move to TW state they are placed
> back on the the tb->owners. fastreuse port is no longer set so we have
> to walk potential long list of sockets in tb->owners to open a new
> listener socket. I imagine this is happens when we try to open a new
> listener SO_REUSEPORT after the system has been running a while and so
> we hit the long tb->owners list.

Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

So where tb->fastreuse is cleared ?

If all your sockets have SO_REUSEPORT set, this should not happen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-20  2:41                                                 ` Eric Dumazet
@ 2016-12-20  3:40                                                   ` Josef Bacik
  2016-12-20  4:52                                                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2016-12-20  3:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers


> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>> 
>> When sockets created SO_REUSEPORT move to TW state they are placed
>> back on the the tb->owners. fastreuse port is no longer set so we have
>> to walk potential long list of sockets in tb->owners to open a new
>> listener socket. I imagine this is happens when we try to open a new
>> listener SO_REUSEPORT after the system has been running a while and so
>> we hit the long tb->owners list.
> 
> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> 
> So where tb->fastreuse is cleared ?
> 
> If all your sockets have SO_REUSEPORT set, this should not happen.
> 

The app starts out with no SO_REUSEPORT, and then we restart it with that option enabled.  What I suspect is we have all the twsks from the original service, and the fastreuse stuff is cleared.  My naive patch resets it once we add a reuseport sk to the tb and that makes the problem go away.  I'm reworking all of this logic and adding some extra info to the tb to make the reset actually safe.  I'll send those patches out tomorrow. Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-20  3:40                                                   ` Josef Bacik
@ 2016-12-20  4:52                                                     ` Eric Dumazet
  2016-12-20  4:59                                                       ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-12-20  4:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers

On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
> > On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
> >> 
> >> When sockets created SO_REUSEPORT move to TW state they are placed
> >> back on the the tb->owners. fastreuse port is no longer set so we have
> >> to walk potential long list of sockets in tb->owners to open a new
> >> listener socket. I imagine this is happens when we try to open a new
> >> listener SO_REUSEPORT after the system has been running a while and so
> >> we hit the long tb->owners list.
> > 
> > Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> > 
> > So where tb->fastreuse is cleared ?
> > 
> > If all your sockets have SO_REUSEPORT set, this should not happen.
> > 
> 
> The app starts out with no SO_REUSEPORT, and then we restart it with
> that option enabled.

But... why would the application do this dance ?

I now better understand why we never had these issues...


>   What I suspect is we have all the twsks from the original service,
> and the fastreuse stuff is cleared.  My naive patch resets it once we
> add a reuseport sk to the tb and that makes the problem go away.  I'm
> reworking all of this logic and adding some extra info to the tb to
> make the reset actually safe.  I'll send those patches out tomorrow.
> Thanks,

Okay, we will review them ;)

Note that Willy Tarreau wants some mechanism to be able to freeze a
listener, to allow haproxy to be replaced without closing any sessions.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Soft lockup in inet_put_port on 4.6
  2016-12-20  4:52                                                     ` Eric Dumazet
@ 2016-12-20  4:59                                                       ` Josef Bacik
  0 siblings, 0 replies; 32+ messages in thread
From: Josef Bacik @ 2016-12-20  4:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers


> On Dec 19, 2016, at 11:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
>>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> 
>>>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>>>> 
>>>> When sockets created SO_REUSEPORT move to TW state they are placed
>>>> back on the the tb->owners. fastreuse port is no longer set so we have
>>>> to walk potential long list of sockets in tb->owners to open a new
>>>> listener socket. I imagine this is happens when we try to open a new
>>>> listener SO_REUSEPORT after the system has been running a while and so
>>>> we hit the long tb->owners list.
>>> 
>>> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
>>> 
>>> So where tb->fastreuse is cleared ?
>>> 
>>> If all your sockets have SO_REUSEPORT set, this should not happen.
>>> 
>> 
>> The app starts out with no SO_REUSEPORT, and then we restart it with
>> that option enabled.
> 
> But... why would the application do this dance ?
> 
> I now better understand why we never had these issues...
> 

It doesn't do it as a part of it's normal operation.  The old version didn't use SO_REUSEPORT and then somebody added support for it, restarted the service with the new option enabled and boom.  They immediately stopped doing anything and gave it to me to figure out.

> 
>>  What I suspect is we have all the twsks from the original service,
>> and the fastreuse stuff is cleared.  My naive patch resets it once we
>> add a reuseport sk to the tb and that makes the problem go away.  I'm
>> reworking all of this logic and adding some extra info to the tb to
>> make the reset actually safe.  I'll send those patches out tomorrow.
>> Thanks,
> 
> Okay, we will review them ;)
> 
> Note that Willy Tarreau wants some mechanism to be able to freeze a
> listener, to allow haproxy to be replaced without closing any sessions.
> 

I assume that's what these guys would want as well.  They have some weird handoff thing they do when the app starts but I'm not entirely convinced it does what they think it does.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-12-20  4:59 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert
2016-12-08 21:03 ` Hannes Frederic Sowa
2016-12-08 21:36   ` Josef Bacik
2016-12-09  0:30     ` Eric Dumazet
2016-12-09  1:01       ` Josef Bacik
2016-12-10  1:59         ` Josef Bacik
2016-12-10  3:47           ` Eric Dumazet
2016-12-10  4:14             ` Eric Dumazet
2016-12-12 18:05               ` Josef Bacik
2016-12-12 18:44                 ` Hannes Frederic Sowa
2016-12-12 21:23                   ` Josef Bacik
2016-12-12 22:24                   ` Josef Bacik
2016-12-13 20:51                     ` Tom Herbert
2016-12-13 23:03                       ` Craig Gallek
2016-12-13 23:32                         ` Tom Herbert
2016-12-15 18:53                           ` Josef Bacik
2016-12-15 22:39                             ` Tom Herbert
2016-12-15 23:25                               ` Craig Gallek
2016-12-16  0:07                             ` Hannes Frederic Sowa
2016-12-16 14:54                               ` Josef Bacik
2016-12-16 15:21                                 ` Josef Bacik
2016-12-16 22:08                                   ` Josef Bacik
2016-12-16 22:18                                     ` Tom Herbert
2016-12-16 22:50                                       ` Josef Bacik
2016-12-17 11:08                                         ` Hannes Frederic Sowa
2016-12-17 13:26                                           ` Josef Bacik
2016-12-20  1:56                                             ` David Miller
2016-12-20  2:07                                               ` Tom Herbert
2016-12-20  2:41                                                 ` Eric Dumazet
2016-12-20  3:40                                                   ` Josef Bacik
2016-12-20  4:52                                                     ` Eric Dumazet
2016-12-20  4:59                                                       ` Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.