* Soft lockup in inet_put_port on 4.6 @ 2016-12-06 23:06 Tom Herbert 2016-12-08 21:03 ` Hannes Frederic Sowa 0 siblings, 1 reply; 32+ messages in thread From: Tom Herbert @ 2016-12-06 23:06 UTC (permalink / raw) To: Linux Kernel Network Developers, Josef Bacik Hello, We are seeing a fair number of machines getting into softlockup in 4.6 kernel. As near as I can tell this is happening on the spinlock in bind hash bucket. When inet_csk_get_port exits and does spinunlock_bh the TCP timer runs and we hit lockup in inet_put_port (presumably on same lock). It seems like the locked isn't properly be unlocked somewhere but I don't readily see it. Any ideas? Thanks, Tom NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [proxygend:4152094] Modules linked in: fuse nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 xt_limit xt_multiport ipip ip_tunnel tunnel4 ip6_tunnel tunnel6 coretemp mptctl mptbase cls_bpf ipmi_watchdog tcp_diag inet_diag ip6table_filter xt_NFLOG nfnetlink_log xt_comment xt_statistic iptable_filter xt_mark tpm_crb i2c_piix4 dm_crypt loop ipmi_devintf acpi_cpufreq iTCO_wdt iTCO_vendor_support ipmi_si ipmi_msghandler efivars i2c_i801 sg lpc_ich mfd_core hpilo xhci_pci xhci_hcd button nvme nvme_core CPU: 22 PID: 4152094 Comm: proxygend Tainted: G W L 4.6.7-13_fbk3_1119_g367d67b #13 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 task: ffff88168c52d100 ti: ffff881c12fb0000 task.ti: ffff881c12fb0000 RIP: 0010:[<ffffffff810b87b8>] [<ffffffff810b87b8>] queued_spin_lock_slowpath+0xf8/0x170 RSP: 0018:ffff883fff303da0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff881257163e00 RCX: 0000000000000001 RDX: ffff883fff375e40 RSI: 00000000005c0000 RDI: ffffc90018d6bae0 RBP: ffff883fff303da0 R08: ffff883fff315e40 R09: 0000000000000000 R10: 0000000000000020 R11: 00000000000001c0 R12: ffffc90018d6bae0 R13: ffffffff820f8a80 R14: ffff881257163f30 R15: 0000000000000000 FS: 00007fa7bb7ff700(0000) GS:ffff883fff300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff4be114d90 CR3: 000000243f99c000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffff883fff303db0 ffffffff817e5910 ffff883fff303dd8 ffffffff8172f6b4 ffff881257163e00 0000000000000007 0000000000000004 ffff883fff303e00 ffffffff81733237 ffff881257163e00 0000000000000000 ffffffff81ce7cc0 Call Trace: <IRQ> [<ffffffff817e5910>] _raw_spin_lock+0x20/0x30 [<ffffffff8172f6b4>] inet_put_port+0x54/0xb0 [<ffffffff81733237>] tcp_set_state+0x67/0xc0 [<ffffffff81733a33>] tcp_done+0x33/0x90 [<ffffffff81746431>] tcp_write_err+0x31/0x50 [<ffffffff81746bc9>] tcp_retransmit_timer+0x119/0x7d0 [<ffffffff81747460>] ? tcp_write_timer_handler+0x1e0/0x1e0 [<ffffffff8174730e>] tcp_write_timer_handler+0x8e/0x1e0 [<ffffffff817474c7>] tcp_write_timer+0x67/0x70 [<ffffffff810ccc35>] call_timer_fn+0x35/0x120 [<ffffffff81747460>] ? tcp_write_timer_handler+0x1e0/0x1e0 [<ffffffff810cd01c>] run_timer_softirq+0x1fc/0x2b0 [<ffffffff817e811c>] __do_softirq+0xcc/0x26c [<ffffffff817e753c>] do_softirq_own_stack+0x1c/0x30 <EOI> [<ffffffff8107b481>] do_softirq+0x31/0x40 [<ffffffff8107b508>] __local_bh_enable_ip+0x78/0x80 [<ffffffff817e572a>] _raw_spin_unlock_bh +0x1a/0x20 [<ffffffff81730a61>] inet_csk_get_port+0x1c1/0x5a0 [<ffffffff816c7637>] ? sock_poll+0x47/0xb0 [<ffffffff817313f5>] inet_csk_listen_start+0x65/0xc0 [<ffffffff8175ea8c>] inet_listen+0x9c/0xe0 [<ffffffff816c8560>] SyS_listen+0x80/0x90 [<ffffffff817e5adb>] entry_SYSCALL_64_fastpath+0x13/0x8f Code: c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 5e 01 00 48 03 14 c5 c0 d4 d1 81 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 41 8b 40 08 <85> c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb 02 f3 90 8b ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert @ 2016-12-08 21:03 ` Hannes Frederic Sowa 2016-12-08 21:36 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Hannes Frederic Sowa @ 2016-12-08 21:03 UTC (permalink / raw) To: Tom Herbert, Linux Kernel Network Developers, Josef Bacik Hello Tom, On Wed, Dec 7, 2016, at 00:06, Tom Herbert wrote: > We are seeing a fair number of machines getting into softlockup in 4.6 > kernel. As near as I can tell this is happening on the spinlock in > bind hash bucket. When inet_csk_get_port exits and does spinunlock_bh > the TCP timer runs and we hit lockup in inet_put_port (presumably on > same lock). It seems like the locked isn't properly be unlocked > somewhere but I don't readily see it. > > Any ideas? Likewise we received reports that pretty much look the same on our heavily patched kernel. Did you have a chance to investigate or reproduce the problem? I am wondering if you would be able to take a complete thread stack dump if you can reproduce this to check if one of the user space processes is looping inside finding a free port? Thanks, Hannes ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-08 21:03 ` Hannes Frederic Sowa @ 2016-12-08 21:36 ` Josef Bacik 2016-12-09 0:30 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-08 21:36 UTC (permalink / raw) To: Hannes Frederic Sowa; +Cc: Tom Herbert, Linux Kernel Network Developers On Thu, Dec 8, 2016 at 4:03 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Hello Tom, > > On Wed, Dec 7, 2016, at 00:06, Tom Herbert wrote: >> We are seeing a fair number of machines getting into softlockup in >> 4.6 >> kernel. As near as I can tell this is happening on the spinlock in >> bind hash bucket. When inet_csk_get_port exits and does >> spinunlock_bh >> the TCP timer runs and we hit lockup in inet_put_port (presumably on >> same lock). It seems like the locked isn't properly be unlocked >> somewhere but I don't readily see it. >> >> Any ideas? > > Likewise we received reports that pretty much look the same on our > heavily patched kernel. Did you have a chance to investigate or > reproduce the problem? > > I am wondering if you would be able to take a complete thread stack > dump > if you can reproduce this to check if one of the user space processes > is > looping inside finding a free port? We can reproduce the problem at will, still trying to run down the problem. I'll try and find one of the boxes that dumped a core and get a bt of everybody. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-08 21:36 ` Josef Bacik @ 2016-12-09 0:30 ` Eric Dumazet 2016-12-09 1:01 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2016-12-09 0:30 UTC (permalink / raw) To: Josef Bacik Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote: > We can reproduce the problem at will, still trying to run down the > problem. I'll try and find one of the boxes that dumped a core and get > a bt of everybody. Thanks, OK, sounds good. I had a look and : - could not spot a fix that came after 4.6. - could not spot an obvious bug. Anything special in the program triggering the issue ? SO_REUSEPORT and/or special socket options ? Thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-09 0:30 ` Eric Dumazet @ 2016-12-09 1:01 ` Josef Bacik 2016-12-10 1:59 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-09 1:01 UTC (permalink / raw) To: Eric Dumazet Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers > On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote: >> >> We can reproduce the problem at will, still trying to run down the >> problem. I'll try and find one of the boxes that dumped a core and get >> a bt of everybody. Thanks, > > OK, sounds good. > > I had a look and : > - could not spot a fix that came after 4.6. > - could not spot an obvious bug. > > Anything special in the program triggering the issue ? > SO_REUSEPORT and/or special socket options ? > So they recently started using SO_REUSEPORT, that's what triggered it, if they don't use it then everything is fine. I added some instrumentation for get_port to see if it was looping in there and none of my printk's triggered. The softlockup messages are always on the inet_bind_bucket lock, sometimes in the process context in get_port or in the softirq context either through inet_put_port or inet_kill_twsk. On the box that I have a coredump for there's only one processor in the inet code so I'm not sure what to make of that. That was a box from last week so I'll look at a more recent core and see if it's different. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-09 1:01 ` Josef Bacik @ 2016-12-10 1:59 ` Josef Bacik 2016-12-10 3:47 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-10 1:59 UTC (permalink / raw) To: Eric Dumazet Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik <jbacik@fb.com> wrote: > >> On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> >> wrote: >> >>> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote: >>> >>> We can reproduce the problem at will, still trying to run down the >>> problem. I'll try and find one of the boxes that dumped a core >>> and get >>> a bt of everybody. Thanks, >> >> OK, sounds good. >> >> I had a look and : >> - could not spot a fix that came after 4.6. >> - could not spot an obvious bug. >> >> Anything special in the program triggering the issue ? >> SO_REUSEPORT and/or special socket options ? >> > > So they recently started using SO_REUSEPORT, that's what triggered > it, if they don't use it then everything is fine. > > I added some instrumentation for get_port to see if it was looping in > there and none of my printk's triggered. The softlockup messages are > always on the inet_bind_bucket lock, sometimes in the process context > in get_port or in the softirq context either through inet_put_port or > inet_kill_twsk. On the box that I have a coredump for there's only > one processor in the inet code so I'm not sure what to make of that. > That was a box from last week so I'll look at a more recent core and > see if it's different. Thanks, Ok more investigation today, a few bullet points - With all the debugging turned on the boxes seem to recover after about a minute. I'd get the spam of the soft lockup messages all on the inet_bind_bucket, and then the box would be fine. - I looked at a core I had from before I started investigating things and there's only one process trying to get the inet_bind_bucket of all the 48 cpus. - I noticed that there was over 100k twsk's in that original core. - I put a global counter of the twsk's (since most of the softlockup messages have the twsk timers in the stack) and noticed with the debugging kernel it started around 16k twsk's and once it recovered it was down to less than a thousand. There's a jump where it goes from 8k to 2k and then there's only one more softlockup message and the box is fine. - This happens when we restart the service with the config option to start using SO_REUSEPORT. The application is our load balancing app, so obviously has lots of connections opened at any given time. What I'm wondering and will test on Monday is if the SO_REUSEPORT change even matters, or if simply restarting the service is what triggers the problem. One thing I forgot to mention is that it's also using TCP_FASTOPEN in both the non-reuseport and reuseport variants. What I suspect is happening is the service stops, all of the sockets it had open go into TIMEWAIT with relatively the same timer period, and then suddenly all wake up at the same time which coupled with the massive amount of traffic that we see per box anyway results in so much contention and ksoftirqd usage that the box livelocks for a while. With the lock debugging and stuff turned on we aren't able to service as much traffic so it recovers relatively quickly, whereas a normal production kernel never recovers. Please keep in mind that I"m a file system developer so my conclusions may be completely insane, any guidance would be welcome. I'll continue hammering on this on Monday. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-10 1:59 ` Josef Bacik @ 2016-12-10 3:47 ` Eric Dumazet 2016-12-10 4:14 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2016-12-10 3:47 UTC (permalink / raw) To: Josef Bacik Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers On Fri, 2016-12-09 at 20:59 -0500, Josef Bacik wrote: > On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik <jbacik@fb.com> wrote: > > > >> On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> > >> wrote: > >> > >>> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote: > >>> > >>> We can reproduce the problem at will, still trying to run down the > >>> problem. I'll try and find one of the boxes that dumped a core > >>> and get > >>> a bt of everybody. Thanks, > >> > >> OK, sounds good. > >> > >> I had a look and : > >> - could not spot a fix that came after 4.6. > >> - could not spot an obvious bug. > >> > >> Anything special in the program triggering the issue ? > >> SO_REUSEPORT and/or special socket options ? > >> > > > > So they recently started using SO_REUSEPORT, that's what triggered > > it, if they don't use it then everything is fine. > > > > I added some instrumentation for get_port to see if it was looping in > > there and none of my printk's triggered. The softlockup messages are > > always on the inet_bind_bucket lock, sometimes in the process context > > in get_port or in the softirq context either through inet_put_port or > > inet_kill_twsk. On the box that I have a coredump for there's only > > one processor in the inet code so I'm not sure what to make of that. > > That was a box from last week so I'll look at a more recent core and > > see if it's different. Thanks, > > Ok more investigation today, a few bullet points > > - With all the debugging turned on the boxes seem to recover after > about a minute. I'd get the spam of the soft lockup messages all on > the inet_bind_bucket, and then the box would be fine. > - I looked at a core I had from before I started investigating things > and there's only one process trying to get the inet_bind_bucket of all > the 48 cpus. > - I noticed that there was over 100k twsk's in that original core. > - I put a global counter of the twsk's (since most of the softlockup > messages have the twsk timers in the stack) and noticed with the > debugging kernel it started around 16k twsk's and once it recovered it > was down to less than a thousand. There's a jump where it goes from 8k > to 2k and then there's only one more softlockup message and the box is > fine. > - This happens when we restart the service with the config option to > start using SO_REUSEPORT. > > The application is our load balancing app, so obviously has lots of > connections opened at any given time. What I'm wondering and will test > on Monday is if the SO_REUSEPORT change even matters, or if simply > restarting the service is what triggers the problem. One thing I > forgot to mention is that it's also using TCP_FASTOPEN in both the > non-reuseport and reuseport variants. > > What I suspect is happening is the service stops, all of the sockets it > had open go into TIMEWAIT with relatively the same timer period, and > then suddenly all wake up at the same time which coupled with the > massive amount of traffic that we see per box anyway results in so much > contention and ksoftirqd usage that the box livelocks for a while. > With the lock debugging and stuff turned on we aren't able to service > as much traffic so it recovers relatively quickly, whereas a normal > production kernel never recovers. > > Please keep in mind that I"m a file system developer so my conclusions > may be completely insane, any guidance would be welcome. I'll continue > hammering on this on Monday. Thanks, Hmm... Is your ephemeral port range includes the port your load balancing app is using ? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-10 3:47 ` Eric Dumazet @ 2016-12-10 4:14 ` Eric Dumazet 2016-12-12 18:05 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2016-12-10 4:14 UTC (permalink / raw) To: Josef Bacik Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: > > Hmm... Is your ephemeral port range includes the port your load > balancing app is using ? I suspect that you might have processes doing bind( port = 0) that are trapped into the bind_conflict() scan ? With 100,000 + timewaits there, this possibly hurts. Can you try the following loop breaker ? diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index d5d3ead0a6c31e42e8843d30f8c643324a91b8e9..74f0f5ee6a02c624edb0263b9ddd27813f68d0a5 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -51,7 +51,7 @@ int inet_csk_bind_conflict(const struct sock *sk, int reuse = sk->sk_reuse; int reuseport = sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); - + unsigned int max_count; /* * Unlike other sk lookup places we do not check * for sk_net here, since _all_ the socks listed @@ -59,6 +59,7 @@ int inet_csk_bind_conflict(const struct sock *sk, * one this bucket belongs to. */ + max_count = relax ? ~0U : 100; sk_for_each_bound(sk2, &tb->owners) { if (sk != sk2 && !inet_v6_ipv6only(sk2) && @@ -84,6 +85,8 @@ int inet_csk_bind_conflict(const struct sock *sk, break; } } + if (--max_count == 0) + return 1; } return sk2 != NULL; } diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c index 1c86c478f578b49373e61a4c397f23f3dc7f3fc6..4f63d06e0d601da94eb3f2b35a988abd060e156c 100644 --- a/net/ipv6/inet6_connection_sock.c +++ b/net/ipv6/inet6_connection_sock.c @@ -35,12 +35,14 @@ int inet6_csk_bind_conflict(const struct sock *sk, int reuse = sk->sk_reuse; int reuseport = sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); + unsigned int max_count; /* We must walk the whole port owner list in this case. -DaveM */ /* * See comment in inet_csk_bind_conflict about sock lookup * vs net namespaces issues. */ + max_count = relax ? ~0U : 100; sk_for_each_bound(sk2, &tb->owners) { if (sk != sk2 && (!sk->sk_bound_dev_if || @@ -61,6 +63,8 @@ int inet6_csk_bind_conflict(const struct sock *sk, ipv6_rcv_saddr_equal(sk, sk2, true)) break; } + if (--max_count == 0) + return 1; } return sk2 != NULL; ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-10 4:14 ` Eric Dumazet @ 2016-12-12 18:05 ` Josef Bacik 2016-12-12 18:44 ` Hannes Frederic Sowa 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-12 18:05 UTC (permalink / raw) To: Eric Dumazet Cc: Hannes Frederic Sowa, Tom Herbert, Linux Kernel Network Developers On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: > >> >> Hmm... Is your ephemeral port range includes the port your load >> balancing app is using ? > > I suspect that you might have processes doing bind( port = 0) that are > trapped into the bind_conflict() scan ? > > With 100,000 + timewaits there, this possibly hurts. > > Can you try the following loop breaker ? It doesn't appear that the app is doing bind(port = 0) during normal operation. I tested this patch and it made no difference. I'm going to test simply restarting the app without changing to the SO_REUSEPORT option. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-12 18:05 ` Josef Bacik @ 2016-12-12 18:44 ` Hannes Frederic Sowa 2016-12-12 21:23 ` Josef Bacik 2016-12-12 22:24 ` Josef Bacik 0 siblings, 2 replies; 32+ messages in thread From: Hannes Frederic Sowa @ 2016-12-12 18:44 UTC (permalink / raw) To: Josef Bacik, Eric Dumazet; +Cc: Tom Herbert, Linux Kernel Network Developers On 12.12.2016 19:05, Josef Bacik wrote: > On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> > wrote: >> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: >> >>> >>> Hmm... Is your ephemeral port range includes the port your load >>> balancing app is using ? >> >> I suspect that you might have processes doing bind( port = 0) that are >> trapped into the bind_conflict() scan ? >> >> With 100,000 + timewaits there, this possibly hurts. >> >> Can you try the following loop breaker ? > > It doesn't appear that the app is doing bind(port = 0) during normal > operation. I tested this patch and it made no difference. I'm going to > test simply restarting the app without changing to the SO_REUSEPORT > option. Thanks, Would it be possible to trace the time the function uses with trace? If we don't see the number growing considerably over time we probably can rule out that we loop somewhere in there (I would instrument inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). __inet_hash_connect -> __inet_check_established also takes a lock (inet_ehash_lockp) which can be locked from inet_diag code path during socket diag info dumping. Unfortunately we couldn't reproduce it so far. :/ Thanks, Hannes ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-12 18:44 ` Hannes Frederic Sowa @ 2016-12-12 21:23 ` Josef Bacik 2016-12-12 22:24 ` Josef Bacik 1 sibling, 0 replies; 32+ messages in thread From: Josef Bacik @ 2016-12-12 21:23 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > On 12.12.2016 19:05, Josef Bacik wrote: >> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet >> <eric.dumazet@gmail.com> >> wrote: >>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: >>> >>>> >>>> Hmm... Is your ephemeral port range includes the port your load >>>> balancing app is using ? >>> >>> I suspect that you might have processes doing bind( port = 0) that >>> are >>> trapped into the bind_conflict() scan ? >>> >>> With 100,000 + timewaits there, this possibly hurts. >>> >>> Can you try the following loop breaker ? >> >> It doesn't appear that the app is doing bind(port = 0) during normal >> operation. I tested this patch and it made no difference. I'm >> going to >> test simply restarting the app without changing to the SO_REUSEPORT >> option. Thanks, > > Would it be possible to trace the time the function uses with trace? > If > we don't see the number growing considerably over time we probably can > rule out that we loop somewhere in there (I would instrument > inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). > > __inet_hash_connect -> __inet_check_established also takes a lock > (inet_ehash_lockp) which can be locked from inet_diag code path during > socket diag info dumping. > > Unfortunately we couldn't reproduce it so far. :/ Working on getting the timing info, will probably be tomorrow due to meetings. I did test simply restarting the app without changing to the config that enabled the use of SO_REUSEPORT and the problem didn't occur, so it definitely has something to do with SO_REUSEPORT. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-12 18:44 ` Hannes Frederic Sowa 2016-12-12 21:23 ` Josef Bacik @ 2016-12-12 22:24 ` Josef Bacik 2016-12-13 20:51 ` Tom Herbert 1 sibling, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-12 22:24 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers, Josef Bacik On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > On 12.12.2016 19:05, Josef Bacik wrote: >> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet >> <eric.dumazet@gmail.com> >> wrote: >>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: >>> >>>> >>>> Hmm... Is your ephemeral port range includes the port your load >>>> balancing app is using ? >>> >>> I suspect that you might have processes doing bind( port = 0) that >>> are >>> trapped into the bind_conflict() scan ? >>> >>> With 100,000 + timewaits there, this possibly hurts. >>> >>> Can you try the following loop breaker ? >> >> It doesn't appear that the app is doing bind(port = 0) during normal >> operation. I tested this patch and it made no difference. I'm >> going to >> test simply restarting the app without changing to the SO_REUSEPORT >> option. Thanks, > > Would it be possible to trace the time the function uses with trace? > If > we don't see the number growing considerably over time we probably can > rule out that we loop somewhere in there (I would instrument > inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). > > __inet_hash_connect -> __inet_check_established also takes a lock > (inet_ehash_lockp) which can be locked from inet_diag code path during > socket diag info dumping. > > Unfortunately we couldn't reproduce it so far. :/ So I had a bcc script running to time how long we spent in inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but of course I'm an idiot and didn't actually separate out the stats so I could tell _which_ one was taking forever. But anyway here's a normal distribution on the box Some shit : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 74 | | 2048 -> 4095 : 10537 |****************************************| 4096 -> 8191 : 8497 |******************************** | 8192 -> 16383 : 3745 |************** | 16384 -> 32767 : 300 |* | 32768 -> 65535 : 250 | | 65536 -> 131071 : 180 | | 131072 -> 262143 : 71 | | 262144 -> 524287 : 18 | | 524288 -> 1048575 : 5 | | With the times in nanoseconds, and here's the distribution during the problem Some shit : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 21 | | 2048 -> 4095 : 21820 |****************************************| 4096 -> 8191 : 11598 |********************* | 8192 -> 16383 : 4337 |******* | 16384 -> 32767 : 290 | | 32768 -> 65535 : 59 | | 65536 -> 131071 : 23 | | 131072 -> 262143 : 12 | | 262144 -> 524287 : 6 | | 524288 -> 1048575 : 19 | | 1048576 -> 2097151 : 1079 |* | 2097152 -> 4194303 : 0 | | 4194304 -> 8388607 : 1 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 0 | | 33554432 -> 67108863 : 1192 |** | Some shit : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 48 | | 2048 -> 4095 : 14714 |********************| 4096 -> 8191 : 6769 |********* | 8192 -> 16383 : 2234 |*** | 16384 -> 32767 : 422 | | 32768 -> 65535 : 208 | | 65536 -> 131071 : 61 | | 131072 -> 262143 : 10 | | 262144 -> 524287 : 416 | | 524288 -> 1048575 : 826 |* | 1048576 -> 2097151 : 598 | | 2097152 -> 4194303 : 10 | | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 1 | | 16777216 -> 33554431 : 289 | | 33554432 -> 67108863 : 921 |* | 67108864 -> 134217727 : 74 | | 134217728 -> 268435455 : 75 | | 268435456 -> 536870911 : 48 | | 536870912 -> 1073741823 : 25 | | 1073741824 -> 2147483647 : 3 | | 2147483648 -> 4294967295 : 2 | | 4294967296 -> 8589934591 : 1 | | As you can see we start getting tail latencies of up to 4-8 seconds. Tomorrow I'll separate out the stats so we can know which function is the problem child. Sorry about not doing that first. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-12 22:24 ` Josef Bacik @ 2016-12-13 20:51 ` Tom Herbert 2016-12-13 23:03 ` Craig Gallek 0 siblings, 1 reply; 32+ messages in thread From: Tom Herbert @ 2016-12-13 20:51 UTC (permalink / raw) To: Josef Bacik Cc: Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers I think there may be some suspicious code in inet_csk_get_port. At tb_found there is: if (((tb->fastreuse > 0 && reuse) || (tb->fastreuseport > 0 && !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && smallest_size == -1) goto success; if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) { if ((reuse || (tb->fastreuseport > 0 && sk->sk_reuseport && !rcu_access_pointer(sk->sk_reuseport_cb) && uid_eq(tb->fastuid, uid))) && smallest_size != -1 && --attempts >= 0) { spin_unlock_bh(&head->lock); goto again; } goto fail_unlock; } AFAICT there is redundancy in these two conditionals. The same clause is being checked in both: (tb->fastreuseport > 0 && !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the first conditional should be hit, goto done, and the second will never evaluate that part to true-- unless the sk is changed (do we need READ_ONCE for sk->sk_reuseport_cb?). Another potential issue is the that the goto again goes back to doing the port scan, but if snum had been set originally that doesn't seem like what we want. Thanks, Tom On Mon, Dec 12, 2016 at 2:24 PM, Josef Bacik <jbacik@fb.com> wrote: > > On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> >> On 12.12.2016 19:05, Josef Bacik wrote: >>> >>> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> >>> wrote: >>>> >>>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: >>>> >>>>> >>>>> Hmm... Is your ephemeral port range includes the port your load >>>>> balancing app is using ? >>>> >>>> >>>> I suspect that you might have processes doing bind( port = 0) that are >>>> trapped into the bind_conflict() scan ? >>>> >>>> With 100,000 + timewaits there, this possibly hurts. >>>> >>>> Can you try the following loop breaker ? >>> >>> >>> It doesn't appear that the app is doing bind(port = 0) during normal >>> operation. I tested this patch and it made no difference. I'm going to >>> test simply restarting the app without changing to the SO_REUSEPORT >>> option. Thanks, >> >> >> Would it be possible to trace the time the function uses with trace? If >> we don't see the number growing considerably over time we probably can >> rule out that we loop somewhere in there (I would instrument >> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). >> >> __inet_hash_connect -> __inet_check_established also takes a lock >> (inet_ehash_lockp) which can be locked from inet_diag code path during >> socket diag info dumping. >> >> Unfortunately we couldn't reproduce it so far. :/ > > > So I had a bcc script running to time how long we spent in > inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but of > course I'm an idiot and didn't actually separate out the stats so I could > tell _which_ one was taking forever. But anyway here's a normal > distribution on the box > > Some shit : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 0 | | > 512 -> 1023 : 0 | | > 1024 -> 2047 : 74 | | > 2048 -> 4095 : 10537 > |****************************************| > 4096 -> 8191 : 8497 |******************************** | > 8192 -> 16383 : 3745 |************** | > 16384 -> 32767 : 300 |* | > 32768 -> 65535 : 250 | | > 65536 -> 131071 : 180 | | > 131072 -> 262143 : 71 | | > 262144 -> 524287 : 18 | | > 524288 -> 1048575 : 5 | | > > With the times in nanoseconds, and here's the distribution during the > problem > > Some shit : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 0 | | > 512 -> 1023 : 0 | | > 1024 -> 2047 : 21 | | > 2048 -> 4095 : 21820 > |****************************************| > 4096 -> 8191 : 11598 |********************* | > 8192 -> 16383 : 4337 |******* | > 16384 -> 32767 : 290 | | > 32768 -> 65535 : 59 | | > 65536 -> 131071 : 23 | | > 131072 -> 262143 : 12 | | > 262144 -> 524287 : 6 | | > 524288 -> 1048575 : 19 | | > 1048576 -> 2097151 : 1079 |* | > 2097152 -> 4194303 : 0 | | > 4194304 -> 8388607 : 1 | | > 8388608 -> 16777215 : 0 | | > 16777216 -> 33554431 : 0 | | > 33554432 -> 67108863 : 1192 |** | > Some shit : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 0 | | > 512 -> 1023 : 0 | | > 1024 -> 2047 : 48 | | > 2048 -> 4095 : 14714 > |********************| > 4096 -> 8191 : 6769 |********* | > 8192 -> 16383 : 2234 |*** | > 16384 -> 32767 : 422 | | > 32768 -> 65535 : 208 | | > 65536 -> 131071 : 61 | | > 131072 -> 262143 : 10 | | > 262144 -> 524287 : 416 | | > 524288 -> 1048575 : 826 |* | > 1048576 -> 2097151 : 598 | | > 2097152 -> 4194303 : 10 | | > 4194304 -> 8388607 : 0 | | > 8388608 -> 16777215 : 1 | | > 16777216 -> 33554431 : 289 | | > 33554432 -> 67108863 : 921 |* | > 67108864 -> 134217727 : 74 | | > 134217728 -> 268435455 : 75 | | > 268435456 -> 536870911 : 48 | | > 536870912 -> 1073741823 : 25 | | > 1073741824 -> 2147483647 : 3 | | > 2147483648 -> 4294967295 : 2 | | > 4294967296 -> 8589934591 : 1 | | > > As you can see we start getting tail latencies of up to 4-8 seconds. > Tomorrow I'll separate out the stats so we can know which function is the > problem child. Sorry about not doing that first. Thanks, > > Josef > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-13 20:51 ` Tom Herbert @ 2016-12-13 23:03 ` Craig Gallek 2016-12-13 23:32 ` Tom Herbert 0 siblings, 1 reply; 32+ messages in thread From: Craig Gallek @ 2016-12-13 23:03 UTC (permalink / raw) To: Tom Herbert Cc: Josef Bacik, Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> wrote: > I think there may be some suspicious code in inet_csk_get_port. At > tb_found there is: > > if (((tb->fastreuse > 0 && reuse) || > (tb->fastreuseport > 0 && > !rcu_access_pointer(sk->sk_reuseport_cb) && > sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && > smallest_size == -1) > goto success; > if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) { > if ((reuse || > (tb->fastreuseport > 0 && > sk->sk_reuseport && > !rcu_access_pointer(sk->sk_reuseport_cb) && > uid_eq(tb->fastuid, uid))) && > smallest_size != -1 && --attempts >= 0) { > spin_unlock_bh(&head->lock); > goto again; > } > goto fail_unlock; > } > > AFAICT there is redundancy in these two conditionals. The same clause > is being checked in both: (tb->fastreuseport > 0 && > !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && > uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the > first conditional should be hit, goto done, and the second will never > evaluate that part to true-- unless the sk is changed (do we need > READ_ONCE for sk->sk_reuseport_cb?). That's an interesting point... It looks like this function also changed in 4.6 from using a single local_bh_disable() at the beginning with several spin_lock(&head->lock) to exclusively spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh disable variant was preventing the timers in your stack trace from running interleaved with this function before? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-13 23:03 ` Craig Gallek @ 2016-12-13 23:32 ` Tom Herbert 2016-12-15 18:53 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Tom Herbert @ 2016-12-13 23:32 UTC (permalink / raw) To: Craig Gallek Cc: Josef Bacik, Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> wrote: > On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> wrote: >> I think there may be some suspicious code in inet_csk_get_port. At >> tb_found there is: >> >> if (((tb->fastreuse > 0 && reuse) || >> (tb->fastreuseport > 0 && >> !rcu_access_pointer(sk->sk_reuseport_cb) && >> sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && >> smallest_size == -1) >> goto success; >> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) { >> if ((reuse || >> (tb->fastreuseport > 0 && >> sk->sk_reuseport && >> !rcu_access_pointer(sk->sk_reuseport_cb) && >> uid_eq(tb->fastuid, uid))) && >> smallest_size != -1 && --attempts >= 0) { >> spin_unlock_bh(&head->lock); >> goto again; >> } >> goto fail_unlock; >> } >> >> AFAICT there is redundancy in these two conditionals. The same clause >> is being checked in both: (tb->fastreuseport > 0 && >> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the >> first conditional should be hit, goto done, and the second will never >> evaluate that part to true-- unless the sk is changed (do we need >> READ_ONCE for sk->sk_reuseport_cb?). > That's an interesting point... It looks like this function also > changed in 4.6 from using a single local_bh_disable() at the beginning > with several spin_lock(&head->lock) to exclusively > spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh > disable variant was preventing the timers in your stack trace from > running interleaved with this function before? Could be, although dropping the lock shouldn't be able to affect the search state. TBH, I'm a little lost in reading function, the SO_REUSEPORT handling is pretty complicated. For instance, rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that function and also in every call to inet_csk_bind_conflict. I wonder if we can simply this under the assumption that SO_REUSEPORT is only allowed if the port number (snum) is explicitly specified. Tom ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-13 23:32 ` Tom Herbert @ 2016-12-15 18:53 ` Josef Bacik 2016-12-15 22:39 ` Tom Herbert 2016-12-16 0:07 ` Hannes Frederic Sowa 0 siblings, 2 replies; 32+ messages in thread From: Josef Bacik @ 2016-12-15 18:53 UTC (permalink / raw) To: Tom Herbert Cc: Craig Gallek, Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers [-- Attachment #1: Type: text/plain, Size: 3507 bytes --] On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote: > On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> > wrote: >> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> >> wrote: >>> I think there may be some suspicious code in inet_csk_get_port. At >>> tb_found there is: >>> >>> if (((tb->fastreuse > 0 && reuse) || >>> (tb->fastreuseport > 0 && >>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>> sk->sk_reuseport && uid_eq(tb->fastuid, >>> uid))) && >>> smallest_size == -1) >>> goto success; >>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>> tb, true)) { >>> if ((reuse || >>> (tb->fastreuseport > 0 && >>> sk->sk_reuseport && >>> >>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>> uid_eq(tb->fastuid, uid))) && >>> smallest_size != -1 && --attempts >= >>> 0) { >>> spin_unlock_bh(&head->lock); >>> goto again; >>> } >>> goto fail_unlock; >>> } >>> >>> AFAICT there is redundancy in these two conditionals. The same >>> clause >>> is being checked in both: (tb->fastreuseport > 0 && >>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true >>> the >>> first conditional should be hit, goto done, and the second will >>> never >>> evaluate that part to true-- unless the sk is changed (do we need >>> READ_ONCE for sk->sk_reuseport_cb?). >> That's an interesting point... It looks like this function also >> changed in 4.6 from using a single local_bh_disable() at the >> beginning >> with several spin_lock(&head->lock) to exclusively >> spin_lock_bh(&head->lock) at each locking point. Perhaps the full >> bh >> disable variant was preventing the timers in your stack trace from >> running interleaved with this function before? > > Could be, although dropping the lock shouldn't be able to affect the > search state. TBH, I'm a little lost in reading function, the > SO_REUSEPORT handling is pretty complicated. For instance, > rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that > function and also in every call to inet_csk_bind_conflict. I wonder if > we can simply this under the assumption that SO_REUSEPORT is only > allowed if the port number (snum) is explicitly specified. Ok first I have data for you Hannes, here's the time distributions before during and after the lockup (with all the debugging in place the box eventually recovers). I've attached it as a text file since it is long. Second is I was thinking about why we would spend so much time doing the ->owners list, and obviously it's because of the massive amount of timewait sockets on the owners list. I wrote the following dumb patch and tested it and the problem has disappeared completely. Now I don't know if this is right at all, but I thought it was weird we weren't copying the soreuseport option from the original socket onto the twsk. Is there are reason we aren't doing this currently? Does this help explain what is happening? Thanks, Josef [-- Attachment #2.1: timing-dist.txt --] [-- Type: text/plain, Size: 30972 bytes --] inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 4 |* | 2048 -> 4095 : 100 |****************************************| 4096 -> 8191 : 64 |************************* | 8192 -> 16383 : 35 |************** | 16384 -> 32767 : 2 | | inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 1 |* | 2048 -> 4095 : 38 |****************************************| 4096 -> 8191 : 9 |********* | 8192 -> 16383 : 2 |** | 16384 -> 32767 : 1 |* | <restart happens> inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 9 |** | 2048 -> 4095 : 54 |**************** | 4096 -> 8191 : 15 |**** | 8192 -> 16383 : 0 | | 16384 -> 32767 : 1 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 130 |****************************************| 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 0 | | 33554432 -> 67108863 : 92 |**************************** | inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 11 | | 2048 -> 4095 : 132 |********* | 4096 -> 8191 : 91 |****** | 8192 -> 16383 : 13 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 401 |**************************** | 4194304 -> 8388607 : 274 |******************* | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 16 |* | 33554432 -> 67108863 : 561 |****************************************| inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 6 | | 2048 -> 4095 : 68 |**** | 4096 -> 8191 : 9 | | 8192 -> 16383 : 2 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 650 |****************************************| 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 15 | | 33554432 -> 67108863 : 583 |*********************************** | inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 18 |* | 2048 -> 4095 : 263 |******************** | 4096 -> 8191 : 188 |************** | 8192 -> 16383 : 186 |************** | 16384 -> 32767 : 7 | | 32768 -> 65535 : 1 | | 65536 -> 131071 : 1 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 37 |** | 4194304 -> 8388607 : 454 |********************************** | 8388608 -> 16777215 : 9 | | 16777216 -> 33554431 : 24 |* | 33554432 -> 67108863 : 526 |****************************************| <soft lockup messages start happening> inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 20 |* | 2048 -> 4095 : 130 |********** | 4096 -> 8191 : 40 |*** | 8192 -> 16383 : 2 | | 16384 -> 32767 : 1 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 506 |*************************************** | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 23 |* | 33554432 -> 67108863 : 511 |****************************************| inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 9 | | 2048 -> 4095 : 356 |********************| 4096 -> 8191 : 230 |************ | 8192 -> 16383 : 342 |******************* | 16384 -> 32767 : 12 | | 32768 -> 65535 : 1 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 1 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 311 |***************** | 4194304 -> 8388607 : 163 |********* | 8388608 -> 16777215 : 1 | | 16777216 -> 33554431 : 3 | | 33554432 -> 67108863 : 338 |****************** | 67108864 -> 134217727 : 55 |*** | 134217728 -> 268435455 : 65 |*** | 268435456 -> 536870911 : 36 |** | 536870912 -> 1073741823 : 22 |* | 1073741824 -> 2147483647 : 16 | | 2147483648 -> 4294967295 : 7 | | 4294967296 -> 8589934591 : 1 | | inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 2 | | 2048 -> 4095 : 86 |*** | 4096 -> 8191 : 16 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 187 |******* | 2097152 -> 4194303 : 975 |****************************************| 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 337 |************* | 33554432 -> 67108863 : 442 |****************** | inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 162 |**** | 2048 -> 4095 : 495 |************** | 4096 -> 8191 : 66 |* | 8192 -> 16383 : 6 | | 16384 -> 32767 : 2 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 680 |********************| 4194304 -> 8388607 : 166 |**** | 8388608 -> 16777215 : 10 | | 16777216 -> 33554431 : 6 | | 33554432 -> 67108863 : 150 |**** | 67108864 -> 134217727 : 275 |******** | 134217728 -> 268435455 : 205 |****** | 268435456 -> 536870911 : 151 |**** | 536870912 -> 1073741823 : 137 |**** | 1073741824 -> 2147483647 : 76 |** | 2147483648 -> 4294967295 : 48 |* | 4294967296 -> 8589934591 : 6 | | 8589934592 -> 17179869183 : 2 | | inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 7 | | 2048 -> 4095 : 40 |*** | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 33 |** | 2097152 -> 4194303 : 159 |************ | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 311 |************************* | 33554432 -> 67108863 : 493 |****************************************| inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 129 |******************* | 2048 -> 4095 : 55 |******** | 4096 -> 8191 : 47 |******* | 8192 -> 16383 : 17 |** | 16384 -> 32767 : 2 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 30 |**** | 2097152 -> 4194303 : 130 |********************| 4194304 -> 8388607 : 24 |*** | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 13 |** | 33554432 -> 67108863 : 118 |****************** | 67108864 -> 134217727 : 58 |******** | 134217728 -> 268435455 : 17 |** | 268435456 -> 536870911 : 7 |* | 536870912 -> 1073741823 : 0 | | 1073741824 -> 2147483647 : 1 | | 2147483648 -> 4294967295 : 0 | | 4294967296 -> 8589934591 : 1 | | inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 6 |* | 2048 -> 4095 : 14 |** | 4096 -> 8191 : 0 | | 8192 -> 16383 : 1 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 158 |******************************** | 2097152 -> 4194303 : 22 |**** | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 192 |****************************************| 33554432 -> 67108863 : 9 |* | <recovers> inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 10 |**************** | 2048 -> 4095 : 25 |****************************************| 4096 -> 8191 : 16 |************************* | 8192 -> 16383 : 1 |* | 16384 -> 32767 : 0 | | 32768 -> 65535 : 1 |* | inet_csk_bind_conflict : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 10 |********************************* | 2048 -> 4095 : 12 |****************************************| inet_csk_get_port : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 4 |****************************************| 8192 -> 16383 : 1 |********** | [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2.2: tw-reuseport.patch --] [-- Type: text/x-patch, Size: 1229 bytes --] commit ea66f43c5b4d94625ad7322e4097acd9a06d7fdd Author: Josef Bacik <jbacik@fb.com> Date: Wed Dec 14 11:54:49 2016 -0800 do reuseport too diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h index c9b3eb7..567017b 100644 --- a/include/net/inet_timewait_sock.h +++ b/include/net/inet_timewait_sock.h @@ -55,6 +55,7 @@ struct inet_timewait_sock { #define tw_family __tw_common.skc_family #define tw_state __tw_common.skc_state #define tw_reuse __tw_common.skc_reuse +#define tw_reuseport __tw_common.skc_reuseport #define tw_ipv6only __tw_common.skc_ipv6only #define tw_bound_dev_if __tw_common.skc_bound_dev_if #define tw_node __tw_common.skc_nulls_node diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index a1b1057..04c560e 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -183,6 +183,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, tw->tw_dport = inet->inet_dport; tw->tw_family = sk->sk_family; tw->tw_reuse = sk->sk_reuse; + tw->tw_reuseport = sk->sk_reuseport; tw->tw_hash = sk->sk_hash; tw->tw_ipv6only = 0; tw->tw_transparent = inet->transparent; ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-15 18:53 ` Josef Bacik @ 2016-12-15 22:39 ` Tom Herbert 2016-12-15 23:25 ` Craig Gallek 2016-12-16 0:07 ` Hannes Frederic Sowa 1 sibling, 1 reply; 32+ messages in thread From: Tom Herbert @ 2016-12-15 22:39 UTC (permalink / raw) To: Josef Bacik Cc: Craig Gallek, Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@fb.com> wrote: > On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote: >> >> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> >> wrote: >>> >>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> >>> wrote: >>>> >>>> I think there may be some suspicious code in inet_csk_get_port. At >>>> tb_found there is: >>>> >>>> if (((tb->fastreuse > 0 && reuse) || >>>> (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>> sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && >>>> smallest_size == -1) >>>> goto success; >>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, >>>> true)) { >>>> if ((reuse || >>>> (tb->fastreuseport > 0 && >>>> sk->sk_reuseport && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>> && >>>> uid_eq(tb->fastuid, uid))) && >>>> smallest_size != -1 && --attempts >= 0) { >>>> spin_unlock_bh(&head->lock); >>>> goto again; >>>> } >>>> goto fail_unlock; >>>> } >>>> >>>> AFAICT there is redundancy in these two conditionals. The same clause >>>> is being checked in both: (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the >>>> first conditional should be hit, goto done, and the second will never >>>> evaluate that part to true-- unless the sk is changed (do we need >>>> READ_ONCE for sk->sk_reuseport_cb?). >>> >>> That's an interesting point... It looks like this function also >>> changed in 4.6 from using a single local_bh_disable() at the beginning >>> with several spin_lock(&head->lock) to exclusively >>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh >>> disable variant was preventing the timers in your stack trace from >>> running interleaved with this function before? >> >> >> Could be, although dropping the lock shouldn't be able to affect the >> search state. TBH, I'm a little lost in reading function, the >> SO_REUSEPORT handling is pretty complicated. For instance, >> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that >> function and also in every call to inet_csk_bind_conflict. I wonder if >> we can simply this under the assumption that SO_REUSEPORT is only >> allowed if the port number (snum) is explicitly specified. > > > Ok first I have data for you Hannes, here's the time distributions before > during and after the lockup (with all the debugging in place the box > eventually recovers). I've attached it as a text file since it is long. > > Second is I was thinking about why we would spend so much time doing the > ->owners list, and obviously it's because of the massive amount of timewait > sockets on the owners list. I wrote the following dumb patch and tested it > and the problem has disappeared completely. Now I don't know if this is > right at all, but I thought it was weird we weren't copying the soreuseport > option from the original socket onto the twsk. Is there are reason we > aren't doing this currently? Does this help explain what is happening? > Thanks, > I think that would explain it. We would be walking long lists of TW sockets in inet_bind_bucket_for_each(tb, &head->chain). This should break, although now I'm wondering if there's other ways we can get into this situation. reuseport ensures that we can have long lists of sockets in a single bucket, TW sockets can make that list really long. Tom > Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-15 22:39 ` Tom Herbert @ 2016-12-15 23:25 ` Craig Gallek 0 siblings, 0 replies; 32+ messages in thread From: Craig Gallek @ 2016-12-15 23:25 UTC (permalink / raw) To: Tom Herbert Cc: Josef Bacik, Hannes Frederic Sowa, Eric Dumazet, Linux Kernel Network Developers On Thu, Dec 15, 2016 at 5:39 PM, Tom Herbert <tom@herbertland.com> wrote: > On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@fb.com> wrote: >> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote: >>> >>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> >>> wrote: >>>> >>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> >>>> wrote: >>>>> >>>>> I think there may be some suspicious code in inet_csk_get_port. At >>>>> tb_found there is: >>>>> >>>>> if (((tb->fastreuse > 0 && reuse) || >>>>> (tb->fastreuseport > 0 && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>> sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && >>>>> smallest_size == -1) >>>>> goto success; >>>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, >>>>> true)) { >>>>> if ((reuse || >>>>> (tb->fastreuseport > 0 && >>>>> sk->sk_reuseport && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>>> && >>>>> uid_eq(tb->fastuid, uid))) && >>>>> smallest_size != -1 && --attempts >= 0) { >>>>> spin_unlock_bh(&head->lock); >>>>> goto again; >>>>> } >>>>> goto fail_unlock; >>>>> } >>>>> >>>>> AFAICT there is redundancy in these two conditionals. The same clause >>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the >>>>> first conditional should be hit, goto done, and the second will never >>>>> evaluate that part to true-- unless the sk is changed (do we need >>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>> >>>> That's an interesting point... It looks like this function also >>>> changed in 4.6 from using a single local_bh_disable() at the beginning >>>> with several spin_lock(&head->lock) to exclusively >>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh >>>> disable variant was preventing the timers in your stack trace from >>>> running interleaved with this function before? >>> >>> >>> Could be, although dropping the lock shouldn't be able to affect the >>> search state. TBH, I'm a little lost in reading function, the >>> SO_REUSEPORT handling is pretty complicated. For instance, >>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that >>> function and also in every call to inet_csk_bind_conflict. I wonder if >>> we can simply this under the assumption that SO_REUSEPORT is only >>> allowed if the port number (snum) is explicitly specified. >> >> >> Ok first I have data for you Hannes, here's the time distributions before >> during and after the lockup (with all the debugging in place the box >> eventually recovers). I've attached it as a text file since it is long. >> >> Second is I was thinking about why we would spend so much time doing the >> ->owners list, and obviously it's because of the massive amount of timewait >> sockets on the owners list. I wrote the following dumb patch and tested it >> and the problem has disappeared completely. Now I don't know if this is >> right at all, but I thought it was weird we weren't copying the soreuseport >> option from the original socket onto the twsk. Is there are reason we >> aren't doing this currently? Does this help explain what is happening? >> Thanks, >> > I think that would explain it. We would be walking long lists of TW > sockets in inet_bind_bucket_for_each(tb, &head->chain). This should > break, although now I'm wondering if there's other ways we can get > into this situation. reuseport ensures that we can have long lists of > sockets in a single bucket, TW sockets can make that list really long. What if the time-wait timer implementation was changed to do more opportunistic removals? In this case, you seem to have a coordinated timer event causing many independent locking events on the bucket in question. If one of those firing events realized it could handle all of them, you could greatly reduce the contention. The fact that they all hash to the same bucket may make this even easier... > Tom > >> Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-15 18:53 ` Josef Bacik 2016-12-15 22:39 ` Tom Herbert @ 2016-12-16 0:07 ` Hannes Frederic Sowa 2016-12-16 14:54 ` Josef Bacik 1 sibling, 1 reply; 32+ messages in thread From: Hannes Frederic Sowa @ 2016-12-16 0:07 UTC (permalink / raw) To: Josef Bacik, Tom Herbert Cc: Craig Gallek, Eric Dumazet, Linux Kernel Network Developers Hi Josef, On 15.12.2016 19:53, Josef Bacik wrote: > On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> wrote: >> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> >> wrote: >>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> >>> wrote: >>>> I think there may be some suspicious code in inet_csk_get_port. At >>>> tb_found there is: >>>> >>>> if (((tb->fastreuse > 0 && reuse) || >>>> (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>> uid))) && >>>> smallest_size == -1) >>>> goto success; >>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>> tb, true)) { >>>> if ((reuse || >>>> (tb->fastreuseport > 0 && >>>> sk->sk_reuseport && >>>> >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>> uid_eq(tb->fastuid, uid))) && >>>> smallest_size != -1 && --attempts >= 0) { >>>> spin_unlock_bh(&head->lock); >>>> goto again; >>>> } >>>> goto fail_unlock; >>>> } >>>> >>>> AFAICT there is redundancy in these two conditionals. The same clause >>>> is being checked in both: (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the >>>> first conditional should be hit, goto done, and the second will never >>>> evaluate that part to true-- unless the sk is changed (do we need >>>> READ_ONCE for sk->sk_reuseport_cb?). >>> That's an interesting point... It looks like this function also >>> changed in 4.6 from using a single local_bh_disable() at the beginning >>> with several spin_lock(&head->lock) to exclusively >>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh >>> disable variant was preventing the timers in your stack trace from >>> running interleaved with this function before? >> >> Could be, although dropping the lock shouldn't be able to affect the >> search state. TBH, I'm a little lost in reading function, the >> SO_REUSEPORT handling is pretty complicated. For instance, >> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that >> function and also in every call to inet_csk_bind_conflict. I wonder if >> we can simply this under the assumption that SO_REUSEPORT is only >> allowed if the port number (snum) is explicitly specified. > > Ok first I have data for you Hannes, here's the time distributions > before during and after the lockup (with all the debugging in place the > box eventually recovers). I've attached it as a text file since it is > long. Thanks a lot! > Second is I was thinking about why we would spend so much time doing the > ->owners list, and obviously it's because of the massive amount of > timewait sockets on the owners list. I wrote the following dumb patch > and tested it and the problem has disappeared completely. Now I don't > know if this is right at all, but I thought it was weird we weren't > copying the soreuseport option from the original socket onto the twsk. > Is there are reason we aren't doing this currently? Does this help > explain what is happening? Thanks, The patch is interesting and a good clue, but I am immediately a bit concerned that we don't copy/tag the socket with the uid also to keep the security properties for SO_REUSEPORT. I have to think a bit more about this. We have seen hangs during connect. I am afraid this patch wouldn't help there while also guaranteeing uniqueness. Thanks a lot! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-16 0:07 ` Hannes Frederic Sowa @ 2016-12-16 14:54 ` Josef Bacik 2016-12-16 15:21 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-16 14:54 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Hi Josef, > > On 15.12.2016 19:53, Josef Bacik wrote: >> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> >> wrote: >>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>> <kraigatgoog@gmail.com> >>> wrote: >>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>> <tom@herbertland.com> >>>> wrote: >>>>> I think there may be some suspicious code in inet_csk_get_port. >>>>> At >>>>> tb_found there is: >>>>> >>>>> if (((tb->fastreuse > 0 && reuse) || >>>>> (tb->fastreuseport > 0 && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>>> && >>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>> uid))) && >>>>> smallest_size == -1) >>>>> goto success; >>>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>> tb, true)) { >>>>> if ((reuse || >>>>> (tb->fastreuseport > 0 && >>>>> sk->sk_reuseport && >>>>> >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>> uid_eq(tb->fastuid, uid))) && >>>>> smallest_size != -1 && --attempts >>>>> >= 0) { >>>>> spin_unlock_bh(&head->lock); >>>>> goto again; >>>>> } >>>>> goto fail_unlock; >>>>> } >>>>> >>>>> AFAICT there is redundancy in these two conditionals. The same >>>>> clause >>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is >>>>> true the >>>>> first conditional should be hit, goto done, and the second >>>>> will never >>>>> evaluate that part to true-- unless the sk is changed (do we >>>>> need >>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>> That's an interesting point... It looks like this function also >>>> changed in 4.6 from using a single local_bh_disable() at the >>>> beginning >>>> with several spin_lock(&head->lock) to exclusively >>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the >>>> full bh >>>> disable variant was preventing the timers in your stack trace >>>> from >>>> running interleaved with this function before? >>> >>> Could be, although dropping the lock shouldn't be able to affect >>> the >>> search state. TBH, I'm a little lost in reading function, the >>> SO_REUSEPORT handling is pretty complicated. For instance, >>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>> that >>> function and also in every call to inet_csk_bind_conflict. I >>> wonder if >>> we can simply this under the assumption that SO_REUSEPORT is only >>> allowed if the port number (snum) is explicitly specified. >> >> Ok first I have data for you Hannes, here's the time distributions >> before during and after the lockup (with all the debugging in place >> the >> box eventually recovers). I've attached it as a text file since it >> is >> long. > > Thanks a lot! > >> Second is I was thinking about why we would spend so much time >> doing the >> ->owners list, and obviously it's because of the massive amount of >> timewait sockets on the owners list. I wrote the following dumb >> patch >> and tested it and the problem has disappeared completely. Now I >> don't >> know if this is right at all, but I thought it was weird we weren't >> copying the soreuseport option from the original socket onto the >> twsk. >> Is there are reason we aren't doing this currently? Does this help >> explain what is happening? Thanks, > > The patch is interesting and a good clue, but I am immediately a bit > concerned that we don't copy/tag the socket with the uid also to keep > the security properties for SO_REUSEPORT. I have to think a bit more > about this. > > We have seen hangs during connect. I am afraid this patch wouldn't > help > there while also guaranteeing uniqueness. Yeah so I looked at the code some more and actually my patch is really bad. If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which is outside of the timewait sock, so that's definitely bad. But we should at least be setting it to 0 so that we don't do this normally. Unfortunately simply setting it to 0 doesn't fix the problem. So for some reason having ->sk_reuseport set to 1 on a timewait socket makes this problem non-existent, which is strange. So back to the drawing board I guess. I wonder if doing what craig suggested and batching the timewait timer expires so it hurts less would accomplish the same results. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-16 14:54 ` Josef Bacik @ 2016-12-16 15:21 ` Josef Bacik 2016-12-16 22:08 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-16 15:21 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote: > On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> Hi Josef, >> >> On 15.12.2016 19:53, Josef Bacik wrote: >>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> >>> wrote: >>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>> <kraigatgoog@gmail.com> >>>> wrote: >>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>> <tom@herbertland.com> >>>>> wrote: >>>>>> I think there may be some suspicious code in >>>>>> inet_csk_get_port. At >>>>>> tb_found there is: >>>>>> >>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>> (tb->fastreuseport > 0 && >>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>>>> && >>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>> uid))) && >>>>>> smallest_size == -1) >>>>>> goto success; >>>>>> if >>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>> tb, true)) { >>>>>> if ((reuse || >>>>>> (tb->fastreuseport > 0 && >>>>>> sk->sk_reuseport && >>>>>> >>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>> uid_eq(tb->fastuid, uid))) && >>>>>> smallest_size != -1 && --attempts >>>>>> >= 0) { >>>>>> spin_unlock_bh(&head->lock); >>>>>> goto again; >>>>>> } >>>>>> goto fail_unlock; >>>>>> } >>>>>> >>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>> same clause >>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is >>>>>> true the >>>>>> first conditional should be hit, goto done, and the second >>>>>> will never >>>>>> evaluate that part to true-- unless the sk is changed (do we >>>>>> need >>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>> That's an interesting point... It looks like this function also >>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>> beginning >>>>> with several spin_lock(&head->lock) to exclusively >>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the >>>>> full bh >>>>> disable variant was preventing the timers in your stack trace >>>>> from >>>>> running interleaved with this function before? >>>> >>>> Could be, although dropping the lock shouldn't be able to affect >>>> the >>>> search state. TBH, I'm a little lost in reading function, the >>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>>> that >>>> function and also in every call to inet_csk_bind_conflict. I >>>> wonder if >>>> we can simply this under the assumption that SO_REUSEPORT is only >>>> allowed if the port number (snum) is explicitly specified. >>> >>> Ok first I have data for you Hannes, here's the time distributions >>> before during and after the lockup (with all the debugging in >>> place the >>> box eventually recovers). I've attached it as a text file since >>> it is >>> long. >> >> Thanks a lot! >> >>> Second is I was thinking about why we would spend so much time >>> doing the >>> ->owners list, and obviously it's because of the massive amount of >>> timewait sockets on the owners list. I wrote the following dumb >>> patch >>> and tested it and the problem has disappeared completely. Now I >>> don't >>> know if this is right at all, but I thought it was weird we weren't >>> copying the soreuseport option from the original socket onto the >>> twsk. >>> Is there are reason we aren't doing this currently? Does this help >>> explain what is happening? Thanks, >> >> The patch is interesting and a good clue, but I am immediately a bit >> concerned that we don't copy/tag the socket with the uid also to keep >> the security properties for SO_REUSEPORT. I have to think a bit more >> about this. >> >> We have seen hangs during connect. I am afraid this patch wouldn't >> help >> there while also guaranteeing uniqueness. > > > Yeah so I looked at the code some more and actually my patch is > really bad. If sk2->sk_reuseport is set we'll look at > sk2->sk_reuseport_cb, which is outside of the timewait sock, so > that's definitely bad. > > But we should at least be setting it to 0 so that we don't do this > normally. Unfortunately simply setting it to 0 doesn't fix the > problem. So for some reason having ->sk_reuseport set to 1 on a > timewait socket makes this problem non-existent, which is strange. > > So back to the drawing board I guess. I wonder if doing what craig > suggested and batching the timewait timer expires so it hurts less > would accomplish the same results. Thanks, Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This is the code if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || rcu_access_pointer(sk->sk_reuseport_cb) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2))))) { if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr || sk2->sk_rcv_saddr == sk->sk_rcv_saddr) break; } so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 1. But now we are using reuseport, so sk->sk_reuseport_cb should be non-NULL right? So really setting the timewait sock's sk_reuseport should have no bearing on how this loop plays out right? Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-16 15:21 ` Josef Bacik @ 2016-12-16 22:08 ` Josef Bacik 2016-12-16 22:18 ` Tom Herbert 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-16 22:08 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote: > On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote: >> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa >> <hannes@stressinduktion.org> wrote: >>> Hi Josef, >>> >>> On 15.12.2016 19:53, Josef Bacik wrote: >>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert >>>> <tom@herbertland.com> wrote: >>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>>> <kraigatgoog@gmail.com> >>>>> wrote: >>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>>> <tom@herbertland.com> >>>>>> wrote: >>>>>>> I think there may be some suspicious code in >>>>>>> inet_csk_get_port. At >>>>>>> tb_found there is: >>>>>>> >>>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>>> (tb->fastreuseport > 0 && >>>>>>> >>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>>> uid))) && >>>>>>> smallest_size == -1) >>>>>>> goto success; >>>>>>> if >>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>>> tb, true)) { >>>>>>> if ((reuse || >>>>>>> (tb->fastreuseport > 0 && >>>>>>> sk->sk_reuseport && >>>>>>> >>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>> uid_eq(tb->fastuid, uid))) && >>>>>>> smallest_size != -1 && --attempts >>>>>>> >= 0) { >>>>>>> spin_unlock_bh(&head->lock); >>>>>>> goto again; >>>>>>> } >>>>>>> goto fail_unlock; >>>>>>> } >>>>>>> >>>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>>> same clause >>>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport >>>>>>> && >>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is >>>>>>> true the >>>>>>> first conditional should be hit, goto done, and the second >>>>>>> will never >>>>>>> evaluate that part to true-- unless the sk is changed (do we >>>>>>> need >>>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>>> That's an interesting point... It looks like this function also >>>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>>> beginning >>>>>> with several spin_lock(&head->lock) to exclusively >>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the >>>>>> full bh >>>>>> disable variant was preventing the timers in your stack trace >>>>>> from >>>>>> running interleaved with this function before? >>>>> >>>>> Could be, although dropping the lock shouldn't be able to affect >>>>> the >>>>> search state. TBH, I'm a little lost in reading function, the >>>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times >>>>> in that >>>>> function and also in every call to inet_csk_bind_conflict. I >>>>> wonder if >>>>> we can simply this under the assumption that SO_REUSEPORT is only >>>>> allowed if the port number (snum) is explicitly specified. >>>> >>>> Ok first I have data for you Hannes, here's the time distributions >>>> before during and after the lockup (with all the debugging in >>>> place the >>>> box eventually recovers). I've attached it as a text file since >>>> it is >>>> long. >>> >>> Thanks a lot! >>> >>>> Second is I was thinking about why we would spend so much time >>>> doing the >>>> ->owners list, and obviously it's because of the massive amount of >>>> timewait sockets on the owners list. I wrote the following dumb >>>> patch >>>> and tested it and the problem has disappeared completely. Now I >>>> don't >>>> know if this is right at all, but I thought it was weird we >>>> weren't >>>> copying the soreuseport option from the original socket onto the >>>> twsk. >>>> Is there are reason we aren't doing this currently? Does this >>>> help >>>> explain what is happening? Thanks, >>> >>> The patch is interesting and a good clue, but I am immediately a bit >>> concerned that we don't copy/tag the socket with the uid also to >>> keep >>> the security properties for SO_REUSEPORT. I have to think a bit more >>> about this. >>> >>> We have seen hangs during connect. I am afraid this patch wouldn't >>> help >>> there while also guaranteeing uniqueness. >> >> >> Yeah so I looked at the code some more and actually my patch is >> really bad. If sk2->sk_reuseport is set we'll look at >> sk2->sk_reuseport_cb, which is outside of the timewait sock, so >> that's definitely bad. >> >> But we should at least be setting it to 0 so that we don't do this >> normally. Unfortunately simply setting it to 0 doesn't fix the >> problem. So for some reason having ->sk_reuseport set to 1 on a >> timewait socket makes this problem non-existent, which is strange. >> >> So back to the drawing board I guess. I wonder if doing what craig >> suggested and batching the timewait timer expires so it hurts less >> would accomplish the same results. Thanks, > > Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This > is the code > > if ((!reuse || !sk2->sk_reuse || > sk2->sk_state == TCP_LISTEN) && > (!reuseport || !sk2->sk_reuseport || > rcu_access_pointer(sk->sk_reuseport_cb) || > (sk2->sk_state != TCP_TIME_WAIT && > !uid_eq(uid, sock_i_uid(sk2))))) { > > if (!sk2->sk_rcv_saddr || > !sk->sk_rcv_saddr || > sk2->sk_rcv_saddr == > sk->sk_rcv_saddr) > break; > } > > so in my patches case we now have reuseport == 1, sk2->sk_reuseport > == 1. But now we are using reuseport, so sk->sk_reuseport_cb should > be non-NULL right? So really setting the timewait sock's > sk_reuseport should have no bearing on how this loop plays out right? > Thanks, So more messing around and I noticed that we basically don't do the tb->fastreuseport logic at all if we've ended up with a non SO_REUSEPORT socket on that tb. So before I fully understood what I was doing I fixed it so that after we go through ->bind_conflict() once with a SO_REUSEPORT socket, we reset tb->fastreuseport to 1 and set the uid to match the uid of the socket. This made the problem go away. Tom pointed out that if we bind to the same port on a different address and we have a non SO_REUSEPORT socket with the same address on this tb then we'd be screwed with my code. Which brings me to his proposed solution. We need another hash table that is indexed based on the binding address. Then each node corresponds to one address/port binding, with non-SO_REUSEPORT entries having only one entry, and normal SO_REUSEPORT entries having many. This cleans up the need to search all the possible sockets on any given tb, we just go and look at the one we care about. Does this make sense? Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-16 22:08 ` Josef Bacik @ 2016-12-16 22:18 ` Tom Herbert 2016-12-16 22:50 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Tom Herbert @ 2016-12-16 22:18 UTC (permalink / raw) To: Josef Bacik Cc: Hannes Frederic Sowa, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote: > On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote: >> >> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote: >>> >>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa >>> <hannes@stressinduktion.org> wrote: >>>> >>>> Hi Josef, >>>> >>>> On 15.12.2016 19:53, Josef Bacik wrote: >>>>> >>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> >>>>> wrote: >>>>>> >>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@herbertland.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> I think there may be some suspicious code in inet_csk_get_port. At >>>>>>>> tb_found there is: >>>>>>>> >>>>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>>>> (tb->fastreuseport > 0 && >>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>>>> uid))) && >>>>>>>> smallest_size == -1) >>>>>>>> goto success; >>>>>>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>>>> tb, true)) { >>>>>>>> if ((reuse || >>>>>>>> (tb->fastreuseport > 0 && >>>>>>>> sk->sk_reuseport && >>>>>>>> >>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>> uid_eq(tb->fastuid, uid))) && >>>>>>>> smallest_size != -1 && --attempts >= >>>>>>>> 0) { >>>>>>>> spin_unlock_bh(&head->lock); >>>>>>>> goto again; >>>>>>>> } >>>>>>>> goto fail_unlock; >>>>>>>> } >>>>>>>> >>>>>>>> AFAICT there is redundancy in these two conditionals. The same >>>>>>>> clause >>>>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true >>>>>>>> the >>>>>>>> first conditional should be hit, goto done, and the second will >>>>>>>> never >>>>>>>> evaluate that part to true-- unless the sk is changed (do we need >>>>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>>>> >>>>>>> That's an interesting point... It looks like this function also >>>>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>>>> beginning >>>>>>> with several spin_lock(&head->lock) to exclusively >>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full >>>>>>> bh >>>>>>> disable variant was preventing the timers in your stack trace from >>>>>>> running interleaved with this function before? >>>>>> >>>>>> >>>>>> Could be, although dropping the lock shouldn't be able to affect the >>>>>> search state. TBH, I'm a little lost in reading function, the >>>>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>>>>> that >>>>>> function and also in every call to inet_csk_bind_conflict. I wonder >>>>>> if >>>>>> we can simply this under the assumption that SO_REUSEPORT is only >>>>>> allowed if the port number (snum) is explicitly specified. >>>>> >>>>> >>>>> Ok first I have data for you Hannes, here's the time distributions >>>>> before during and after the lockup (with all the debugging in place >>>>> the >>>>> box eventually recovers). I've attached it as a text file since it is >>>>> long. >>>> >>>> >>>> Thanks a lot! >>>> >>>>> Second is I was thinking about why we would spend so much time doing >>>>> the >>>>> ->owners list, and obviously it's because of the massive amount of >>>>> timewait sockets on the owners list. I wrote the following dumb patch >>>>> and tested it and the problem has disappeared completely. Now I don't >>>>> know if this is right at all, but I thought it was weird we weren't >>>>> copying the soreuseport option from the original socket onto the twsk. >>>>> Is there are reason we aren't doing this currently? Does this help >>>>> explain what is happening? Thanks, >>>> >>>> >>>> The patch is interesting and a good clue, but I am immediately a bit >>>> concerned that we don't copy/tag the socket with the uid also to keep >>>> the security properties for SO_REUSEPORT. I have to think a bit more >>>> about this. >>>> >>>> We have seen hangs during connect. I am afraid this patch wouldn't help >>>> there while also guaranteeing uniqueness. >>> >>> >>> >>> Yeah so I looked at the code some more and actually my patch is really >>> bad. If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which >>> is outside of the timewait sock, so that's definitely bad. >>> >>> But we should at least be setting it to 0 so that we don't do this >>> normally. Unfortunately simply setting it to 0 doesn't fix the problem. So >>> for some reason having ->sk_reuseport set to 1 on a timewait socket makes >>> this problem non-existent, which is strange. >>> >>> So back to the drawing board I guess. I wonder if doing what craig >>> suggested and batching the timewait timer expires so it hurts less would >>> accomplish the same results. Thanks, >> >> >> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This is the >> code >> >> if ((!reuse || !sk2->sk_reuse || >> sk2->sk_state == TCP_LISTEN) && >> (!reuseport || !sk2->sk_reuseport || >> rcu_access_pointer(sk->sk_reuseport_cb) || >> (sk2->sk_state != TCP_TIME_WAIT && >> !uid_eq(uid, sock_i_uid(sk2))))) { >> >> if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr >> || >> sk2->sk_rcv_saddr == sk->sk_rcv_saddr) >> break; >> } >> >> so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 1. >> But now we are using reuseport, so sk->sk_reuseport_cb should be non-NULL >> right? So really setting the timewait sock's sk_reuseport should have no >> bearing on how this loop plays out right? Thanks, > > > > So more messing around and I noticed that we basically don't do the > tb->fastreuseport logic at all if we've ended up with a non SO_REUSEPORT > socket on that tb. So before I fully understood what I was doing I fixed it > so that after we go through ->bind_conflict() once with a SO_REUSEPORT > socket, we reset tb->fastreuseport to 1 and set the uid to match the uid of > the socket. This made the problem go away. Tom pointed out that if we bind > to the same port on a different address and we have a non SO_REUSEPORT > socket with the same address on this tb then we'd be screwed with my code. > > Which brings me to his proposed solution. We need another hash table that > is indexed based on the binding address. Then each node corresponds to one > address/port binding, with non-SO_REUSEPORT entries having only one entry, > and normal SO_REUSEPORT entries having many. This cleans up the need to > search all the possible sockets on any given tb, we just go and look at the > one we care about. Does this make sense? Thanks, > Hi Josef, Thinking about it some more the hash table won't work because of the rules of binding different addresses to the same port. What I think we can do is to change inet_bind_bucket to be structure that contains all the information used to detect conflicts (reuse*, if, address, uid, etc.) and a list of sockets that share that exact same information-- for instance all socket in timewait state create through some listener socket should wind up on single bucket. When we do the bind_conflict function we only should have to walk this buckets, not the full list of sockets. Thoughts on this? Thanks, Tom > Josef > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-16 22:18 ` Tom Herbert @ 2016-12-16 22:50 ` Josef Bacik 2016-12-17 11:08 ` Hannes Frederic Sowa 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-16 22:50 UTC (permalink / raw) To: Tom Herbert Cc: Hannes Frederic Sowa, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote: > On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote: >> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote: >>> >>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote: >>>> >>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa >>>> <hannes@stressinduktion.org> wrote: >>>>> >>>>> Hi Josef, >>>>> >>>>> On 15.12.2016 19:53, Josef Bacik wrote: >>>>>> >>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert >>>>>> <tom@herbertland.com> >>>>>> wrote: >>>>>>> >>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>>>>> <kraigatgoog@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>>>>> <tom@herbertland.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> I think there may be some suspicious code in >>>>>>>>> inet_csk_get_port. At >>>>>>>>> tb_found there is: >>>>>>>>> >>>>>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>> >>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>> sk->sk_reuseport && >>>>>>>>> uid_eq(tb->fastuid, >>>>>>>>> uid))) && >>>>>>>>> smallest_size == -1) >>>>>>>>> goto success; >>>>>>>>> if >>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>>>>> tb, true)) { >>>>>>>>> if ((reuse || >>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>> sk->sk_reuseport && >>>>>>>>> >>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>> uid_eq(tb->fastuid, uid))) && >>>>>>>>> smallest_size != -1 && >>>>>>>>> --attempts >= >>>>>>>>> 0) { >>>>>>>>> >>>>>>>>> spin_unlock_bh(&head->lock); >>>>>>>>> goto again; >>>>>>>>> } >>>>>>>>> goto fail_unlock; >>>>>>>>> } >>>>>>>>> >>>>>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>>>>> same >>>>>>>>> clause >>>>>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>> sk->sk_reuseport && >>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this >>>>>>>>> is true >>>>>>>>> the >>>>>>>>> first conditional should be hit, goto done, and the >>>>>>>>> second will >>>>>>>>> never >>>>>>>>> evaluate that part to true-- unless the sk is changed (do >>>>>>>>> we need >>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>>>>> >>>>>>>> That's an interesting point... It looks like this function >>>>>>>> also >>>>>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>>>>> beginning >>>>>>>> with several spin_lock(&head->lock) to exclusively >>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps >>>>>>>> the full >>>>>>>> bh >>>>>>>> disable variant was preventing the timers in your stack >>>>>>>> trace from >>>>>>>> running interleaved with this function before? >>>>>>> >>>>>>> >>>>>>> Could be, although dropping the lock shouldn't be able to >>>>>>> affect the >>>>>>> search state. TBH, I'm a little lost in reading function, the >>>>>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three >>>>>>> times in >>>>>>> that >>>>>>> function and also in every call to inet_csk_bind_conflict. I >>>>>>> wonder >>>>>>> if >>>>>>> we can simply this under the assumption that SO_REUSEPORT is >>>>>>> only >>>>>>> allowed if the port number (snum) is explicitly specified. >>>>>> >>>>>> >>>>>> Ok first I have data for you Hannes, here's the time >>>>>> distributions >>>>>> before during and after the lockup (with all the debugging in >>>>>> place >>>>>> the >>>>>> box eventually recovers). I've attached it as a text file >>>>>> since it is >>>>>> long. >>>>> >>>>> >>>>> Thanks a lot! >>>>> >>>>>> Second is I was thinking about why we would spend so much time >>>>>> doing >>>>>> the >>>>>> ->owners list, and obviously it's because of the massive >>>>>> amount of >>>>>> timewait sockets on the owners list. I wrote the following >>>>>> dumb patch >>>>>> and tested it and the problem has disappeared completely. Now >>>>>> I don't >>>>>> know if this is right at all, but I thought it was weird we >>>>>> weren't >>>>>> copying the soreuseport option from the original socket onto >>>>>> the twsk. >>>>>> Is there are reason we aren't doing this currently? Does this >>>>>> help >>>>>> explain what is happening? Thanks, >>>>> >>>>> >>>>> The patch is interesting and a good clue, but I am immediately a >>>>> bit >>>>> concerned that we don't copy/tag the socket with the uid also to >>>>> keep >>>>> the security properties for SO_REUSEPORT. I have to think a bit >>>>> more >>>>> about this. >>>>> >>>>> We have seen hangs during connect. I am afraid this patch >>>>> wouldn't help >>>>> there while also guaranteeing uniqueness. >>>> >>>> >>>> >>>> Yeah so I looked at the code some more and actually my patch is >>>> really >>>> bad. If sk2->sk_reuseport is set we'll look at >>>> sk2->sk_reuseport_cb, which >>>> is outside of the timewait sock, so that's definitely bad. >>>> >>>> But we should at least be setting it to 0 so that we don't do this >>>> normally. Unfortunately simply setting it to 0 doesn't fix the >>>> problem. So >>>> for some reason having ->sk_reuseport set to 1 on a timewait >>>> socket makes >>>> this problem non-existent, which is strange. >>>> >>>> So back to the drawing board I guess. I wonder if doing what >>>> craig >>>> suggested and batching the timewait timer expires so it hurts >>>> less would >>>> accomplish the same results. Thanks, >>> >>> >>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. >>> This is the >>> code >>> >>> if ((!reuse || !sk2->sk_reuse || >>> sk2->sk_state == TCP_LISTEN) && >>> (!reuseport || !sk2->sk_reuseport || >>> >>> rcu_access_pointer(sk->sk_reuseport_cb) || >>> (sk2->sk_state != TCP_TIME_WAIT && >>> !uid_eq(uid, sock_i_uid(sk2))))) { >>> >>> if (!sk2->sk_rcv_saddr || >>> !sk->sk_rcv_saddr >>> || >>> sk2->sk_rcv_saddr == >>> sk->sk_rcv_saddr) >>> break; >>> } >>> >>> so in my patches case we now have reuseport == 1, >>> sk2->sk_reuseport == 1. >>> But now we are using reuseport, so sk->sk_reuseport_cb should be >>> non-NULL >>> right? So really setting the timewait sock's sk_reuseport should >>> have no >>> bearing on how this loop plays out right? Thanks, >> >> >> >> So more messing around and I noticed that we basically don't do the >> tb->fastreuseport logic at all if we've ended up with a non >> SO_REUSEPORT >> socket on that tb. So before I fully understood what I was doing I >> fixed it >> so that after we go through ->bind_conflict() once with a >> SO_REUSEPORT >> socket, we reset tb->fastreuseport to 1 and set the uid to match >> the uid of >> the socket. This made the problem go away. Tom pointed out that >> if we bind >> to the same port on a different address and we have a non >> SO_REUSEPORT >> socket with the same address on this tb then we'd be screwed with >> my code. >> >> Which brings me to his proposed solution. We need another hash >> table that >> is indexed based on the binding address. Then each node >> corresponds to one >> address/port binding, with non-SO_REUSEPORT entries having only one >> entry, >> and normal SO_REUSEPORT entries having many. This cleans up the >> need to >> search all the possible sockets on any given tb, we just go and >> look at the >> one we care about. Does this make sense? Thanks, >> > Hi Josef, > > Thinking about it some more the hash table won't work because of the > rules of binding different addresses to the same port. What I think we > can do is to change inet_bind_bucket to be structure that contains all > the information used to detect conflicts (reuse*, if, address, uid, > etc.) and a list of sockets that share that exact same information-- > for instance all socket in timewait state create through some listener > socket should wind up on single bucket. When we do the bind_conflict > function we only should have to walk this buckets, not the full list > of sockets. > > Thoughts on this? This sounds good, maybe tb->owners be a list of say struct inet_unique_shit { struct sock_common sk; struct hlist socks; }; Then we make inet_unique_shit like twsks', just copy the relevant information, then hang the real sockets off of the socks hlist. Something like that? Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-16 22:50 ` Josef Bacik @ 2016-12-17 11:08 ` Hannes Frederic Sowa 2016-12-17 13:26 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Hannes Frederic Sowa @ 2016-12-17 11:08 UTC (permalink / raw) To: Josef Bacik, Tom Herbert Cc: Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On 16.12.2016 23:50, Josef Bacik wrote: > On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote: >> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote: >>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote: >>>> >>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote: >>>>> >>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa >>>>> <hannes@stressinduktion.org> wrote: >>>>>> >>>>>> Hi Josef, >>>>>> >>>>>> On 15.12.2016 19:53, Josef Bacik wrote: >>>>>>> >>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>>>>>> <kraigatgoog@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>>>>>> <tom@herbertland.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I think there may be some suspicious code in >>>>>>>>>> inet_csk_get_port. At >>>>>>>>>> tb_found there is: >>>>>>>>>> >>>>>>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>>> >>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>>>>>> uid))) && >>>>>>>>>> smallest_size == -1) >>>>>>>>>> goto success; >>>>>>>>>> if >>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>>>>>> tb, true)) { >>>>>>>>>> if ((reuse || >>>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>>> sk->sk_reuseport && >>>>>>>>>> >>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>> uid_eq(tb->fastuid, uid))) && >>>>>>>>>> smallest_size != -1 && >>>>>>>>>> --attempts >= >>>>>>>>>> 0) { >>>>>>>>>> spin_unlock_bh(&head->lock); >>>>>>>>>> goto again; >>>>>>>>>> } >>>>>>>>>> goto fail_unlock; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>>>>>> same >>>>>>>>>> clause >>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>> sk->sk_reuseport && >>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this >>>>>>>>>> is true >>>>>>>>>> the >>>>>>>>>> first conditional should be hit, goto done, and the second >>>>>>>>>> will >>>>>>>>>> never >>>>>>>>>> evaluate that part to true-- unless the sk is changed (do >>>>>>>>>> we need >>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>>>>>> >>>>>>>>> That's an interesting point... It looks like this function also >>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>>>>>> beginning >>>>>>>>> with several spin_lock(&head->lock) to exclusively >>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps >>>>>>>>> the full >>>>>>>>> bh >>>>>>>>> disable variant was preventing the timers in your stack >>>>>>>>> trace from >>>>>>>>> running interleaved with this function before? >>>>>>>> >>>>>>>> >>>>>>>> Could be, although dropping the lock shouldn't be able to >>>>>>>> affect the >>>>>>>> search state. TBH, I'm a little lost in reading function, the >>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>>>>>>> that >>>>>>>> function and also in every call to inet_csk_bind_conflict. I >>>>>>>> wonder >>>>>>>> if >>>>>>>> we can simply this under the assumption that SO_REUSEPORT is only >>>>>>>> allowed if the port number (snum) is explicitly specified. >>>>>>> >>>>>>> >>>>>>> Ok first I have data for you Hannes, here's the time distributions >>>>>>> before during and after the lockup (with all the debugging in >>>>>>> place >>>>>>> the >>>>>>> box eventually recovers). I've attached it as a text file >>>>>>> since it is >>>>>>> long. >>>>>> >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>>> Second is I was thinking about why we would spend so much time >>>>>>> doing >>>>>>> the >>>>>>> ->owners list, and obviously it's because of the massive amount of >>>>>>> timewait sockets on the owners list. I wrote the following >>>>>>> dumb patch >>>>>>> and tested it and the problem has disappeared completely. Now >>>>>>> I don't >>>>>>> know if this is right at all, but I thought it was weird we >>>>>>> weren't >>>>>>> copying the soreuseport option from the original socket onto >>>>>>> the twsk. >>>>>>> Is there are reason we aren't doing this currently? Does this >>>>>>> help >>>>>>> explain what is happening? Thanks, >>>>>> >>>>>> >>>>>> The patch is interesting and a good clue, but I am immediately a bit >>>>>> concerned that we don't copy/tag the socket with the uid also to >>>>>> keep >>>>>> the security properties for SO_REUSEPORT. I have to think a bit more >>>>>> about this. >>>>>> >>>>>> We have seen hangs during connect. I am afraid this patch >>>>>> wouldn't help >>>>>> there while also guaranteeing uniqueness. >>>>> >>>>> >>>>> >>>>> Yeah so I looked at the code some more and actually my patch is >>>>> really >>>>> bad. If sk2->sk_reuseport is set we'll look at >>>>> sk2->sk_reuseport_cb, which >>>>> is outside of the timewait sock, so that's definitely bad. >>>>> >>>>> But we should at least be setting it to 0 so that we don't do this >>>>> normally. Unfortunately simply setting it to 0 doesn't fix the >>>>> problem. So >>>>> for some reason having ->sk_reuseport set to 1 on a timewait >>>>> socket makes >>>>> this problem non-existent, which is strange. >>>>> >>>>> So back to the drawing board I guess. I wonder if doing what craig >>>>> suggested and batching the timewait timer expires so it hurts less >>>>> would >>>>> accomplish the same results. Thanks, >>>> >>>> >>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This >>>> is the >>>> code >>>> >>>> if ((!reuse || !sk2->sk_reuse || >>>> sk2->sk_state == TCP_LISTEN) && >>>> (!reuseport || !sk2->sk_reuseport || >>>> rcu_access_pointer(sk->sk_reuseport_cb) || >>>> (sk2->sk_state != TCP_TIME_WAIT && >>>> !uid_eq(uid, sock_i_uid(sk2))))) { >>>> >>>> if (!sk2->sk_rcv_saddr || >>>> !sk->sk_rcv_saddr >>>> || >>>> sk2->sk_rcv_saddr == >>>> sk->sk_rcv_saddr) >>>> break; >>>> } >>>> >>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport >>>> == 1. >>>> But now we are using reuseport, so sk->sk_reuseport_cb should be >>>> non-NULL >>>> right? So really setting the timewait sock's sk_reuseport should >>>> have no >>>> bearing on how this loop plays out right? Thanks, >>> >>> >>> >>> So more messing around and I noticed that we basically don't do the >>> tb->fastreuseport logic at all if we've ended up with a non >>> SO_REUSEPORT >>> socket on that tb. So before I fully understood what I was doing I >>> fixed it >>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT >>> socket, we reset tb->fastreuseport to 1 and set the uid to match the >>> uid of >>> the socket. This made the problem go away. Tom pointed out that if >>> we bind >>> to the same port on a different address and we have a non SO_REUSEPORT >>> socket with the same address on this tb then we'd be screwed with my >>> code. >>> >>> Which brings me to his proposed solution. We need another hash >>> table that >>> is indexed based on the binding address. Then each node corresponds >>> to one >>> address/port binding, with non-SO_REUSEPORT entries having only one >>> entry, >>> and normal SO_REUSEPORT entries having many. This cleans up the >>> need to >>> search all the possible sockets on any given tb, we just go and look >>> at the >>> one we care about. Does this make sense? Thanks, >>> >> Hi Josef, >> >> Thinking about it some more the hash table won't work because of the >> rules of binding different addresses to the same port. What I think we >> can do is to change inet_bind_bucket to be structure that contains all >> the information used to detect conflicts (reuse*, if, address, uid, >> etc.) and a list of sockets that share that exact same information-- >> for instance all socket in timewait state create through some listener >> socket should wind up on single bucket. When we do the bind_conflict >> function we only should have to walk this buckets, not the full list >> of sockets. >> >> Thoughts on this? > > This sounds good, maybe tb->owners be a list of say > > struct inet_unique_shit { > struct sock_common sk; > struct hlist socks; > }; > > Then we make inet_unique_shit like twsks', just copy the relevant > information, then hang the real sockets off of the socks hlist. > Something like that? Thanks, As a counter idea: can we push up a flag to the inet_bind_bucket that we check in the fast- way style that indicates that we have wildcarded non-reuse(port) in there, so we can skip the bind_bucket much more quickly? This wouldn't require a new data structure. Thanks, Hannes ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-17 11:08 ` Hannes Frederic Sowa @ 2016-12-17 13:26 ` Josef Bacik 2016-12-20 1:56 ` David Miller 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-17 13:26 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: Tom Herbert, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers > On Dec 17, 2016, at 6:09 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > >> On 16.12.2016 23:50, Josef Bacik wrote: >>> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote: >>>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote: >>>>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote: >>>>> >>>>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote: >>>>>> >>>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa >>>>>> <hannes@stressinduktion.org> wrote: >>>>>>> >>>>>>> Hi Josef, >>>>>>> >>>>>>>> On 15.12.2016 19:53, Josef Bacik wrote: >>>>>>>> >>>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>>>>>>> <kraigatgoog@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>>>>>>> <tom@herbertland.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I think there may be some suspicious code in >>>>>>>>>>> inet_csk_get_port. At >>>>>>>>>>> tb_found there is: >>>>>>>>>>> >>>>>>>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>>>> >>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>>>>>>> uid))) && >>>>>>>>>>> smallest_size == -1) >>>>>>>>>>> goto success; >>>>>>>>>>> if >>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>>>>>>> tb, true)) { >>>>>>>>>>> if ((reuse || >>>>>>>>>>> (tb->fastreuseport > 0 && >>>>>>>>>>> sk->sk_reuseport && >>>>>>>>>>> >>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>>> uid_eq(tb->fastuid, uid))) && >>>>>>>>>>> smallest_size != -1 && >>>>>>>>>>> --attempts >= >>>>>>>>>>> 0) { >>>>>>>>>>> spin_unlock_bh(&head->lock); >>>>>>>>>>> goto again; >>>>>>>>>>> } >>>>>>>>>>> goto fail_unlock; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>>>>>>> same >>>>>>>>>>> clause >>>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>>>>>>> sk->sk_reuseport && >>>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this >>>>>>>>>>> is true >>>>>>>>>>> the >>>>>>>>>>> first conditional should be hit, goto done, and the second >>>>>>>>>>> will >>>>>>>>>>> never >>>>>>>>>>> evaluate that part to true-- unless the sk is changed (do >>>>>>>>>>> we need >>>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>>>>>>> >>>>>>>>>> That's an interesting point... It looks like this function also >>>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>>>>>>> beginning >>>>>>>>>> with several spin_lock(&head->lock) to exclusively >>>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps >>>>>>>>>> the full >>>>>>>>>> bh >>>>>>>>>> disable variant was preventing the timers in your stack >>>>>>>>>> trace from >>>>>>>>>> running interleaved with this function before? >>>>>>>>> >>>>>>>>> >>>>>>>>> Could be, although dropping the lock shouldn't be able to >>>>>>>>> affect the >>>>>>>>> search state. TBH, I'm a little lost in reading function, the >>>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>>>>>>>> that >>>>>>>>> function and also in every call to inet_csk_bind_conflict. I >>>>>>>>> wonder >>>>>>>>> if >>>>>>>>> we can simply this under the assumption that SO_REUSEPORT is only >>>>>>>>> allowed if the port number (snum) is explicitly specified. >>>>>>>> >>>>>>>> >>>>>>>> Ok first I have data for you Hannes, here's the time distributions >>>>>>>> before during and after the lockup (with all the debugging in >>>>>>>> place >>>>>>>> the >>>>>>>> box eventually recovers). I've attached it as a text file >>>>>>>> since it is >>>>>>>> long. >>>>>>> >>>>>>> >>>>>>> Thanks a lot! >>>>>>> >>>>>>>> Second is I was thinking about why we would spend so much time >>>>>>>> doing >>>>>>>> the >>>>>>>> ->owners list, and obviously it's because of the massive amount of >>>>>>>> timewait sockets on the owners list. I wrote the following >>>>>>>> dumb patch >>>>>>>> and tested it and the problem has disappeared completely. Now >>>>>>>> I don't >>>>>>>> know if this is right at all, but I thought it was weird we >>>>>>>> weren't >>>>>>>> copying the soreuseport option from the original socket onto >>>>>>>> the twsk. >>>>>>>> Is there are reason we aren't doing this currently? Does this >>>>>>>> help >>>>>>>> explain what is happening? Thanks, >>>>>>> >>>>>>> >>>>>>> The patch is interesting and a good clue, but I am immediately a bit >>>>>>> concerned that we don't copy/tag the socket with the uid also to >>>>>>> keep >>>>>>> the security properties for SO_REUSEPORT. I have to think a bit more >>>>>>> about this. >>>>>>> >>>>>>> We have seen hangs during connect. I am afraid this patch >>>>>>> wouldn't help >>>>>>> there while also guaranteeing uniqueness. >>>>>> >>>>>> >>>>>> >>>>>> Yeah so I looked at the code some more and actually my patch is >>>>>> really >>>>>> bad. If sk2->sk_reuseport is set we'll look at >>>>>> sk2->sk_reuseport_cb, which >>>>>> is outside of the timewait sock, so that's definitely bad. >>>>>> >>>>>> But we should at least be setting it to 0 so that we don't do this >>>>>> normally. Unfortunately simply setting it to 0 doesn't fix the >>>>>> problem. So >>>>>> for some reason having ->sk_reuseport set to 1 on a timewait >>>>>> socket makes >>>>>> this problem non-existent, which is strange. >>>>>> >>>>>> So back to the drawing board I guess. I wonder if doing what craig >>>>>> suggested and batching the timewait timer expires so it hurts less >>>>>> would >>>>>> accomplish the same results. Thanks, >>>>> >>>>> >>>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This >>>>> is the >>>>> code >>>>> >>>>> if ((!reuse || !sk2->sk_reuse || >>>>> sk2->sk_state == TCP_LISTEN) && >>>>> (!reuseport || !sk2->sk_reuseport || >>>>> rcu_access_pointer(sk->sk_reuseport_cb) || >>>>> (sk2->sk_state != TCP_TIME_WAIT && >>>>> !uid_eq(uid, sock_i_uid(sk2))))) { >>>>> >>>>> if (!sk2->sk_rcv_saddr || >>>>> !sk->sk_rcv_saddr >>>>> || >>>>> sk2->sk_rcv_saddr == >>>>> sk->sk_rcv_saddr) >>>>> break; >>>>> } >>>>> >>>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport >>>>> == 1. >>>>> But now we are using reuseport, so sk->sk_reuseport_cb should be >>>>> non-NULL >>>>> right? So really setting the timewait sock's sk_reuseport should >>>>> have no >>>>> bearing on how this loop plays out right? Thanks, >>>> >>>> >>>> >>>> So more messing around and I noticed that we basically don't do the >>>> tb->fastreuseport logic at all if we've ended up with a non >>>> SO_REUSEPORT >>>> socket on that tb. So before I fully understood what I was doing I >>>> fixed it >>>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT >>>> socket, we reset tb->fastreuseport to 1 and set the uid to match the >>>> uid of >>>> the socket. This made the problem go away. Tom pointed out that if >>>> we bind >>>> to the same port on a different address and we have a non SO_REUSEPORT >>>> socket with the same address on this tb then we'd be screwed with my >>>> code. >>>> >>>> Which brings me to his proposed solution. We need another hash >>>> table that >>>> is indexed based on the binding address. Then each node corresponds >>>> to one >>>> address/port binding, with non-SO_REUSEPORT entries having only one >>>> entry, >>>> and normal SO_REUSEPORT entries having many. This cleans up the >>>> need to >>>> search all the possible sockets on any given tb, we just go and look >>>> at the >>>> one we care about. Does this make sense? Thanks, >>>> >>> Hi Josef, >>> >>> Thinking about it some more the hash table won't work because of the >>> rules of binding different addresses to the same port. What I think we >>> can do is to change inet_bind_bucket to be structure that contains all >>> the information used to detect conflicts (reuse*, if, address, uid, >>> etc.) and a list of sockets that share that exact same information-- >>> for instance all socket in timewait state create through some listener >>> socket should wind up on single bucket. When we do the bind_conflict >>> function we only should have to walk this buckets, not the full list >>> of sockets. >>> >>> Thoughts on this? >> >> This sounds good, maybe tb->owners be a list of say >> >> struct inet_unique_shit { >> struct sock_common sk; >> struct hlist socks; >> }; >> >> Then we make inet_unique_shit like twsks', just copy the relevant >> information, then hang the real sockets off of the socks hlist. >> Something like that? Thanks, > > As a counter idea: can we push up a flag to the inet_bind_bucket that we > check in the fast- way style that indicates that we have wildcarded > non-reuse(port) in there, so we can skip the bind_bucket much more > quickly? This wouldn't require a new data structure. So take my current duct tape fix and augment it with more information in the bind bucket? I'm not sure how to make this work without at least having a list of the binded addrs as well to make sure we are really ok. I suppose we could save the fastreuseport address that last succeeded to make it work properly, but I'd have to make it protocol agnostic and then have a callback to have the protocol to make sure we don't have to do the bind_conflict run. Is that what you were thinking of? Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-17 13:26 ` Josef Bacik @ 2016-12-20 1:56 ` David Miller 2016-12-20 2:07 ` Tom Herbert 0 siblings, 1 reply; 32+ messages in thread From: David Miller @ 2016-12-20 1:56 UTC (permalink / raw) To: jbacik; +Cc: hannes, tom, kraigatgoog, eric.dumazet, netdev From: Josef Bacik <jbacik@fb.com> Date: Sat, 17 Dec 2016 13:26:00 +0000 > So take my current duct tape fix and augment it with more > information in the bind bucket? I'm not sure how to make this work > without at least having a list of the binded addrs as well to make > sure we are really ok. I suppose we could save the fastreuseport > address that last succeeded to make it work properly, but I'd have > to make it protocol agnostic and then have a callback to have the > protocol to make sure we don't have to do the bind_conflict run. Is > that what you were thinking of? Thanks, So there isn't a deadlock or lockup here, something is just running really slow, right? And that "something" is a scan of the sockets on a tb list, and there's lots of timewait sockets hung off of that tb. As far as I can tell, this scan is happening in inet_csk_bind_conflict(). Furthermore, reuseport is somehow required to make this problem happen. How exactly? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-20 1:56 ` David Miller @ 2016-12-20 2:07 ` Tom Herbert 2016-12-20 2:41 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Tom Herbert @ 2016-12-20 2:07 UTC (permalink / raw) To: David Miller Cc: Josef Bacik, Hannes Frederic Sowa, Craig Gallek, Eric Dumazet, Linux Kernel Network Developers On Mon, Dec 19, 2016 at 5:56 PM, David Miller <davem@davemloft.net> wrote: > From: Josef Bacik <jbacik@fb.com> > Date: Sat, 17 Dec 2016 13:26:00 +0000 > >> So take my current duct tape fix and augment it with more >> information in the bind bucket? I'm not sure how to make this work >> without at least having a list of the binded addrs as well to make >> sure we are really ok. I suppose we could save the fastreuseport >> address that last succeeded to make it work properly, but I'd have >> to make it protocol agnostic and then have a callback to have the >> protocol to make sure we don't have to do the bind_conflict run. Is >> that what you were thinking of? Thanks, > > So there isn't a deadlock or lockup here, something is just running > really slow, right? > Correct. > And that "something" is a scan of the sockets on a tb list, and > there's lots of timewait sockets hung off of that tb. > Yes. > As far as I can tell, this scan is happening in > inet_csk_bind_conflict(). > Yes. > Furthermore, reuseport is somehow required to make this problem > happen. How exactly? When sockets created SO_REUSEPORT move to TW state they are placed back on the the tb->owners. fastreuse port is no longer set so we have to walk potential long list of sockets in tb->owners to open a new listener socket. I imagine this is happens when we try to open a new listener SO_REUSEPORT after the system has been running a while and so we hit the long tb->owners list. Tom ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-20 2:07 ` Tom Herbert @ 2016-12-20 2:41 ` Eric Dumazet 2016-12-20 3:40 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2016-12-20 2:41 UTC (permalink / raw) To: Tom Herbert Cc: David Miller, Josef Bacik, Hannes Frederic Sowa, Craig Gallek, Linux Kernel Network Developers On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote: > When sockets created SO_REUSEPORT move to TW state they are placed > back on the the tb->owners. fastreuse port is no longer set so we have > to walk potential long list of sockets in tb->owners to open a new > listener socket. I imagine this is happens when we try to open a new > listener SO_REUSEPORT after the system has been running a while and so > we hit the long tb->owners list. Hmm... __inet_twsk_hashdance() does not change tb->fastreuse So where tb->fastreuse is cleared ? If all your sockets have SO_REUSEPORT set, this should not happen. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-20 2:41 ` Eric Dumazet @ 2016-12-20 3:40 ` Josef Bacik 2016-12-20 4:52 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Josef Bacik @ 2016-12-20 3:40 UTC (permalink / raw) To: Eric Dumazet Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek, Linux Kernel Network Developers > On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote: >> >> When sockets created SO_REUSEPORT move to TW state they are placed >> back on the the tb->owners. fastreuse port is no longer set so we have >> to walk potential long list of sockets in tb->owners to open a new >> listener socket. I imagine this is happens when we try to open a new >> listener SO_REUSEPORT after the system has been running a while and so >> we hit the long tb->owners list. > > Hmm... __inet_twsk_hashdance() does not change tb->fastreuse > > So where tb->fastreuse is cleared ? > > If all your sockets have SO_REUSEPORT set, this should not happen. > The app starts out with no SO_REUSEPORT, and then we restart it with that option enabled. What I suspect is we have all the twsks from the original service, and the fastreuse stuff is cleared. My naive patch resets it once we add a reuseport sk to the tb and that makes the problem go away. I'm reworking all of this logic and adding some extra info to the tb to make the reset actually safe. I'll send those patches out tomorrow. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-20 3:40 ` Josef Bacik @ 2016-12-20 4:52 ` Eric Dumazet 2016-12-20 4:59 ` Josef Bacik 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2016-12-20 4:52 UTC (permalink / raw) To: Josef Bacik Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek, Linux Kernel Network Developers On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote: > > On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote: > >> > >> When sockets created SO_REUSEPORT move to TW state they are placed > >> back on the the tb->owners. fastreuse port is no longer set so we have > >> to walk potential long list of sockets in tb->owners to open a new > >> listener socket. I imagine this is happens when we try to open a new > >> listener SO_REUSEPORT after the system has been running a while and so > >> we hit the long tb->owners list. > > > > Hmm... __inet_twsk_hashdance() does not change tb->fastreuse > > > > So where tb->fastreuse is cleared ? > > > > If all your sockets have SO_REUSEPORT set, this should not happen. > > > > The app starts out with no SO_REUSEPORT, and then we restart it with > that option enabled. But... why would the application do this dance ? I now better understand why we never had these issues... > What I suspect is we have all the twsks from the original service, > and the fastreuse stuff is cleared. My naive patch resets it once we > add a reuseport sk to the tb and that makes the problem go away. I'm > reworking all of this logic and adding some extra info to the tb to > make the reset actually safe. I'll send those patches out tomorrow. > Thanks, Okay, we will review them ;) Note that Willy Tarreau wants some mechanism to be able to freeze a listener, to allow haproxy to be replaced without closing any sessions. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Soft lockup in inet_put_port on 4.6 2016-12-20 4:52 ` Eric Dumazet @ 2016-12-20 4:59 ` Josef Bacik 0 siblings, 0 replies; 32+ messages in thread From: Josef Bacik @ 2016-12-20 4:59 UTC (permalink / raw) To: Eric Dumazet Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek, Linux Kernel Network Developers > On Dec 19, 2016, at 11:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote: >>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >>> >>>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote: >>>> >>>> When sockets created SO_REUSEPORT move to TW state they are placed >>>> back on the the tb->owners. fastreuse port is no longer set so we have >>>> to walk potential long list of sockets in tb->owners to open a new >>>> listener socket. I imagine this is happens when we try to open a new >>>> listener SO_REUSEPORT after the system has been running a while and so >>>> we hit the long tb->owners list. >>> >>> Hmm... __inet_twsk_hashdance() does not change tb->fastreuse >>> >>> So where tb->fastreuse is cleared ? >>> >>> If all your sockets have SO_REUSEPORT set, this should not happen. >>> >> >> The app starts out with no SO_REUSEPORT, and then we restart it with >> that option enabled. > > But... why would the application do this dance ? > > I now better understand why we never had these issues... > It doesn't do it as a part of it's normal operation. The old version didn't use SO_REUSEPORT and then somebody added support for it, restarted the service with the new option enabled and boom. They immediately stopped doing anything and gave it to me to figure out. > >> What I suspect is we have all the twsks from the original service, >> and the fastreuse stuff is cleared. My naive patch resets it once we >> add a reuseport sk to the tb and that makes the problem go away. I'm >> reworking all of this logic and adding some extra info to the tb to >> make the reset actually safe. I'll send those patches out tomorrow. >> Thanks, > > Okay, we will review them ;) > > Note that Willy Tarreau wants some mechanism to be able to freeze a > listener, to allow haproxy to be replaced without closing any sessions. > I assume that's what these guys would want as well. They have some weird handoff thing they do when the app starts but I'm not entirely convinced it does what they think it does. Thanks, Josef ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2016-12-20 4:59 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert 2016-12-08 21:03 ` Hannes Frederic Sowa 2016-12-08 21:36 ` Josef Bacik 2016-12-09 0:30 ` Eric Dumazet 2016-12-09 1:01 ` Josef Bacik 2016-12-10 1:59 ` Josef Bacik 2016-12-10 3:47 ` Eric Dumazet 2016-12-10 4:14 ` Eric Dumazet 2016-12-12 18:05 ` Josef Bacik 2016-12-12 18:44 ` Hannes Frederic Sowa 2016-12-12 21:23 ` Josef Bacik 2016-12-12 22:24 ` Josef Bacik 2016-12-13 20:51 ` Tom Herbert 2016-12-13 23:03 ` Craig Gallek 2016-12-13 23:32 ` Tom Herbert 2016-12-15 18:53 ` Josef Bacik 2016-12-15 22:39 ` Tom Herbert 2016-12-15 23:25 ` Craig Gallek 2016-12-16 0:07 ` Hannes Frederic Sowa 2016-12-16 14:54 ` Josef Bacik 2016-12-16 15:21 ` Josef Bacik 2016-12-16 22:08 ` Josef Bacik 2016-12-16 22:18 ` Tom Herbert 2016-12-16 22:50 ` Josef Bacik 2016-12-17 11:08 ` Hannes Frederic Sowa 2016-12-17 13:26 ` Josef Bacik 2016-12-20 1:56 ` David Miller 2016-12-20 2:07 ` Tom Herbert 2016-12-20 2:41 ` Eric Dumazet 2016-12-20 3:40 ` Josef Bacik 2016-12-20 4:52 ` Eric Dumazet 2016-12-20 4:59 ` Josef Bacik
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.