From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josef Bacik Subject: Re: Soft lockup in inet_put_port on 4.6 Date: Fri, 16 Dec 2016 10:21:24 -0500 Message-ID: <1481901684.24490.7@smtp.office365.com> References: <1481231024.1911284.813071977.72AF4DEE@webmail.messagingengine.com> <1481233016.11849.1@smtp.office365.com> <1481243432.4930.145.camel@edumazet-glaptop3.roam.corp.google.com> <6C6EE0ED-7E78-4866-8AAF-D75FD4719EF3@fb.com> <1481335192.3663.0@smtp.office365.com> <1481341624.4930.204.camel@edumazet-glaptop3.roam.corp.google.com> <1481343298.4930.208.camel@edumazet-glaptop3.roam.corp.google.com> <1481565929.24490.0@smtp.office365.com> <3c022731-e703-34ac-55f1-60f5b94b6d62@stressinduktion.org> <1481581466.24490.2@smtp.office365.com> <1481828016.24490.5@smtp.office365.com> <1481900088.24490.6@smtp.office365.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Cc: Tom Herbert , Craig Gallek , Eric Dumazet , Linux Kernel Network Developers To: Hannes Frederic Sowa Return-path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:38633 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1757426AbcLPQIM (ORCPT ); Fri, 16 Dec 2016 11:08:12 -0500 In-Reply-To: <1481900088.24490.6@smtp.office365.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik wrote: > On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa > wrote: >> Hi Josef, >> >> On 15.12.2016 19:53, Josef Bacik wrote: >>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert >>> wrote: >>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>>> >>>> wrote: >>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>>> >>>>> wrote: >>>>>> I think there may be some suspicious code in >>>>>> inet_csk_get_port. At >>>>>> tb_found there is: >>>>>> >>>>>> if (((tb->fastreuse > 0 && reuse) || >>>>>> (tb->fastreuseport > 0 && >>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>>>> && >>>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>>> uid))) && >>>>>> smallest_size == -1) >>>>>> goto success; >>>>>> if >>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>>> tb, true)) { >>>>>> if ((reuse || >>>>>> (tb->fastreuseport > 0 && >>>>>> sk->sk_reuseport && >>>>>> >>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>>> uid_eq(tb->fastuid, uid))) && >>>>>> smallest_size != -1 && --attempts >>>>>> >= 0) { >>>>>> spin_unlock_bh(&head->lock); >>>>>> goto again; >>>>>> } >>>>>> goto fail_unlock; >>>>>> } >>>>>> >>>>>> AFAICT there is redundancy in these two conditionals. The >>>>>> same clause >>>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is >>>>>> true the >>>>>> first conditional should be hit, goto done, and the second >>>>>> will never >>>>>> evaluate that part to true-- unless the sk is changed (do we >>>>>> need >>>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>>> That's an interesting point... It looks like this function also >>>>> changed in 4.6 from using a single local_bh_disable() at the >>>>> beginning >>>>> with several spin_lock(&head->lock) to exclusively >>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the >>>>> full bh >>>>> disable variant was preventing the timers in your stack trace >>>>> from >>>>> running interleaved with this function before? >>>> >>>> Could be, although dropping the lock shouldn't be able to affect >>>> the >>>> search state. TBH, I'm a little lost in reading function, the >>>> SO_REUSEPORT handling is pretty complicated. For instance, >>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>>> that >>>> function and also in every call to inet_csk_bind_conflict. I >>>> wonder if >>>> we can simply this under the assumption that SO_REUSEPORT is only >>>> allowed if the port number (snum) is explicitly specified. >>> >>> Ok first I have data for you Hannes, here's the time distributions >>> before during and after the lockup (with all the debugging in >>> place the >>> box eventually recovers). I've attached it as a text file since >>> it is >>> long. >> >> Thanks a lot! >> >>> Second is I was thinking about why we would spend so much time >>> doing the >>> ->owners list, and obviously it's because of the massive amount of >>> timewait sockets on the owners list. I wrote the following dumb >>> patch >>> and tested it and the problem has disappeared completely. Now I >>> don't >>> know if this is right at all, but I thought it was weird we weren't >>> copying the soreuseport option from the original socket onto the >>> twsk. >>> Is there are reason we aren't doing this currently? Does this help >>> explain what is happening? Thanks, >> >> The patch is interesting and a good clue, but I am immediately a bit >> concerned that we don't copy/tag the socket with the uid also to keep >> the security properties for SO_REUSEPORT. I have to think a bit more >> about this. >> >> We have seen hangs during connect. I am afraid this patch wouldn't >> help >> there while also guaranteeing uniqueness. > > > Yeah so I looked at the code some more and actually my patch is > really bad. If sk2->sk_reuseport is set we'll look at > sk2->sk_reuseport_cb, which is outside of the timewait sock, so > that's definitely bad. > > But we should at least be setting it to 0 so that we don't do this > normally. Unfortunately simply setting it to 0 doesn't fix the > problem. So for some reason having ->sk_reuseport set to 1 on a > timewait socket makes this problem non-existent, which is strange. > > So back to the drawing board I guess. I wonder if doing what craig > suggested and batching the timewait timer expires so it hurts less > would accomplish the same results. Thanks, Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This is the code if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || rcu_access_pointer(sk->sk_reuseport_cb) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2))))) { if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr || sk2->sk_rcv_saddr == sk->sk_rcv_saddr) break; } so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 1. But now we are using reuseport, so sk->sk_reuseport_cb should be non-NULL right? So really setting the timewait sock's sk_reuseport should have no bearing on how this loop plays out right? Thanks, Josef