From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josef Bacik Subject: Re: Soft lockup in inet_put_port on 4.6 Date: Fri, 16 Dec 2016 09:54:48 -0500 Message-ID: <1481900088.24490.6@smtp.office365.com> References: <1481231024.1911284.813071977.72AF4DEE@webmail.messagingengine.com> <1481233016.11849.1@smtp.office365.com> <1481243432.4930.145.camel@edumazet-glaptop3.roam.corp.google.com> <6C6EE0ED-7E78-4866-8AAF-D75FD4719EF3@fb.com> <1481335192.3663.0@smtp.office365.com> <1481341624.4930.204.camel@edumazet-glaptop3.roam.corp.google.com> <1481343298.4930.208.camel@edumazet-glaptop3.roam.corp.google.com> <1481565929.24490.0@smtp.office365.com> <3c022731-e703-34ac-55f1-60f5b94b6d62@stressinduktion.org> <1481581466.24490.2@smtp.office365.com> <1481828016.24490.5@smtp.office365.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Cc: Tom Herbert , Craig Gallek , Eric Dumazet , Linux Kernel Network Developers To: Hannes Frederic Sowa Return-path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:41813 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754029AbcLPOzD (ORCPT ); Fri, 16 Dec 2016 09:55:03 -0500 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa wrote: > Hi Josef, > > On 15.12.2016 19:53, Josef Bacik wrote: >> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert >> wrote: >>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek >>> >>> wrote: >>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert >>>> >>>> wrote: >>>>> I think there may be some suspicious code in inet_csk_get_port. >>>>> At >>>>> tb_found there is: >>>>> >>>>> if (((tb->fastreuse > 0 && reuse) || >>>>> (tb->fastreuseport > 0 && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>>> && >>>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>>> uid))) && >>>>> smallest_size == -1) >>>>> goto success; >>>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>>> tb, true)) { >>>>> if ((reuse || >>>>> (tb->fastreuseport > 0 && >>>>> sk->sk_reuseport && >>>>> >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>>> uid_eq(tb->fastuid, uid))) && >>>>> smallest_size != -1 && --attempts >>>>> >= 0) { >>>>> spin_unlock_bh(&head->lock); >>>>> goto again; >>>>> } >>>>> goto fail_unlock; >>>>> } >>>>> >>>>> AFAICT there is redundancy in these two conditionals. The same >>>>> clause >>>>> is being checked in both: (tb->fastreuseport > 0 && >>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is >>>>> true the >>>>> first conditional should be hit, goto done, and the second >>>>> will never >>>>> evaluate that part to true-- unless the sk is changed (do we >>>>> need >>>>> READ_ONCE for sk->sk_reuseport_cb?). >>>> That's an interesting point... It looks like this function also >>>> changed in 4.6 from using a single local_bh_disable() at the >>>> beginning >>>> with several spin_lock(&head->lock) to exclusively >>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the >>>> full bh >>>> disable variant was preventing the timers in your stack trace >>>> from >>>> running interleaved with this function before? >>> >>> Could be, although dropping the lock shouldn't be able to affect >>> the >>> search state. TBH, I'm a little lost in reading function, the >>> SO_REUSEPORT handling is pretty complicated. For instance, >>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in >>> that >>> function and also in every call to inet_csk_bind_conflict. I >>> wonder if >>> we can simply this under the assumption that SO_REUSEPORT is only >>> allowed if the port number (snum) is explicitly specified. >> >> Ok first I have data for you Hannes, here's the time distributions >> before during and after the lockup (with all the debugging in place >> the >> box eventually recovers). I've attached it as a text file since it >> is >> long. > > Thanks a lot! > >> Second is I was thinking about why we would spend so much time >> doing the >> ->owners list, and obviously it's because of the massive amount of >> timewait sockets on the owners list. I wrote the following dumb >> patch >> and tested it and the problem has disappeared completely. Now I >> don't >> know if this is right at all, but I thought it was weird we weren't >> copying the soreuseport option from the original socket onto the >> twsk. >> Is there are reason we aren't doing this currently? Does this help >> explain what is happening? Thanks, > > The patch is interesting and a good clue, but I am immediately a bit > concerned that we don't copy/tag the socket with the uid also to keep > the security properties for SO_REUSEPORT. I have to think a bit more > about this. > > We have seen hangs during connect. I am afraid this patch wouldn't > help > there while also guaranteeing uniqueness. Yeah so I looked at the code some more and actually my patch is really bad. If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which is outside of the timewait sock, so that's definitely bad. But we should at least be setting it to 0 so that we don't do this normally. Unfortunately simply setting it to 0 doesn't fix the problem. So for some reason having ->sk_reuseport set to 1 on a timewait socket makes this problem non-existent, which is strange. So back to the drawing board I guess. I wonder if doing what craig suggested and batching the timewait timer expires so it hurts less would accomplish the same results. Thanks, Josef