All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <jbacik@fb.com>
To: Tom Herbert <tom@herbertland.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>,
	Craig Gallek <kraigatgoog@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"Linux Kernel Network Developers" <netdev@vger.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
Date: Fri, 16 Dec 2016 17:50:10 -0500	[thread overview]
Message-ID: <1481928610.17731.0@smtp.office365.com> (raw)
In-Reply-To: <CALx6S34qacbD-zxvEJyDVsp1_U_tkL_t0hEXrtfffBALsDMxEw@mail.gmail.com>

On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> 
wrote:
> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>> 
>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>> 
>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>  <hannes@stressinduktion.org> wrote:
>>>>> 
>>>>>  Hi Josef,
>>>>> 
>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>> 
>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
>>>>>> <tom@herbertland.com>
>>>>>>  wrote:
>>>>>>> 
>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>   wrote:
>>>>>>>> 
>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>>>>>> <tom@herbertland.com>
>>>>>>>>   wrote:
>>>>>>>>> 
>>>>>>>>>    I think there may be some suspicious code in 
>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>    tb_found there is:
>>>>>>>>> 
>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>                         (tb->fastreuseport > 0 &&
>>>>>>>>>                          
>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>                          sk->sk_reuseport && 
>>>>>>>>> uid_eq(tb->fastuid,
>>>>>>>>>   uid))) &&
>>>>>>>>>                        smallest_size == -1)
>>>>>>>>>                            goto success;
>>>>>>>>>                    if 
>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>   tb, true)) {
>>>>>>>>>                            if ((reuse ||
>>>>>>>>>                                 (tb->fastreuseport > 0 &&
>>>>>>>>>                                  sk->sk_reuseport &&
>>>>>>>>> 
>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>                                smallest_size != -1 && 
>>>>>>>>> --attempts >=
>>>>>>>>>  0) {
>>>>>>>>>                                    
>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>>                                    goto again;
>>>>>>>>>                            }
>>>>>>>>>                            goto fail_unlock;
>>>>>>>>>                    }
>>>>>>>>> 
>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The 
>>>>>>>>> same
>>>>>>>>>  clause
>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) && 
>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this 
>>>>>>>>> is true
>>>>>>>>>  the
>>>>>>>>>    first conditional should be hit, goto done,  and the 
>>>>>>>>> second will
>>>>>>>>>  never
>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do 
>>>>>>>>> we need
>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>> 
>>>>>>>>    That's an interesting point... It looks like this function 
>>>>>>>> also
>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>  beginning
>>>>>>>>    with several spin_lock(&head->lock) to exclusively
>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps 
>>>>>>>> the full
>>>>>>>>  bh
>>>>>>>>    disable variant was preventing the timers in your stack 
>>>>>>>> trace from
>>>>>>>>    running interleaved with this function before?
>>>>>>> 
>>>>>>> 
>>>>>>>   Could be, although dropping the lock shouldn't be able to 
>>>>>>> affect the
>>>>>>>   search state. TBH, I'm a little lost in reading function, the
>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three 
>>>>>>> times in
>>>>>>>  that
>>>>>>>   function and also in every call to inet_csk_bind_conflict. I 
>>>>>>> wonder
>>>>>>>  if
>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is 
>>>>>>> only
>>>>>>>   allowed if the port number (snum) is explicitly specified.
>>>>>> 
>>>>>> 
>>>>>>   Ok first I have data for you Hannes, here's the time 
>>>>>> distributions
>>>>>>   before during and after the lockup (with all the debugging in 
>>>>>> place
>>>>>>  the
>>>>>>   box eventually recovers).  I've attached it as a text file 
>>>>>> since it is
>>>>>>   long.
>>>>> 
>>>>> 
>>>>>  Thanks a lot!
>>>>> 
>>>>>>   Second is I was thinking about why we would spend so much time 
>>>>>> doing
>>>>>>  the
>>>>>>   ->owners list, and obviously it's because of the massive 
>>>>>> amount of
>>>>>>   timewait sockets on the owners list.  I wrote the following 
>>>>>> dumb patch
>>>>>>   and tested it and the problem has disappeared completely.  Now 
>>>>>> I don't
>>>>>>   know if this is right at all, but I thought it was weird we 
>>>>>> weren't
>>>>>>   copying the soreuseport option from the original socket onto 
>>>>>> the twsk.
>>>>>>   Is there are reason we aren't doing this currently?  Does this 
>>>>>> help
>>>>>>   explain what is happening?  Thanks,
>>>>> 
>>>>> 
>>>>>  The patch is interesting and a good clue, but I am immediately a 
>>>>> bit
>>>>>  concerned that we don't copy/tag the socket with the uid also to 
>>>>> keep
>>>>>  the security properties for SO_REUSEPORT. I have to think a bit 
>>>>> more
>>>>>  about this.
>>>>> 
>>>>>  We have seen hangs during connect. I am afraid this patch 
>>>>> wouldn't help
>>>>>  there while also guaranteeing uniqueness.
>>>> 
>>>> 
>>>> 
>>>>  Yeah so I looked at the code some more and actually my patch is 
>>>> really
>>>>  bad.  If sk2->sk_reuseport is set we'll look at 
>>>> sk2->sk_reuseport_cb, which
>>>>  is outside of the timewait sock, so that's definitely bad.
>>>> 
>>>>  But we should at least be setting it to 0 so that we don't do this
>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the 
>>>> problem.  So
>>>>  for some reason having ->sk_reuseport set to 1 on a timewait 
>>>> socket makes
>>>>  this problem non-existent, which is strange.
>>>> 
>>>>  So back to the drawing board I guess.  I wonder if doing what 
>>>> craig
>>>>  suggested and batching the timewait timer expires so it hurts 
>>>> less would
>>>>  accomplish the same results.  Thanks,
>>> 
>>> 
>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  
>>> This is the
>>>  code
>>> 
>>>                         if ((!reuse || !sk2->sk_reuse ||
>>>                             sk2->sk_state == TCP_LISTEN) &&
>>>                             (!reuseport || !sk2->sk_reuseport ||
>>>                              
>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>                              (sk2->sk_state != TCP_TIME_WAIT &&
>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {
>>> 
>>>                                 if (!sk2->sk_rcv_saddr || 
>>> !sk->sk_rcv_saddr
>>>  ||
>>>                                     sk2->sk_rcv_saddr == 
>>> sk->sk_rcv_saddr)
>>>                                         break;
>>>                         }
>>> 
>>>  so in my patches case we now have reuseport == 1, 
>>> sk2->sk_reuseport == 1.
>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be 
>>> non-NULL
>>>  right?  So really setting the timewait sock's sk_reuseport should 
>>> have no
>>>  bearing on how this loop plays out right?  Thanks,
>> 
>> 
>> 
>>  So more messing around and I noticed that we basically don't do the
>>  tb->fastreuseport logic at all if we've ended up with a non 
>> SO_REUSEPORT
>>  socket on that tb.  So before I fully understood what I was doing I 
>> fixed it
>>  so that after we go through ->bind_conflict() once with a 
>> SO_REUSEPORT
>>  socket, we reset tb->fastreuseport to 1 and set the uid to match 
>> the uid of
>>  the socket.  This made the problem go away.  Tom pointed out that 
>> if we bind
>>  to the same port on a different address and we have a non 
>> SO_REUSEPORT
>>  socket with the same address on this tb then we'd be screwed with 
>> my code.
>> 
>>  Which brings me to his proposed solution.  We need another hash 
>> table that
>>  is indexed based on the binding address.  Then each node 
>> corresponds to one
>>  address/port binding, with non-SO_REUSEPORT entries having only one 
>> entry,
>>  and normal SO_REUSEPORT entries having many.  This cleans up the 
>> need to
>>  search all the possible sockets on any given tb, we just go and 
>> look at the
>>  one we care about.  Does this make sense?  Thanks,
>> 
> Hi Josef,
> 
> Thinking about it some more the hash table won't work because of the
> rules of binding different addresses to the same port. What I think we
> can do is to change inet_bind_bucket to be structure that contains all
> the information used to detect conflicts (reuse*, if, address, uid,
> etc.) and a list of sockets that share that exact same information--
> for instance all socket in timewait state create through some listener
> socket should wind up on single bucket. When we do the bind_conflict
> function we only should have to walk this buckets, not the full list
> of sockets.
> 
> Thoughts on this?

This sounds good, maybe tb->owners be a list of say

struct inet_unique_shit {
	struct sock_common sk;
	struct hlist socks;
};

Then we make inet_unique_shit like twsks', just copy the relevant 
information, then hang the real sockets off of the socks hlist.  
Something like that?  Thanks,

Josef

  reply	other threads:[~2016-12-16 23:25 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert
2016-12-08 21:03 ` Hannes Frederic Sowa
2016-12-08 21:36   ` Josef Bacik
2016-12-09  0:30     ` Eric Dumazet
2016-12-09  1:01       ` Josef Bacik
2016-12-10  1:59         ` Josef Bacik
2016-12-10  3:47           ` Eric Dumazet
2016-12-10  4:14             ` Eric Dumazet
2016-12-12 18:05               ` Josef Bacik
2016-12-12 18:44                 ` Hannes Frederic Sowa
2016-12-12 21:23                   ` Josef Bacik
2016-12-12 22:24                   ` Josef Bacik
2016-12-13 20:51                     ` Tom Herbert
2016-12-13 23:03                       ` Craig Gallek
2016-12-13 23:32                         ` Tom Herbert
2016-12-15 18:53                           ` Josef Bacik
2016-12-15 22:39                             ` Tom Herbert
2016-12-15 23:25                               ` Craig Gallek
2016-12-16  0:07                             ` Hannes Frederic Sowa
2016-12-16 14:54                               ` Josef Bacik
2016-12-16 15:21                                 ` Josef Bacik
2016-12-16 22:08                                   ` Josef Bacik
2016-12-16 22:18                                     ` Tom Herbert
2016-12-16 22:50                                       ` Josef Bacik [this message]
2016-12-17 11:08                                         ` Hannes Frederic Sowa
2016-12-17 13:26                                           ` Josef Bacik
2016-12-20  1:56                                             ` David Miller
2016-12-20  2:07                                               ` Tom Herbert
2016-12-20  2:41                                                 ` Eric Dumazet
2016-12-20  3:40                                                   ` Josef Bacik
2016-12-20  4:52                                                     ` Eric Dumazet
2016-12-20  4:59                                                       ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1481928610.17731.0@smtp.office365.com \
    --to=jbacik@fb.com \
    --cc=eric.dumazet@gmail.com \
    --cc=hannes@stressinduktion.org \
    --cc=kraigatgoog@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.