Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

From: Eric Dumazet <edumazet@google.com>
To: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: davem@davemloft.net, dsahern@kernel.org, duanmuquan@baidu.com,
	 kuba@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org,  pabeni@redhat.com
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.
Date: Thu, 8 Jun 2023 08:35:20 +0200	[thread overview]
Message-ID: <CANn89iK8snOz8TYOhhwfimC7ykYA78GA3Nyv8x06SZYa1nKdyA@mail.gmail.com> (raw)
In-Reply-To: <20230608054740.11256-1-kuniyu@amazon.com>

On Thu, Jun 8, 2023 at 7:48 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Wed, 7 Jun 2023 15:32:57 +0200
> > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <duanmuquan@baidu.com> wrote:
> > >
> > > Hi, Eric,
> > >
> > >  Thanks for your comments!
> > >
> > >  About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> > >
> > >  1.  The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> > >
> > > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> > >
> > > When this case occurs, the hashdance has definitely finished，because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> > >
> > >  With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> > >
> > >
> > >
> > > 2. About the performance impact:
> > >
> > >      A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up    the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> > >
> > >  The second lookup will only be done in the condition that FIN segment gets a listener sock.
> > >
> > >   About the performance impact:
> > >
> > > 1)  Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> > >
> > > The second inet_match() in __inet_lookup_established()  does least 3 comparisons for each segmet.
> > >
> > >
> > > 2)  When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> > >
> > >
> > >
> > > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> > >
> > >
> > > About bad case (2):
> > >
> > >  tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> > >
> > > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds  extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> > >
> > > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> > >
> > >
> > >
> > > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> > >
> > >
> >
> > Again, you can write a lot of stuff, the fact is that your patch does
> > not solve the issue.
> >
> > You could add 10 lookups, and still miss some cases, because they are
> > all RCU lookups with no barriers.
> >
> > In order to solve the issue of packets for the same 4-tuple being
> > processed by many cpus, the only way to solve races is to add mutual
> > exclusion.
> >
> > Note that we already have to lock the bucket spinlock every time we
> > transition a request socket to socket, a socket to timewait, or any
> > insert/delete.
> >
> > We need to expand the scope of this lock, and cleanup things that we
> > added in the past, because we tried too hard to 'detect races'
>
> How about this ?  This is still a workaround though, retry sounds
> better than expanding the scope of the lock given the race is rare.

The chance of two cpus having to hold the same spinlock is rather small.

Algo is the following:

Attempt a lockless/RCU lookup.

1) Socket is found, we are good to go. Fast path is still fast.

2) Socket  is not found in ehash
   - We lock the bucket spinlock.
   - We retry the lookup
   - If socket found, continue with it (release the spinlock when
appropriate, after all write manipulations in the bucket are done)
   - If socket still not found, we lookup a listener.
      We insert a TCP_NEW_SYN_RECV ....
       Again, we release the spinlock when appropriate, after all
write manipulations in the bucket are done)

No more races, and the fast path is the same.

>
> ---8<---
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..b034be2f37c8 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -484,14 +484,24 @@ struct sock *__inet_lookup_established(struct net *net,
>         unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
>         unsigned int slot = hash & hashinfo->ehash_mask;
>         struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
> +       bool should_retry = true;
>
>  begin:
>         sk_nulls_for_each_rcu(sk, node, &head->chain) {
>                 if (sk->sk_hash != hash)
>                         continue;
>                 if (likely(inet_match(net, sk, acookie, ports, dif, sdif))) {
> -                       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> +                       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) {

Because of SLAB_TYPESAFE_BY_RCU, we can not really do this kind of stuff.

Really this RCU lookup should be a best effort, not something
potentially looping X times .

We only need a proper fallback to spinlock protected lookup.

> +                               if (sk->sk_state == TCP_TIME_WAIT)
> +                                       goto begin;
> +
> +                               if (sk->sk_state == TCP_CLOSE && should_retry) {
> +                                       should_retry = false;
> +                                       goto begin;
> +                               }
> +
>                                 goto out;
> +                       }
>                         if (unlikely(!inet_match(net, sk, acookie,
>                                                  ports, dif, sdif))) {
>                                 sock_gen_put(sk);
> ---8<---
>
>
> >