All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: davem@davemloft.net, edumazet@google.com, jirislaby@kernel.org,
	kuba@kernel.org, kuni1840@gmail.com, netdev@vger.kernel.org,
	pabeni@redhat.com
Subject: Re: [PATCH RFC net 1/2] tcp: Add TIME_WAIT sockets in bhash2.
Date: Fri, 23 Dec 2022 11:34:40 -0800	[thread overview]
Message-ID: <CAJnrk1ZTh89qcMoC4nzE8-E-Do9idwmjXAcV-J1THkPjaZGqFw@mail.gmail.com> (raw)
In-Reply-To: <20221223015537.4249-1-kuniyu@amazon.com>

On Thu, Dec 22, 2022 at 5:55 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Joanne Koong <joannelkoong@gmail.com>
> Date:   Thu, 22 Dec 2022 16:25:10 -0800
> > On Thu, Dec 22, 2022 at 3:27 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Joanne Koong <joannelkoong@gmail.com>
> > > Date:   Thu, 22 Dec 2022 13:46:57 -0800
> > > > On Thu, Dec 22, 2022 at 7:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On Thu, 2022-12-22 at 00:12 +0900, Kuniyuki Iwashima wrote:
> > > > > > Jiri Slaby reported regression of bind() with a simple repro. [0]
> > > > > >
> > > > > > The repro creates a TIME_WAIT socket and tries to bind() a new socket
> > > > > > with the same local address and port.  Before commit 28044fc1d495 ("net:
> > > > > > Add a bhash2 table hashed by port and address"), the bind() failed with
> > > > > > -EADDRINUSE, but now it succeeds.
> > > > > >
> > > > > > The cited commit should have put TIME_WAIT sockets into bhash2; otherwise,
> > > > > > inet_bhash2_conflict() misses TIME_WAIT sockets when validating bind()
> > > > > > requests if the address is not a wildcard one.
> > > >
> > > > (resending my reply because it wasn't in plaintext mode)
> > > >
> > > > Thanks for adding this! I hadn't realized TIME_WAIT sockets also are
> > > > considered when checking against inet bind conflicts.
> > > >
> > > > >
> > > > > How does keeping the timewait sockets inside bhash2 affect the bind
> > > > > loopup performance? I fear that could defeat completely the goal of
> > > > > 28044fc1d495, on quite busy server we could have quite a bit of tw with
> > > > > the same address/port. If so, we could even consider reverting
> > > > > 28044fc1d495.
> > >
> > > It will slow down along the number of twsk, but I think it's still faster
> > > than bhash if we listen() on multiple IP.  If we don't, bhash is always
> > > faster because of bhash2's additional locking.  However, this is the
> > > nature of bhash2 from the beginning.
> > >
> > >
> > > > >
> > > >
> > > > Can you clarify what you mean by bind loopup?
> > >
> > > I think it means just bhash2 traversal.  (s/loopup/lookup/)
> > >
> > > >
> > > > > > [0]: https://lore.kernel.org/netdev/6b971a4e-c7d8-411e-1f92-fda29b5b2fb9@kernel.org/
> > > > > >
> > > > > > Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
> > > > > > Reported-by: Jiri Slaby <jirislaby@kernel.org>
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > ---
> > > > > >  include/net/inet_timewait_sock.h |  2 ++
> > > > > >  include/net/sock.h               |  5 +++--
> > > > > >  net/ipv4/inet_hashtables.c       |  5 +++--
> > > > > >  net/ipv4/inet_timewait_sock.c    | 31 +++++++++++++++++++++++++++++--
> > > > > >  4 files changed, 37 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> > > > > > index 5b47545f22d3..c46ed239ad9a 100644
> > > > > > --- a/include/net/inet_timewait_sock.h
> > > > > > +++ b/include/net/inet_timewait_sock.h
> > > > > > @@ -44,6 +44,7 @@ struct inet_timewait_sock {
> > > > > >  #define tw_bound_dev_if              __tw_common.skc_bound_dev_if
> > > > > >  #define tw_node                      __tw_common.skc_nulls_node
> > > > > >  #define tw_bind_node         __tw_common.skc_bind_node
> > > > > > +#define tw_bind2_node                __tw_common.skc_bind2_node
> > > > > >  #define tw_refcnt            __tw_common.skc_refcnt
> > > > > >  #define tw_hash                      __tw_common.skc_hash
> > > > > >  #define tw_prot                      __tw_common.skc_prot
> > > > > > @@ -73,6 +74,7 @@ struct inet_timewait_sock {
> > > > > >       u32                     tw_priority;
> > > > > >       struct timer_list       tw_timer;
> > > > > >       struct inet_bind_bucket *tw_tb;
> > > > > > +     struct inet_bind2_bucket        *tw_tb2;
> > > > > >  };
> > > > > >  #define tw_tclass tw_tos
> > > > > >
> > > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > > index dcd72e6285b2..aaec985c1b5b 100644
> > > > > > --- a/include/net/sock.h
> > > > > > +++ b/include/net/sock.h
> > > > > > @@ -156,6 +156,7 @@ typedef __u64 __bitwise __addrpair;
> > > > > >   *   @skc_tw_rcv_nxt: (aka tw_rcv_nxt) TCP window next expected seq number
> > > > > >   *           [union with @skc_incoming_cpu]
> > > > > >   *   @skc_refcnt: reference count
> > > > > > + *   @skc_bind2_node: bind node in the bhash2 table
> > > > > >   *
> > > > > >   *   This is the minimal network layer representation of sockets, the header
> > > > > >   *   for struct sock and struct inet_timewait_sock.
> > > > > > @@ -241,6 +242,7 @@ struct sock_common {
> > > > > >               u32             skc_window_clamp;
> > > > > >               u32             skc_tw_snd_nxt; /* struct tcp_timewait_sock */
> > > > > >       };
> > > > > > +     struct hlist_node       skc_bind2_node;
> > > > >
> > > > > I *think* it would be better adding a tw_bind2_node field to the
> > > > > inet_timewait_sock struct, so that we leave unmodified the request
> > > > > socket and we don't change the struct sock binary layout. That could
> > > > > affect performances moving hot fields on different cachelines.
> > > > >
> > > > +1. The rest of this patch LGTM.
> > >
> > > Then we can't use sk_for_each_bound_bhash2(), or we have to guarantee this.
> > >
> > >   BUILD_BUG_ON(offsetof(struct sock, sk_bind2_node),
> > >                offsetof(struct inet_timewait_sock, tw_bind2_node))
> > >
> > > Considering the number of members in struct sock, at least we have
> > > to move sk_bind2_node forward.
> > >
> > > Another option is to have another TIME_WAIT list in inet_bind2_bucket like
> > > tb2->deathrow or something.  sk_for_each_bound_bhash2() is used only in
> > > inet_bhash2_conflict(), so I think this is feasible.
> >
> > Oh I see, thanks for clarifying!
> >
> > I think we could also check sk_state (which is in __sk_common already)
> > and if it's TCP_TIME_WAIT, then we know sk is at offsetof(struct
> > inet_timewait_sock, tw_bind2_node), whereas otherwise it's at
> > offsetof(struct sock, sk_bind2_node). This seems simpler/cleaner to me
> > than the other approaches. What are your thoughts?
>
> Sorry, I don't get it.  You mean we can check sk_state first and change
> how we traverse ?  But then we cannot know the offset of sk_state if we
> don't know if the socket is TIME_WAIT ... ?

I think the offset of sk_state is the same for both sockets because
sk_state is in "struct sock_common" (__sk_common.skc_state) that both
share.

  reply	other threads:[~2022-12-23 19:35 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-21 15:12 [PATCH RFC net 0/2] tcp: Fix bhash2 and TIME_WAIT regression Kuniyuki Iwashima
2022-12-21 15:12 ` [PATCH RFC net 1/2] tcp: Add TIME_WAIT sockets in bhash2 Kuniyuki Iwashima
2022-12-21 16:37   ` Jiri Slaby
2022-12-22 15:05   ` Paolo Abeni
2022-12-22 21:46     ` Joanne Koong
2022-12-22 23:26       ` Kuniyuki Iwashima
2022-12-23  0:25         ` Joanne Koong
2022-12-23  1:55           ` Kuniyuki Iwashima
2022-12-23 19:34             ` Joanne Koong [this message]
2022-12-26 13:21               ` Kuniyuki Iwashima
2022-12-23 10:15         ` Paolo Abeni
2022-12-23 17:08           ` Kuniyuki Iwashima
2022-12-21 15:12 ` [PATCH RFC net 2/2] tcp: Add selftest for bind() and TIME_WAIT Kuniyuki Iwashima
2022-12-22 21:41   ` Joanne Koong
2022-12-22 23:54     ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJnrk1ZTh89qcMoC4nzE8-E-Do9idwmjXAcV-J1THkPjaZGqFw@mail.gmail.com \
    --to=joannelkoong@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jirislaby@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=kuniyu@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.