netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev <netdev@vger.kernel.org>, Martin KaFai Lau <kafai@fb.com>,
	David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: [net-next v1] net: Add a second bind table hashed by port and address
Date: Fri, 22 Apr 2022 15:55:12 -0700	[thread overview]
Message-ID: <CAJnrk1btHhosTt_PwW77NK1frmZ2Q8j4DYEB9+7H_VP5iocqcg@mail.gmail.com> (raw)
In-Reply-To: <CANn89iKOkHHJ-papcMXJvq_8xSE2zXvqTfNSfGhq=Y1y_oKy6A@mail.gmail.com>

On Fri, Apr 22, 2022 at 2:25 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Apr 22, 2022 at 2:07 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Apr 21, 2022 at 3:50 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Apr 21, 2022 at 3:16 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > We currently have one tcp bind table (bhash) which hashes by port
> > > > number only. In the socket bind path, we check for bind conflicts by
> > > > traversing the specified port's inet_bind2_bucket while holding the
> > > > bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
> > > >
> > > > In instances where there are tons of sockets hashed to the same port
> > > > at different addresses, checking for a bind conflict is time-intensive
> > > > and can cause softirq cpu lockups, as well as stops new tcp connections
> > > > since __inet_inherit_port() also contests for the spinlock.
> > > >
> > > > This patch proposes adding a second bind table, bhash2, that hashes by
> > > > port and ip address. Searching the bhash2 table leads to significantly
> > > > faster conflict resolution and less time holding the spinlock.
> > > > When experimentally testing this on a local server, the results for how
> > > > long a bind request takes were as follows:
> > > >
> > > > when there are ~24k sockets already bound to the port -
> > > >
> > > > ipv4:
> > > > before - 0.002317 seconds
> > > > with bhash2 - 0.000018 seconds
> > > >
> > > > ipv6:
> > > > before - 0.002431 seconds
> > > > with bhash2 - 0.000021 seconds
> > >
> > >
> > > Hi Joanne
> > >
> > > Do you have a test for this ? Are you using 24k IPv6 addresses on the host ?
> > >
> > > I fear we add some extra code and cost for quite an unusual configuration.
> > >
> > > Thanks.
> > >
> > Hi Eric,
> >
> > I have a test on my local server that populates the bhash table entry
> > with 24k sockets for a given port and address, and then times how long
> > a bind request on that port takes.
>
> OK, but why 24k ? Why not 24 M then ?
>
> In this case, will a 64K hash table be big enough ?
24k was one test case scenario, another one was ~12M; these were used
to get a sense of how the bhash2 table performs in situations where
the bhash table entry for the port is saturated.
>
>  When populating the table entry, I
> > use the same IPv6 address on the host (with SO_REUSEADDR set). At
> > Facebook, there are some internal teams that submit bind requests for
> > 400 vips on the same port on concurrent threads that run into softirq
> > lockup issues due to the bhash table entry spinlock contention, which
> > is the main motivation behind this patch.
>
> I am pretty sure the IPv6 stack does not scale well if we have
> thousands of IPv6 addresses on one netdev.
> Some O(N) behavior will also trigger latency violations.
>
> Can you share the test, in a form that can be added in linux tree ?
I will include it somewhere under testing/selftests/net - does that sound okay?
>
> I mean, before today nobody was trying to have 24k listeners on a host,
> so it would be nice to have a regression test for future changes in the stack.
>
> If the goal is to deal with 400 vips, why using 24k in your changelog ?
> I would rather stick to the reality, and not pretend TCP stack should
> scale to 24k listeners.
I chose 24k to test on because one of the internal team's usages is
binding from 80 workers for ~300 vips in parallel for the same port.
>
> I have not looked at the patch yet, I choked on the changelog for
> being exaggerated.

      reply	other threads:[~2022-04-22 23:17 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-21 22:14 [net-next v1] net: Add a second bind table hashed by port and address Joanne Koong
2022-04-21 22:50 ` Eric Dumazet
2022-04-22 21:06   ` Joanne Koong
2022-04-22 21:25     ` Eric Dumazet
2022-04-22 22:55       ` Joanne Koong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJnrk1btHhosTt_PwW77NK1frmZ2Q8j4DYEB9+7H_VP5iocqcg@mail.gmail.com \
    --to=joannelkoong@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).