Re: [net-next v1] net: Add a second bind table hashed by port and address

From: Joanne Koong <joannelkoong@gmail.com>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev <netdev@vger.kernel.org>, Martin KaFai Lau <kafai@fb.com>,
	David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: [net-next v1] net: Add a second bind table hashed by port and address
Date: Fri, 22 Apr 2022 15:55:12 -0700	[thread overview]
Message-ID: <CAJnrk1btHhosTt_PwW77NK1frmZ2Q8j4DYEB9+7H_VP5iocqcg@mail.gmail.com> (raw)
In-Reply-To: <CANn89iKOkHHJ-papcMXJvq_8xSE2zXvqTfNSfGhq=Y1y_oKy6A@mail.gmail.com>

On Fri, Apr 22, 2022 at 2:25 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Apr 22, 2022 at 2:07 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Apr 21, 2022 at 3:50 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Apr 21, 2022 at 3:16 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > We currently have one tcp bind table (bhash) which hashes by port
> > > > number only. In the socket bind path, we check for bind conflicts by
> > > > traversing the specified port's inet_bind2_bucket while holding the
> > > > bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
> > > >
> > > > In instances where there are tons of sockets hashed to the same port
> > > > at different addresses, checking for a bind conflict is time-intensive
> > > > and can cause softirq cpu lockups, as well as stops new tcp connections
> > > > since __inet_inherit_port() also contests for the spinlock.
> > > >
> > > > This patch proposes adding a second bind table, bhash2, that hashes by
> > > > port and ip address. Searching the bhash2 table leads to significantly
> > > > faster conflict resolution and less time holding the spinlock.
> > > > When experimentally testing this on a local server, the results for how
> > > > long a bind request takes were as follows:
> > > >
> > > > when there are ~24k sockets already bound to the port -
> > > >
> > > > ipv4:
> > > > before - 0.002317 seconds
> > > > with bhash2 - 0.000018 seconds
> > > >
> > > > ipv6:
> > > > before - 0.002431 seconds
> > > > with bhash2 - 0.000021 seconds
> > >
> > >
> > > Hi Joanne
> > >
> > > Do you have a test for this ? Are you using 24k IPv6 addresses on the host ?
> > >
> > > I fear we add some extra code and cost for quite an unusual configuration.
> > >
> > > Thanks.
> > >
> > Hi Eric,
> >
> > I have a test on my local server that populates the bhash table entry
> > with 24k sockets for a given port and address, and then times how long
> > a bind request on that port takes.
>
> OK, but why 24k ? Why not 24 M then ?
>
> In this case, will a 64K hash table be big enough ?
24k was one test case scenario, another one was ~12M; these were used
to get a sense of how the bhash2 table performs in situations where
the bhash table entry for the port is saturated.
>
>  When populating the table entry, I
> > use the same IPv6 address on the host (with SO_REUSEADDR set). At
> > Facebook, there are some internal teams that submit bind requests for
> > 400 vips on the same port on concurrent threads that run into softirq
> > lockup issues due to the bhash table entry spinlock contention, which
> > is the main motivation behind this patch.
>
> I am pretty sure the IPv6 stack does not scale well if we have
> thousands of IPv6 addresses on one netdev.
> Some O(N) behavior will also trigger latency violations.
>
> Can you share the test, in a form that can be added in linux tree ?
I will include it somewhere under testing/selftests/net - does that sound okay?
>
> I mean, before today nobody was trying to have 24k listeners on a host,
> so it would be nice to have a regression test for future changes in the stack.
>
> If the goal is to deal with 400 vips, why using 24k in your changelog ?
> I would rather stick to the reality, and not pretend TCP stack should
> scale to 24k listeners.
I chose 24k to test on because one of the internal team's usages is
binding from 80 workers for ~300 vips in parallel for the same port.
>
> I have not looked at the patch yet, I choked on the changelog for
> being exaggerated.