Re: [PATCH RFC 0/9] sk_buff: optimize layout for GRO

From: Paolo Abeni <pabeni@redhat.com>
To: Casey Schaufler <casey@schaufler-ca.com>, netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Florian Westphal <fw@strlen.de>,
	Eric Dumazet <edumazet@google.com>,
	linux-security-module@vger.kernel.org, selinux@vger.kernel.org
Subject: Re: [PATCH RFC 0/9] sk_buff: optimize layout for GRO
Date: Thu, 22 Jul 2021 18:57:31 +0200	[thread overview]
Message-ID: <d3fe6ae85b8fad9090288c553f8d248603758506.camel@redhat.com> (raw)
In-Reply-To: <2e9e57f0-98f9-b64d-fd82-aecef84835c5@schaufler-ca.com>

On Thu, 2021-07-22 at 09:04 -0700, Casey Schaufler wrote:
> On 7/22/2021 12:10 AM, Paolo Abeni wrote:
> > On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
> > > On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> > > > This is a very early draft - in a different world would be
> > > > replaced by hallway discussion at in-person conference - aimed at
> > > > outlining some ideas and collect feedback on the overall outlook.
> > > > There are still bugs to be fixed, more test and benchmark need, etc.
> > > > 
> > > > There are 3 main goals:
> > > > - [try to] avoid the overhead for uncommon conditions at GRO time
> > > >   (patches 1-4)
> > > > - enable backpressure for the veth GRO path (patches 5-6)
> > > > - reduce the number of cacheline used by the sk_buff lifecycle
> > > >   from 4 to 3, at least in some common scenarios (patches 1,7-9).
> > > >   The idea here is avoid the initialization of some fields and
> > > >   control their validity with a bitmask, as presented by at least
> > > >   Florian and Jesper in the past.
> > > If I understand correctly, you're creating an optimized case
> > > which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
> > > and if so, why those particular fields? What impact will this have
> > > in the non-optimal (with any of the excluded fields) case?
> > Thank you for the feedback.
> 
> You're most welcome. You did request comments.
> 
> > There are 2 different relevant points:
> > 
> > - the GRO stage.
> >   packets carring any of CT, dst, sk or skb_ext will do 2 additional
> > conditionals per gro_receive WRT the current code. My understanding is
> > that having any of such field set at GRO receive time is quite
> > exceptional for real nic. All others packet will do 4 or 5 less
> > conditionals, and will traverse a little less code.
> > 
> > - sk_buff lifecycle
> >   * packets carrying vlan and UDP will not see any differences: sk_buff
> > lifecycle will stil use 4 cachelines, as currently does, and no
> > additional conditional is introduced.
> >   * packets carring nfct or secmark will see an additional conditional
> > every time such field is accessed. The number of cacheline used will
> > still be 4, as in the current code. My understanding is that when such
> > access happens, there is already a relevant amount of "additional" code
> > to be executed, the conditional overhead should not be measurable.
> 
> I'm responsible for some of that "additonal" code. If the secmark
> is considered to be outside the performance critical data there are
> changes I would like to make that will substantially improve the
> performance of that "additional" code that would include a u64
> secmark. If use of a secmark is considered indicative of a "slow"
> path, the rationale for restricting it to u32, that it might impact
> the "usual" case performance, seems specious. I can't say that I
> understand all the nuances and implications involved. It does
> appear that the changes you've suggested could negate the classic
> argument that requires the u32 secmark.

I see now I did not reply to one of you questions - why I picked-up
 vlan, tunnel secmark fields to move them at sk_buff tail. 

Tow main drivers on my side:
- there are use cases/deployments that do not use them.
- moving them around was doable in term of required changes.

There are no "slow-path" implications on my side. For example, vlan_*
fields are very critical performance wise, if the traffic is tagged.
But surely there are busy servers not using tagget traffic which will
enjoy the reduced cachelines footprint, and this changeset will not
impact negatively the first case.

WRT to the vlan example, secmark and nfct require an extra conditional
to fetch the data. My understanding is that such additional conditional
is not measurable performance-wise when benchmarking the security
modules (or conntrack) because they have to do much more intersting
things after fetching a few bytes from an already hot cacheline.

Not sure if the above somehow clarify my statements.

As for expanding secmark to 64 bits, I guess that could be an
interesting follow-up discussion :)

Cheers,

Paolo