Re: [kuba@kernel.org: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask]

From: Jakub Kicinski <kuba@kernel.org>
To: Marcel Apfelbaum <mapfelba@redhat.com>
Cc: marcel@redhat.com, Saeed Mahameed <saeed@kernel.org>,
	netdev@vger.kernel.org, Paolo Abeni <pabeni@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	"David S. Miller" <davem@davemloft.net>,
	Shuah Khan <shuah@kernel.org>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Daniel Borkmann <daniel@iogearbox.net>
Subject: Re: [kuba@kernel.org: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask]
Date: Fri, 20 Nov 2020 12:50:39 -0800	[thread overview]
Message-ID: <20201120125039.6a88b0b1@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> (raw)
In-Reply-To: <CA+aaKfD_6qdNVRgr2TdDeuOau1UCFzRqWRB8iM-_GHV7mMrcsg@mail.gmail.com>

On Fri, 20 Nov 2020 19:39:24 +0200 Marcel Apfelbaum wrote:
> > The CPU isolation is done statically at system boot by setting
> > Linux kernel parameters, So the container management component, in
> > this case the Machine Configuration Operator (for Openshift) or the
> > K8s counterpart can't really help. (actually they would help if a
> > global RPS mask would exist)
> >
> > I tried to tweak the rps_cpus mask using the container management
> > stack, but there is no sane way to do it, please let me get a
> > little into the details.
> >
> > The k8s orchestration component that deals with injecting the
> > network device(s) into the container is CNI, which is interface
> > based and implemented by a lot of plugins, making it hardly
> > feasible to go over all the existing plugins and change them. Also
> > what about the 3rd party ones?

I'm not particularly amenable to the "changing user space is hard"
argument. Especially that you don't appear to have given it an honest
try.

> > Writing a new CNI plugin and chain it into the existing one is also
> > not an option AFAIK, they work at the network level and do not have
> > access to sysfs (they handle the network namespaces). Even if it
> > would be possible (I don't have a deep CNI understanding), it will
> > require a cluster global configuration that is actually needed only
> > for some of the cluster nodes.
> >
> > Another approach is to set the RPS configuration from the inside(of
> > the container), but the /sys mount is read only for unprivileged
> > containers, so we lose again.
> >
> > That leaves us with a host daemon hack:
> > Since the virtual network devices are created in the host namespace
> > and then "moved" into the container, we can listen to some udev
> > event and write to the rps_cpus file after the virtual netdev is
> > created and before it is moved (as stated above, the work is done
> > by a CNI plugin implementation). That is of course extremely racy
> > and not a valid solution.
> >
> >> > Possibly I can reduce the amount of new code introduced by this
> >> > patchset removing some code duplication
> >> > between rps_default_mask_sysctl() and flow_limit_cpu_sysctl().
> >> > Would that make this change more acceptable? Or should I drop
> >> > this altogether?  
> >>
> >> I'm leaning towards drop altogether, unless you can get some
> >> support/review tags from other netdev developers. So far it
> >> appears we only got a down vote from Saeed.

As I said here, try to convince some other senior networking developers
this is the right solution and I'll apply it.

This is giving me flashbacks of trying bend the kernel for OpenStack
because there was no developer on my team who could change OpenStack.

> > Any solution that will allow the user space to avoid the
> > network soft IRQs on specific CPUs would be welcome.
> >
> > The proposed global mask is a solution, maybe there other ways?