Re: [kuba@kernel.org: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask]

From: Marcel Apfelbaum <mapfelba@redhat.com>
To: Saeed Mahameed <saeed@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org
Cc: Paolo Abeni <pabeni@redhat.com>, Jonathan Corbet <corbet@lwn.net>,
	"David S. Miller" <davem@davemloft.net>,
	Shuah Khan <shuah@kernel.org>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Daniel Borkmann <daniel@iogearbox.net>
Subject: Re: [kuba@kernel.org: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask]
Date: Fri, 20 Nov 2020 19:39:24 +0200	[thread overview]
Message-ID: <CA+aaKfD_6qdNVRgr2TdDeuOau1UCFzRqWRB8iM-_GHV7mMrcsg@mail.gmail.com> (raw)
In-Reply-To: <CA+aaKfCMa1sOa6bMXFAaP6Wb=5ZgoAcnZAaP9aBmWwOzaAtcHw@mail.gmail.com>

+netdev
[Sorry for the second email, I failed to set the text-only mode]

On Fri, Nov 20, 2020 at 7:30 PM Marcel Apfelbaum <mapfelba@redhat.com> wrote:
>
> Hi,
>
>>
>>
>> ---------- Forwarded message ----------
>> From: Jakub Kicinski <kuba@kernel.org>
>> To: Paolo Abeni <pabeni@redhat.com>
>> Cc: Saeed Mahameed <saeed@kernel.org>, netdev@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, "David S. Miller" <davem@davemloft.net>, Shuah Khan <shuah@kernel.org>, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Marcelo Tosatti <mtosatti@redhat.com>, Daniel Borkmann <daniel@iogearbox.net>
>> Bcc:
>> Date: Wed, 4 Nov 2020 11:42:26 -0800
>> Subject: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask
>> On Wed, 04 Nov 2020 18:36:08 +0100 Paolo Abeni wrote:
>> > On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote:
>> > > On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote:
>> > > > The relevant use case is an host running containers (with the related
>> > > > orchestration tools) in a RT environment. Virtual devices (veths, ovs
>> > > > ports, etc.) are created by the orchestration tools at run-time.
>> > > > Critical processes are allowed to send packets/generate outgoing, it gets a network-interface
>> > upstart job just as it does on a real host.
>> >
>> > > > network traffic - but any interrupt is moved away from the related
>> > > > cores, so that usual incoming network traffic processing does not
>> > > > happen there.
>> > > >
>> > > > Still an xmit operation on a virtual devices may be transmitted via ovs
>> > > > or veth, with the relevant forwarding operation happening in a softirq
>> > > > on the same CPU originating the packet.
>> > > >
>> > > > RPS is configured (even) on such virtual devices to move away the
>> > > > forwarding from the relevant CPUs.
>> > > >
>> > > > As Saeed noted, such configuration could be possibly performed via some
>> > > > user-space daemon monitoring network devices and network namespaces
>> > > > creation. That will be anyway prone to some race: the orchestation tool
>> > > > may create and enable the netns and virtual devices before the daemon
>> > > > has properly set the RPS mask.
>> > > >
>> > > > In the latter scenario some packet forwarding could still slip in the
>> > > > relevant CPU, causing measurable latency. In all non RT scenarios the
>> > > > above will be likely irrelevant, but in the RT context that is not
>> > > > acceptable - e.g. it causes in real environments latency above the
>> > > > defined limits, while the proposed patches avoid the issue.
>> > > >
>> > > > Do you see any other simple way to avoid the above race?
>> > > >
>> > > > Please let me know if the above answers your doubts,
>> > >
>> > > Thanks, that makes it clearer now.
>> > >
>> > > Depending on how RT-aware your container management is it may or may not
>> > > be the right place to configure this, as it creates the veth interface.
>> > > Presumably it's the container management which does the placement of
>> > > the tasks to cores, why is it not setting other attributes, like RPS?
>
>
> The CPU isolation is done statically at system boot by setting Linux kernel parameters,
> So the container management component, in this case the Machine Configuration Operator (for Openshift)
> or the K8s counterpart can't really help. (actually they would help if a global RPS mask would exist)
>
> I tried to tweak the rps_cpus mask using the container management stack, but there
> is no sane way to do it, please let me get a little into the details.
>
> The k8s orchestration component that deals with injecting the network device(s) into the
> container is CNI, which is interface based and implemented by a lot of plugins, making it
> hardly feasible to go over all the existing plugins and change them. Also what about
> the 3rd party ones?
>
> Writing a new CNI plugin and chain it into the existing one is also not an option AFAIK,
> they work at the network level and do not have access to sysfs (they handle the network namespaces).
> Even if it would be possible (I don't have a deep CNI understanding), it will require a cluster global configuration
> that is actually needed only for some of the cluster nodes.
>
> Another approach is to set the RPS configuration from the inside(of the container),
> but the /sys mount is read only for unprivileged containers, so we lose again.
>
> That leaves us with a host daemon hack:
> Since the virtual network devices are created in the host namespace and
> then "moved" into the container, we can listen to some udev event
> and write to the rps_cpus file after the virtual netdev is created and before
> it is moved (as stated above, the work is done by a CNI plugin implementation).
> That is of course extremely racy and not a valid solution.
>
>
>>
>> >
>> > The container orchestration is quite complex, and I'm unsure isolation
>> > and networking configuration are performed (or can be performed) by the
>> > same precess (without an heavy refactor).
>> >
>> > On the flip hand, the global rps mask knob looked quite
>> > straightforward to me.
>>
>> I understand, but I can't shake the feeling this is a hack.
>>
>> Whatever sets the CPU isolation should take care of the RPS settings.
>
>
> Sadly it can't be done, please see above.
>
>>
>>
>> > Possibly I can reduce the amount of new code introduced by this
>> > patchset removing some code duplication
>> > between rps_default_mask_sysctl() and flow_limit_cpu_sysctl(). Would
>> > that make this change more acceptable? Or should I drop this
>> > altogether?
>>
>> I'm leaning towards drop altogether, unless you can get some
>> support/review tags from other netdev developers. So far it
>> appears we only got a down vote from Saeed.
>>
>
> Any solution that will allow the user space to avoid the
> network soft IRQs on specific CPUs would be welcome.
>
> The proposed global mask is a solution, maybe there other ways?
>
> Thanks,
> Marcel