From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A30DC2D0E4 for ; Fri, 20 Nov 2020 20:50:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0585B22D0A for ; Fri, 20 Nov 2020 20:50:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="J1bOxW/c" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730198AbgKTUum (ORCPT ); Fri, 20 Nov 2020 15:50:42 -0500 Received: from mail.kernel.org ([198.145.29.99]:36810 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729587AbgKTUul (ORCPT ); Fri, 20 Nov 2020 15:50:41 -0500 Received: from kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com (unknown [163.114.132.6]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 9D9642223F; Fri, 20 Nov 2020 20:50:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1605905441; bh=CD6J9aBN3wd+dkrZ6D4o7RWa0H5Idy0AtRGQgdfH1H0=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=J1bOxW/cK/+sjBE+Pofli22i2mKXW3WbeCRRHVTEKiKoknf4WK7oSrUYsCwWbeQ0d BDPi9AKjTED0cwrPhpD0Hbj0ZABDGx2u+bh+JU6ppDs2b0bPtiaSOuWUaWatHxoVmh 6saxDs3hB4YmUZk4LkhofkGWuo8a5EmkHmPWFjVM= Date: Fri, 20 Nov 2020 12:50:39 -0800 From: Jakub Kicinski To: Marcel Apfelbaum Cc: marcel@redhat.com, Saeed Mahameed , netdev@vger.kernel.org, Paolo Abeni , Jonathan Corbet , "David S. Miller" , Shuah Khan , Marcelo Tosatti , Daniel Borkmann Subject: Re: [kuba@kernel.org: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask] Message-ID: <20201120125039.6a88b0b1@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> In-Reply-To: References: <20201119162527.GB9877@fuller.cnet> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Fri, 20 Nov 2020 19:39:24 +0200 Marcel Apfelbaum wrote: > > The CPU isolation is done statically at system boot by setting > > Linux kernel parameters, So the container management component, in > > this case the Machine Configuration Operator (for Openshift) or the > > K8s counterpart can't really help. (actually they would help if a > > global RPS mask would exist) > > > > I tried to tweak the rps_cpus mask using the container management > > stack, but there is no sane way to do it, please let me get a > > little into the details. > > > > The k8s orchestration component that deals with injecting the > > network device(s) into the container is CNI, which is interface > > based and implemented by a lot of plugins, making it hardly > > feasible to go over all the existing plugins and change them. Also > > what about the 3rd party ones? I'm not particularly amenable to the "changing user space is hard" argument. Especially that you don't appear to have given it an honest try. > > Writing a new CNI plugin and chain it into the existing one is also > > not an option AFAIK, they work at the network level and do not have > > access to sysfs (they handle the network namespaces). Even if it > > would be possible (I don't have a deep CNI understanding), it will > > require a cluster global configuration that is actually needed only > > for some of the cluster nodes. > > > > Another approach is to set the RPS configuration from the inside(of > > the container), but the /sys mount is read only for unprivileged > > containers, so we lose again. > > > > That leaves us with a host daemon hack: > > Since the virtual network devices are created in the host namespace > > and then "moved" into the container, we can listen to some udev > > event and write to the rps_cpus file after the virtual netdev is > > created and before it is moved (as stated above, the work is done > > by a CNI plugin implementation). That is of course extremely racy > > and not a valid solution. > > > >> > Possibly I can reduce the amount of new code introduced by this > >> > patchset removing some code duplication > >> > between rps_default_mask_sysctl() and flow_limit_cpu_sysctl(). > >> > Would that make this change more acceptable? Or should I drop > >> > this altogether? > >> > >> I'm leaning towards drop altogether, unless you can get some > >> support/review tags from other netdev developers. So far it > >> appears we only got a down vote from Saeed. As I said here, try to convince some other senior networking developers this is the right solution and I'll apply it. This is giving me flashbacks of trying bend the kernel for OpenStack because there was no developer on my team who could change OpenStack. > > Any solution that will allow the user space to avoid the > > network soft IRQs on specific CPUs would be welcome. > > > > The proposed global mask is a solution, maybe there other ways?