linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	helgaas@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, frederic@kernel.org,
	sassmann@redhat.com, jesse.brandeburg@intel.com,
	lihong.yang@intel.com, jeffrey.t.kirsher@intel.com,
	jacob.e.keller@intel.com, jlelli@redhat.com, hch@infradead.org,
	bhelgaas@google.com, mike.marciniszyn@intel.com,
	dennis.dalessandro@intel.com, thomas.lendacky@amd.com,
	jiri@nvidia.com, mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, lgoncalv@redhat.com
Subject: Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
Date: Mon, 26 Oct 2020 14:30:12 -0300	[thread overview]
Message-ID: <20201026173012.GA377978@fuller.cnet> (raw)
In-Reply-To: <87ft6464jf.fsf@nanos.tec.linutronix.de>

On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> > On 10/23/20 4:58 AM, Peter Zijlstra wrote:
> >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
> >> So shouldn't we then fix the drivers / interface first, to get rid of
> >> this inconsistency?
> >>
> > Considering we agree that excess vector is a problem that needs to be
> > solved across all the drivers and that you are comfortable with the other
> > three patches in the set. If I may suggest the following:
> >
> > - We can pick those three patches for now, as that will atleast fix a
> >   driver that is currently impacting RT workloads. Is that a fair
> >   expectation?
> 
> No. Blindly reducing the maximum vectors to the number of housekeeping
> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
> what the right number of interrupts for this situation is.
> 
> Many of these drivers need more than queue interrupts, admin, error
> interrupt and some operate best with seperate RX/TX interrupts per
> queue. They all can "work" with a single PCI interrupt of course, but
> the price you pay is performance.
> 
> An isolated setup, which I'm familiar with, has two housekeeping
> CPUs. So far I restricted the number of network queues with a module
> argument to two, which allocates two management interrupts for the
> device and two interrupts (RX/TX) per queue, i.e. a total of six.
> 
> Now I reduced the number of available interrupts to two according to
> your hack, which makes it use one queue RX/TX combined and one
> management interrupt. Guess what happens? Network performance tanks to
> the points that it breaks a carefully crafted setup.
> 
> The same applies to a device which is application specific and wants one
> channel including an interrupt per isolated application core. Today I
> can isolate 8 out of 12 CPUs and let the device create 8 channels and
> set one interrupt and channel affine to each isolated CPU. With your
> hack, I get only 4 interrupts and channels. Fail!

Good point.

> You cannot declare that all this is perfectly fine, just because it does
> not matter for your particular use case.
> 
> So without information from the driver which tells what the best number
> of interrupts is with a reduced number of CPUs, this cutoff will cause
> more problems than it solves. Regressions guaranteed.

One might want to move from one interrupt per isolated app core
to zero, or vice versa. It seems that "best number of interrupts 
is with reduced number of CPUs" information, is therefore in userspace, 
not in driver...

No?

> Managed interrupts base their interrupt allocation and spreading on
> information which is handed in by the individual driver and not on crude
> assumptions. They are not imposing restrictions on the use case.
> 
> It's perfectly fine for isolated work to save a data set to disk after
> computation has finished and that just works with the per-cpu I/O queue
> which is otherwise completely silent. 

Userspace could only change the mask of interrupts which are not 
triggered by requests from the local CPU (admin, error, mgmt, etc),
to avoid the vector exhaustion problem.

However, there is no explicit way for userspace to know that, as far as
i know.

 130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
 131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
 132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
 133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
 134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
 135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
 136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
 137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8


Can that be retrieved from PCI-MSI information, or drivers
have to inform this? 

> All isolated workers can do the
> same in parallel without trampling on each other toes by competing for a
> reduced number of queues which are affine to the housekeeper CPUs.
> 
> Unfortunately network multi-queue is substantially different from block
> multi-queue (as I learned in this conversation), so the concept cannot
> be applied one-to-one to networking as is. But there are certainly part
> of it which can be reused.
> 
> This needs a lot more thought than just these crude hacks.
> 
> Especially under the aspect that there are talks about making isolation
> runtime switchable. Are you going to rmmod/insmod the i40e network
> driver to do so? That's going to work fine if you do that
> reconfiguration over network...
> 
> Thanks,
> 
>         tglx



  parent reply	other threads:[~2020-10-26 17:31 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 1/4] sched/isolation: API to get number of " Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs Nitesh Narayan Lal
2020-10-23 13:25   ` Peter Zijlstra
2020-10-23 13:29     ` Frederic Weisbecker
2020-10-23 13:57       ` Nitesh Narayan Lal
2020-10-23 13:45     ` Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
2020-09-28 21:59   ` Bjorn Helgaas
2020-09-29 17:46     ` Christoph Hellwig
2020-10-16 12:20   ` Peter Zijlstra
2020-10-18 18:14     ` Nitesh Narayan Lal
2020-10-19 11:11       ` Peter Zijlstra
2020-10-19 14:00         ` Marcelo Tosatti
2020-10-19 14:25           ` Nitesh Narayan Lal
2020-10-20  7:30           ` Peter Zijlstra
2020-10-20 13:00             ` Nitesh Narayan Lal
2020-10-20 13:41               ` Peter Zijlstra
2020-10-20 14:39                 ` Nitesh Narayan Lal
2020-10-22 17:47                   ` Nitesh Narayan Lal
2020-10-23  8:58                     ` Peter Zijlstra
2020-10-23 13:10                       ` Nitesh Narayan Lal
2020-10-23 21:00                         ` Thomas Gleixner
2020-10-26 13:35                           ` Nitesh Narayan Lal
2020-10-26 13:57                             ` Thomas Gleixner
2020-10-26 17:30                           ` Marcelo Tosatti [this message]
2020-10-26 19:00                             ` Thomas Gleixner
2020-10-26 19:11                               ` Marcelo Tosatti
2020-10-26 19:21                               ` Jacob Keller
2020-10-26 20:11                                 ` Thomas Gleixner
2020-10-26 21:11                                   ` Jacob Keller
2020-10-26 21:50                                     ` Thomas Gleixner
2020-10-26 22:13                                       ` Jakub Kicinski
2020-10-26 22:46                                         ` Thomas Gleixner
2020-10-26 22:52                                         ` Jacob Keller
2020-10-26 22:22                                       ` Nitesh Narayan Lal
2020-10-26 22:49                                         ` Thomas Gleixner
2020-10-26 23:08                                           ` Jacob Keller
2020-10-27 14:28                                             ` Thomas Gleixner
2020-10-27 11:47                                         ` Marcelo Tosatti
2020-10-27 14:43                                           ` Thomas Gleixner
2020-10-19 14:21         ` Frederic Weisbecker
2020-10-20 14:16   ` Thomas Gleixner
2020-10-20 16:18     ` Nitesh Narayan Lal
2020-10-20 18:07       ` Thomas Gleixner
2020-10-21 20:25         ` Thomas Gleixner
2020-10-21 21:04           ` Nitesh Narayan Lal
2020-10-22  0:02           ` Jakub Kicinski
2020-10-22  0:27             ` Jacob Keller
2020-10-22  8:28             ` Thomas Gleixner
2020-10-22 12:28           ` Marcelo Tosatti
2020-10-22 22:39             ` Thomas Gleixner
2020-10-01 15:49 ` [PATCH v4 0/4] isolation: limit msix vectors " Frederic Weisbecker
2020-10-08 21:40   ` Nitesh Narayan Lal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201026173012.GA377978@fuller.cnet \
    --to=mtosatti@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=dennis.dalessandro@intel.com \
    --cc=frederic@kernel.org \
    --cc=hch@infradead.org \
    --cc=helgaas@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jacob.e.keller@intel.com \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=jiri@nvidia.com \
    --cc=jlelli@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=lgoncalv@redhat.com \
    --cc=lihong.yang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mike.marciniszyn@intel.com \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=nitesh@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sassmann@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).