archive mirror
 help / color / mirror / Atom feed
From: Nitesh Narayan Lal <>
Subject: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs
Date: Mon, 28 Sep 2020 14:35:25 -0400	[thread overview]
Message-ID: <> (raw)

This is a follow-up posting for "[PATCH v3 0/4] isolation: limit msix vectors
to housekeeping CPUs".

With the current implementation device drivers while creating their MSIX        
vectors only take num_online_cpus() into consideration which works quite well  
for a non-RT environment, but in an RT environment that has a large number of   
isolated CPUs and very few housekeeping CPUs this could lead to a problem.    
The problem will be triggered when something like tuned will try to move all    
the IRQs from isolated CPUs to the limited number of housekeeping CPUs to       
prevent interruptions for a latency-sensitive workload that will be running
on the isolated CPUs. This failure is caused because of the per CPU vector         

Proposed Fix
In this patch-set, the following changes are proposed:
- A generic API housekeeping_num_online_cpus() which is meant to return the
  online housekeeping CPUs based on the hk_flag passed by the caller.
- i40e: Specifically for the i40e driver the num_online_cpus() used in 
  i40e_init_msix() to calculate numbers msix vectors is replaced with the
  above defined API that returns the online housekeeping CPUs that are meant
  to handle managed IRQ jobs.
- pci_alloc_irq_vector(): With the help of housekeeping_num_online_cpus() the
  max_vecs passed in pci_alloc_irq_vector() is restricted only to the online
  housekeeping CPUs (designated for managed IRQ jobs) strictly in an RT
  environment. However, if the min_vecs exceeds the online housekeeping CPUs,
  max_vecs is limited based on the min_vecs instead.

Future Work

- In the previous upstream discussion [1], it was decided that it would be
  better if we can have a generic framework that can be consumed by all the
  drivers to fix this kind of issue. However, it will be a long term work,
  and since there are RT workloads that are getting impacted by the reported
  issue. We agreed upon the proposed per-device approach for now.

- To test that the issue is resolved with i40e change I added a tracepoint
  in i40e_init_msix() to find the number of CPUs derived for vector creation
  with and without tuned's realtime-virtual-host profile. As per expectation
  with the profile applied I was only getting the number of housekeeping CPUs
  and all available CPUs without it. Another way to verify is by checking
  the number of IRQs that get created corresponding to a impacted device.
  Similarly did a few more tests with different modes eg with only nohz_full,
  isolcpus etc.

- To analyze the performance impact I have targetted the change introduced in 
  pci_alloc_irq_vectors() and compared the results against a vanilla kernel
  (5.9.0-rc3) results.

  Setup Information:
  + I had a couple of 24-core machines connected back to back via a couple of
    mlx5 NICs and I analyzed the average bitrate for server-client TCP and
    UDP transmission via iperf. 
  + To minimize the Bitrate variation of iperf TCP and UDP stream test I have
    applied the tuned's network-throughput profile and disabled HT.
 Test Information:
  + For the environment that had no isolated CPUs:
    I have tested with single stream and 24 streams (same as that of online
  + For the environment that had 20 isolated CPUs:
    I have tested with single stream, 4 streams (same as that the number of
    housekeeping) and 24 streams (same as that of online CPUs).

  # UDP Stream Test:
    + There was no degradation observed in UDP stream tests in both
      environments. (With isolated CPUs and without isolated CPUs after the
      introduction of the patches).
  # TCP Stream Test - No isolated CPUs:
    + No noticeable degradation was observed.
  # TCP Stream Test - With isolated CPUs:
    + Multiple Stream (4)  - Average degradation of around 5-6%
    + Multiple Stream (24) - Average degradation of around 2-3%
    + Single Stream        - Even on a vanilla kernel the Bitrate observed 
                             for a TCP single stream test seem to vary
                             significantly across different runs (eg. the %
                             variation between the best and the worst case on
                             a vanilla kernel was around 8-10%). A similar
                             variation was observed with the kernel that
                             included my patches. No additional degradation
                             was observed.

If there are any suggestions for more performance evaluation, I would
be happy to discuss/perform them.

Changes from v3[2]:
- Moved the logic to limit the max_vecs from pci_alloc_irq_vectors() to
  pci_alloc_irq_vectors_affinity() as that's the exported interface and
  drivers using this API also need to be fixed (suggestion from Bjorn Helgaas).

Changes from v2[3]:
- Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep
  the naming convention consistent (based on a suggestion from Peter
  Zijlstra and Frederic Weisbecker).
- Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API
  to make it more usable in different use-cases (based on a suggestion from 
  Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
  from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's

Changes from v1[4]:
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
  using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
  of online housekeeping CPUs. This is based on Frederic Weisbecker's
- Since the hk_num_online_cpus() is self-explanatory, got rid of             
  the comment that was added previously.                                     
- Added a new patch that is meant to enable managed IRQ isolation for
  nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.              
Patch4 (PCI):                                                                 
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
  of skipping modification to max_vecs, started restricting it based on the
  min_vecs. This is based on a suggestion from Marcelo Tosatti.                                                                    


Nitesh Narayan Lal (4):
  sched/isolation: API to get number of housekeeping CPUs
  sched/isolation: Extend nohz_full to isolate managed IRQs
  i40e: Limit msix vectors to housekeeping CPUs
  PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
 drivers/pci/msi.c                           | 18 ++++++++++++++++++
 include/linux/sched/isolation.h             |  9 +++++++++
 kernel/sched/isolation.c                    |  2 +-
 4 files changed, 30 insertions(+), 2 deletions(-)


             reply	other threads:[~2020-09-28 18:36 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-28 18:35 Nitesh Narayan Lal [this message]
2020-09-28 18:35 ` [PATCH v4 1/4] sched/isolation: API to get number of housekeeping CPUs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs Nitesh Narayan Lal
2020-10-23 13:25   ` Peter Zijlstra
2020-10-23 13:29     ` Frederic Weisbecker
2020-10-23 13:57       ` Nitesh Narayan Lal
2020-10-23 13:45     ` Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
2020-09-28 21:59   ` Bjorn Helgaas
2020-09-29 17:46     ` Christoph Hellwig
2020-10-16 12:20   ` Peter Zijlstra
2020-10-18 18:14     ` Nitesh Narayan Lal
2020-10-19 11:11       ` Peter Zijlstra
2020-10-19 14:00         ` Marcelo Tosatti
2020-10-19 14:25           ` Nitesh Narayan Lal
2020-10-20  7:30           ` Peter Zijlstra
2020-10-20 13:00             ` Nitesh Narayan Lal
2020-10-20 13:41               ` Peter Zijlstra
2020-10-20 14:39                 ` Nitesh Narayan Lal
2020-10-22 17:47                   ` Nitesh Narayan Lal
2020-10-23  8:58                     ` Peter Zijlstra
2020-10-23 13:10                       ` Nitesh Narayan Lal
2020-10-23 21:00                         ` Thomas Gleixner
2020-10-26 13:35                           ` Nitesh Narayan Lal
2020-10-26 13:57                             ` Thomas Gleixner
2020-10-26 17:30                           ` Marcelo Tosatti
2020-10-26 19:00                             ` Thomas Gleixner
2020-10-26 19:11                               ` Marcelo Tosatti
2020-10-26 19:21                               ` Jacob Keller
2020-10-26 20:11                                 ` Thomas Gleixner
2020-10-26 21:11                                   ` Jacob Keller
2020-10-26 21:50                                     ` Thomas Gleixner
2020-10-26 22:13                                       ` Jakub Kicinski
2020-10-26 22:46                                         ` Thomas Gleixner
2020-10-26 22:52                                         ` Jacob Keller
2020-10-26 22:22                                       ` Nitesh Narayan Lal
2020-10-26 22:49                                         ` Thomas Gleixner
2020-10-26 23:08                                           ` Jacob Keller
2020-10-27 14:28                                             ` Thomas Gleixner
2020-10-27 11:47                                         ` Marcelo Tosatti
2020-10-27 14:43                                           ` Thomas Gleixner
2020-10-19 14:21         ` Frederic Weisbecker
2020-10-20 14:16   ` Thomas Gleixner
2020-10-20 16:18     ` Nitesh Narayan Lal
2020-10-20 18:07       ` Thomas Gleixner
2020-10-21 20:25         ` Thomas Gleixner
2020-10-21 21:04           ` Nitesh Narayan Lal
2020-10-22  0:02           ` Jakub Kicinski
2020-10-22  0:27             ` Jacob Keller
2020-10-22  8:28             ` Thomas Gleixner
2020-10-22 12:28           ` Marcelo Tosatti
2020-10-22 22:39             ` Thomas Gleixner
2020-10-01 15:49 ` [PATCH v4 0/4] isolation: limit msix vectors " Frederic Weisbecker
2020-10-08 21:40   ` Nitesh Narayan Lal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).