All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Leon Romanovsky <leon@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Jakub Kicinski <kuba@kernel.org>,
	linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org,
	netdev@vger.kernel.org, Don Dutile <ddutile@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	"David S . Miller" <davem@davemloft.net>
Subject: Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
Date: Tue, 16 Feb 2021 10:12:12 -0600	[thread overview]
Message-ID: <20210216161212.GA805748@bjorn-Precision-5520> (raw)
In-Reply-To: <YCt1WAAEO1hx2pjY@unreal>

Proposed subject:

  PCI/IOV: Add dynamic MSI-X vector assignment sysfs interface

On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>

Here's a draft of the sort of thing I'm looking for here:

  A typical cloud provider SR-IOV use case is to create many VFs for
  use by guest VMs.  The VFs may not be assigned to a VM until a
  customer requests a VM of a certain size, e.g., number of CPUs.  A
  VF may need MSI-X vectors proportional to the number of CPUs in the
  VM, but there is no standard way to change the number of MSI-X
  vectors supported by a VF.

  Some Mellanox ConnectX devices support dynamic assignment of MSI-X
  vectors to SR-IOV VFs.  This can be done by the PF driver after VFs
  are enabled, and it can be done without affecting VFs that are
  already in use.  The hardware supports a limited pool of MSI-X
  vectors that can be assigned to the PF or to individual VFs.  This
  is device-specific behavior that requires support in the PF driver.

  Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a
  writable "sriov_vf_msix_count" file for each VF.  Management
  software may use these to learn how many MSI-X vectors are available
  and to dynamically assign them to VFs before the VFs are passed
  through to a VM.

  If the PF driver implements the ->sriov_get_vf_total_msix()
  callback, "sriov_vf_total_msix" contains the total number of MSI-X
  vectors available for distribution among VFs.

  If no driver is bound to the VF, writing "N" to
  "sriov_vf_msix_count" uses the PF driver ->sriov_set_msix_vec_count()
  callback to assign "N" MSI-X vectors to the VF.  When a VF driver
  subsequently reads the MSI-X Message Control register, it will see
  the new Table Size "N".

> > > Extend PCI sysfs interface with a new callback that allows configuration
> > > of the number of MSI-X vectors for specific SR-IOV VF. This is needed
> > > to optimize the performance of VFs devices by allocating the number of
> > > vectors based on the administrator knowledge of the intended use of the VF.
> > >
> > > This function is applicable for SR-IOV VF because such devices allocate
> > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > right number of vectors, so some devices allocate them statically and equally.
> >
> > This commit log should be clear that this functionality is motivated
> > by *mlx5* behavior.  The description above makes it sound like this is
> > generic PCI spec behavior, and it is not.
> >
> > It may be a reasonable design that conforms to the spec, and we hope
> > the model will be usable by other designs, but it is not required by
> > the spec and AFAIK there is nothing in the spec you can point to as
> > background for this.
> >
> > So don't *remove* the text you have above, but please *add* some
> > preceding background information about how mlx5 works.
> >
> > > 1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count
> > > file will be seen for the VFs and it is writable as long as a driver is not
> > > bound to the VF.
> >
> >   This adds /sys/bus/pci/devices/.../sriov_vf_msix_count for VF
> >   devices and is writable ...
> >
> > > The values accepted are:
> > >  * > 0 - this will be number reported by the Table Size in the VF's MSI-X Message
> > >          Control register
> > >  * < 0 - not valid
> > >  * = 0 - will reset to the device default value
> >
> >   = 0 - will reset to a device-specific default value
> >
> > > 2) In order to make management easy, provide new read-only sysfs file that
> > > returns a total number of possible to configure MSI-X vectors.
> >
> >   For PF devices, this adds a read-only
> >   /sys/bus/pci/devices/.../sriov_vf_total_msix file that contains the
> >   total number of MSI-X vectors available for distribution among VFs.
> >
> > Just as in sysfs-bus-pci, this file should be listed first, because
> > you must read it before you can use vf_msix_count.
> 
> No problem, I'll change, just remember that we are talking about commit
> message because in Documentation file, the order is exactly as you request.

Yes, I noticed that, thank you!  It will be good to have them in the
same order in both the commit log and the Documentation file.  I think
it will make more sense to readers.

> > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > >   = 0 - feature is not supported
> > >   > 0 - total number of MSI-X vectors available for distribution among the VFs
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci |  28 +++++
> > >  drivers/pci/iov.c                       | 153 ++++++++++++++++++++++++
> > >  include/linux/pci.h                     |  12 ++
> > >  3 files changed, 193 insertions(+)
> 
> <...>
> 
> > > + */
> > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *virtfn;
> > > +	int id, ret;
> > > +
> > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > +		return 0;
> > > +
> > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> >
> > But I still don't like the fact that we're calling
> > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > complication and opportunities for errors.
> 
> It is not different from any other code that we have in the kernel.

It *is* different.  There is a general rule that drivers should not
call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
still true that callers of sysfs_create_files() are very special, and
I'd prefer not to add another one.

> Let's be concrete, can you point to the errors in this code that I
> should fix?

I'm not saying there are current errors; I'm saying the additional
code makes errors possible in future code.  For example, we hope that
other drivers can use these sysfs interfaces, and it's possible they
may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
correctly.

Or there may be races in device addition/removal.  We have current
issues in this area, e.g., [2], and they're fairly subtle.  I'm not
saying your patches have these issues; only that extra code makes more
chances for mistakes and it's more work to validate it.

> > I don't see the advantage of creating these files only when the PF
> > driver supports this.  The management tools have to deal with
> > sriov_vf_total_msix == 0 and sriov_vf_msix_count == 0 anyway.
> > Having the sysfs files not be present at all might be slightly
> > prettier to the person running "ls", but I'm not sure the code
> > complication is worth that.
> 
> It is more than "ls", right now sriov_numvfs is visible without relation
> to the driver, even if driver doesn't implement ".sriov_configure", which
> IMHO bad. We didn't want to repeat.
> 
> Right now, we have many devices that supports SR-IOV, but small amount
> of them are capable to rewrite their VF MSI-X table siz. We don't want
> "to punish" and clatter their sysfs.

I agree, it's clutter, but at least it's just cosmetic clutter (but
I'm willing to hear discussion about why it's more than cosmetic; see
below).

From the management software point of view, I don't think it matters.
That software already needs to deal with files that don't exist (on
old kernels) and files that contain zero (feature not supported or no
vectors are available).

From my point of view, pci_enable_vf_overlay() or
pci_disable_vfs_overlay() are also clutter, at least compared to
static sysfs attributes.

> > I see a hint that Alex might have requested this "only visible when PF
> > driver supports it" functionality, but I don't see that email on
> > linux-pci, so I missed the background.
> 
> First version of this patch had static files solution.
> https://lore.kernel.org/linux-pci/20210103082440.34994-2-leon@kernel.org/#Z30drivers:pci:iov.c

Thanks for the pointer to the patch.  Can you point me to the
discussion about why we should use the "only visible when PF driver
supports it" model?

Bjorn

[1] https://lore.kernel.org/linux-pci/YBmG7qgIDYIveDfX@kroah.com/
[2] https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/

  reply	other threads:[~2021-02-16 16:13 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-02-09 13:34 ` [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
2021-02-15 21:01   ` Bjorn Helgaas
2021-02-16  7:33     ` Leon Romanovsky
2021-02-16 16:12       ` Bjorn Helgaas [this message]
2021-02-16 19:58         ` Leon Romanovsky
2021-02-17 18:02           ` Bjorn Helgaas
2021-02-17 19:25             ` Jason Gunthorpe
2021-02-17 20:28               ` Bjorn Helgaas
2021-02-17 23:52                 ` Jason Gunthorpe
2021-02-18 10:15             ` Leon Romanovsky
2021-02-18 22:39               ` Bjorn Helgaas
2021-02-19  7:52                 ` Leon Romanovsky
2021-02-19  8:20                   ` Greg Kroah-Hartman
2021-02-19 16:58                     ` Leon Romanovsky
2021-02-20 19:06                     ` Bjorn Helgaas
2021-02-21  6:59                       ` Leon Romanovsky
2021-02-23 21:07                         ` Bjorn Helgaas
2021-02-24  8:09                           ` Greg Kroah-Hartman
2021-02-24 21:37                             ` Don Dutile
2021-02-24  9:53                           ` Leon Romanovsky
2021-02-24 15:07                             ` Bjorn Helgaas
2021-02-21 13:00                       ` Greg Kroah-Hartman
2021-02-21 13:55                         ` Leon Romanovsky
2021-02-21 15:01                           ` Greg Kroah-Hartman
2021-02-21 15:30                             ` Leon Romanovsky
2021-02-16 20:37         ` Jason Gunthorpe
2021-02-09 13:34 ` [PATCH mlx5-next v6 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
2021-02-09 13:34 ` [PATCH mlx5-next v6 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
2021-02-09 13:34 ` [PATCH mlx5-next v6 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
2021-02-09 21:06 ` [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Alexander Duyck
2021-02-14  5:24   ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210216161212.GA805748@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=alex.williamson@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=bhelgaas@google.com \
    --cc=davem@davemloft.net \
    --cc=ddutile@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.