Re: [PATCH] PCI/IOV: update num_VFs earlier

From: Bjorn Helgaas <helgaas@kernel.org>
To: CREGUT Pierre IMT/OLN <pierre.cregut@orange.com>
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] PCI/IOV: update num_VFs earlier
Date: Tue, 1 Oct 2019 18:45:20 -0500	[thread overview]
Message-ID: <20191001234520.GA96866@google.com> (raw)
In-Reply-To: <744273fd-8045-7527-ad29-fa19adf6d015@orange.com>

On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> I also initially thought that kobject_uevent generated the netlink event
> but this is not the case. This is generated by the specific driver in use.
> For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> i40e_pci_sriov_configure that sends the event.
> It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> finally calls the generic pci_enable_sriov function.

I don't know anything about netlink.  The script from the bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
runs

  ip monitor dev enp9s0f2

What are the actual netlink events you see?  Are they related to a
device being removed?

When we change num_VFs, I think we have to disable any existing VFs
before enabling the new num_VFs, so if you trigger on a netlink
"remove" event, I wouldn't be surprised that reading sriov_numvfs
would give a zero until the new VFs are enabled.

> So the proposed patch works well for the i40e driver (x710 cards) because
> the update to num_VFs is fast enough to be committed before the event is
> received. It may not work with other cards. The same is true for the zero
> value and there is no guarantee for other cards.
> 
> The clean solution would be to lock the device in sriov_numvfs_show.
> I guess that there are good reasons why locks have been avoided
> in sysfs getter functions so let us explore other approaches.
> 
> We can either return a "not settled" value (-1) or (probably better)
> do not return a value but an error (-EAGAIN returned by the show
> function).
> 
> To distinguish this "not settled" situation we can either:
> * overload the meaning of num_VFs (eg make it negative)
>   but it is an unsigned short.
> * add a bool to pci_sriov struct (rather simple but modifies a well
>   established structure).
> * use the fact that not_settled => device is locked and use
>   mutex_is_locked as an over approximation.
> 
> The later is not perfect but requires minimal changes to
> sriov_numvfs_show:
> 
>      if (mutex_is_locked(&dev->mutex))
>          return -EAGAIN;

I thought this was a good idea, but

  - It does break the device_lock() encapsulation a little bit:
    sriov_numvfs_store() uses device_lock(), which happens to be
    implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
    rely on that implementation, and

  - The netlink events are being generated via the NIC driver, and I'm
    a little hesitant about changing the PCI core to deal with timing
    issues "over there".

> In all cases, the device could be locked or the boolean set just
> after the test. But I don't think there is a case where causality
> would be violated.Thank you in advance for your recommendations.  I will
> update the patch according to your instructions.
> 
> Le 06/04/2019 à 00:33, Bjorn Helgaas a écrit :
> > On Fri, Mar 29, 2019 at 09:00:58AM +0100, Pierre Crégut wrote:
> > > Ensure that iov->num_VFs is set before a netlink message is sent
> > > when the number of VFs is changed. Only the path for num_VFs > 0
> > > is affected. The path for num_VFs = 0 is already correct.
> > > 
> > > Monitoring programs can relie on netlink messages to track interface
> > > change and query their state in /sys. But when sriov_numvfs is set to a
> > > positive value, the netlink message is sent before the value is available
> > > in sysfs. The value read after the message is received is always zero.
> > Thanks, Pierre!  Can you clue me in on where exactly the connection
> > from sriov_enable() to netlink is?
> > 
> > I see one side of the race is with sriov_numvfs_show(), but I don't
> > know where the netlink message is sent.  Is that connected with the
> > kobject_uevent(KOBJ_CHANGE)?
> > 
> > One thing this would help with is figuring out exactly how *much*
> > earlier we need to set iov->num_VFs.  It looks like the current patch
> > sets it before we actually enable the VFs, so a user could read
> > /sys/.../sriov_numvfs and get the wrong value.  Of course, that's
> > unavoidable; the question is whether it's OK to get the new value
> > *before* it actually takes effect, or whether we want to return a
> > stale value until after it takes effect.
> > 
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
> > > Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
> > > ---
> > > note: the behaviour can be tested with the following shell script also
> > > available on the bugzilla (d being the phy device name):
> > > 
> > > ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
> > > while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done
> > > 
> > >   drivers/pci/iov.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 3aa115ed3a65..a9655c10e87f 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -351,6 +351,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   		goto err_pcibios;
> > >   	}
> > > +	iov->num_VFs = nr_virtfn;
> > >   	pci_iov_set_numvfs(dev, nr_virtfn);
> > >   	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
> > >   	pci_cfg_access_lock(dev);
> > > @@ -363,7 +364,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   		goto err_pcibios;
> > >   	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> > > -	iov->num_VFs = nr_virtfn;
> > >   	return 0;
> > > @@ -379,6 +379,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   	if (iov->link != dev->devfn)
> > >   		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > +	iov->num_VFs = 0;
> > >   	pci_iov_set_numvfs(dev, 0);
> > >   	return rc;
> > >   }
> > > -- 
> > > 2.17.1
> > >