Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin <caoj.fnst@cn.fujitsu.com>
To: Alex Williamson <alex.williamson@redhat.com>, <mst@redhat.com>
Cc: <linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<izumi.taku@jp.fujitsu.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery
Date: Wed, 14 Dec 2016 18:24:23 +0800	[thread overview]
Message-ID: <58511DD7.8040508@cn.fujitsu.com> (raw)
In-Reply-To: <20161212121216.1c385d65@t450s.home>

Sorry for late.
after reading all your comments, I think I will try the solution 1.

On 12/13/2016 03:12 AM, Alex Williamson wrote:
> On Mon, 12 Dec 2016 21:49:01 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> 
>> Hi,
>> I have 2 solutions(high level design) came to me, please see if they are
>> acceptable, or which one is acceptable. Also have some questions.
>>
>> 1. block guest access during host recovery
>>
>>    add new field error_recovering in struct vfio_pci_device to
>>    indicate host recovery status. aer driver in host will still do
>>    reset link
>>
>>    - set error_recovering in vfio-pci driver's error_detected, used to
>>      block all kinds of user access(config space, mmio)
>>    - in order to solve concurrent issue of device resetting & user
>>      access, check device state[*] in vfio-pci driver's resume, see if
>>      device reset is done, if it is, then clear"error_recovering", or
>>      else new a timer, check device state periodically until device
>>      reset is done. (what if device reset don't end for a long time?)
>>    - In qemu, translate guest link reset to host link reset.
>>      A question here: we already have link reset in host, is a second
>>      link reset necessary? why?
>>  
>>    [*] how to check device state: reading certain config space
>>        register, check return value is valid or not(All F's)
> 
> Isn't this exactly the path we were on previously?

Yes, it is basically the previous path, plus the optimization.

> There might be an
> optimization that we could skip back-to-back resets, but how can you
> necessarily infer that the resets are for the same thing? If the user
> accesses the device between resets, can you still guarantee the guest
> directed reset is unnecessary?  If time passes between resets, do you
> know they're for the same event?  How much time can pass between the
> host and guest reset to know they're for the same event?  In the
> process of error handling, which is more important, speed or
> correctness?
>  

I think vfio driver itself won't know what each reset comes for, and I
don't quite understand why should vfio care this question, is this a new
question in the design?

But I think it make sense that the user access during 2 resets maybe a
trouble for guest recovery, misbehaved user could be out of our
imagination.  Correctness is more important.

If I understand you right, let me make a summary: host recovery just
does link reset, which is incomplete, so we'd better do a complete guest
recovery for correctness.

>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
>>    Let user decide how to do serious recovery
>>
>>    add new field "user_driver" in struct pci_dev, used to skip link
>>    reset for vfio-pci; add new field "link_reset" in struct
>>    vfio_pci_device to indicate link has been reset or not during
>>    recovery
>>
>>    - set user_driver in vfio_pci_probe(), to skip link reset for
>>      vfio-pci in host.
>>    - (use a flag)block user access(config, mmio) during host recovery
>>      (not sure if this step is necessary)
>>    - In qemu, translate guest link reset to host link reset.
>>    - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
>>      is executed
>>    - In vfio-pci driver's resume, new a timer, check "link_reset" field
>>      periodically, if it is set in reasonable time, then clear it and
>>      delete timer, or else, vfio-pci driver will does the link reset!
> 
> What happens in the case of a multifunction device where each function
> is part of a separate IOMMU group and one function is hot-removed from
> the user? We can't do a link reset on that function since the other
> function is still in use.  We have no choice but release a device in an
> unknown state back to the host.

hot-remove from user, do you mean, for example, all functions assigned
to VM, then suddenly a person does something like following

$ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind

$ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind

to return device to host driver, or don't bind it to host driver, let it
in driver-less state???

>  As previously discussed, we don't
> expect that any sort of function-level FLR will necessarily reset the
> device to the same state.  I also don't really like vfio-pci taking
> over error handling capabilities from the PCI-core.  That's redundant
> code and extra maintenance overhead.
>  

I understand the concern, so I suppose solution 1 is preferred.

-- 
Sincerely,
Cao jin

>> A quick question:
>> I don't know how devices is divided into iommu groups, is it possible
>> for functions in a multi-function device to be split into different groups?
> 
> Yes, if a multifunction device supports ACS or if we have quirks to
> expose that the functions do not perform internal peer-to-peer, then
> they may be in separate IOMMU groups, depending on the rest of the PCI
> topology.  See:
> 
> http://vfio.blogspot.com/2014/08/iommu-groups-inside-and-out.html
> 
> Thanks,
> Alex
> 
> 
> .
>