Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin <caoj.fnst@cn.fujitsu.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<izumi.taku@jp.fujitsu.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery
Date: Wed, 7 Dec 2016 10:49:39 +0800	[thread overview]
Message-ID: <584778C3.8050302@cn.fujitsu.com> (raw)
In-Reply-To: <20161206083556.23be6ee5@t450s.home>

On 12/06/2016 11:35 PM, Alex Williamson wrote:
> On Tue, 6 Dec 2016 18:46:04 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> 
>> On 12/06/2016 12:59 PM, Alex Williamson wrote:
>>> On Tue, 6 Dec 2016 05:55:28 +0200
>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>   
>>>> On Mon, Dec 05, 2016 at 09:17:30AM -0700, Alex Williamson wrote:  
>>>>> If you're going to take the lead for these AER patches, I would
>>>>> certainly suggest that understanding the reasoning behind the bus reset
>>>>> behavior is a central aspect to this series.  This effort has dragged
>>>>> out for nearly two years and I apologize, but I don't really have a lot
>>>>> of patience for rehashing some of these issues if you're not going to
>>>>> read the previous discussions or consult with your colleagues to
>>>>> understand how we got to this point.  If you want to challenge some of
>>>>> the design points, that's great, it could use some new eyes, but please
>>>>> understand how we got here first.    
>>>>
>>>> Well I'm guessing Cao jin here isn't the only one not
>>>> willing to plough through all historical versions of the patchset
>>>> just to figure out the motivation for some code.
>>>>
>>>> Including a summary of a high level architecture couldn't hurt.
>>>>
>>>> Any chance of writing such?  Alternatively, we can try to build it as
>>>> part of this thread.  Shouldn't be hard as it seems somewhat
>>>> straight-forward on the surface:
>>>>
>>>> - detect link error on the host, don't reset link as we would normally do  
>>>
>>> This is actually a new approach that I'm not sure I agree with.  By
>>> skipping the host directed link reset, vfio is taking responsibility
>>> for doing this, but then we just assume the user will do it.  I have
>>> issues with this.
>>>
>>> The previous approach was to use the error detected notifier to block
>>> access to the device, allowing the host to perform the link reset.  A
>>> subsequent notification in the AER process released the user access
>>> which allowed the user AER process to proceed.  This did result in both
>>> a host directed and a guest directed link reset, but other than
>>> coordinating the blocking of the user process during host reset, that
>>> hasn't been brought up as an issue previously.
>>>   
>>
>> Tests on previous versions didn't bring up issues as I find, I think
>> that is because we didn't test it completely. As I know, before August
>> of this year, we didn't have cable connected to NIC, let alone
>> connecting NIC to gateway.
> 
> Lack of testing has been a significant issue throughout the development
> of this series.
> 
>> Even if I fixed the guest oops issue in igb driver that Alex found in
>> v9, v9 still cannot work in my test. And in my test, disable link
>> reset(in host) in aer core for vfio-pci is the most significant step to
>> get my test passed.
> 
> But is it the correct step?  I'm not convinced.  Why did blocking guest
> access not work?  How do you plan to manage vfio taking the
> responsibility to perform a bus reset when you don't know whether QEMU
> is the user of the device or whether the user supports AER recovery?
>  
>>>> - report link error to guest
>>>> - detect link reset request from guest
>>>> - reset link on host
>>>>
>>>> Since link reset will reset all devices behind it, for this to work we
>>>> need same set of devices behind the link in host and guest.  Enforcing
>>>> this would be nice to have.  
>>>
>>> This is a pretty significant complication and I think it's a
>>> requirement.  This is the heart of why we have an AER vfio-pci device
>>> option and why we require that QEMU should fail to initialize the
>>> device if AER is enabled in an incompatible configuration.  If a user
>>> cannot be sure that AER is enabled on a device, it's pointless to
>>> bother implementing it, though honestly I question the value of it in
>>> the VM altogether given configuration requirements (ie. are users
>>> going to accept the reason that all the ports of a device need to be
>>> assigned to a single guest for guest-based AER recovery when they were
>>> previously able to assign each port to a separate VM?).
>>>    
>>>> - as link now might end up in bad state, reset
>>>>   it when device is unassigned  
>>>
>>> This is also a new aspect for the approach here, previously we allowed
>>> the host directed link reset so we could assume the device was
>>> sufficiently recovered.  In the proposal here, the AER core skips any
>>> devices bound to vfio-pci, but vfio can't guarantee that we can do a
>>> bus reset on them.  PCI bus isolation is not accounted for in DMA
>>> isolation, which is the basis for IOMMU groups.  A bus can host
>>> multiple IOMMU groups, each of which may have a different user.  Only
>>> in a very specific configuration can vfio do a bus reset.
>>>    
>>>> Any details I missed?  
>>>
>>> AIUI, the critical feature is that the guest needs to be able to reset
>>> the device link, all the other design elements play out from the
>>> logical expansion of that feature.  It means that a guest bus reset
>>> needs to translate to a host bus reset, which means that all of the
>>> affected host devices need to be accounted for and those that are
>>> assigned to the guest need to be affected in the guest in the same
>>> way.  QEMU must enforce this configuration or else a user cannot know
>>> the result of a AER fault, ie. will it cause a VMSTOP condition or is
>>> the fault forwarded to the guest.  The required configuration
>>> restrictions are quite involved, therefore we can't simply require this
>>> of all configurations, so a vfio-pci device option is added.  The
>>> coordination of the host directed reset with the guest directed reset
>>> was only a complications discovered within the last few revisions of
>>> the series.  As noted above, the previous solution to this was to
>>> attempt to block access to the device while the host reset proceeds.
>>>
>>> Clearly it's a little disconcerting if we throw all of that away and
>>> simply assume that an FLR is sufficient to reset the device when it
>>> seems like link issues might be a nontrivial source of AER faults.  If
>>> FLR is sufficient, then why does the core AER handling code in the
>>> kernel not simply do this?  Thanks,
>>>   
>>
>> I agree with the points here.  Now I understand why  translate a guest
>> link reset to host link reset[*], and FLR shouldn't be equivalent to
>> link reset, but the PITY is, we can't trigger a real fatal error to see
>> if a FLR is sufficient to reset the device.
> 
> In this case, it's not a matter of finding a scenario where it works.
> Using FLR to recover from a fatal error would need to be supported by
> verbiage in a specification.  I'm not interested in cases where it
> happens to work on single device for a certain type of error.  The link
> reset is the bare metal mechanism for recovering from a fatal error and
> it should be our mechanism as well unless there's spec wording to
> support another approach.  Thanks,
> 

Ok, I understand your points here, will drag that dropped patch back and
test.

-- 
Sincerely,
Cao jin