Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin <caoj.fnst@cn.fujitsu.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<izumi.taku@jp.fujitsu.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery
Date: Fri, 9 Dec 2016 11:40:50 +0800	[thread overview]
Message-ID: <584A27C2.6030407@cn.fujitsu.com> (raw)
In-Reply-To: <58497263.7080500@cn.fujitsu.com>

On 12/08/2016 10:46 PM, Cao jin wrote:
> 
> 
> On 12/06/2016 11:35 PM, Alex Williamson wrote:
>> On Tue, 6 Dec 2016 18:46:04 +0800
>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>> On 12/06/2016 12:59 PM, Alex Williamson wrote:
>>>> On Tue, 6 Dec 2016 05:55:28 +0200
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>   
>>>>> On Mon, Dec 05, 2016 at 09:17:30AM -0700, Alex Williamson wrote:  
>>>>>> If you're going to take the lead for these AER patches, I would
>>>>>> certainly suggest that understanding the reasoning behind the bus reset
>>>>>> behavior is a central aspect to this series.  This effort has dragged
>>>>>> out for nearly two years and I apologize, but I don't really have a lot
>>>>>> of patience for rehashing some of these issues if you're not going to
>>>>>> read the previous discussions or consult with your colleagues to
>>>>>> understand how we got to this point.  If you want to challenge some of
>>>>>> the design points, that's great, it could use some new eyes, but please
>>>>>> understand how we got here first.    
>>>>>
>>>>> Well I'm guessing Cao jin here isn't the only one not
>>>>> willing to plough through all historical versions of the patchset
>>>>> just to figure out the motivation for some code.
>>>>>
>>>>> Including a summary of a high level architecture couldn't hurt.
>>>>>
>>>>> Any chance of writing such?  Alternatively, we can try to build it as
>>>>> part of this thread.  Shouldn't be hard as it seems somewhat
>>>>> straight-forward on the surface:
>>>>>
>>>>> - detect link error on the host, don't reset link as we would normally do  
>>>>
>>>> This is actually a new approach that I'm not sure I agree with.  By
>>>> skipping the host directed link reset, vfio is taking responsibility
>>>> for doing this, but then we just assume the user will do it.  I have
>>>> issues with this.
>>>>
>>>> The previous approach was to use the error detected notifier to block
>>>> access to the device, allowing the host to perform the link reset.  A
>>>> subsequent notification in the AER process released the user access
>>>> which allowed the user AER process to proceed.  This did result in both
>>>> a host directed and a guest directed link reset, but other than
>>>> coordinating the blocking of the user process during host reset, that
>>>> hasn't been brought up as an issue previously.
>>>>   
>>>
>>> Tests on previous versions didn't bring up issues as I find, I think
>>> that is because we didn't test it completely. As I know, before August
>>> of this year, we didn't have cable connected to NIC, let alone
>>> connecting NIC to gateway.
>>
>> Lack of testing has been a significant issue throughout the development
>> of this series.
>>
>>> Even if I fixed the guest oops issue in igb driver that Alex found in
>>> v9, v9 still cannot work in my test. And in my test, disable link
>>> reset(in host) in aer core for vfio-pci is the most significant step to
>>> get my test passed.
>>
>> But is it the correct step?  I'm not convinced.  Why did blocking guest
>> access not work?  How do you plan to manage vfio taking the
>> responsibility to perform a bus reset when you don't know whether QEMU
>> is the user of the device or whether the user supports AER recovery?
>>  
> 
> Maybe currently we don't have enough proof to prove the correctness, but
> I think I did find some facts to prove that link reset in host is a big
> trouble, and can answer part of questions above.
> 
> 1st, some thoughts:
> In pci-error-recovery.txt and do_recovery() of kernel tree, we can see,
> a recovery consists of several steps(callbacks), link reset is one of
> them, and except link reset, the others are seems kind of device
> specific. In our case, both host & guest will do recovery, I think the
> host recovery actually is some kind of fake recovery, see vfio-pci
> driver's error_detected & resume callback, they don't do anything
> special, mainly signal error to user, but the link reset in host "fake
> reset" does some serious work, in other words, I think host does the
> recovery incompletely, so I was thinking, why not just drop incompletely
> host recovery(drop link reset) for vfio-pci, and let the guest take care
> of the whole serious recovery.  This is part of the reason of why my
> version looks like this.  But yes, I admit the issue Alex mentioned,
> vfio can't guarantee that user will do a bus reset, this is an issue I
> will keep looking for a solution.
> 
> 2nd, some facts and analyzation from test:
> In host, the relationship between time and behviour in each component
> roughly looks as following:
> 
>      +   HW    +  host kernel   +     qemu      + guest kernel  +
>      |         |(error recovery)|               |               |
>      |         |                |               |               |
>      |         | vfio-pci's     |               |               |
>      |         | error_detected |               |               |
>      |         | +              |               |               |
>      |         | |              |               |               |
>      |         | | error notify |               |               |
>      |         | | via eventfd  |               |               |
>      |         | +---------------> +----------+ |               |
>      |         |                |  +vfio_err_ | |               |
>      |         |                |  |notifier_ | |               |
>      | +---- +<---+link reset   |  |handler   | |               |
>      | | HW  | |                |  |          | |               |       
>   | |     | |                |             | |               |
>      | |r    | | vfio-pci's     |  |pass aer  | |               |
>      | |e..  | | resume         |  |to guest  | |               |
>      | |s.   | | (blocking end) |  |          | |               |
>      | |e    | |                |  |   *2*    | |               |
>      | |t    | |                |  +----+-----+ |               |
>      | |i    | |                |       |       |               |
>      | |n    | |                |       +--------> +----------+ |
>      | |g    | |                |               |  | guest    | |
>      | |     | |                |               |  | recovery | |
>      | |     | |                |               |  | process  | |
>      | |     | |                |               |  |(include  | |
>      | |     | |                |               |  |register  | |
>      | | *1* | |                |               |  |access)   | |
>      | |     | |                |               |  |          | |
>      | |     | |                |               |  |   *3*    | |
>      |         |                |               |  +----------+ |
> Time |         |                |               |               |
>      |         |                |               |               |
>      |         |                |               |               |
>      |         |                |               |               |
>      v         |                |               |               |
> 
> Now let me try to answer: Why did blocking guest access not work?
> Some important factor:
> 1. host recovery doesn't do anything special except error notifying, so
> it may be executed very fast.
> 2. Hardware resetting time is not sure, from pcie spec 6.6.1, guessing
> it need many ms, pretty long?
> 
> some facts found in v9(block config write, not read, during host
> recovery) test:
> 1. reading uncor error register in vfio_err_notifier_handler sometimes
> returns correct value, sometimes return invalid value(All F's)
> 

I forget another facts:
2. the igb bug[*] our patch v9 triggered also is the same kind things.
error_detected() of igb in guest read config space, and return invalid
all F's, then igb NULLed its hardware address, then oops shows.

I think it is also a concurrent issue as described below.

[*] http://patchwork.ozlabs.org/patch/692171

> So, I am thinking, if host blocking on host end early, and *2*, *3* is
> parallel with *1*, the way used in v9 to blocking guest access, may not
> work.
> 

-- 
Sincerely,
Cao jin