From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1758325AbcLOOuK (ORCPT <rfc822;w@1wt.eu>);
        Thu, 15 Dec 2016 09:50:10 -0500
Received: from mx1.redhat.com ([209.132.183.28]:51594 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1757854AbcLOOuI (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 15 Dec 2016 09:50:08 -0500
Date: Thu, 15 Dec 2016 16:50:07 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
        linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
        izumi.taku@jp.fujitsu.com
Subject: Re: [PATCH] vfio/pci: Support error recovery
Message-ID: <20161215161750-mutt-send-email-mst@kernel.org>
References: <1480246457-10368-1-git-send-email-caoj.fnst@cn.fujitsu.com>
 <584EAACD.9070800@cn.fujitsu.com>
 <20161212121216.1c385d65@t450s.home>
 <58511DD7.8040508@cn.fujitsu.com>
 <20161214151637.6b2c6ddd@t450s.home>
 <5852A119.8050902@cn.fujitsu.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5852A119.8050902@cn.fujitsu.com>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Thu, 15 Dec 2016 14:50:08 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Dec 15, 2016 at 09:56:41PM +0800, Cao jin wrote:
> 
> 
> On 12/15/2016 06:16 AM, Alex Williamson wrote:
> > On Wed, 14 Dec 2016 18:24:23 +0800
> > Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> > 
> >> Sorry for late.
> >> after reading all your comments, I think I will try the solution 1.
> >>
> >> On 12/13/2016 03:12 AM, Alex Williamson wrote:
> >>> On Mon, 12 Dec 2016 21:49:01 +0800
> >>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >>>   
> >>>> Hi,
> >>>> I have 2 solutions(high level design) came to me, please see if they are
> >>>> acceptable, or which one is acceptable. Also have some questions.
> >>>>
> >>>> 1. block guest access during host recovery
> >>>>
> >>>>    add new field error_recovering in struct vfio_pci_device to
> >>>>    indicate host recovery status. aer driver in host will still do
> >>>>    reset link
> >>>>
> >>>>    - set error_recovering in vfio-pci driver's error_detected, used to
> >>>>      block all kinds of user access(config space, mmio)
> >>>>    - in order to solve concurrent issue of device resetting & user
> >>>>      access, check device state[*] in vfio-pci driver's resume, see if
> >>>>      device reset is done, if it is, then clear"error_recovering", or
> >>>>      else new a timer, check device state periodically until device
> >>>>      reset is done. (what if device reset don't end for a long time?)
> >>>>    - In qemu, translate guest link reset to host link reset.
> >>>>      A question here: we already have link reset in host, is a second
> >>>>      link reset necessary? why?
> >>>>  
> >>>>    [*] how to check device state: reading certain config space
> >>>>        register, check return value is valid or not(All F's)  
> >>>
> >>> Isn't this exactly the path we were on previously?  
> >>
> >> Yes, it is basically the previous path, plus the optimization.
> >>
> >>> There might be an
> >>> optimization that we could skip back-to-back resets, but how can you
> >>> necessarily infer that the resets are for the same thing? If the user
> >>> accesses the device between resets, can you still guarantee the guest
> >>> directed reset is unnecessary?  If time passes between resets, do you
> >>> know they're for the same event?  How much time can pass between the
> >>> host and guest reset to know they're for the same event?  In the
> >>> process of error handling, which is more important, speed or
> >>> correctness?
> >>>    
> >>
> >> I think vfio driver itself won't know what each reset comes for, and I
> >> don't quite understand why should vfio care this question, is this a new
> >> question in the design?
> > 
> > You're suggesting an optimization to eliminate one of the resets,
> > and as we've discussed, I don't see removing the host induced reset
> > as a viable option.  That means you want to eliminate the guest
> > directed reset.  There are potentially three levels to do that, the
> > vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
> > within the guest.  My comments were directed to the first option, the
> > host kernel level cannot correlate user directed resets as duplicates
> > of host directed resets.  
> >  
> 
> Ah, maybe it is mistake, I don't really want to eliminate guest directed
> reset very much, I was just not sure why it is very necessary.
> 
> The optimization I said just is fully separating host recovery from
> guest recovery(timer, check device periodically) in time, because there
> is concurrent device resetting & user access.
> 
> >> But I think it make sense that the user access during 2 resets maybe a
> >> trouble for guest recovery, misbehaved user could be out of our
> >> imagination.  Correctness is more important.
> >>
> >> If I understand you right, let me make a summary: host recovery just
> >> does link reset, which is incomplete, so we'd better do a complete guest
> >> recovery for correctness.
> > 
> > We don't know whether the host link reset is incomplete, but we can't do
> > a link reset transparently to the device, the device is no longer in the
> > same state after the reset.  The device specific driver, which exists
> > in userspace needs to be involved in device recovery.  Therefore
> > regardless of how QEMU handles the error, the driver within the guest
> > needs to be notified and perform recovery.  Since the device is PCI and
> > we're on x86 and nobody wants to introduce paravirtual error recovery,
> > we must use AER.  Part of AER recovery includes the possibility of
> > performing a link reset.  So it seems this eliminates avoiding the link
> > reset within the guest.
> > 
> > That leaves QEMU.  Here we need to decide whether a guest triggered
> > link reset induces a host link reset.  The current working theory is
> > that yes, this must be the case.  If there is ever a case where a
> > driver within the guest could trigger a link reset for the purposes
> > of error recovery when the host has not, I think this must be the
> > case.  Therefore, at least some guest induced link resets must become
> > host link resets.  Currently we assume all guest induced link resets
> > become host link resets.  Minimally to avoid that, QEMU would need to
> > know (not assume) whether the host performed a link reset.  Even with
> > that, QEMU would need to be able to correlate that a link reset from
> > the guest is a duplicate of a link reset that was already performed by
> > the host.  That implies that QEMU needs to deduce the intention of
> > the guest.  That seems like a complicated task for a patch series that
> > is already complicated enough, especially for a feature of questionable
> > value given the configuration restrictions (imo).
> > 
> > I would much rather focus on getting it right and making it as simple
> > as we can, even if that means links get reset one too many times on
> > error.
> > 
> 
> Thanks very much for your detailed explanation, it does helps me to
> understand your concern, understand why a second link reset is necessary.
> 
> I still want to share my thoughts with you(not argue): now we know host
> aer driver will do link reset for vfio-pci first, so I can say, even if
> fatal error is link related, after host link reset, link can work now.
> Then in qemu, we are not necessary to translate guest link reset to host
> link reset, just use vfio_pci_reset() as it is to do device
> reset(probably is FLR). Which also means we don't need following
> patch(make code easier):
> 
> @@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev)
> 
>       trace_vfio_pci_reset(vdev->vbasedev.name);
> 
> +     if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
> +         PCIDevice *br = pci_bridge_get_device(pdev->bus);
> +
> +         if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
> +              PCI_BRIDGE_CTL_BUS_RESET)) {
> +             if (pci_get_function_0(pdev) == pdev) {
> +                 vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> +             }
> +             return;
> +         }
> +     }
> +
>       vfio_pci_pre_reset(vdev);
> 
> 
> I think this also implies: we have a virtual link in qemu, but a virtual
> link will never be broken like a physical link.(In particular we already
> know host aer driver surely will do link reset to recover physical
> link). So, guest's link reset don't need to care whether virtual link is
> reset, just care virtual device.  And qemu "translates guest link reset
> to host link reset" seems kind of taking link-reset responsibility over
> from host:)
> 
> >>>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
> >>>>    Let user decide how to do serious recovery
> >>>>
> >>>>    add new field "user_driver" in struct pci_dev, used to skip link
> >>>>    reset for vfio-pci; add new field "link_reset" in struct
> >>>>    vfio_pci_device to indicate link has been reset or not during
> >>>>    recovery
> >>>>
> >>>>    - set user_driver in vfio_pci_probe(), to skip link reset for
> >>>>      vfio-pci in host.
> >>>>    - (use a flag)block user access(config, mmio) during host recovery
> >>>>      (not sure if this step is necessary)
> >>>>    - In qemu, translate guest link reset to host link reset.
> >>>>    - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
> >>>>      is executed
> >>>>    - In vfio-pci driver's resume, new a timer, check "link_reset" field
> >>>>      periodically, if it is set in reasonable time, then clear it and
> >>>>      delete timer, or else, vfio-pci driver will does the link reset!  
> >>>
> >>> What happens in the case of a multifunction device where each function
> >>> is part of a separate IOMMU group and one function is hot-removed from
> >>> the user? We can't do a link reset on that function since the other
> >>> function is still in use.  We have no choice but release a device in an
> >>> unknown state back to the host.  
> >>
> >> hot-remove from user, do you mean, for example, all functions assigned
> >> to VM, then suddenly a person does something like following
> >>
> >> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
> >>
> >> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
> >>
> >> to return device to host driver, or don't bind it to host driver, let it
> >> in driver-less state???
> > 
> > Yes, the host kernel has no visiblity to how a user is making use of
> > devices.  To support AER we require a similar topology between host and
> > guest such that a guest link reset translates to a host reset.  That
> > requirement is imposed by userspace, ie. QEMU.  The host kernel cannot
> > presume that this is the case.  Therefore we could have a
> > multi-function device where each function is assigned to the same or
> > different users in any configuration.  If a fault occurs and we defer
> > to the user to perform the link reset, we have absolutely no guarantee
> > that it will ever occur.  If the functions are assigned to different
> > users, then each user individually doesn't have the capability to
> > perform a link reset.  If the devices happen to be assigned to a single
> > user when the error occurs, we cannot assume the user has an AER
> > compatible configuration, the devices could be exposed as separate
> > single function devices, any one of which might be individually removed
> > from the user and made use of by the host, such as your sysfs example
> > above.  The host cannot perform a link reset in this case either
> > as the sibling devices are still in use by the guest.  Thanks,
> > 
> > Alex
> > 
> > 
> 
> this explanation is valuable to me, so this is also why we can't do link
> reset in vfio driver when one of the function is closed. And do link
> reset in vfio driver until all functions are close is poor solution and
> very complex(quarantine the device) as you said.
> 
> I am going to try solution 1, but I still have some consideration share
> with you, this won't stop my trial, and don't have relationship with
> above discussion, just FYI:
> 
> In non-virtuallization environment, from device's perspective, the steps
> of a normal recovery consists of:
>     error_detect
>     mmio_enabled
>     link_reset
>     slot_reset
>     resume
> 
> Now in our condition, the steps become:
>     *link_reset* (host's, the following are guest's)
>     error_detect
>     mmio_enabled
>     link_reset
>     slot_reset
>     resume
> 
> Especially, some device's specific driver in guest could do some
> specific work in error_detect, take igb_io_error_detected() for example.
> Like the words in pci-error-recovery.txt said:
> 
> it gives the driver a chance to cleanup, waiting for pending stuff
> (timers, whatever, etc...) to complete;
> 
> But if link_reset is the first step, we lost all the status(register
> value, etc) in the device. Of course I don't know if this will be a
> problem (might not), just curious if this has been your concern:)

You'll find I did mention it :)

But consider Documentation/PCI/pcieaer-howto.txt

	3.2.2.2 Non-correctable (non-fatal and fatal) errors

	If an error message indicates a non-fatal error, performing link reset
	at upstream is not required. The AER driver calls error_detected(dev,
	pci_channel_io_normal) to all drivers associated within a hierarchy in
	question. for example,
	EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
	If Upstream port A captures an AER error, the hierarchy consists of
	Downstream port B and EndPoint.

	A driver may return PCI_ERS_RESULT_CAN_RECOVER,
	PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
	whether it can recover or the AER driver calls mmio_enabled as next.

	If an error message indicates a fatal error, kernel will broadcast
	error_detected(dev, pci_channel_io_frozen) to all drivers within
	a hierarchy in question. Then, performing link reset at upstream is
	necessary.

I think that if you just forward errors to guests they will get confused.
I see three possible approaches.


1. Always pretend to guest that there was a fatal error,
  then basically:

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index dce511f..4022f9b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,7 +1299,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 	vfio_device_put(device);
 
-	return PCI_ERS_RESULT_CAN_RECOVER;
+	return PCI_ERS_RESULT_DISCONNECT;
 }
 
 static const struct pci_error_handlers vfio_err_handlers = {


probably conditional on userspace invoking some ioctl
to avoid breaking existing users.

2. send non fatal error to guest.
Add another eventfd to distinguish non fatal and fatal errors.

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index dce511f..e22f449 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1292,14 +1292,17 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 	mutex_lock(&vdev->igate);
 
-	if (vdev->err_trigger)
+	if (state == pci_channel_io_normal && vdev->recover_trigger)
+		eventfd_signal(vdev->recover_trigger, 1);
+	else if (vdev->err_trigger)
 		eventfd_signal(vdev->err_trigger, 1);
 
 	mutex_unlock(&vdev->igate);
 
 	vfio_device_put(device);
 
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
 static const struct pci_error_handlers vfio_err_handlers = {

Forward non fatal ones to guest, stop vm on fatal ones.


3. forward both non fatal and fatal error to guest
This includes 1 and 2 above, and

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index dce511f..4022f9b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,7 +1299,8 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 	vfio_device_put(device);
 
-	return PCI_ERS_RESULT_CAN_RECOVER;
+	return state == pci_channel_io_normal : PCI_ERS_RESULT_CAN_RECOVER :
+		PCI_ERS_RESULT_DISCONNECT;
 }
 
 static const struct pci_error_handlers vfio_err_handlers = {
	
Maybe make this conditional on recover_trigger to keep
compatibility.


You seem to be starting from 1. But how about starting small, and doing
2 as a first step? Fatal errors will still stop vm.
This will help you merge a bunch of error reporting infrastructure
without worrying about recovery so much.

Making some progress finally will be good.


Alex, what do you think?


-- 
MST