From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758325AbcLOOuK (ORCPT ); Thu, 15 Dec 2016 09:50:10 -0500 Received: from mx1.redhat.com ([209.132.183.28]:51594 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757854AbcLOOuI (ORCPT ); Thu, 15 Dec 2016 09:50:08 -0500 Date: Thu, 15 Dec 2016 16:50:07 +0200 From: "Michael S. Tsirkin" To: Cao jin Cc: Alex Williamson , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, izumi.taku@jp.fujitsu.com Subject: Re: [PATCH] vfio/pci: Support error recovery Message-ID: <20161215161750-mutt-send-email-mst@kernel.org> References: <1480246457-10368-1-git-send-email-caoj.fnst@cn.fujitsu.com> <584EAACD.9070800@cn.fujitsu.com> <20161212121216.1c385d65@t450s.home> <58511DD7.8040508@cn.fujitsu.com> <20161214151637.6b2c6ddd@t450s.home> <5852A119.8050902@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5852A119.8050902@cn.fujitsu.com> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Thu, 15 Dec 2016 14:50:08 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 15, 2016 at 09:56:41PM +0800, Cao jin wrote: > > > On 12/15/2016 06:16 AM, Alex Williamson wrote: > > On Wed, 14 Dec 2016 18:24:23 +0800 > > Cao jin wrote: > > > >> Sorry for late. > >> after reading all your comments, I think I will try the solution 1. > >> > >> On 12/13/2016 03:12 AM, Alex Williamson wrote: > >>> On Mon, 12 Dec 2016 21:49:01 +0800 > >>> Cao jin wrote: > >>> > >>>> Hi, > >>>> I have 2 solutions(high level design) came to me, please see if they are > >>>> acceptable, or which one is acceptable. Also have some questions. > >>>> > >>>> 1. block guest access during host recovery > >>>> > >>>> add new field error_recovering in struct vfio_pci_device to > >>>> indicate host recovery status. aer driver in host will still do > >>>> reset link > >>>> > >>>> - set error_recovering in vfio-pci driver's error_detected, used to > >>>> block all kinds of user access(config space, mmio) > >>>> - in order to solve concurrent issue of device resetting & user > >>>> access, check device state[*] in vfio-pci driver's resume, see if > >>>> device reset is done, if it is, then clear"error_recovering", or > >>>> else new a timer, check device state periodically until device > >>>> reset is done. (what if device reset don't end for a long time?) > >>>> - In qemu, translate guest link reset to host link reset. > >>>> A question here: we already have link reset in host, is a second > >>>> link reset necessary? why? > >>>> > >>>> [*] how to check device state: reading certain config space > >>>> register, check return value is valid or not(All F's) > >>> > >>> Isn't this exactly the path we were on previously? > >> > >> Yes, it is basically the previous path, plus the optimization. > >> > >>> There might be an > >>> optimization that we could skip back-to-back resets, but how can you > >>> necessarily infer that the resets are for the same thing? If the user > >>> accesses the device between resets, can you still guarantee the guest > >>> directed reset is unnecessary? If time passes between resets, do you > >>> know they're for the same event? How much time can pass between the > >>> host and guest reset to know they're for the same event? In the > >>> process of error handling, which is more important, speed or > >>> correctness? > >>> > >> > >> I think vfio driver itself won't know what each reset comes for, and I > >> don't quite understand why should vfio care this question, is this a new > >> question in the design? > > > > You're suggesting an optimization to eliminate one of the resets, > > and as we've discussed, I don't see removing the host induced reset > > as a viable option. That means you want to eliminate the guest > > directed reset. There are potentially three levels to do that, the > > vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it > > within the guest. My comments were directed to the first option, the > > host kernel level cannot correlate user directed resets as duplicates > > of host directed resets. > > > > Ah, maybe it is mistake, I don't really want to eliminate guest directed > reset very much, I was just not sure why it is very necessary. > > The optimization I said just is fully separating host recovery from > guest recovery(timer, check device periodically) in time, because there > is concurrent device resetting & user access. > > >> But I think it make sense that the user access during 2 resets maybe a > >> trouble for guest recovery, misbehaved user could be out of our > >> imagination. Correctness is more important. > >> > >> If I understand you right, let me make a summary: host recovery just > >> does link reset, which is incomplete, so we'd better do a complete guest > >> recovery for correctness. > > > > We don't know whether the host link reset is incomplete, but we can't do > > a link reset transparently to the device, the device is no longer in the > > same state after the reset. The device specific driver, which exists > > in userspace needs to be involved in device recovery. Therefore > > regardless of how QEMU handles the error, the driver within the guest > > needs to be notified and perform recovery. Since the device is PCI and > > we're on x86 and nobody wants to introduce paravirtual error recovery, > > we must use AER. Part of AER recovery includes the possibility of > > performing a link reset. So it seems this eliminates avoiding the link > > reset within the guest. > > > > That leaves QEMU. Here we need to decide whether a guest triggered > > link reset induces a host link reset. The current working theory is > > that yes, this must be the case. If there is ever a case where a > > driver within the guest could trigger a link reset for the purposes > > of error recovery when the host has not, I think this must be the > > case. Therefore, at least some guest induced link resets must become > > host link resets. Currently we assume all guest induced link resets > > become host link resets. Minimally to avoid that, QEMU would need to > > know (not assume) whether the host performed a link reset. Even with > > that, QEMU would need to be able to correlate that a link reset from > > the guest is a duplicate of a link reset that was already performed by > > the host. That implies that QEMU needs to deduce the intention of > > the guest. That seems like a complicated task for a patch series that > > is already complicated enough, especially for a feature of questionable > > value given the configuration restrictions (imo). > > > > I would much rather focus on getting it right and making it as simple > > as we can, even if that means links get reset one too many times on > > error. > > > > Thanks very much for your detailed explanation, it does helps me to > understand your concern, understand why a second link reset is necessary. > > I still want to share my thoughts with you(not argue): now we know host > aer driver will do link reset for vfio-pci first, so I can say, even if > fatal error is link related, after host link reset, link can work now. > Then in qemu, we are not necessary to translate guest link reset to host > link reset, just use vfio_pci_reset() as it is to do device > reset(probably is FLR). Which also means we don't need following > patch(make code easier): > > @@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev) > > trace_vfio_pci_reset(vdev->vbasedev.name); > > + if (vdev->features & VFIO_FEATURE_ENABLE_AER) { > + PCIDevice *br = pci_bridge_get_device(pdev->bus); > + > + if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) & > + PCI_BRIDGE_CTL_BUS_RESET)) { > + if (pci_get_function_0(pdev) == pdev) { > + vfio_pci_hot_reset(vdev, vdev->single_depend_dev); > + } > + return; > + } > + } > + > vfio_pci_pre_reset(vdev); > > > I think this also implies: we have a virtual link in qemu, but a virtual > link will never be broken like a physical link.(In particular we already > know host aer driver surely will do link reset to recover physical > link). So, guest's link reset don't need to care whether virtual link is > reset, just care virtual device. And qemu "translates guest link reset > to host link reset" seems kind of taking link-reset responsibility over > from host:) > > >>>> 2. skip link reset in aer driver of host kernel, for vfio-pci. > >>>> Let user decide how to do serious recovery > >>>> > >>>> add new field "user_driver" in struct pci_dev, used to skip link > >>>> reset for vfio-pci; add new field "link_reset" in struct > >>>> vfio_pci_device to indicate link has been reset or not during > >>>> recovery > >>>> > >>>> - set user_driver in vfio_pci_probe(), to skip link reset for > >>>> vfio-pci in host. > >>>> - (use a flag)block user access(config, mmio) during host recovery > >>>> (not sure if this step is necessary) > >>>> - In qemu, translate guest link reset to host link reset. > >>>> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET > >>>> is executed > >>>> - In vfio-pci driver's resume, new a timer, check "link_reset" field > >>>> periodically, if it is set in reasonable time, then clear it and > >>>> delete timer, or else, vfio-pci driver will does the link reset! > >>> > >>> What happens in the case of a multifunction device where each function > >>> is part of a separate IOMMU group and one function is hot-removed from > >>> the user? We can't do a link reset on that function since the other > >>> function is still in use. We have no choice but release a device in an > >>> unknown state back to the host. > >> > >> hot-remove from user, do you mean, for example, all functions assigned > >> to VM, then suddenly a person does something like following > >> > >> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind > >> > >> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind > >> > >> to return device to host driver, or don't bind it to host driver, let it > >> in driver-less state??? > > > > Yes, the host kernel has no visiblity to how a user is making use of > > devices. To support AER we require a similar topology between host and > > guest such that a guest link reset translates to a host reset. That > > requirement is imposed by userspace, ie. QEMU. The host kernel cannot > > presume that this is the case. Therefore we could have a > > multi-function device where each function is assigned to the same or > > different users in any configuration. If a fault occurs and we defer > > to the user to perform the link reset, we have absolutely no guarantee > > that it will ever occur. If the functions are assigned to different > > users, then each user individually doesn't have the capability to > > perform a link reset. If the devices happen to be assigned to a single > > user when the error occurs, we cannot assume the user has an AER > > compatible configuration, the devices could be exposed as separate > > single function devices, any one of which might be individually removed > > from the user and made use of by the host, such as your sysfs example > > above. The host cannot perform a link reset in this case either > > as the sibling devices are still in use by the guest. Thanks, > > > > Alex > > > > > > this explanation is valuable to me, so this is also why we can't do link > reset in vfio driver when one of the function is closed. And do link > reset in vfio driver until all functions are close is poor solution and > very complex(quarantine the device) as you said. > > I am going to try solution 1, but I still have some consideration share > with you, this won't stop my trial, and don't have relationship with > above discussion, just FYI: > > In non-virtuallization environment, from device's perspective, the steps > of a normal recovery consists of: > error_detect > mmio_enabled > link_reset > slot_reset > resume > > Now in our condition, the steps become: > *link_reset* (host's, the following are guest's) > error_detect > mmio_enabled > link_reset > slot_reset > resume > > Especially, some device's specific driver in guest could do some > specific work in error_detect, take igb_io_error_detected() for example. > Like the words in pci-error-recovery.txt said: > > it gives the driver a chance to cleanup, waiting for pending stuff > (timers, whatever, etc...) to complete; > > But if link_reset is the first step, we lost all the status(register > value, etc) in the device. Of course I don't know if this will be a > problem (might not), just curious if this has been your concern:) You'll find I did mention it :) But consider Documentation/PCI/pcieaer-howto.txt 3.2.2.2 Non-correctable (non-fatal and fatal) errors If an error message indicates a non-fatal error, performing link reset at upstream is not required. The AER driver calls error_detected(dev, pci_channel_io_normal) to all drivers associated within a hierarchy in question. for example, EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. If Upstream port A captures an AER error, the hierarchy consists of Downstream port B and EndPoint. A driver may return PCI_ERS_RESULT_CAN_RECOVER, PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on whether it can recover or the AER driver calls mmio_enabled as next. If an error message indicates a fatal error, kernel will broadcast error_detected(dev, pci_channel_io_frozen) to all drivers within a hierarchy in question. Then, performing link reset at upstream is necessary. I think that if you just forward errors to guests they will get confused. I see three possible approaches. 1. Always pretend to guest that there was a fatal error, then basically: diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index dce511f..4022f9b 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1299,7 +1299,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, vfio_device_put(device); - return PCI_ERS_RESULT_CAN_RECOVER; + return PCI_ERS_RESULT_DISCONNECT; } static const struct pci_error_handlers vfio_err_handlers = { probably conditional on userspace invoking some ioctl to avoid breaking existing users. 2. send non fatal error to guest. Add another eventfd to distinguish non fatal and fatal errors. diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index dce511f..e22f449 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1292,14 +1292,17 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, mutex_lock(&vdev->igate); - if (vdev->err_trigger) + if (state == pci_channel_io_normal && vdev->recover_trigger) + eventfd_signal(vdev->recover_trigger, 1); + else if (vdev->err_trigger) eventfd_signal(vdev->err_trigger, 1); mutex_unlock(&vdev->igate); vfio_device_put(device); return PCI_ERS_RESULT_CAN_RECOVER; } static const struct pci_error_handlers vfio_err_handlers = { Forward non fatal ones to guest, stop vm on fatal ones. 3. forward both non fatal and fatal error to guest This includes 1 and 2 above, and diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index dce511f..4022f9b 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1299,7 +1299,8 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, vfio_device_put(device); - return PCI_ERS_RESULT_CAN_RECOVER; + return state == pci_channel_io_normal : PCI_ERS_RESULT_CAN_RECOVER : + PCI_ERS_RESULT_DISCONNECT; } static const struct pci_error_handlers vfio_err_handlers = { Maybe make this conditional on recover_trigger to keep compatibility. You seem to be starting from 1. But how about starting small, and doing 2 as a first step? Fatal errors will still stop vm. This will help you merge a bunch of error reporting infrastructure without worrying about recovery so much. Making some progress finally will be good. Alex, what do you think? -- MST