From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934057AbcLNWQk (ORCPT ); Wed, 14 Dec 2016 17:16:40 -0500 Received: from mx1.redhat.com ([209.132.183.28]:49170 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933070AbcLNWQi (ORCPT ); Wed, 14 Dec 2016 17:16:38 -0500 Date: Wed, 14 Dec 2016 15:16:37 -0700 From: Alex Williamson To: Cao jin Cc: , , , Subject: Re: [PATCH] vfio/pci: Support error recovery Message-ID: <20161214151637.6b2c6ddd@t450s.home> In-Reply-To: <58511DD7.8040508@cn.fujitsu.com> References: <1480246457-10368-1-git-send-email-caoj.fnst@cn.fujitsu.com> <584EAACD.9070800@cn.fujitsu.com> <20161212121216.1c385d65@t450s.home> <58511DD7.8040508@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Wed, 14 Dec 2016 22:16:38 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 14 Dec 2016 18:24:23 +0800 Cao jin wrote: > Sorry for late. > after reading all your comments, I think I will try the solution 1. > > On 12/13/2016 03:12 AM, Alex Williamson wrote: > > On Mon, 12 Dec 2016 21:49:01 +0800 > > Cao jin wrote: > > > >> Hi, > >> I have 2 solutions(high level design) came to me, please see if they are > >> acceptable, or which one is acceptable. Also have some questions. > >> > >> 1. block guest access during host recovery > >> > >> add new field error_recovering in struct vfio_pci_device to > >> indicate host recovery status. aer driver in host will still do > >> reset link > >> > >> - set error_recovering in vfio-pci driver's error_detected, used to > >> block all kinds of user access(config space, mmio) > >> - in order to solve concurrent issue of device resetting & user > >> access, check device state[*] in vfio-pci driver's resume, see if > >> device reset is done, if it is, then clear"error_recovering", or > >> else new a timer, check device state periodically until device > >> reset is done. (what if device reset don't end for a long time?) > >> - In qemu, translate guest link reset to host link reset. > >> A question here: we already have link reset in host, is a second > >> link reset necessary? why? > >> > >> [*] how to check device state: reading certain config space > >> register, check return value is valid or not(All F's) > > > > Isn't this exactly the path we were on previously? > > Yes, it is basically the previous path, plus the optimization. > > > There might be an > > optimization that we could skip back-to-back resets, but how can you > > necessarily infer that the resets are for the same thing? If the user > > accesses the device between resets, can you still guarantee the guest > > directed reset is unnecessary? If time passes between resets, do you > > know they're for the same event? How much time can pass between the > > host and guest reset to know they're for the same event? In the > > process of error handling, which is more important, speed or > > correctness? > > > > I think vfio driver itself won't know what each reset comes for, and I > don't quite understand why should vfio care this question, is this a new > question in the design? You're suggesting an optimization to eliminate one of the resets, and as we've discussed, I don't see removing the host induced reset as a viable option. That means you want to eliminate the guest directed reset. There are potentially three levels to do that, the vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it within the guest. My comments were directed to the first option, the host kernel level cannot correlate user directed resets as duplicates of host directed resets. > But I think it make sense that the user access during 2 resets maybe a > trouble for guest recovery, misbehaved user could be out of our > imagination. Correctness is more important. > > If I understand you right, let me make a summary: host recovery just > does link reset, which is incomplete, so we'd better do a complete guest > recovery for correctness. We don't know whether the host link reset is incomplete, but we can't do a link reset transparently to the device, the device is no longer in the same state after the reset. The device specific driver, which exists in userspace needs to be involved in device recovery. Therefore regardless of how QEMU handles the error, the driver within the guest needs to be notified and perform recovery. Since the device is PCI and we're on x86 and nobody wants to introduce paravirtual error recovery, we must use AER. Part of AER recovery includes the possibility of performing a link reset. So it seems this eliminates avoiding the link reset within the guest. That leaves QEMU. Here we need to decide whether a guest triggered link reset induces a host link reset. The current working theory is that yes, this must be the case. If there is ever a case where a driver within the guest could trigger a link reset for the purposes of error recovery when the host has not, I think this must be the case. Therefore, at least some guest induced link resets must become host link resets. Currently we assume all guest induced link resets become host link resets. Minimally to avoid that, QEMU would need to know (not assume) whether the host performed a link reset. Even with that, QEMU would need to be able to correlate that a link reset from the guest is a duplicate of a link reset that was already performed by the host. That implies that QEMU needs to deduce the intention of the guest. That seems like a complicated task for a patch series that is already complicated enough, especially for a feature of questionable value given the configuration restrictions (imo). I would much rather focus on getting it right and making it as simple as we can, even if that means links get reset one too many times on error. > >> 2. skip link reset in aer driver of host kernel, for vfio-pci. > >> Let user decide how to do serious recovery > >> > >> add new field "user_driver" in struct pci_dev, used to skip link > >> reset for vfio-pci; add new field "link_reset" in struct > >> vfio_pci_device to indicate link has been reset or not during > >> recovery > >> > >> - set user_driver in vfio_pci_probe(), to skip link reset for > >> vfio-pci in host. > >> - (use a flag)block user access(config, mmio) during host recovery > >> (not sure if this step is necessary) > >> - In qemu, translate guest link reset to host link reset. > >> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET > >> is executed > >> - In vfio-pci driver's resume, new a timer, check "link_reset" field > >> periodically, if it is set in reasonable time, then clear it and > >> delete timer, or else, vfio-pci driver will does the link reset! > > > > What happens in the case of a multifunction device where each function > > is part of a separate IOMMU group and one function is hot-removed from > > the user? We can't do a link reset on that function since the other > > function is still in use. We have no choice but release a device in an > > unknown state back to the host. > > hot-remove from user, do you mean, for example, all functions assigned > to VM, then suddenly a person does something like following > > $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind > > $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind > > to return device to host driver, or don't bind it to host driver, let it > in driver-less state??? Yes, the host kernel has no visiblity to how a user is making use of devices. To support AER we require a similar topology between host and guest such that a guest link reset translates to a host reset. That requirement is imposed by userspace, ie. QEMU. The host kernel cannot presume that this is the case. Therefore we could have a multi-function device where each function is assigned to the same or different users in any configuration. If a fault occurs and we defer to the user to perform the link reset, we have absolutely no guarantee that it will ever occur. If the functions are assigned to different users, then each user individually doesn't have the capability to perform a link reset. If the devices happen to be assigned to a single user when the error occurs, we cannot assume the user has an AER compatible configuration, the devices could be exposed as separate single function devices, any one of which might be individually removed from the user and made use of by the host, such as your sysfs example above. The host cannot perform a link reset in this case either as the sibling devices are still in use by the guest. Thanks, Alex