From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S934057AbcLNWQk (ORCPT <rfc822;w@1wt.eu>);
        Wed, 14 Dec 2016 17:16:40 -0500
Received: from mx1.redhat.com ([209.132.183.28]:49170 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S933070AbcLNWQi (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 14 Dec 2016 17:16:38 -0500
Date: Wed, 14 Dec 2016 15:16:37 -0700
From: Alex Williamson <alex.williamson@redhat.com>
To: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: <mst@redhat.com>, <linux-kernel@vger.kernel.org>,
        <kvm@vger.kernel.org>, <izumi.taku@jp.fujitsu.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery
Message-ID: <20161214151637.6b2c6ddd@t450s.home>
In-Reply-To: <58511DD7.8040508@cn.fujitsu.com>
References: <1480246457-10368-1-git-send-email-caoj.fnst@cn.fujitsu.com>
        <584EAACD.9070800@cn.fujitsu.com>
        <20161212121216.1c385d65@t450s.home>
        <58511DD7.8040508@cn.fujitsu.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Wed, 14 Dec 2016 22:16:38 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 14 Dec 2016 18:24:23 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

> Sorry for late.
> after reading all your comments, I think I will try the solution 1.
> 
> On 12/13/2016 03:12 AM, Alex Williamson wrote:
> > On Mon, 12 Dec 2016 21:49:01 +0800
> > Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >   
> >> Hi,
> >> I have 2 solutions(high level design) came to me, please see if they are
> >> acceptable, or which one is acceptable. Also have some questions.
> >>
> >> 1. block guest access during host recovery
> >>
> >>    add new field error_recovering in struct vfio_pci_device to
> >>    indicate host recovery status. aer driver in host will still do
> >>    reset link
> >>
> >>    - set error_recovering in vfio-pci driver's error_detected, used to
> >>      block all kinds of user access(config space, mmio)
> >>    - in order to solve concurrent issue of device resetting & user
> >>      access, check device state[*] in vfio-pci driver's resume, see if
> >>      device reset is done, if it is, then clear"error_recovering", or
> >>      else new a timer, check device state periodically until device
> >>      reset is done. (what if device reset don't end for a long time?)
> >>    - In qemu, translate guest link reset to host link reset.
> >>      A question here: we already have link reset in host, is a second
> >>      link reset necessary? why?
> >>  
> >>    [*] how to check device state: reading certain config space
> >>        register, check return value is valid or not(All F's)  
> > 
> > Isn't this exactly the path we were on previously?  
> 
> Yes, it is basically the previous path, plus the optimization.
> 
> > There might be an
> > optimization that we could skip back-to-back resets, but how can you
> > necessarily infer that the resets are for the same thing? If the user
> > accesses the device between resets, can you still guarantee the guest
> > directed reset is unnecessary?  If time passes between resets, do you
> > know they're for the same event?  How much time can pass between the
> > host and guest reset to know they're for the same event?  In the
> > process of error handling, which is more important, speed or
> > correctness?
> >    
> 
> I think vfio driver itself won't know what each reset comes for, and I
> don't quite understand why should vfio care this question, is this a new
> question in the design?

You're suggesting an optimization to eliminate one of the resets,
and as we've discussed, I don't see removing the host induced reset
as a viable option.  That means you want to eliminate the guest
directed reset.  There are potentially three levels to do that, the
vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
within the guest.  My comments were directed to the first option, the
host kernel level cannot correlate user directed resets as duplicates
of host directed resets.  
 
> But I think it make sense that the user access during 2 resets maybe a
> trouble for guest recovery, misbehaved user could be out of our
> imagination.  Correctness is more important.
> 
> If I understand you right, let me make a summary: host recovery just
> does link reset, which is incomplete, so we'd better do a complete guest
> recovery for correctness.

We don't know whether the host link reset is incomplete, but we can't do
a link reset transparently to the device, the device is no longer in the
same state after the reset.  The device specific driver, which exists
in userspace needs to be involved in device recovery.  Therefore
regardless of how QEMU handles the error, the driver within the guest
needs to be notified and perform recovery.  Since the device is PCI and
we're on x86 and nobody wants to introduce paravirtual error recovery,
we must use AER.  Part of AER recovery includes the possibility of
performing a link reset.  So it seems this eliminates avoiding the link
reset within the guest.

That leaves QEMU.  Here we need to decide whether a guest triggered
link reset induces a host link reset.  The current working theory is
that yes, this must be the case.  If there is ever a case where a
driver within the guest could trigger a link reset for the purposes
of error recovery when the host has not, I think this must be the
case.  Therefore, at least some guest induced link resets must become
host link resets.  Currently we assume all guest induced link resets
become host link resets.  Minimally to avoid that, QEMU would need to
know (not assume) whether the host performed a link reset.  Even with
that, QEMU would need to be able to correlate that a link reset from
the guest is a duplicate of a link reset that was already performed by
the host.  That implies that QEMU needs to deduce the intention of
the guest.  That seems like a complicated task for a patch series that
is already complicated enough, especially for a feature of questionable
value given the configuration restrictions (imo).

I would much rather focus on getting it right and making it as simple
as we can, even if that means links get reset one too many times on
error.

> >> 2. skip link reset in aer driver of host kernel, for vfio-pci.
> >>    Let user decide how to do serious recovery
> >>
> >>    add new field "user_driver" in struct pci_dev, used to skip link
> >>    reset for vfio-pci; add new field "link_reset" in struct
> >>    vfio_pci_device to indicate link has been reset or not during
> >>    recovery
> >>
> >>    - set user_driver in vfio_pci_probe(), to skip link reset for
> >>      vfio-pci in host.
> >>    - (use a flag)block user access(config, mmio) during host recovery
> >>      (not sure if this step is necessary)
> >>    - In qemu, translate guest link reset to host link reset.
> >>    - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
> >>      is executed
> >>    - In vfio-pci driver's resume, new a timer, check "link_reset" field
> >>      periodically, if it is set in reasonable time, then clear it and
> >>      delete timer, or else, vfio-pci driver will does the link reset!  
> > 
> > What happens in the case of a multifunction device where each function
> > is part of a separate IOMMU group and one function is hot-removed from
> > the user? We can't do a link reset on that function since the other
> > function is still in use.  We have no choice but release a device in an
> > unknown state back to the host.  
> 
> hot-remove from user, do you mean, for example, all functions assigned
> to VM, then suddenly a person does something like following
> 
> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
> 
> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
> 
> to return device to host driver, or don't bind it to host driver, let it
> in driver-less state???

Yes, the host kernel has no visiblity to how a user is making use of
devices.  To support AER we require a similar topology between host and
guest such that a guest link reset translates to a host reset.  That
requirement is imposed by userspace, ie. QEMU.  The host kernel cannot
presume that this is the case.  Therefore we could have a
multi-function device where each function is assigned to the same or
different users in any configuration.  If a fault occurs and we defer
to the user to perform the link reset, we have absolutely no guarantee
that it will ever occur.  If the functions are assigned to different
users, then each user individually doesn't have the capability to
perform a link reset.  If the devices happen to be assigned to a single
user when the error occurs, we cannot assume the user has an AER
compatible configuration, the devices could be exposed as separate
single function devices, any one of which might be individually removed
from the user and made use of by the host, such as your sysfs example
above.  The host cannot perform a link reset in this case either
as the sibling devices are still in use by the guest.  Thanks,

Alex