From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752269AbcLFDze (ORCPT <rfc822;w@1wt.eu>);
        Mon, 5 Dec 2016 22:55:34 -0500
Received: from mx1.redhat.com ([209.132.183.28]:33936 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751201AbcLFDzb (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 5 Dec 2016 22:55:31 -0500
Date: Tue, 6 Dec 2016 05:55:28 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Cao jin <caoj.fnst@cn.fujitsu.com>, linux-kernel@vger.kernel.org,
        kvm@vger.kernel.org, izumi.taku@jp.fujitsu.com
Subject: Re: [PATCH] vfio/pci: Support error recovery
Message-ID: <20161206054642-mutt-send-email-mst@kernel.org>
References: <1480246457-10368-1-git-send-email-caoj.fnst@cn.fujitsu.com>
 <20161130210413.5161aab1@t450s.home>
 <58402830.3060606@cn.fujitsu.com>
 <20161201075541.756f6332@t450s.home>
 <5844092A.30204@cn.fujitsu.com>
 <20161204083047.7e715b09@t450s.home>
 <58450083.9010201@cn.fujitsu.com>
 <20161205091730.568e5079@t450s.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20161205091730.568e5079@t450s.home>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Tue, 06 Dec 2016 03:55:31 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Dec 05, 2016 at 09:17:30AM -0700, Alex Williamson wrote:
> If you're going to take the lead for these AER patches, I would
> certainly suggest that understanding the reasoning behind the bus reset
> behavior is a central aspect to this series.  This effort has dragged
> out for nearly two years and I apologize, but I don't really have a lot
> of patience for rehashing some of these issues if you're not going to
> read the previous discussions or consult with your colleagues to
> understand how we got to this point.  If you want to challenge some of
> the design points, that's great, it could use some new eyes, but please
> understand how we got here first.

Well I'm guessing Cao jin here isn't the only one not
willing to plough through all historical versions of the patchset
just to figure out the motivation for some code.

Including a summary of a high level architecture couldn't hurt.

Any chance of writing such?  Alternatively, we can try to build it as
part of this thread.  Shouldn't be hard as it seems somewhat
straight-forward on the surface:

- detect link error on the host, don't reset link as we would normally do
- report link error to guest
- detect link reset request from guest
- reset link on host

Since link reset will reset all devices behind it, for this to work we
need same set of devices behind the link in host and guest.  Enforcing
this would be nice to have.

- as link now might end up in bad state, reset
  it when device is unassigned

Any details I missed?

-- 
MST