All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<qemu-devel@nongnu.org>, <izumi.taku@jp.fujitsu.com>
Subject: Re: [PATCH v2] vfio/pci: Support error recovery
Date: Thu, 19 Jan 2017 12:26:04 -0700	[thread overview]
Message-ID: <20170119122604.2ee85ebc@t450s.home> (raw)
In-Reply-To: <58802CC5.1080005@cn.fujitsu.com>

On Thu, 19 Jan 2017 11:04:37 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

> On 01/19/2017 05:32 AM, Alex Williamson wrote:
> > On Tue, 10 Jan 2017 17:11:01 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> >> On Tue, Jan 10, 2017 at 07:46:17PM +0800, Cao jin wrote:  
> >>>
> >>>
> >>> On 01/10/2017 07:04 AM, Michael S. Tsirkin wrote:    
> >>>> On Sat, Dec 31, 2016 at 05:15:36PM +0800, Cao jin wrote:    
> >>>>> Support serious device error recovery    
> >>>>
> >>>> serious?
> >>>>    
> >>>
> >>> Sorry for my poor vocabulary if it confuses people. I wanted to express
> >>> the meaning that: vfio-pci actually cannot do a real recovery for device
> >>> even if it provides the callbacks, it relies on the user to do a
> >>> effective(or word "serious"?) recovery.
> >>>
> >>> Welcome the amendment on the commit log.    
> >>
> >> It's up to Alex, maybe he's able to figure it all out from
> >> code, but the rest of us could benefit from a description
> >> of what the patch does from userspace point of view.
> >>
> >> Also, is it a pre-requisite of the userspace patches you posted?  
> > 
> > This is the same blocking user accesses while the device is in recovery
> > that you thought was ineffective/wrong before.  Why do we still need it
> > if QEMU isn't trying to handle fatal errors?  If the kernel is doing a
> > reset shouldn't the user consider the device dead?  A commit log
> > explaining this is absolutely necessary.  Thanks,
> > 
> > Alex
> >   
> 
> Yes, it is the same blocking user access as before, and I did said it is
> not effective as we expected, and I drew the figure to illustrate my
> analysis. I think the blocking is right, maybe just not enough to work
> fine, because it is possible that vfio's blocking is over, while
> hardware reset is not done, results in inaccessible device.
> 
> Leave the blocking here is no harm for now, and could be useful in
> future(when we handle fatal error).

If you want this in the kernel, you're going to need to invest the
effort to make it work.  I'm not going to put in code that is
ineffective at what it intends to do.
 
> We don't forward fatal error events to guest, why would guest kernel do
> a reset? Or do you mean some device driver would do hardware reset on
> non-fatal error?

The question is if we're only trying to recover from non-fatal events,
what is the scenario where the user is attempting to access the device
while it's in reset?  Do we need to consider the existing notifier to
be a fatal event where access to the device should stop immediately and
add a new notifier for non-fatal events where it's safe for the user to
access the device?  Trying to use the eventfd to push status through a
single notifier seems flawed.  Thanks,

Alex

WARNING: multiple messages have this Message-ID (diff)
From: Alex Williamson <alex.williamson@redhat.com>
To: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, izumi.taku@jp.fujitsu.com
Subject: Re: [Qemu-devel] [PATCH v2] vfio/pci: Support error recovery
Date: Thu, 19 Jan 2017 12:26:04 -0700	[thread overview]
Message-ID: <20170119122604.2ee85ebc@t450s.home> (raw)
In-Reply-To: <58802CC5.1080005@cn.fujitsu.com>

On Thu, 19 Jan 2017 11:04:37 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

> On 01/19/2017 05:32 AM, Alex Williamson wrote:
> > On Tue, 10 Jan 2017 17:11:01 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> >> On Tue, Jan 10, 2017 at 07:46:17PM +0800, Cao jin wrote:  
> >>>
> >>>
> >>> On 01/10/2017 07:04 AM, Michael S. Tsirkin wrote:    
> >>>> On Sat, Dec 31, 2016 at 05:15:36PM +0800, Cao jin wrote:    
> >>>>> Support serious device error recovery    
> >>>>
> >>>> serious?
> >>>>    
> >>>
> >>> Sorry for my poor vocabulary if it confuses people. I wanted to express
> >>> the meaning that: vfio-pci actually cannot do a real recovery for device
> >>> even if it provides the callbacks, it relies on the user to do a
> >>> effective(or word "serious"?) recovery.
> >>>
> >>> Welcome the amendment on the commit log.    
> >>
> >> It's up to Alex, maybe he's able to figure it all out from
> >> code, but the rest of us could benefit from a description
> >> of what the patch does from userspace point of view.
> >>
> >> Also, is it a pre-requisite of the userspace patches you posted?  
> > 
> > This is the same blocking user accesses while the device is in recovery
> > that you thought was ineffective/wrong before.  Why do we still need it
> > if QEMU isn't trying to handle fatal errors?  If the kernel is doing a
> > reset shouldn't the user consider the device dead?  A commit log
> > explaining this is absolutely necessary.  Thanks,
> > 
> > Alex
> >   
> 
> Yes, it is the same blocking user access as before, and I did said it is
> not effective as we expected, and I drew the figure to illustrate my
> analysis. I think the blocking is right, maybe just not enough to work
> fine, because it is possible that vfio's blocking is over, while
> hardware reset is not done, results in inaccessible device.
> 
> Leave the blocking here is no harm for now, and could be useful in
> future(when we handle fatal error).

If you want this in the kernel, you're going to need to invest the
effort to make it work.  I'm not going to put in code that is
ineffective at what it intends to do.
 
> We don't forward fatal error events to guest, why would guest kernel do
> a reset? Or do you mean some device driver would do hardware reset on
> non-fatal error?

The question is if we're only trying to recover from non-fatal events,
what is the scenario where the user is attempting to access the device
while it's in reset?  Do we need to consider the existing notifier to
be a fatal event where access to the device should stop immediately and
add a new notifier for non-fatal events where it's safe for the user to
access the device?  Trying to use the eventfd to push status through a
single notifier seems flawed.  Thanks,

Alex

  reply	other threads:[~2017-01-19 19:26 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-31  9:15 [PATCH v2] vfio/pci: Support error recovery Cao jin
2016-12-31  9:15 ` [Qemu-devel] " Cao jin
2017-01-09 23:04 ` Michael S. Tsirkin
2017-01-09 23:04   ` [Qemu-devel] " Michael S. Tsirkin
2017-01-10 11:46   ` Cao jin
2017-01-10 11:46     ` [Qemu-devel] " Cao jin
2017-01-10 15:11     ` Michael S. Tsirkin
2017-01-10 15:11       ` [Qemu-devel] " Michael S. Tsirkin
2017-01-11  1:53       ` Cao jin
2017-01-11  1:53         ` [Qemu-devel] " Cao jin
2017-01-12 16:24         ` Michael S. Tsirkin
2017-01-12 16:24           ` [Qemu-devel] " Michael S. Tsirkin
2017-01-18 21:32       ` Alex Williamson
2017-01-18 21:32         ` [Qemu-devel] " Alex Williamson
2017-01-18 21:32         ` Alex Williamson
2017-01-19  3:04         ` Cao jin
2017-01-19  3:04           ` [Qemu-devel] " Cao jin
2017-01-19  3:04           ` Cao jin
2017-01-19 19:26           ` Alex Williamson [this message]
2017-01-19 19:26             ` [Qemu-devel] " Alex Williamson
2017-01-18  5:33 ` Cao jin
2017-01-18  5:33   ` [Qemu-devel] " Cao jin
2017-01-18  5:33   ` Cao jin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170119122604.2ee85ebc@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=caoj.fnst@cn.fujitsu.com \
    --cc=izumi.taku@jp.fujitsu.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.