From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41043)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1bLZfA-0006G8-QD
	for qemu-devel@nongnu.org; Fri, 08 Jul 2016 13:34:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1bLZf7-0001tR-1M
	for qemu-devel@nongnu.org; Fri, 08 Jul 2016 13:34:03 -0400
Received: from mx1.redhat.com ([209.132.183.28]:45683)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1bLZf6-0001tN-PJ
	for qemu-devel@nongnu.org; Fri, 08 Jul 2016 13:34:00 -0400
Date: Fri, 8 Jul 2016 11:33:58 -0600
From: Alex Williamson <alex.williamson@redhat.com>
Message-ID: <20160708113358.2e5ed040@t450s.home>
In-Reply-To: <3e661554-3c79-9d8f-41b2-c4cdd0fcb95f@cn.fujitsu.com>
References: <1464315131-25834-1-git-send-email-zhoujie2011@cn.fujitsu.com>
	<20160621215626.71c99582@t450s.home>
	<113474d2-8408-db49-e7ef-8c6b736af866@cn.fujitsu.com>
	<468b752b-a161-902b-d4cc-489dfa18c21e@cn.fujitsu.com>
	<20160622094236.515549fa@t450s.home>
	<7746532f-2fad-1304-0df7-7cd25ba761af@cn.fujitsu.com>
	<20160627095418.659e6e5f@t450s.home>
	<d52923dc-afb9-9dc1-ac0e-bae645141843@cn.fujitsu.com>
	<20160627215808.1531a774@t450s.home>
	<7912dad0-0e37-603d-fdfe-bb4950b55f28@cn.fujitsu.com>
	<20160628084052.1e85a730@t450s.home>
	<689ac38f-96d7-9717-e9c4-d9926272cb86@cn.fujitsu.com>
	<20160629122242.2ac20254@t450s.home>
	<bb6d23b4-2043-07c4-4bef-a2a00e5f4c58@cn.fujitsu.com>
	<99e88b46-8b8b-f7de-700f-8d644a7f005a@cn.fujitsu.com>
	<f3862836-a03e-0e53-91bb-b0cf5006415d@cn.fujitsu.com>
	<20160705110300.029985a6@t450s.home>
	<24876827-07fc-6f69-6949-7e358720c914@cn.fujitsu.com>
	<20160707130442.7ac934c8@t450s.home>
	<3e661554-3c79-9d8f-41b2-c4cdd0fcb95f@cn.fujitsu.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v8 11/12] vfio: register aer resume
 notification handler for aer resume
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Zhou Jie <zhoujie2011@cn.fujitsu.com>
Cc: izumi.taku@jp.fujitsu.com, caoj.fnst@cn.fujitsu.com, Chen Fan <fan.chen@easystack.cn>, qemu-devel@nongnu.org, mst@redhat.com

On Fri, 8 Jul 2016 09:38:50 +0800
Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:

> Hi Alex,
> 
> >>>>> The following code will be modified.
> >>>>> 1. vfio_pci_ioctl
> >>>>>    add a flag in vfio_device_info for workable_state support
> >>>>>    return workable_state in "struct vfio_pci_device" when user get info  
> >>>
> >>> Seems like two flags are required, one to indicate the presence of this
> >>> feature and another to indicate the state.  I'd prefer something like
> >>> access_blocked.  
> >> User can get the state by state flag.
> >> And also maybe blocked by ioctl or write functons.
> >> User has choice to invoke which functions.  
> >
> > Let's imagine there's one flag bit, there are two possible polarities,
> > a) the bit is set when access is available, b) the bit is set when
> > access is blocked.
> >
> > Let's examine a), if the bit is not set, does that means that access is
> > not available or does that mean the kernel doesn't support that
> > feature?  There's no way to know.  Fail.  So we switch to b), an error
> > occurs, the bit is not set, does that mean access is blocked or does
> > that mean that the kernel we're using doesn't support the feature.
> > Fail.  If there's a way to do this with one bit, please explain it to
> > me.  Relying on a function to block, which may not be a valid
> > assumption on the kernel we're using also fails.  Userspace must be
> > able to know, in a deterministic and backwards compatible way, the
> > features of the kernel and the behavior to expect.  
> Sorry, I didn't say clearly.
> There is one flag and one variable.
> 1) workable state support flag
>     This flag let user know whether the kenerl support block feature.
> 2) workable state variable in "struct vfio_pci_device"
>     This variable let user know whether the device is blocked.

The variable clearly isn't visible to the user, so the user can know
whether the kernel supports this feature, but not whether the feature
is currently active.  Perhaps there's no way to avoid races completely,
but don't you expect that if we define that certain operations are
blocked after an error notification that a user may want some way to
poll for whether the block is active after a reset rather than simply
calling a blocked interface to probe for it?
 
> >>>>> 2. vfio_pci_ioctl
> >>>>>    During err occurs and resume:
> >>>>>    if (cmd == VFIO_DEVICE_SET_IRQS || VFIO_DEVICE_RESET
> >>>>>    || VFIO_DEVICE_GET_PCI_HOT_RESET_INFO || VFIO_DEVICE_PCI_HOT_RESET)
> >>>>>    block for workable_state clearing
> >>>>> 3. vfio_pci_write
> >>>>>    During err occurs and resume:
> >>>>>    ignore and return 0  
> >>>
> >>> This is contradictory to your comment "Do nothing for bar regions".
> >>> ISTR that returning 0 for read/write calls is an easy way to break
> >>> users since we've return neither the desired bytes nor an error code.  
> >> No, there is not change for read.
> >> Just return 0 for write.
> >> Return 0 mean that there is no byte has been written.
> >> Consider that the aer has occurred,
> >> it is better to not to write any thing to device.
> >> User can still read/write bar regions by mmap address,
> >> this may generate some date errors,
> >> but it doesn't matter as device is going to be reset.  
> >
> > My statement still stands, you state "Do nothing for bar regions" and
> > "return 0 for write".  Those are contradictory and as I indicate,
> > returning 0 is problematic for userspace.  Additionally, making
> > vfio_pci_write return zero while still allowing writes through the BAR
> > mmap is inconsistent.  
> I understand.
> I will block writing to configure space.
> And don't change for writing bar regions.
> 
> >>>>> 4. vfio_pci_aer_err_detected
> >>>>>    Set workable_state to false in "struct vfio_pci_device"
> >>>>>    Disable INTx:
> >>>>>      If Disable INTx is support
> >>>>>        disable by PCI_COMMAND
> >>>>>      else
> >>>>>        disable by disable_irq function
> >>>>>    Disable MSI:
> >>>>>        disable by clearing the "Bus Master Enable" bit of PCI_COMMAND  
> >>>
> >>> I've suggested repeatedly to properly teardown these interrupts.  I
> >>> disagree with your proposed approach here.  If the device is intended to
> >>> be in a state that requires re-initialization then the interrupt setup
> >>> should be part of that.  
> >> I have traced the INTx functions.
> >> -vfio_pci_intx_unmask_handler
> >> -vfio_pci_intx_mask
> >> -vfio_intx_set_signal
> >>
> >> They are invoked by User.
> >> -vfio_pci_write
> >> -vfio_pci_ioctl
> >>
> >> During err occurs and resume above functions are blocked.
> >> So, User cann't set the INTx.
> >> I will disable the INTx in vfio_pci_aer_err_detected.
> >> And reset the INTx in vfio_pci_aer_resume
> >> according the original user setting(vdev->ctx[0].masked).  
> >
> > Again, you're missing the point.  If the device is expected to be
> > reinitialized after resume, why don't we return the device to an
> > initial state where interrupts are not configured?  I think this
> > presents inconsistent behavior to the user.  
> About return the device to an
> initial state where interrupts are not configured.
> Do you mean free_irq?
> User request_irq by vfio_pci_ioctl.
> If we free_irq, how can the user know it?
> Do you think guest will reinitializ the device after resume,
> so it doesn't matter?
> 
> Maybe we should not expect what the guest will do.
> What I want to do in vfio driver is as following.
> 1. aer occurs
> 2. Disable INTx and MSI
> 3. aer driver reset the device
> 4. Restore INTx and MSI
> 5. user process the aer event

As we've discussed before, the AER notification needs to be relayed to
the user without delay, otherwise we only increase the gap where the
user might consume bogus data.  It also only seems reasonable to modify
the behavior of the interfaces (ie. blocking) if the user is notified,
which would be through the existing error notifier.  We can never
depend on a specific behavior from the user, we may be dealing with a
malicious user.

We already disable interrupts in vfio_pci_disable() simply by calling
the ioctl function directly.  If we simply disable and re-enable
interrupts as you propose, how does the user deal with edge triggered
interrupts that may have occurred during that period?  Are they lost?
Should we instead leave the interrupts enabled but skip
eventfd_signal() in the interrupts handlers, queuing interrupts for
re-delivery after the device is resumed?  Or does it make more sense to
simply disable the interrupts as done in vfio_pci_disable() and define
that the user needs to re-establish interrupts before continuing after
an error event?  Thanks,

Alex