All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: vfio-pci API for PCI bus/slot (hot) resets
@ 2013-08-01 22:18 Alex Williamson
  2013-08-02  5:10 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2013-08-01 22:18 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Alexey Kardashevskiy, Benjamin Herrenschmidt

vfio-pci needs to support an interface to do hot resets (PCI parent
bridge secondary bus reset).   We need this to support reset of
co-assigned devices where one or more of the devices does not support
function level reset.  In particular, discrete graphics cards typically
have no reset options other than doing a link reset.  What I have below
is a bit awkward, so I welcome other ideas to accomplish this goal.
I've been using a "blind" interface based on all affected devices
belonging to the same VFIO container for current VGA testing.  This is
ok when all you want to do is VGA, but I'd really like to make use of
this any time a device doesn't support a function level reset.  I've
posted a series to the PCI list to add bus and slot reset interfaces to
PCI-core, this API is how we expose that through VFIO to a user.  Please
comment.  Thanks,

Alex

---
Mechanism to do PCI hot resets through VFIO:

VFIO is fundamentally an IOMMU group and device level interface.
There's no concept of buses, slots, or hierarchies of devices.  There
are only IOMMU group and devices.  A bus (or slot) may contain exactly
one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
An IOMMU group may contain one or more devices.

The first question is perhaps where should we create the interface to do
a PCI hot reset.  Assuming an ioctl interface, our choices are the
group, the container, or the device file descriptors.  Groups and
containers are not PCI specific, so an extension on either of those
doesn't make much sense.  They also don't have much granularity if your
goal is to do a hot reset on the smallest subset of devices you can.
Therefore the only choice seems to be a VFIO device level interface.

The fact that a hot reset affects multiple devices also raises concerns.
How do we make sure a user has sufficient access/privilege to perform
this operation?  If all of the affected devices are within the same
group, then we know the user already "owns" all those devices.  Using
groups as the boundary excludes a number of use cases though.  The user
would need to prove that they also own the other groups or devices that
are affected by the reset.  This might be multiple groups, so the ioctl
quickly grows to requiring a list of file descriptors be passed for
validation.

We already use the group file descriptor as a unit of ownership for
enabling the container, so it seems like it would make sense to use it
here too.  The alternative is a device file descriptor, but groups may
encompass devices the user doesn't care to use and we don't want to
require that they open a file descriptor simply to perform a hot reset.
Groups can also contain devices that the user cannot open, for instance
those owned by VFIO "compatible" drivers like pci-stub or pcieport.

The user also needs to know the set of devices affected by a hot reset,
otherwise they have no idea which group file descriptors to pass to such
an interface.  That implies we also need a separate "info" ioctl for the
user to learn that information.  We could argue that the user could
learn this information from sysfs, but that imposes non-trivial library
or code overhead on the user to evaluate the topology.  The PCI hot
reset info ioctl would need to indicate whether a hot reset is
available, and the set of affected devices.  It may be useful to provide
this as a {group, device} pair so the user doesn't need to
cross-reference each device with sysfs to determine the group for the
device.  This would then provide both the set of groups required to
perform the hot reset and the set of devices affected by the hot reset.

As an alternative, we could consider simply requiring that all of the
devices affected by a hot reset belong to the same VFIO container.
However, allowing multiple groups per container is an optional IOMMU
capability that really has no relation to PCI bus/slot boundaries.  It
seems a bit arbitrary to require groups be placed in the same container
to get a PCI hot reset.  That likely means we'd still need to support
passing some kind of ownership token as above with groups.  So it
doesn't seem to make the situation any better.

Given the above discussion, I therefore propose the following PCI hot
reset interface:

/**
 * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??,
 *                                        struct vfio_device_pci_hot_reset_info)
 */

#define VFIO_DEVICE_PCI_HOT_RESET_INFO     _IO(VFIO_TYPE, VFIO_BASE + ??)

struct vfio_device_pci_hot_reset_info_entry {
	__u32	group_id;
	__u16	segment; /* A reset will never include devices on other segments... return it anyway */
	__u8	bus;
	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
};

struct vfio_device_pci_hot_reset_info {
	__u32	argsz;
	__u32	flags;
#define VFIO_PCI_HOT_RESET_SUPPORTED	(1 << 0) /* Device supports hot reset */
#define VFIO_PCI_HOT_RESET_POPULATED	(1 << 1) /* Entries field are populated */
	__u32	count;
	struct vfio_device_pci_hot_reset_info_entry	entries[];
};

The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file
descriptor with a struct vfio_device_pci_hot_reset_info data structure,
minimally the sizeof the struct with argsz set.  VFIO returns whether
hot reset is supported and the number of devices affected by the reset.
If argsz is big enough, VFIO will fill in the entries and set the
populated flag, otherwise the caller can reallocate the structure and
try again.

/**
 * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??,
 *                                  struct vfio_device_pci_hot_reset)
 */

#define VFIO_DEVICE_PCI_HOT_RESET     _IO(VFIO_TYPE, VFIO_BASE + ??)

struct vfio_device_pci_hot_reset {
	__u32	argsz;
	__u32	flags;
	__u32	count;
	__u32	fds[];
};

As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file
descriptor.  Using the list from PCI_HOT_RESET_INFO, the user allocates
a struct vfio_device_pci_hot_reset of sufficient size to pass a list of
VFIO group file descriptors.  There should be one file descriptor for
each group listed in the info entries.  If the list of groups matches
those affected by a hot reset of the device, then VFIO will perform the
hot reset action and return success.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-01 22:18 RFC: vfio-pci API for PCI bus/slot (hot) resets Alex Williamson
@ 2013-08-02  5:10 ` Benjamin Herrenschmidt
  2013-08-02 16:36   ` Alex Williamson
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-02  5:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, Alexey Kardashevskiy

On Thu, 2013-08-01 at 16:18 -0600, Alex Williamson wrote:
> vfio-pci needs to support an interface to do hot resets (PCI parent
> bridge secondary bus reset).   We need this to support reset of
> co-assigned devices where one or more of the devices does not support
> function level reset.  In particular, discrete graphics cards typically
> have no reset options other than doing a link reset.  

Link reset != hot reset but yeah I see your point. There is more too,
there is fundamental reset which may or may not be something we can
control for individual slots (ie, switch legs) depending on platform
specific stuff...

> What I have below
> is a bit awkward, so I welcome other ideas to accomplish this goal.
> I've been using a "blind" interface based on all affected devices
> belonging to the same VFIO container for current VGA testing.  This is
> ok when all you want to do is VGA, but I'd really like to make use of
> this any time a device doesn't support a function level reset.  I've
> posted a series to the PCI list to add bus and slot reset interfaces to
> PCI-core, this API is how we expose that through VFIO to a user.  Please
> comment.  Thanks,

Also in some cases, at least for us, we have a problem where the scope
of the reset can be larger than the group ... in this case I think we
need to forbid the reset.

For us, it's also tied into our "EEH" error handling. IE. The way the
guest will request some of these things is going to be via firmware APIs
that we have yet to implement, but that we were also thinking of
implementing entire in the kernel rather than qemu for various
reasons... IE. There is more to error handling & recovery in our case at
least than AER and reset :-)

I need to spend more time reading your proposal and see how it can work
for us (or not...)... but we might end up just doing our own thing that
carries the whole EEH API instead.

Cheers,
Ben.

> Alex
> 
> ---
> Mechanism to do PCI hot resets through VFIO:
> 
> VFIO is fundamentally an IOMMU group and device level interface.
> There's no concept of buses, slots, or hierarchies of devices.  There
> are only IOMMU group and devices.  A bus (or slot) may contain exactly
> one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> An IOMMU group may contain one or more devices.
> 
> The first question is perhaps where should we create the interface to do
> a PCI hot reset.  Assuming an ioctl interface, our choices are the
> group, the container, or the device file descriptors.  Groups and
> containers are not PCI specific, so an extension on either of those
> doesn't make much sense.  They also don't have much granularity if your
> goal is to do a hot reset on the smallest subset of devices you can.
> Therefore the only choice seems to be a VFIO device level interface.
> 
> The fact that a hot reset affects multiple devices also raises concerns.
> How do we make sure a user has sufficient access/privilege to perform
> this operation?  If all of the affected devices are within the same
> group, then we know the user already "owns" all those devices.  Using
> groups as the boundary excludes a number of use cases though.  The user
> would need to prove that they also own the other groups or devices that
> are affected by the reset.  This might be multiple groups, so the ioctl
> quickly grows to requiring a list of file descriptors be passed for
> validation.
> 
> We already use the group file descriptor as a unit of ownership for
> enabling the container, so it seems like it would make sense to use it
> here too.  The alternative is a device file descriptor, but groups may
> encompass devices the user doesn't care to use and we don't want to
> require that they open a file descriptor simply to perform a hot reset.
> Groups can also contain devices that the user cannot open, for instance
> those owned by VFIO "compatible" drivers like pci-stub or pcieport.
> 
> The user also needs to know the set of devices affected by a hot reset,
> otherwise they have no idea which group file descriptors to pass to such
> an interface.  That implies we also need a separate "info" ioctl for the
> user to learn that information.  We could argue that the user could
> learn this information from sysfs, but that imposes non-trivial library
> or code overhead on the user to evaluate the topology.  The PCI hot
> reset info ioctl would need to indicate whether a hot reset is
> available, and the set of affected devices.  It may be useful to provide
> this as a {group, device} pair so the user doesn't need to
> cross-reference each device with sysfs to determine the group for the
> device.  This would then provide both the set of groups required to
> perform the hot reset and the set of devices affected by the hot reset.
> 
> As an alternative, we could consider simply requiring that all of the
> devices affected by a hot reset belong to the same VFIO container.
> However, allowing multiple groups per container is an optional IOMMU
> capability that really has no relation to PCI bus/slot boundaries.  It
> seems a bit arbitrary to require groups be placed in the same container
> to get a PCI hot reset.  That likely means we'd still need to support
> passing some kind of ownership token as above with groups.  So it
> doesn't seem to make the situation any better.
> 
> Given the above discussion, I therefore propose the following PCI hot
> reset interface:
> 
> /**
>  * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??,
>  *                                        struct vfio_device_pci_hot_reset_info)
>  */
> 
> #define VFIO_DEVICE_PCI_HOT_RESET_INFO     _IO(VFIO_TYPE, VFIO_BASE + ??)
> 
> struct vfio_device_pci_hot_reset_info_entry {
> 	__u32	group_id;
> 	__u16	segment; /* A reset will never include devices on other segments... return it anyway */
> 	__u8	bus;
> 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> };
> 
> struct vfio_device_pci_hot_reset_info {
> 	__u32	argsz;
> 	__u32	flags;
> #define VFIO_PCI_HOT_RESET_SUPPORTED	(1 << 0) /* Device supports hot reset */
> #define VFIO_PCI_HOT_RESET_POPULATED	(1 << 1) /* Entries field are populated */
> 	__u32	count;
> 	struct vfio_device_pci_hot_reset_info_entry	entries[];
> };
> 
> The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file
> descriptor with a struct vfio_device_pci_hot_reset_info data structure,
> minimally the sizeof the struct with argsz set.  VFIO returns whether
> hot reset is supported and the number of devices affected by the reset.
> If argsz is big enough, VFIO will fill in the entries and set the
> populated flag, otherwise the caller can reallocate the structure and
> try again.
> 
> /**
>  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??,
>  *                                  struct vfio_device_pci_hot_reset)
>  */
> 
> #define VFIO_DEVICE_PCI_HOT_RESET     _IO(VFIO_TYPE, VFIO_BASE + ??)
> 
> struct vfio_device_pci_hot_reset {
> 	__u32	argsz;
> 	__u32	flags;
> 	__u32	count;
> 	__u32	fds[];
> };
> 
> As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file
> descriptor.  Using the list from PCI_HOT_RESET_INFO, the user allocates
> a struct vfio_device_pci_hot_reset of sufficient size to pass a list of
> VFIO group file descriptors.  There should be one file descriptor for
> each group listed in the info entries.  If the list of groups matches
> those affected by a hot reset of the device, then VFIO will perform the
> hot reset action and return success.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-02  5:10 ` Benjamin Herrenschmidt
@ 2013-08-02 16:36   ` Alex Williamson
  2013-08-02 21:28     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2013-08-02 16:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: kvm, linux-kernel, Alexey Kardashevskiy

On Fri, 2013-08-02 at 15:10 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2013-08-01 at 16:18 -0600, Alex Williamson wrote:
> > vfio-pci needs to support an interface to do hot resets (PCI parent
> > bridge secondary bus reset).   We need this to support reset of
> > co-assigned devices where one or more of the devices does not support
> > function level reset.  In particular, discrete graphics cards typically
> > have no reset options other than doing a link reset.  
> 
> Link reset != hot reset but yeah I see your point. There is more too,
> there is fundamental reset which may or may not be something we can
> control for individual slots (ie, switch legs) depending on platform
> specific stuff...
> 
> > What I have below
> > is a bit awkward, so I welcome other ideas to accomplish this goal.
> > I've been using a "blind" interface based on all affected devices
> > belonging to the same VFIO container for current VGA testing.  This is
> > ok when all you want to do is VGA, but I'd really like to make use of
> > this any time a device doesn't support a function level reset.  I've
> > posted a series to the PCI list to add bus and slot reset interfaces to
> > PCI-core, this API is how we expose that through VFIO to a user.  Please
> > comment.  Thanks,
> 
> Also in some cases, at least for us, we have a problem where the scope
> of the reset can be larger than the group ... in this case I think we
> need to forbid the reset.

Interesting, I would have ventured to guess that resets are always
contained within a single group given the PE isolation domains rooted at
a PHB.  Why disallow it though?  What I propose below would still allow
it if the user can prove that they own all of the affected groups.  Even
on x86, I think it's too restrictive to only allow reset if the affect
is contained within a group.  It's just too easy to need it for a
multi-function device.  We could hope that a device that implements ACS
also implements some sort of FLR, but that may just be wishful thinking.

> For us, it's also tied into our "EEH" error handling. IE. The way the
> guest will request some of these things is going to be via firmware APIs
> that we have yet to implement, but that we were also thinking of
> implementing entire in the kernel rather than qemu for various
> reasons... IE. There is more to error handling & recovery in our case at
> least than AER and reset :-)

I've been focused on reset for device re-initialization and
repeatability, but that's a good point, we need to consider error
recovery as a use of this interface as well.

> I need to spend more time reading your proposal and see how it can work
> for us (or not...)... but we might end up just doing our own thing that
> carries the whole EEH API instead.

Obviously it would be great if it could work for you, but regardless of
whether you intend to use it I'd appreciate feedback.  Thanks,

Alex

> > ---
> > Mechanism to do PCI hot resets through VFIO:
> > 
> > VFIO is fundamentally an IOMMU group and device level interface.
> > There's no concept of buses, slots, or hierarchies of devices.  There
> > are only IOMMU group and devices.  A bus (or slot) may contain exactly
> > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> > An IOMMU group may contain one or more devices.
> > 
> > The first question is perhaps where should we create the interface to do
> > a PCI hot reset.  Assuming an ioctl interface, our choices are the
> > group, the container, or the device file descriptors.  Groups and
> > containers are not PCI specific, so an extension on either of those
> > doesn't make much sense.  They also don't have much granularity if your
> > goal is to do a hot reset on the smallest subset of devices you can.
> > Therefore the only choice seems to be a VFIO device level interface.
> > 
> > The fact that a hot reset affects multiple devices also raises concerns.
> > How do we make sure a user has sufficient access/privilege to perform
> > this operation?  If all of the affected devices are within the same
> > group, then we know the user already "owns" all those devices.  Using
> > groups as the boundary excludes a number of use cases though.  The user
> > would need to prove that they also own the other groups or devices that
> > are affected by the reset.  This might be multiple groups, so the ioctl
> > quickly grows to requiring a list of file descriptors be passed for
> > validation.
> > 
> > We already use the group file descriptor as a unit of ownership for
> > enabling the container, so it seems like it would make sense to use it
> > here too.  The alternative is a device file descriptor, but groups may
> > encompass devices the user doesn't care to use and we don't want to
> > require that they open a file descriptor simply to perform a hot reset.
> > Groups can also contain devices that the user cannot open, for instance
> > those owned by VFIO "compatible" drivers like pci-stub or pcieport.
> > 
> > The user also needs to know the set of devices affected by a hot reset,
> > otherwise they have no idea which group file descriptors to pass to such
> > an interface.  That implies we also need a separate "info" ioctl for the
> > user to learn that information.  We could argue that the user could
> > learn this information from sysfs, but that imposes non-trivial library
> > or code overhead on the user to evaluate the topology.  The PCI hot
> > reset info ioctl would need to indicate whether a hot reset is
> > available, and the set of affected devices.  It may be useful to provide
> > this as a {group, device} pair so the user doesn't need to
> > cross-reference each device with sysfs to determine the group for the
> > device.  This would then provide both the set of groups required to
> > perform the hot reset and the set of devices affected by the hot reset.
> > 
> > As an alternative, we could consider simply requiring that all of the
> > devices affected by a hot reset belong to the same VFIO container.
> > However, allowing multiple groups per container is an optional IOMMU
> > capability that really has no relation to PCI bus/slot boundaries.  It
> > seems a bit arbitrary to require groups be placed in the same container
> > to get a PCI hot reset.  That likely means we'd still need to support
> > passing some kind of ownership token as above with groups.  So it
> > doesn't seem to make the situation any better.
> > 
> > Given the above discussion, I therefore propose the following PCI hot
> > reset interface:
> > 
> > /**
> >  * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??,
> >  *                                        struct vfio_device_pci_hot_reset_info)
> >  */
> > 
> > #define VFIO_DEVICE_PCI_HOT_RESET_INFO     _IO(VFIO_TYPE, VFIO_BASE + ??)
> > 
> > struct vfio_device_pci_hot_reset_info_entry {
> > 	__u32	group_id;
> > 	__u16	segment; /* A reset will never include devices on other segments... return it anyway */
> > 	__u8	bus;
> > 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> > };
> > 
> > struct vfio_device_pci_hot_reset_info {
> > 	__u32	argsz;
> > 	__u32	flags;
> > #define VFIO_PCI_HOT_RESET_SUPPORTED	(1 << 0) /* Device supports hot reset */
> > #define VFIO_PCI_HOT_RESET_POPULATED	(1 << 1) /* Entries field are populated */
> > 	__u32	count;
> > 	struct vfio_device_pci_hot_reset_info_entry	entries[];
> > };
> > 
> > The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file
> > descriptor with a struct vfio_device_pci_hot_reset_info data structure,
> > minimally the sizeof the struct with argsz set.  VFIO returns whether
> > hot reset is supported and the number of devices affected by the reset.
> > If argsz is big enough, VFIO will fill in the entries and set the
> > populated flag, otherwise the caller can reallocate the structure and
> > try again.
> > 
> > /**
> >  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??,
> >  *                                  struct vfio_device_pci_hot_reset)
> >  */
> > 
> > #define VFIO_DEVICE_PCI_HOT_RESET     _IO(VFIO_TYPE, VFIO_BASE + ??)
> > 
> > struct vfio_device_pci_hot_reset {
> > 	__u32	argsz;
> > 	__u32	flags;
> > 	__u32	count;
> > 	__u32	fds[];
> > };
> > 
> > As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file
> > descriptor.  Using the list from PCI_HOT_RESET_INFO, the user allocates
> > a struct vfio_device_pci_hot_reset of sufficient size to pass a list of
> > VFIO group file descriptors.  There should be one file descriptor for
> > each group listed in the info entries.  If the list of groups matches
> > those affected by a hot reset of the device, then VFIO will perform the
> > hot reset action and return success.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-02 16:36   ` Alex Williamson
@ 2013-08-02 21:28     ` Benjamin Herrenschmidt
  2013-08-02 22:49       ` Bjorn Helgaas
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-02 21:28 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, Alexey Kardashevskiy, Gavin Shan

On Fri, 2013-08-02 at 10:36 -0600, Alex Williamson wrote:

> > Also in some cases, at least for us, we have a problem where the scope
> > of the reset can be larger than the group ... in this case I think we
> > need to forbid the reset.
> 
> Interesting, I would have ventured to guess that resets are always
> contained within a single group given the PE isolation domains rooted at
> a PHB. 

Depends how I configure them. For example, if I assign two functions to
two different PEs (SR-IOV typically), a hot reset will kill both.

For fundamental reset, depending on how the mobo is wired up and other
platform things, I might for example be limited on doing a PERST at the
PHB level, thus accross multiple PEs.

>  Why disallow it though?  What I propose below would still allow
> it if the user can prove that they own all of the affected groups.

Right, I mean disallow it if the PEs are not all owned by the same user.
What you propose would probably work.

>   Even
> on x86, I think it's too restrictive to only allow reset if the affect
> is contained within a group.  It's just too easy to need it for a
> multi-function device.  We could hope that a device that implements ACS
> also implements some sort of FLR, but that may just be wishful thinking.

Right. Another use case is, I know of devices that need a fundamental
reset (PERST) after applying a FW update.

> > For us, it's also tied into our "EEH" error handling. IE. The way the
> > guest will request some of these things is going to be via firmware APIs
> > that we have yet to implement, but that we were also thinking of
> > implementing entire in the kernel rather than qemu for various
> > reasons... IE. There is more to error handling & recovery in our case at
> > least than AER and reset :-)
> 
> I've been focused on reset for device re-initialization and
> repeatability, but that's a good point, we need to consider error
> recovery as a use of this interface as well.

Unfortunately, with EEH we have very specific FW interface (including
low level diagnostic data "blobs" etc...) that we need to implement so I
am not sure we can "fit" it in a generic interface.

We might want to attempt to list our interfaces and see. CC'ing Gavin
who is our EEH expert.

A rough cut is

 - We have the concept of "frozen" PE (frozen MMIO and frozen DMA, two
separate attributes). So we can query and change the freeze state on a
PE.

 - We have various level of error severity (from informational
(corrected) to full PHB need a reset with some variations in the middle
of PEs being frozen etc...)

 - We have diag blobs coming out of FW for the guest to log (and
possibly partially decode).

 - We have interfaces for various types of resets

> > I need to spend more time reading your proposal and see how it can work
> > for us (or not...)... but we might end up just doing our own thing that
> > carries the whole EEH API instead.
> 
> Obviously it would be great if it could work for you, but regardless of
> whether you intend to use it I'd appreciate feedback.  Thanks,

Ben.

> Alex
> 
> > > ---
> > > Mechanism to do PCI hot resets through VFIO:
> > > 
> > > VFIO is fundamentally an IOMMU group and device level interface.
> > > There's no concept of buses, slots, or hierarchies of devices.  There
> > > are only IOMMU group and devices.  A bus (or slot) may contain exactly
> > > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> > > An IOMMU group may contain one or more devices.
> > > 
> > > The first question is perhaps where should we create the interface to do
> > > a PCI hot reset.  Assuming an ioctl interface, our choices are the
> > > group, the container, or the device file descriptors.  Groups and
> > > containers are not PCI specific, so an extension on either of those
> > > doesn't make much sense.  They also don't have much granularity if your
> > > goal is to do a hot reset on the smallest subset of devices you can.
> > > Therefore the only choice seems to be a VFIO device level interface.
> > > 
> > > The fact that a hot reset affects multiple devices also raises concerns.
> > > How do we make sure a user has sufficient access/privilege to perform
> > > this operation?  If all of the affected devices are within the same
> > > group, then we know the user already "owns" all those devices.  Using
> > > groups as the boundary excludes a number of use cases though.  The user
> > > would need to prove that they also own the other groups or devices that
> > > are affected by the reset.  This might be multiple groups, so the ioctl
> > > quickly grows to requiring a list of file descriptors be passed for
> > > validation.
> > > 
> > > We already use the group file descriptor as a unit of ownership for
> > > enabling the container, so it seems like it would make sense to use it
> > > here too.  The alternative is a device file descriptor, but groups may
> > > encompass devices the user doesn't care to use and we don't want to
> > > require that they open a file descriptor simply to perform a hot reset.
> > > Groups can also contain devices that the user cannot open, for instance
> > > those owned by VFIO "compatible" drivers like pci-stub or pcieport.
> > > 
> > > The user also needs to know the set of devices affected by a hot reset,
> > > otherwise they have no idea which group file descriptors to pass to such
> > > an interface.  That implies we also need a separate "info" ioctl for the
> > > user to learn that information.  We could argue that the user could
> > > learn this information from sysfs, but that imposes non-trivial library
> > > or code overhead on the user to evaluate the topology.  The PCI hot
> > > reset info ioctl would need to indicate whether a hot reset is
> > > available, and the set of affected devices.  It may be useful to provide
> > > this as a {group, device} pair so the user doesn't need to
> > > cross-reference each device with sysfs to determine the group for the
> > > device.  This would then provide both the set of groups required to
> > > perform the hot reset and the set of devices affected by the hot reset.
> > > 
> > > As an alternative, we could consider simply requiring that all of the
> > > devices affected by a hot reset belong to the same VFIO container.
> > > However, allowing multiple groups per container is an optional IOMMU
> > > capability that really has no relation to PCI bus/slot boundaries.  It
> > > seems a bit arbitrary to require groups be placed in the same container
> > > to get a PCI hot reset.  That likely means we'd still need to support
> > > passing some kind of ownership token as above with groups.  So it
> > > doesn't seem to make the situation any better.
> > > 
> > > Given the above discussion, I therefore propose the following PCI hot
> > > reset interface:
> > > 
> > > /**
> > >  * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??,
> > >  *                                        struct vfio_device_pci_hot_reset_info)
> > >  */
> > > 
> > > #define VFIO_DEVICE_PCI_HOT_RESET_INFO     _IO(VFIO_TYPE, VFIO_BASE + ??)
> > > 
> > > struct vfio_device_pci_hot_reset_info_entry {
> > > 	__u32	group_id;
> > > 	__u16	segment; /* A reset will never include devices on other segments... return it anyway */
> > > 	__u8	bus;
> > > 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > };
> > > 
> > > struct vfio_device_pci_hot_reset_info {
> > > 	__u32	argsz;
> > > 	__u32	flags;
> > > #define VFIO_PCI_HOT_RESET_SUPPORTED	(1 << 0) /* Device supports hot reset */
> > > #define VFIO_PCI_HOT_RESET_POPULATED	(1 << 1) /* Entries field are populated */
> > > 	__u32	count;
> > > 	struct vfio_device_pci_hot_reset_info_entry	entries[];
> > > };
> > > 
> > > The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file
> > > descriptor with a struct vfio_device_pci_hot_reset_info data structure,
> > > minimally the sizeof the struct with argsz set.  VFIO returns whether
> > > hot reset is supported and the number of devices affected by the reset.
> > > If argsz is big enough, VFIO will fill in the entries and set the
> > > populated flag, otherwise the caller can reallocate the structure and
> > > try again.
> > > 
> > > /**
> > >  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??,
> > >  *                                  struct vfio_device_pci_hot_reset)
> > >  */
> > > 
> > > #define VFIO_DEVICE_PCI_HOT_RESET     _IO(VFIO_TYPE, VFIO_BASE + ??)
> > > 
> > > struct vfio_device_pci_hot_reset {
> > > 	__u32	argsz;
> > > 	__u32	flags;
> > > 	__u32	count;
> > > 	__u32	fds[];
> > > };
> > > 
> > > As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file
> > > descriptor.  Using the list from PCI_HOT_RESET_INFO, the user allocates
> > > a struct vfio_device_pci_hot_reset of sufficient size to pass a list of
> > > VFIO group file descriptors.  There should be one file descriptor for
> > > each group listed in the info entries.  If the list of groups matches
> > > those affected by a hot reset of the device, then VFIO will perform the
> > > hot reset action and return success.
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-02 21:28     ` Benjamin Herrenschmidt
@ 2013-08-02 22:49       ` Bjorn Helgaas
  2013-08-02 23:15         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Bjorn Helgaas @ 2013-08-02 22:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, kvm, linux-kernel, Alexey Kardashevskiy,
	Gavin Shan, linux-pci

[+cc linux-pci]

On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:

> Right. Another use case is, I know of devices that need a fundamental
> reset (PERST) after applying a FW update.

This is a tangent from the real discussion here, but the question of
resetting a device after a firmware update concerns me.  Many if not
all of our current reset interfaces save & restore the architected
parts of config space around the reset.  But a reset after a firmware
update may change things like the number and type of BARs or even the
functionality of the device, so I don't think the restore is safe in
general.

I doubt this is a big problem in general, but I have found reports of
people having to do a system reset or reboot after updating, e.g.,
FPGA images.  I suppose at least some of these could be worked around
with the right hotplug incantations.

http://forums.xilinx.com/t5/PCI-Express/PC-is-hung-if-run-the-second-program-FPGA-by-JTAG/td-p/20871
http://www.alteraforum.com/forum/archive/index.php/t-28477.html
http://www.alteraforum.com/forum/showthread.php?t=35091
https://github.com/NetFPGA/NetFPGA-public/wiki/Restore-PCIE-Configuration
http://lkml.indiana.edu/hypermail/linux/kernel/1104.3/02544.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-02 22:49       ` Bjorn Helgaas
@ 2013-08-02 23:15         ` Benjamin Herrenschmidt
  2013-08-02 23:37           ` Alex Williamson
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-02 23:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alex Williamson, kvm, linux-kernel, Alexey Kardashevskiy,
	Gavin Shan, linux-pci

On Fri, 2013-08-02 at 16:49 -0600, Bjorn Helgaas wrote:
> [+cc linux-pci]
> 
> On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> 
> > Right. Another use case is, I know of devices that need a fundamental
> > reset (PERST) after applying a FW update.
> 
> This is a tangent from the real discussion here, but the question of
> resetting a device after a firmware update concerns me.  Many if not
> all of our current reset interfaces save & restore the architected
> parts of config space around the reset.  But a reset after a firmware
> update may change things like the number and type of BARs or even the
> functionality of the device, so I don't think the restore is safe in
> general.

Right.

> I doubt this is a big problem in general, but I have found reports of
> people having to do a system reset or reboot after updating, e.g.,
> FPGA images.  I suppose at least some of these could be worked around
> with the right hotplug incantations.

Yes.

We have that similar issue with error handling, when the driver doesn't
have the right hooks, we simulate an unplug, reset, then replug.

Maybe we could provide generic helpers to do that...

Cheers,
Ben.

> http://forums.xilinx.com/t5/PCI-Express/PC-is-hung-if-run-the-second-program-FPGA-by-JTAG/td-p/20871
> http://www.alteraforum.com/forum/archive/index.php/t-28477.html
> http://www.alteraforum.com/forum/showthread.php?t=35091
> https://github.com/NetFPGA/NetFPGA-public/wiki/Restore-PCIE-Configuration
> http://lkml.indiana.edu/hypermail/linux/kernel/1104.3/02544.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-02 23:15         ` Benjamin Herrenschmidt
@ 2013-08-02 23:37           ` Alex Williamson
  2013-08-02 23:56             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2013-08-02 23:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Bjorn Helgaas, kvm, linux-kernel, Alexey Kardashevskiy,
	Gavin Shan, linux-pci

On Sat, 2013-08-03 at 09:15 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-08-02 at 16:49 -0600, Bjorn Helgaas wrote:
> > [+cc linux-pci]
> > 
> > On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
> > <benh@kernel.crashing.org> wrote:
> > 
> > > Right. Another use case is, I know of devices that need a fundamental
> > > reset (PERST) after applying a FW update.
> > 
> > This is a tangent from the real discussion here, but the question of
> > resetting a device after a firmware update concerns me.  Many if not
> > all of our current reset interfaces save & restore the architected
> > parts of config space around the reset.  But a reset after a firmware
> > update may change things like the number and type of BARs or even the
> > functionality of the device, so I don't think the restore is safe in
> > general.
> 
> Right.
> 
> > I doubt this is a big problem in general, but I have found reports of
> > people having to do a system reset or reboot after updating, e.g.,
> > FPGA images.  I suppose at least some of these could be worked around
> > with the right hotplug incantations.
> 
> Yes.
> 
> We have that similar issue with error handling, when the driver doesn't
> have the right hooks, we simulate an unplug, reset, then replug.
> 
> Maybe we could provide generic helpers to do that...

Devices going away and coming back is pretty difficult for vfio to
handle.  Perhaps helpers to rescan a device in-place would be easier.
On the QEMU side we'd need to rescan the device after each reset, which
would be rather tedious for the typical case where it doesn't change.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: vfio-pci API for PCI bus/slot (hot) resets
  2013-08-02 23:37           ` Alex Williamson
@ 2013-08-02 23:56             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-02 23:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, kvm, linux-kernel, Alexey Kardashevskiy,
	Gavin Shan, linux-pci

On Fri, 2013-08-02 at 17:37 -0600, Alex Williamson wrote:

> > Yes.
> > 
> > We have that similar issue with error handling, when the driver doesn't
> > have the right hooks, we simulate an unplug, reset, then replug.
> > 
> > Maybe we could provide generic helpers to do that...
> 
> Devices going away and coming back is pretty difficult for vfio to
> handle.  Perhaps helpers to rescan a device in-place would be easier.

Well, in the error handling case (and *maybe* in the "FW update" case)
we need to unbind and rebind the driver as well. The whole point is that
we have to do that because the driver doesn't have the right hooks.

In the VFIO case, we will have to implement something here so that the
VFIO driver stub doesn't get handled that treatment :-) We'll probably
need some arch specific stuff in vfio_pci unfortunately, so that the
errors get forwarded to the guest via our EEH interfaces, and let the
guest handle it's error handling.

Of course that leaves an interesting problem as to what state the device
is in when it comes back to the host ...

This is made a LOT more complicated in the VFIO model than it is in the
"pHyp" model (our proprietary hypervisor).

Under pHyp, the PE doesn't have a concept of being used outside of a
guest, and it always reset before being assigned/unassigned. The guest
can mess around all it wants (bus numbers, BARs, etc...) it can only
hurt itself. The hypervisor doesn't keep track of any of that.

> On the QEMU side we'd need to rescan the device after each reset, which
> would be rather tedious for the typical case where it doesn't change.

This is a direct consequence of your model :-) It makes things more
complex for us as well, but we have to bite that bullet now. Maybe we
can consider an alternate/simpler model in the future, more akin to what
pHyp does, where we "unplug" the device from the host when assigning it
to a guest (and the whole hierarchy below it if it's a bridge) and let
the guest do what it wants with it. 

Doing so means we no longer have to emulate/filter config space
(provided your HW handles MSI virtualization properly), care about bus
numbers, BARs, etc... nor do we need to keep track of a lot of this in
qemu. All we need is reset the whole & lot and re-probe the bus leg when
returning the devices to the host.
 
Cheers,
Ben.

> Thanks,
> 
> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-08-02 23:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-01 22:18 RFC: vfio-pci API for PCI bus/slot (hot) resets Alex Williamson
2013-08-02  5:10 ` Benjamin Herrenschmidt
2013-08-02 16:36   ` Alex Williamson
2013-08-02 21:28     ` Benjamin Herrenschmidt
2013-08-02 22:49       ` Bjorn Helgaas
2013-08-02 23:15         ` Benjamin Herrenschmidt
2013-08-02 23:37           ` Alex Williamson
2013-08-02 23:56             ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.