All of lore.kernel.org
 help / color / mirror / Atom feed
* kvm PCI assignment & VFIO ramblings
@ 2011-07-29 23:58 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-29 23:58 UTC (permalink / raw)
  To: kvm
  Cc: Anthony Liguori, Alex Williamson, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

Hi folks !

So I promised Anthony I would try to summarize some of the comments &
issues we have vs. VFIO after we've tried to use it for PCI pass-through
on POWER. It's pretty long, there are various items with more or less
impact, some of it is easily fixable, some are API issues, and we'll
probably want to discuss them separately, but for now here's a brain
dump.

David, Alexei, please make sure I haven't missed anything :-)

* Granularity of pass-through

So let's first start with what is probably the main issue and the most
contentious, which is the problem of dealing with the various
constraints which define the granularity of pass-through, along with
exploiting features like the VTd iommu domains.

For the sake of clarity, let me first talk a bit about the "granularity"
issue I've mentioned above.

There are various constraints that can/will force several devices to be
"owned" by the same guest and on the same side of the host/guest
boundary. This is generally because some kind of HW resource is shared
and thus not doing so would break the isolation barrier and enable a
guest to disrupt the operations of the host and/or another guest.

Some of those constraints are well know, such as shared interrupts. Some
are more subtle, for example, if a PCIe->PCI bridge exist in the system,
there is no way for the iommu to identify transactions from devices
coming from the PCI segment of that bridge with a granularity other than
"behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
behind such a bridge must be treated as a single "entity" for
pass-trough purposes.

In IBM POWER land, we call this a "partitionable endpoint" (the term
"endpoint" here is historic, such a PE can be made of several PCIe
"endpoints"). I think "partitionable" is a pretty good name tho to
represent the constraints, so I'll call this a "partitionable group"
from now on. 

Other examples of such HW imposed constraints can be a shared iommu with
no filtering capability (some older POWER hardware which we might want
to support fall into that category, each PCI host bridge is its own
domain but doesn't have a finer granularity... however those machines
tend to have a lot of host bridges :)

If we are ever going to consider applying some of this to non-PCI
devices (see the ongoing discussions here), then we will be faced with
the crazyness of embedded designers which probably means all sort of new
constraints we can't even begin to think about

This leads me to those initial conclusions:

- The -minimum- granularity of pass-through is not always a single
device and not always under SW control

- Having a magic heuristic in libvirt to figure out those constraints is
WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
knowledge of PCI resource management and getting it wrong in many many
cases, something that took years to fix essentially by ripping it all
out. This is kernel knowledge and thus we need the kernel to expose in a
way or another what those constraints are, what those "partitionable
groups" are.

- That does -not- mean that we cannot specify for each individual device
within such a group where we want to put it in qemu (what devfn etc...).
As long as there is a clear understanding that the "ownership" of the
device goes with the group, this is somewhat orthogonal to how they are
represented in qemu. (Not completely... if the iommu is exposed to the
guest ,via paravirt for example, some of these constraints must be
exposed but I'll talk about that more later).

The interface currently proposed for VFIO (and associated uiommu)
doesn't handle that problem at all. Instead, it is entirely centered
around a specific "feature" of the VTd iommu's for creating arbitrary
domains with arbitrary devices (tho those devices -do- have the same
constraints exposed above, don't try to put 2 legacy PCI devices behind
the same bridge into 2 different domains !), but the API totally ignores
the problem, leaves it to libvirt "magic foo" and focuses on something
that is both quite secondary in the grand scheme of things, and quite
x86 VTd specific in the implementation and API definition.

Now, I'm not saying these programmable iommu domains aren't a nice
feature and that we shouldn't exploit them when available, but as it is,
it is too much a central part of the API.

I'll talk a little bit more about recent POWER iommu's here to
illustrate where I'm coming from with my idea of groups:

On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
of domain and a per-RID filtering. However it differs from VTd in a few
ways:

The "domains" (aka PEs) encompass more than just an iommu filtering
scheme. The MMIO space and PIO space are also segmented, and those
segments assigned to domains. Interrupts (well, MSI ports at least) are
assigned to domains. Inbound PCIe error messages are targeted to
domains, etc...

Basically, the PEs provide a very strong isolation feature which
includes errors, and has the ability to immediately "isolate" a PE on
the first occurence of an error. For example, if an inbound PCIe error
is signaled by a device on a PE or such a device does a DMA to a
non-authorized address, the whole PE gets into error state. All
subsequent stores (both DMA and MMIO) are swallowed and reads return all
1's, interrupts are blocked. This is designed to prevent any propagation
of bad data, which is a very important feature in large high reliability
systems.

Software then has the ability to selectively turn back on MMIO and/or
DMA, perform diagnostics, reset devices etc...

Because the domains encompass more than just DMA, but also segment the
MMIO space, it is not practical at all to dynamically reconfigure them
at runtime to "move" devices into domains. The firmware or early kernel
code (it depends) will assign devices BARs using an algorithm that keeps
them within PE segment boundaries, etc....

Additionally (and this is indeed a "restriction" compared to VTd, though
I expect our future IO chips to lift it to some extent), PE don't get
separate DMA address spaces. There is one 64-bit DMA address space per
PCI host bridge, and it is 'segmented' with each segment being assigned
to a PE. Due to the way PE assignment works in hardware, it is not
practical to make several devices share a segment unless they are on the
same bus. Also the resulting limit in the amount of 32-bit DMA space a
device can access means that it's impractical to put too many devices in
a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
more about that later).

The above essentially extends the granularity requirement (or rather is
another factor defining what the granularity of partitionable entities
is). You can think of it as "pre-existing" domains.

I believe the way to solve that is to introduce a kernel interface to
expose those "partitionable entities" to userspace. In addition, it
occurs to me that the ability to manipulate VTd domains essentially
boils down to manipulating those groups (creating larger ones with
individual components).

I like the idea of defining / playing with those groups statically
(using a command line tool or sysfs, possibly having a config file
defining them in a persistent way) rather than having their lifetime
tied to a uiommu file descriptor.

It also makes it a LOT easier to have a channel to manipulate
platform/arch specific attributes of those domains if any.

So we could define an API or representation in sysfs that exposes what
the partitionable entities are, and we may add to it an API to
manipulate them. But we don't have to and I'm happy to keep the
additional SW grouping you can do on VTd as a sepparate "add-on" API
(tho I don't like at all the way it works with uiommu). However, qemu
needs to know what the grouping is regardless of the domains, and it's
not nice if it has to manipulate two different concepts here so
eventually those "partitionable entities" from a qemu standpoint must
look like domains.

My main point is that I don't want the "knowledge" here to be in libvirt
or qemu. In fact, I want to be able to do something as simple as passing
a reference to a PE to qemu (sysfs path ?) and have it just pickup all
the devices in there and expose them to the guest.

This can be done in a way that isn't PCI specific as well (the
definition of the groups and what is grouped would would obviously be
somewhat bus specific and handled by platform code in the kernel).

Maybe something like /sys/devgroups ? This probably warrants involving
more kernel people into the discussion.

* IOMMU

Now more on iommu. I've described I think in enough details how ours
work, there are others, I don't know what freescale or ARM are doing,
sparc doesn't quite work like VTd either, etc...

The main problem isn't that much the mechanics of the iommu but really
how it's exposed (or not) to guests.

VFIO here is basically designed for one and only one thing: expose the
entire guest physical address space to the device more/less 1:1.

This means:

  - It only works with iommu's that provide complete DMA address spaces
to devices. Won't work with a single 'segmented' address space like we
have on POWER.

  - It requires the guest to be pinned. Pass-through -> no more swap

  - The guest cannot make use of the iommu to deal with 32-bit DMA
devices, thus a guest with more than a few G of RAM (I don't know the
exact limit on x86, depends on your IO hole I suppose), and you end up
back to swiotlb & bounce buffering.

  - It doesn't work for POWER server anyways because of our need to
provide a paravirt iommu interface to the guest since that's how pHyp
works today and how existing OSes expect to operate.

Now some of this can be fixed with tweaks, and we've started doing it
(we have a working pass-through using VFIO, forgot to mention that, it's
just that we don't like what we had to do to get there).

Basically, what we do today is:

- We add an ioctl to VFIO to expose to qemu the segment information. IE.
What is the DMA address and size of the DMA "window" usable for a given
device. This is a tweak, that should really be handled at the "domain"
level.

That current hack won't work well if two devices share an iommu. Note
that we have an additional constraint here due to our paravirt
interfaces (specificed in PAPR) which is that PE domains must have a
common parent. Basically, pHyp makes them look like a PCIe host bridge
per domain in the guest. I think that's a pretty good idea and qemu
might want to do the same.

- We hack out the currently unconditional mapping of the entire guest
space in the iommu. Something will have to be done to "decide" whether
to do that or not ... qemu argument -> ioctl ?

- We hook up the paravirt call to insert/remove a translation from the
iommu to the VFIO map/unmap ioctl's.

This limps along but it's not great. Some of the problems are:

- I've already mentioned, the domain problem again :-) 

- Performance sucks of course, the vfio map ioctl wasn't mean for that
and has quite a bit of overhead. However we'll want to do the paravirt
call directly in the kernel eventually ...

  - ... which isn't trivial to get back to our underlying arch specific
iommu object from there. We'll probably need a set of arch specific
"sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
link them to the real thing kernel-side.

- PAPR (the specification of our paravirt interface and the expectation
of current OSes) wants iommu pages to be 4k by default, regardless of
the kernel host page size, which makes things a bit tricky since our
enterprise host kernels have a 64k base page size. Additionally, we have
new PAPR interfaces that we want to exploit, to allow the guest to
create secondary iommu segments (in 64-bit space), which can be used
(under guest control) to do things like map the entire guest (here it
is :-) or use larger iommu page sizes (if permitted by the host kernel,
in our case we could allow 64k iommu page size with a 64k host kernel).

The above means we need arch specific APIs. So arch specific vfio
ioctl's, either that or kvm ones going to vfio or something ... the
current structure of vfio/kvm interaction doesn't make it easy.

* IO space

On most (if not all) non-x86 archs, each PCI host bridge provide a
completely separate PCI address space. Qemu doesn't deal with that very
well. For MMIO it can be handled since those PCI address spaces are
"remapped" holes in the main CPU address space so devices can be
registered by using BAR + offset of that window in qemu MMIO mapping.

For PIO things get nasty. We have totally separate PIO spaces and qemu
doesn't seem to like that. We can try to play the offset trick as well,
we haven't tried yet, but basically that's another one to fix. Not a
huge deal I suppose but heh ...

Also our next generation chipset may drop support for PIO completely.

On the other hand, because PIO is just a special range of MMIO for us,
we can do normal pass-through on it and don't need any of the emulation
done qemu.

  * MMIO constraints

The QEMU side VFIO code hard wires various constraints that are entirely
based on various requirements you decided you have on x86 but don't
necessarily apply to us :-)

Due to our paravirt nature, we don't need to masquerade the MSI-X table
for example. At all. If the guest configures crap into it, too bad, it
can only shoot itself in the foot since the host bridge enforce
validation anyways as I explained earlier. Because it's all paravirt, we
don't need to "translate" the interrupt vectors & addresses, the guest
will call hyercalls to configure things anyways.

We don't need to prevent MMIO pass-through for small BARs at all. This
should be some kind of capability or flag passed by the arch. Our
segmentation of the MMIO domain means that we can give entire segments
to the guest and let it access anything in there (those segments are a
multiple of the page size always). Worst case it will access outside of
a device BAR within a segment and will cause the PE to go into error
state, shooting itself in the foot, there is no risk of side effect
outside of the guest boundaries.

In fact, we don't even need to emulate BAR sizing etc... in theory. Our
paravirt guests expect the BARs to have been already allocated for them
by the firmware and will pick up the addresses from the device-tree :-)

Today we use a "hack", putting all 0's in there and triggering the linux
code path to reassign unassigned resources (which will use BAR
emulation) but that's not what we are -supposed- to do. Not a big deal
and having the emulation there won't -hurt- us, it's just that we don't
really need any of it.

We have a small issue with ROMs. Our current KVM only works with huge
pages for guest memory but that is being fixed. So the way qemu maps the
ROM copy into the guest address space doesn't work. It might be handy
anyways to have a way for qemu to use MMIO emulation for ROM access as a
fallback. I'll look into it.

  * EEH

This is the name of those fancy error handling & isolation features I
mentioned earlier. To some extent it's a superset of AER, but we don't
generally expose AER to guests (or even the host), it's swallowed by
firmware into something else that provides a superset (well mostly) of
the AER information, and allow us to do those additional things like
isolating/de-isolating, reset control etc...

Here too, we'll need arch specific APIs through VFIO. Not necessarily a
huge deal, I mention it for completeness.

   * Misc

There's lots of small bits and pieces... in no special order:

 - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
netlink and a bit of ioctl's ... it's not like there's something
fundamentally  better for netlink vs. ioctl... it really depends what
you are doing, and in this case I fail to see what netlink brings you
other than bloat and more stupid userspace library deps.

 - I don't like too much the fact that VFIO provides yet another
different API to do what we already have at least 2 kernel APIs for, ie,
BAR mapping and config space access. At least it should be better at
using the backend infrastructure of the 2 others (sysfs & procfs). I
understand it wants to filter in some case (config space) and -maybe-
yet another API is the right way to go but allow me to have my doubts.

One thing I thought about but you don't seem to like it ... was to use
the need to represent the partitionable entity as groups in sysfs that I
talked about earlier. Those could have per-device subdirs with the usual
config & resource files, same semantic as the ones in the real device,
but when accessed via the group they get filtering. I might or might not
be practical in the end, tbd, but it would allow apps using a slightly
modified libpci for example to exploit some of this.

 - The qemu vfio code hooks directly into ioapic ... of course that
won't fly with anything !x86

 - The various "objects" dealt with here, -especially- interrupts and
iommu, need a better in-kernel API so that fast in-kernel emulation can
take over from qemu based emulation. The way we need to do some of this
on POWER differs from x86. We can elaborate later, it's not necessarily
a killer either but essentially we'll take the bulk of interrupt
handling away from VFIO to the point where it won't see any of it at
all.

  - Non-PCI devices. That's a hot topic for embedded. I think the vast
majority here is platform devices. There's quite a bit of vfio that
isn't intrinsically PCI specific. We could have an in-kernel platform
driver like we have an in-kernel PCI driver to attach to. The mapping of
resources to userspace is rather generic, as goes for interrupts. I
don't know whether that idea can be pushed much further, I don't have
the bandwidth to look into it much at this point, but maybe it would be
possible to refactor vfio a bit to better separate what is PCI specific
to what is not. The idea would be to move the PCI specific bits to
inside the "placeholder" PCI driver, and same goes for platform bits.
"generic" ioctl's go to VFIO core, anything that doesn't handle, it
passes them to the driver which allows the PCI one to handle things
differently than the platform one, maybe an amba one while at it,
etc.... just a thought, I haven't gone into the details at all.

I think that's all I had on my plate today, it's a long enough email
anyway :-) Anthony suggested we put that on a wiki, I'm a bit
wiki-disabled myself so he proposed to pickup my email and do that. We
should probably discuss the various items in here separately as
different threads to avoid too much confusion.

One other thing we should do on our side is publish somewhere our
current hacks to get you an idea of where we are going and what we had
to do (code speaks more than words). We'll try to do that asap, possibly
next week.

Note that I'll be on/off the next few weeks, travelling and doing
bringup. So expect latency in my replies.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 322+ messages in thread

* kvm PCI assignment & VFIO ramblings
@ 2011-07-29 23:58 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-29 23:58 UTC (permalink / raw)
  To: kvm
  Cc: Alexey Kardashevskiy, Paul Mackerras, linux-pci, David Gibson,
	Alex Williamson, Anthony Liguori, linuxppc-dev

Hi folks !

So I promised Anthony I would try to summarize some of the comments &
issues we have vs. VFIO after we've tried to use it for PCI pass-through
on POWER. It's pretty long, there are various items with more or less
impact, some of it is easily fixable, some are API issues, and we'll
probably want to discuss them separately, but for now here's a brain
dump.

David, Alexei, please make sure I haven't missed anything :-)

* Granularity of pass-through

So let's first start with what is probably the main issue and the most
contentious, which is the problem of dealing with the various
constraints which define the granularity of pass-through, along with
exploiting features like the VTd iommu domains.

For the sake of clarity, let me first talk a bit about the "granularity"
issue I've mentioned above.

There are various constraints that can/will force several devices to be
"owned" by the same guest and on the same side of the host/guest
boundary. This is generally because some kind of HW resource is shared
and thus not doing so would break the isolation barrier and enable a
guest to disrupt the operations of the host and/or another guest.

Some of those constraints are well know, such as shared interrupts. Some
are more subtle, for example, if a PCIe->PCI bridge exist in the system,
there is no way for the iommu to identify transactions from devices
coming from the PCI segment of that bridge with a granularity other than
"behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
behind such a bridge must be treated as a single "entity" for
pass-trough purposes.

In IBM POWER land, we call this a "partitionable endpoint" (the term
"endpoint" here is historic, such a PE can be made of several PCIe
"endpoints"). I think "partitionable" is a pretty good name tho to
represent the constraints, so I'll call this a "partitionable group"
from now on. 

Other examples of such HW imposed constraints can be a shared iommu with
no filtering capability (some older POWER hardware which we might want
to support fall into that category, each PCI host bridge is its own
domain but doesn't have a finer granularity... however those machines
tend to have a lot of host bridges :)

If we are ever going to consider applying some of this to non-PCI
devices (see the ongoing discussions here), then we will be faced with
the crazyness of embedded designers which probably means all sort of new
constraints we can't even begin to think about

This leads me to those initial conclusions:

- The -minimum- granularity of pass-through is not always a single
device and not always under SW control

- Having a magic heuristic in libvirt to figure out those constraints is
WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
knowledge of PCI resource management and getting it wrong in many many
cases, something that took years to fix essentially by ripping it all
out. This is kernel knowledge and thus we need the kernel to expose in a
way or another what those constraints are, what those "partitionable
groups" are.

- That does -not- mean that we cannot specify for each individual device
within such a group where we want to put it in qemu (what devfn etc...).
As long as there is a clear understanding that the "ownership" of the
device goes with the group, this is somewhat orthogonal to how they are
represented in qemu. (Not completely... if the iommu is exposed to the
guest ,via paravirt for example, some of these constraints must be
exposed but I'll talk about that more later).

The interface currently proposed for VFIO (and associated uiommu)
doesn't handle that problem at all. Instead, it is entirely centered
around a specific "feature" of the VTd iommu's for creating arbitrary
domains with arbitrary devices (tho those devices -do- have the same
constraints exposed above, don't try to put 2 legacy PCI devices behind
the same bridge into 2 different domains !), but the API totally ignores
the problem, leaves it to libvirt "magic foo" and focuses on something
that is both quite secondary in the grand scheme of things, and quite
x86 VTd specific in the implementation and API definition.

Now, I'm not saying these programmable iommu domains aren't a nice
feature and that we shouldn't exploit them when available, but as it is,
it is too much a central part of the API.

I'll talk a little bit more about recent POWER iommu's here to
illustrate where I'm coming from with my idea of groups:

On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
of domain and a per-RID filtering. However it differs from VTd in a few
ways:

The "domains" (aka PEs) encompass more than just an iommu filtering
scheme. The MMIO space and PIO space are also segmented, and those
segments assigned to domains. Interrupts (well, MSI ports at least) are
assigned to domains. Inbound PCIe error messages are targeted to
domains, etc...

Basically, the PEs provide a very strong isolation feature which
includes errors, and has the ability to immediately "isolate" a PE on
the first occurence of an error. For example, if an inbound PCIe error
is signaled by a device on a PE or such a device does a DMA to a
non-authorized address, the whole PE gets into error state. All
subsequent stores (both DMA and MMIO) are swallowed and reads return all
1's, interrupts are blocked. This is designed to prevent any propagation
of bad data, which is a very important feature in large high reliability
systems.

Software then has the ability to selectively turn back on MMIO and/or
DMA, perform diagnostics, reset devices etc...

Because the domains encompass more than just DMA, but also segment the
MMIO space, it is not practical at all to dynamically reconfigure them
at runtime to "move" devices into domains. The firmware or early kernel
code (it depends) will assign devices BARs using an algorithm that keeps
them within PE segment boundaries, etc....

Additionally (and this is indeed a "restriction" compared to VTd, though
I expect our future IO chips to lift it to some extent), PE don't get
separate DMA address spaces. There is one 64-bit DMA address space per
PCI host bridge, and it is 'segmented' with each segment being assigned
to a PE. Due to the way PE assignment works in hardware, it is not
practical to make several devices share a segment unless they are on the
same bus. Also the resulting limit in the amount of 32-bit DMA space a
device can access means that it's impractical to put too many devices in
a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
more about that later).

The above essentially extends the granularity requirement (or rather is
another factor defining what the granularity of partitionable entities
is). You can think of it as "pre-existing" domains.

I believe the way to solve that is to introduce a kernel interface to
expose those "partitionable entities" to userspace. In addition, it
occurs to me that the ability to manipulate VTd domains essentially
boils down to manipulating those groups (creating larger ones with
individual components).

I like the idea of defining / playing with those groups statically
(using a command line tool or sysfs, possibly having a config file
defining them in a persistent way) rather than having their lifetime
tied to a uiommu file descriptor.

It also makes it a LOT easier to have a channel to manipulate
platform/arch specific attributes of those domains if any.

So we could define an API or representation in sysfs that exposes what
the partitionable entities are, and we may add to it an API to
manipulate them. But we don't have to and I'm happy to keep the
additional SW grouping you can do on VTd as a sepparate "add-on" API
(tho I don't like at all the way it works with uiommu). However, qemu
needs to know what the grouping is regardless of the domains, and it's
not nice if it has to manipulate two different concepts here so
eventually those "partitionable entities" from a qemu standpoint must
look like domains.

My main point is that I don't want the "knowledge" here to be in libvirt
or qemu. In fact, I want to be able to do something as simple as passing
a reference to a PE to qemu (sysfs path ?) and have it just pickup all
the devices in there and expose them to the guest.

This can be done in a way that isn't PCI specific as well (the
definition of the groups and what is grouped would would obviously be
somewhat bus specific and handled by platform code in the kernel).

Maybe something like /sys/devgroups ? This probably warrants involving
more kernel people into the discussion.

* IOMMU

Now more on iommu. I've described I think in enough details how ours
work, there are others, I don't know what freescale or ARM are doing,
sparc doesn't quite work like VTd either, etc...

The main problem isn't that much the mechanics of the iommu but really
how it's exposed (or not) to guests.

VFIO here is basically designed for one and only one thing: expose the
entire guest physical address space to the device more/less 1:1.

This means:

  - It only works with iommu's that provide complete DMA address spaces
to devices. Won't work with a single 'segmented' address space like we
have on POWER.

  - It requires the guest to be pinned. Pass-through -> no more swap

  - The guest cannot make use of the iommu to deal with 32-bit DMA
devices, thus a guest with more than a few G of RAM (I don't know the
exact limit on x86, depends on your IO hole I suppose), and you end up
back to swiotlb & bounce buffering.

  - It doesn't work for POWER server anyways because of our need to
provide a paravirt iommu interface to the guest since that's how pHyp
works today and how existing OSes expect to operate.

Now some of this can be fixed with tweaks, and we've started doing it
(we have a working pass-through using VFIO, forgot to mention that, it's
just that we don't like what we had to do to get there).

Basically, what we do today is:

- We add an ioctl to VFIO to expose to qemu the segment information. IE.
What is the DMA address and size of the DMA "window" usable for a given
device. This is a tweak, that should really be handled at the "domain"
level.

That current hack won't work well if two devices share an iommu. Note
that we have an additional constraint here due to our paravirt
interfaces (specificed in PAPR) which is that PE domains must have a
common parent. Basically, pHyp makes them look like a PCIe host bridge
per domain in the guest. I think that's a pretty good idea and qemu
might want to do the same.

- We hack out the currently unconditional mapping of the entire guest
space in the iommu. Something will have to be done to "decide" whether
to do that or not ... qemu argument -> ioctl ?

- We hook up the paravirt call to insert/remove a translation from the
iommu to the VFIO map/unmap ioctl's.

This limps along but it's not great. Some of the problems are:

- I've already mentioned, the domain problem again :-) 

- Performance sucks of course, the vfio map ioctl wasn't mean for that
and has quite a bit of overhead. However we'll want to do the paravirt
call directly in the kernel eventually ...

  - ... which isn't trivial to get back to our underlying arch specific
iommu object from there. We'll probably need a set of arch specific
"sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
link them to the real thing kernel-side.

- PAPR (the specification of our paravirt interface and the expectation
of current OSes) wants iommu pages to be 4k by default, regardless of
the kernel host page size, which makes things a bit tricky since our
enterprise host kernels have a 64k base page size. Additionally, we have
new PAPR interfaces that we want to exploit, to allow the guest to
create secondary iommu segments (in 64-bit space), which can be used
(under guest control) to do things like map the entire guest (here it
is :-) or use larger iommu page sizes (if permitted by the host kernel,
in our case we could allow 64k iommu page size with a 64k host kernel).

The above means we need arch specific APIs. So arch specific vfio
ioctl's, either that or kvm ones going to vfio or something ... the
current structure of vfio/kvm interaction doesn't make it easy.

* IO space

On most (if not all) non-x86 archs, each PCI host bridge provide a
completely separate PCI address space. Qemu doesn't deal with that very
well. For MMIO it can be handled since those PCI address spaces are
"remapped" holes in the main CPU address space so devices can be
registered by using BAR + offset of that window in qemu MMIO mapping.

For PIO things get nasty. We have totally separate PIO spaces and qemu
doesn't seem to like that. We can try to play the offset trick as well,
we haven't tried yet, but basically that's another one to fix. Not a
huge deal I suppose but heh ...

Also our next generation chipset may drop support for PIO completely.

On the other hand, because PIO is just a special range of MMIO for us,
we can do normal pass-through on it and don't need any of the emulation
done qemu.

  * MMIO constraints

The QEMU side VFIO code hard wires various constraints that are entirely
based on various requirements you decided you have on x86 but don't
necessarily apply to us :-)

Due to our paravirt nature, we don't need to masquerade the MSI-X table
for example. At all. If the guest configures crap into it, too bad, it
can only shoot itself in the foot since the host bridge enforce
validation anyways as I explained earlier. Because it's all paravirt, we
don't need to "translate" the interrupt vectors & addresses, the guest
will call hyercalls to configure things anyways.

We don't need to prevent MMIO pass-through for small BARs at all. This
should be some kind of capability or flag passed by the arch. Our
segmentation of the MMIO domain means that we can give entire segments
to the guest and let it access anything in there (those segments are a
multiple of the page size always). Worst case it will access outside of
a device BAR within a segment and will cause the PE to go into error
state, shooting itself in the foot, there is no risk of side effect
outside of the guest boundaries.

In fact, we don't even need to emulate BAR sizing etc... in theory. Our
paravirt guests expect the BARs to have been already allocated for them
by the firmware and will pick up the addresses from the device-tree :-)

Today we use a "hack", putting all 0's in there and triggering the linux
code path to reassign unassigned resources (which will use BAR
emulation) but that's not what we are -supposed- to do. Not a big deal
and having the emulation there won't -hurt- us, it's just that we don't
really need any of it.

We have a small issue with ROMs. Our current KVM only works with huge
pages for guest memory but that is being fixed. So the way qemu maps the
ROM copy into the guest address space doesn't work. It might be handy
anyways to have a way for qemu to use MMIO emulation for ROM access as a
fallback. I'll look into it.

  * EEH

This is the name of those fancy error handling & isolation features I
mentioned earlier. To some extent it's a superset of AER, but we don't
generally expose AER to guests (or even the host), it's swallowed by
firmware into something else that provides a superset (well mostly) of
the AER information, and allow us to do those additional things like
isolating/de-isolating, reset control etc...

Here too, we'll need arch specific APIs through VFIO. Not necessarily a
huge deal, I mention it for completeness.

   * Misc

There's lots of small bits and pieces... in no special order:

 - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
netlink and a bit of ioctl's ... it's not like there's something
fundamentally  better for netlink vs. ioctl... it really depends what
you are doing, and in this case I fail to see what netlink brings you
other than bloat and more stupid userspace library deps.

 - I don't like too much the fact that VFIO provides yet another
different API to do what we already have at least 2 kernel APIs for, ie,
BAR mapping and config space access. At least it should be better at
using the backend infrastructure of the 2 others (sysfs & procfs). I
understand it wants to filter in some case (config space) and -maybe-
yet another API is the right way to go but allow me to have my doubts.

One thing I thought about but you don't seem to like it ... was to use
the need to represent the partitionable entity as groups in sysfs that I
talked about earlier. Those could have per-device subdirs with the usual
config & resource files, same semantic as the ones in the real device,
but when accessed via the group they get filtering. I might or might not
be practical in the end, tbd, but it would allow apps using a slightly
modified libpci for example to exploit some of this.

 - The qemu vfio code hooks directly into ioapic ... of course that
won't fly with anything !x86

 - The various "objects" dealt with here, -especially- interrupts and
iommu, need a better in-kernel API so that fast in-kernel emulation can
take over from qemu based emulation. The way we need to do some of this
on POWER differs from x86. We can elaborate later, it's not necessarily
a killer either but essentially we'll take the bulk of interrupt
handling away from VFIO to the point where it won't see any of it at
all.

  - Non-PCI devices. That's a hot topic for embedded. I think the vast
majority here is platform devices. There's quite a bit of vfio that
isn't intrinsically PCI specific. We could have an in-kernel platform
driver like we have an in-kernel PCI driver to attach to. The mapping of
resources to userspace is rather generic, as goes for interrupts. I
don't know whether that idea can be pushed much further, I don't have
the bandwidth to look into it much at this point, but maybe it would be
possible to refactor vfio a bit to better separate what is PCI specific
to what is not. The idea would be to move the PCI specific bits to
inside the "placeholder" PCI driver, and same goes for platform bits.
"generic" ioctl's go to VFIO core, anything that doesn't handle, it
passes them to the driver which allows the PCI one to handle things
differently than the platform one, maybe an amba one while at it,
etc.... just a thought, I haven't gone into the details at all.

I think that's all I had on my plate today, it's a long enough email
anyway :-) Anthony suggested we put that on a wiki, I'm a bit
wiki-disabled myself so he proposed to pickup my email and do that. We
should probably discuss the various items in here separately as
different threads to avoid too much confusion.

One other thing we should do on our side is publish somewhere our
current hacks to get you an idea of where we are going and what we had
to do (code speaks more than words). We'll try to do that asap, possibly
next week.

Note that I'll be on/off the next few weeks, travelling and doing
bringup. So expect latency in my replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
  (?)
@ 2011-07-30 18:20   ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev, iommu, benve,
	aafabbri, chrisw, qemu-devel

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 18:20   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-07-30 18:20   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
@ 2011-07-30 22:21   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 22:21 UTC (permalink / raw)
  To: kvm
  Cc: Anthony Liguori, Alex Williamson, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.
> 
> David, Alexei, please make sure I haven't missed anything :-)

And I think I have :-)

  * Config space

VFIO currently handles that as a byte stream. It's quite gross to be
honest and it's not right. You shouldn't lose access size information
between guest and host when performing real accesses.

Some config space registers can have side effects and not respecting
access sizes can be nasty.

Cheers,
Ben.

> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.
> 
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control
> 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 22:21   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 22:21 UTC (permalink / raw)
  To: kvm
  Cc: Alexey Kardashevskiy, Paul Mackerras, linux-pci, David Gibson,
	Alex Williamson, Anthony Liguori, linuxppc-dev

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.
> 
> David, Alexei, please make sure I haven't missed anything :-)

And I think I have :-)

  * Config space

VFIO currently handles that as a byte stream. It's quite gross to be
honest and it's not right. You shouldn't lose access size information
between guest and host when performing real accesses.

Some config space registers can have side effects and not respecting
access sizes can be nasty.

Cheers,
Ben.

> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.
> 
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control
> 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-07-30 23:54     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-07-30 23:55     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:55     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:55     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
@ 2011-07-31 14:09   ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-07-31 14:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

How about a sysfs entry partition=<partition-id>? then libvirt knows not 
to assign devices from the same partition to different guests (and not 
to let the host play with them, either).

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
>
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.

I have a feeling you'll be getting the same capabilities sooner or 
later, or you won't be able to make use of S/R IOV VFs.  While we should 
support the older hardware, the interfaces should be designed with the 
newer hardware in mind.

> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.

Such magic is nice for a developer playing with qemu but in general less 
useful for a managed system where the various cards need to be exposed 
to the user interface anyway.

> * IOMMU
>
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
>
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
>
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.

A single level iommu cannot be exposed to guests.  Well, it can be 
exposed as an iommu that does not provide per-device mapping.

A two level iommu can be emulated and exposed to the guest.  See 
http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

> This means:
>
>    - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
>
>    - It requires the guest to be pinned. Pass-through ->  no more swap

Newer iommus (and devices, unfortunately) (will) support I/O page faults 
and then the requirement can be removed.

>    - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb&  bounce buffering.

Is this a problem in practice?

>    - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.

Then you need to provide that same interface, and implement it using the 
real iommu.

> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...

Does the guest iomap each request?  Why?

Emulating the iommu in the kernel is of course the way to go if that's 
the case, still won't performance suck even then?

> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
>
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors&  addresses, the guest
> will call hyercalls to configure things anyways.

So, you have interrupt redirection?  That is, MSI-x table values encode 
the vcpu, not pcpu?

Alex, with interrupt redirection, we can skip this as well?  Perhaps 
only if the guest enables interrupt redirection?

If so, it's not arch specific, it's interrupt redirection specific.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Does the BAR value contain the segment base address?  Or is that added 
later?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-31 14:09   ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-07-31 14:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

How about a sysfs entry partition=<partition-id>? then libvirt knows not 
to assign devices from the same partition to different guests (and not 
to let the host play with them, either).

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
>
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.

I have a feeling you'll be getting the same capabilities sooner or 
later, or you won't be able to make use of S/R IOV VFs.  While we should 
support the older hardware, the interfaces should be designed with the 
newer hardware in mind.

> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.

Such magic is nice for a developer playing with qemu but in general less 
useful for a managed system where the various cards need to be exposed 
to the user interface anyway.

> * IOMMU
>
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
>
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
>
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.

A single level iommu cannot be exposed to guests.  Well, it can be 
exposed as an iommu that does not provide per-device mapping.

A two level iommu can be emulated and exposed to the guest.  See 
http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

> This means:
>
>    - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
>
>    - It requires the guest to be pinned. Pass-through ->  no more swap

Newer iommus (and devices, unfortunately) (will) support I/O page faults 
and then the requirement can be removed.

>    - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb&  bounce buffering.

Is this a problem in practice?

>    - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.

Then you need to provide that same interface, and implement it using the 
real iommu.

> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...

Does the guest iomap each request?  Why?

Emulating the iommu in the kernel is of course the way to go if that's 
the case, still won't performance suck even then?

> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
>
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors&  addresses, the guest
> will call hyercalls to configure things anyways.

So, you have interrupt redirection?  That is, MSI-x table values encode 
the vcpu, not pcpu?

Alex, with interrupt redirection, we can skip this as well?  Perhaps 
only if the guest enables interrupt redirection?

If so, it's not arch specific, it's interrupt redirection specific.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Does the BAR value contain the segment base address?  Or is that added 
later?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
                   ` (3 preceding siblings ...)
  (?)
@ 2011-08-01  2:48 ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-01  2:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alex Williamson, Anthony Liguori, linuxppc-dev

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
[snip]
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?

Not quite.  We already require the not-yet-upstream patches which add
guest-side (emulated) IOMMU support to qemu.  The approach we're using
for the passthrough (or at least will when I fix up my patches again)
is that we only map all guest ram into the vfio iommu if and only if
there is no guest visible iommu advertised in the qdev.

This kind of makes sense - if there is no iommu from the guest
perspective, the guest will expect to see all its physical memory 1:1
in DMA.

The hacky bit is that when there *is* a guest visible iommu, it's
assumed that whatever interface the guest iommu uses is somehow wired
up to vfio map/unmap calls.  For us at the moment, this means
passthrough devices for us must be assigned to a special (guest) pci
domain which sets up a suitable wires up the paravirt iommu to the vfio iommu.

In theory under some circumstances, with full emu, you could wire up
an emulated guest iommu interface to a different host iommu
implementation via this mechanism.  However that wouldn't work if the
guest and host iommus capabilities are too different, and in any case
would require considerable extra abstraction work on the qemu guest
iommu code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 22:21   ` Benjamin Herrenschmidt
@ 2011-08-01 16:40     ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 16:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > So I promised Anthony I would try to summarize some of the comments &
> > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > on POWER. It's pretty long, there are various items with more or less
> > impact, some of it is easily fixable, some are API issues, and we'll
> > probably want to discuss them separately, but for now here's a brain
> > dump.
> > 
> > David, Alexei, please make sure I haven't missed anything :-)
> 
> And I think I have :-)
> 
>   * Config space
> 
> VFIO currently handles that as a byte stream. It's quite gross to be
> honest and it's not right. You shouldn't lose access size information
> between guest and host when performing real accesses.
> 
> Some config space registers can have side effects and not respecting
> access sizes can be nasty.

It's a bug, let's fix it.

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-01 16:40     ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 16:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Anthony Liguori, linuxppc-dev

On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > So I promised Anthony I would try to summarize some of the comments &
> > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > on POWER. It's pretty long, there are various items with more or less
> > impact, some of it is easily fixable, some are API issues, and we'll
> > probably want to discuss them separately, but for now here's a brain
> > dump.
> > 
> > David, Alexei, please make sure I haven't missed anything :-)
> 
> And I think I have :-)
> 
>   * Config space
> 
> VFIO currently handles that as a byte stream. It's quite gross to be
> honest and it's not right. You shouldn't lose access size information
> between guest and host when performing real accesses.
> 
> Some config space registers can have side effects and not respecting
> access sizes can be nasty.

It's a bug, let's fix it.

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 23:54     ` Benjamin Herrenschmidt
  (?)
@ 2011-08-01 18:59       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 18:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev, iommu, benve,
	aafabbri, chrisw, qemu-devel

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
> reliably isolate devices. But in practice, it's chancy. Some devices for
> example have "backdoors" into their own config space via MMIO. If I have
> such a device in a guest, I can completely override your DisINTx and
> thus DOS your host or another guest with a shared interrupt. I can move
> my MMIO around and DOS another function by overlapping the addresses.
> 
> You can really only be protect yourself against a device if you have it
> behind a bridge (in addition to having a filtering iommu), which limits
> the MMIO span (and thus letting the guest whack the BARs randomly will
> only allow that guest to shoot itself in the foot).
> 
> Some bridges also provide a way to block INTx below them which comes in
> handy but it's bridge specific. Some devices can be coerced to send the
> INTx "assert" message and never de-assert it (for example by doing a
> soft-reset while it's asserted, which can be done with some devices with
> an MMIO).
> 
> Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
> simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
> and fowards them up, but this isn't very reliable, for example it fails
> over with split transactions).
> 
> Fortunately in PCIe land, we most have bridges above everything. The
> problem somewhat remains with functions of a device, how can you be sure
> that there isn't a way via some MMIO to create side effects on the other
> functions of the device ? (For example by checkstopping the whole
> thing). You can't really :-)
> 
> So it boils down of the "level" of safety/isolation you want to provide,
> and I suppose to some extent it's a user decision but the user needs to
> be informed to some extent. A hard problem :-)
>  
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.  If I have a NIC and HBA behind a
> > bridge, it's perfectly reasonable that I might only assign the NIC to
> > the guest, but as you describe, we then need to prevent the host, or any
> > other guest from making use of the HBA.
> 
> Yes. However the other device is in "limbo" and it may be not clear to
> the user why it can't be used anymore :-)
> 
> The question is more, the user needs to "know" (or libvirt does, or
> somebody ... ) that in order to pass-through device A, it must also
> "remove" device B from the host. How can you even provide a meaningful
> error message to the user if all VFIO does is give you something like
> -EBUSY ?
> 
> So the information about the grouping constraint must trickle down
> somewhat.
> 
> Look at it from a GUI perspective for example. Imagine a front-end
> showing you devices in your system and allowing you to "Drag & drop"
> them to your guest. How do you represent that need for grouping ? First
> how do you expose it from kernel/libvirt to the GUI tool and how do you
> represent it to the user ?
> 
> By grouping the devices in logical groups which end up being the
> "objects" you can drag around, at least you provide some amount of
> clarity. Now if you follow that path down to how the GUI app, libvirt
> and possibly qemu need to know / resolve the dependency, being given the
> "groups" as the primary information of what can be used for pass-through
> makes everything a lot simpler.
>  
> > > - The -minimum- granularity of pass-through is not always a single
> > > device and not always under SW control
> > 
> > But IMHO, we need to preserve the granularity of exposing a device to a
> > guest as a single device.  That might mean some devices are held hostage
> > by an agent on the host.
> 
> Maybe but wouldn't that be even more confusing from a user perspective ?
> And I think it makes it harder from an implementation of admin &
> management tools perspective too.
> 
> > > - Having a magic heuristic in libvirt to figure out those constraints is
> > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > > knowledge of PCI resource management and getting it wrong in many many
> > > cases, something that took years to fix essentially by ripping it all
> > > out. This is kernel knowledge and thus we need the kernel to expose in a
> > > way or another what those constraints are, what those "partitionable
> > > groups" are.
> > > 
> > > - That does -not- mean that we cannot specify for each individual device
> > > within such a group where we want to put it in qemu (what devfn etc...).
> > > As long as there is a clear understanding that the "ownership" of the
> > > device goes with the group, this is somewhat orthogonal to how they are
> > > represented in qemu. (Not completely... if the iommu is exposed to the
> > > guest ,via paravirt for example, some of these constraints must be
> > > exposed but I'll talk about that more later).
> > 
> > Or we can choose not to expose all of the devices in the group to the
> > guest?
> 
> As I said, I don't mind if you don't, I'm just worried about the
> consequences of that from a usability standpoint. Having advanced
> command line option to fine tune is fine. Being able to specify within a
> "group" which devices to show and at what address if fine.
> 
> But I believe the basic entity to be manipulated from an interface
> standpoitn remains the group.
> 
> To get back to my GUI example, once you've D&D your group of devices
> over, you can have the option to open that group and check/uncheck
> individual devices & assign them addresses if you want. That doesn't
> change the fact that practically speaking, the whole group is now owned
> by the guest.
> 
> I will go further than that actually. If you look at how the isolation
> HW works on POWER, the fact that I have the MMIO segmentation means that
> I can simply give the entire group MMIO space to the guest. No problem
> of small BARs, no need to slow-map them ... etc.. that's a pretty handy
> feature don't you think ?
> 
> But that means that those other devices -will- be there, mapped along
> with the one you care about. We may not expose it in config space but it
> will be accessible. I suppose we can keep its IO/MEM decoding disabled.
> But my point is that for all intend and purpose, it's actually owned by
> the guest.
> 
> > > The interface currently proposed for VFIO (and associated uiommu)
> > > doesn't handle that problem at all. Instead, it is entirely centered
> > > around a specific "feature" of the VTd iommu's for creating arbitrary
> > > domains with arbitrary devices (tho those devices -do- have the same
> > > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > > the same bridge into 2 different domains !), but the API totally ignores
> > > the problem, leaves it to libvirt "magic foo" and focuses on something
> > > that is both quite secondary in the grand scheme of things, and quite
> > > x86 VTd specific in the implementation and API definition.
> > 
> > To be fair, libvirt's "magic foo" is built out of the necessity that
> > nobody else is defining the rules.
> 
> Sure, which is why I propose that the kernel exposes the rules since
> it's really the one right place to have that sort of HW constraint
> knowledge, especially since it can be partially at least platform
> specific.
>  
>  .../...

I'll try to consolidate my reply to all the above here because there are
too many places above to interject and make this thread even more
difficult to respond to.  Much of what you're discussion above comes
down to policy.  Do we trust DisINTx?  Do we trust multi-function
devices?  I have no doubt there are devices we can use as examples for
each behaving badly.  On x86 this is one of the reasons we have SR-IOV.
Besides splitting a single device into multiple, it makes sure each
devices is actually virtualization friendly.  POWER seems to add
multiple layers of hardware so that you don't actually have to trust the
device, which is a great value add for enterprise systems, but in doing
so it mostly defeats the purpose and functionality of SR-IOV.

How we present this in a GUI is largely irrelevant because something has
to create a superset of what the hardware dictates (can I uniquely
identify transactions from this device, can I protect other devices from
it, etc.), the system policy (do I trust DisINTx, do I trust function
isolation, do I require ACS) and mold that with what the user actually
wants to assign.  For the VFIO kernel interface, we should only be
concerned with the first problem.  Userspace is free to make the rest as
simple or complete as it cares to.  I argue for x86, we want device
level granularity of assignment, but that also tends to be the typical
case (when only factoring in hardware restrictions) due to our advanced
iommus.

> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Well, iommu aren't the only factor. I mentioned shared interrupts (and
> my unwillingness to always trust DisINTx),

*userspace policy*

>  there's also the MMIO
> grouping I mentioned above (in which case it's an x86 -limitation- with
> small BARs that I don't want to inherit, especially since it's based on
> PAGE_SIZE and we commonly have 64K page size on POWER), etc...

But isn't MMIO grouping effectively *at* the iommu?

> So I'm not too fan of making it entirely look like the iommu is the
> primary factor, but we -can-, that would be workable. I still prefer
> calling a cat a cat and exposing the grouping for what it is, as I think
> I've explained already above, tho. 

The trouble is the "group" analogy is more fitting to a partitionable
system, whereas on x86 we can really mix-n-match devices across iommus
fairly easily.  The iommu seems to be the common point to describe these
differences.

> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. 
> 
> No but you could emulate a HW iommu no ?

We can, but then we have to worry about supporting legacy, proprietary
OSes that may not have support or may make use of it differently.  As
Avi mentions, hardware is coming the eases the "pin the whole guest"
requirement and we may implement emulated iommus for the benefit of some
guests.

> >  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> For your current case maybe. It's just not very future proof imho.
> Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

You expect more 32bit devices in the future?

> > > Also our next generation chipset may drop support for PIO completely.
> > > 
> > > On the other hand, because PIO is just a special range of MMIO for us,
> > > we can do normal pass-through on it and don't need any of the emulation
> > > done qemu.
> > 
> > Maybe we can add mmap support to PIO regions on non-x86.
> 
> We have to yes. I haven't looked into it yet, it should be easy if VFIO
> kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> same interfaces sysfs & proc use).

Patches welcome.

> > >   * MMIO constraints
> > > 
> > > The QEMU side VFIO code hard wires various constraints that are entirely
> > > based on various requirements you decided you have on x86 but don't
> > > necessarily apply to us :-)
> > > 
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors & addresses, the guest
> > > will call hyercalls to configure things anyways.
> > 
> > With interrupt remapping, we can allow the guest access to the MSI-X
> > table, but since that takes the host out of the loop, there's
> > effectively no way for the guest to correctly program it directly by
> > itself.
> 
> Right, I think what we need here is some kind of capabilities to
> "disable" those "features" of qemu vfio.c that aren't needed on our
> platform :-) Shouldn't be too hard. We need to make this runtime tho
> since different machines can have different "capabilities".

Sure, we'll probably eventually want a switch to push the MSI-X table to
KVM when it's available.

> > > We don't need to prevent MMIO pass-through for small BARs at all. This
> > > should be some kind of capability or flag passed by the arch. Our
> > > segmentation of the MMIO domain means that we can give entire segments
> > > to the guest and let it access anything in there (those segments are a
> > > multiple of the page size always). Worst case it will access outside of
> > > a device BAR within a segment and will cause the PE to go into error
> > > state, shooting itself in the foot, there is no risk of side effect
> > > outside of the guest boundaries.
> > 
> > Sure, this could be some kind of capability flag, maybe even implicit in
> > certain configurations.
> 
> Yup.
> 
> > > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > > paravirt guests expect the BARs to have been already allocated for them
> > > by the firmware and will pick up the addresses from the device-tree :-)
> > > 
> > > Today we use a "hack", putting all 0's in there and triggering the linux
> > > code path to reassign unassigned resources (which will use BAR
> > > emulation) but that's not what we are -supposed- to do. Not a big deal
> > > and having the emulation there won't -hurt- us, it's just that we don't
> > > really need any of it.
> > > 
> > > We have a small issue with ROMs. Our current KVM only works with huge
> > > pages for guest memory but that is being fixed. So the way qemu maps the
> > > ROM copy into the guest address space doesn't work. It might be handy
> > > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > > fallback. I'll look into it.
> > 
> > So that means ROMs don't work for you on emulated devices either?  The
> > reason we read it once and map it into the guest is because Michael
> > Tsirkin found a section in the PCI spec that indicates devices can share
> > address decoders between BARs and ROM.
> 
> Yes, he is correct.
> 
> >   This means we can't just leave
> > the enabled bit set in the ROM BAR, because it could actually disable an
> > address decoder for a regular BAR.  We could slow-map the actual ROM,
> > enabling it around each read, but shadowing it seemed far more
> > efficient.
> 
> Right. We can slow map the ROM, or we can not care :-) At the end of the
> day, what is the difference here between a "guest" under qemu and the
> real thing bare metal on the machine ? IE. They have the same issue vs.
> accessing the ROM. IE. I don't see why qemu should try to make it safe
> to access it at any time while it isn't on a real machine. Since VFIO
> resets the devices before putting them in guest space, they should be
> accessible no ? (Might require a hard reset for some devices tho ... )

My primary motivator for doing the ROM the way it's done today is that I
get to push all the ROM handling off to QEMU core PCI code.  The ROM for
an assigned device is handled exactly like the ROM for an emulated
device except it might be generated by reading it from the hardware.
This gives us the benefit of things like rombar=0 if I want to hide the
ROM or romfile=<file> if I want to load an ipxe image for a device that
may not even have a physical ROM.  Not to mention I don't have to
special case ROM handling routines in VFIO.  So it actually has little
to do w/ making it safe to access the ROM at any time.

> In any case, it's not a big deal and we can sort it out, I'm happy to
> fallback to slow map to start with and eventually we will support small
> pages mappings on POWER anyways, it's a temporary limitation.

Perhaps this could also be fixed in the generic QEMU PCI ROM support so
it works for emulated devices too... code reuse paying off already ;)

> > >   * EEH
> > > 
> > > This is the name of those fancy error handling & isolation features I
> > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > generally expose AER to guests (or even the host), it's swallowed by
> > > firmware into something else that provides a superset (well mostly) of
> > > the AER information, and allow us to do those additional things like
> > > isolating/de-isolating, reset control etc...
> > > 
> > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > huge deal, I mention it for completeness.
> > 
> > We expect to do AER via the VFIO netlink interface, which even though
> > its bashed below, would be quite extensible to supporting different
> > kinds of errors.
> 
> As could platform specific ioctls :-)

Is qemu going to poll for errors?

> > >    * Misc
> > > 
> > > There's lots of small bits and pieces... in no special order:
> > > 
> > >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > > netlink and a bit of ioctl's ... it's not like there's something
> > > fundamentally  better for netlink vs. ioctl... it really depends what
> > > you are doing, and in this case I fail to see what netlink brings you
> > > other than bloat and more stupid userspace library deps.
> > 
> > The netlink interface is primarily for host->guest signaling.  I've only
> > implemented the remove command (since we're lacking a pcie-host in qemu
> > to do AER), but it seems to work quite well.  If you have suggestions
> > for how else we might do it, please let me know.  This seems to be the
> > sort of thing netlink is supposed to be used for.
> 
> I don't understand what the advantage of netlink is compared to just
> extending your existing VFIO ioctl interface, possibly using children
> fd's as we do for example with spufs but it's not a huge deal. It just
> that netlink has its own gotchas and I don't like multi-headed
> interfaces.

We could do yet another eventfd that triggers the VFIO user to go call
an ioctl to see what happened, but then we're locked into an ioctl
interface for something that we may want to more easily extend over
time.  As I said, it feels like this is what netlink is for and the
arguments against seem to be more gut reaction.

> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> > 
> > > One thing I thought about but you don't seem to like it ... was to use
> > > the need to represent the partitionable entity as groups in sysfs that I
> > > talked about earlier. Those could have per-device subdirs with the usual
> > > config & resource files, same semantic as the ones in the real device,
> > > but when accessed via the group they get filtering. I might or might not
> > > be practical in the end, tbd, but it would allow apps using a slightly
> > > modified libpci for example to exploit some of this.
> > 
> > I may be tainted by our disagreement that all the devices in a group
> > need to be exposed to the guest and qemu could just take a pointer to a
> > sysfs directory.  That seems very unlike qemu and pushes more of the
> > policy into qemu, which seems like the wrong direction.
> 
> I don't see how it pushes "policy" into qemu.
> 
> The "policy" here is imposed by the HW setup and exposed by the
> kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
> of devices, so far I don't see what's policy about that. From there, it
> would be "handy" for people to just stop there and just see all the
> devices of the group show up in the guest, but by all means feel free to
> suggest a command line interface that allows to more precisely specify
> which of the devices in the group to pass through and at what address.

That's exactly the policy I'm thinking of.  Here's a group of devices,
do something with them...  Does qemu assign them all?  where?  does it
allow hotplug?  do we have ROMs?  should we?  from where?

> > >  - The qemu vfio code hooks directly into ioapic ... of course that
> > > won't fly with anything !x86
> > 
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.
> 
> No it doesn't I agree, that's why it should be some kind of notifier or
> function pointer setup by the platform specific code.

Hmm... it is.  I added a pci_get_irq() that returns a
platform/architecture specific translation of a PCI interrupt to it's
resulting system interrupt.  Implement this in your PCI root bridge.
There's a notifier for when this changes, so vfio will check
pci_get_irq() again, also to be implemented in the PCI root bridge code.
And a notifier that gets registered with that system interrupt and gets
notice for EOI... implemented in x86 ioapic, somewhere else for power.

> >   The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> Right, and we need to cook a similiar sauce for POWER, it's an area that
> has to be arch specific (and in fact specific to the specific HW machine
> being emulated), so we just need to find out what's the cleanest way for
> the plaform to "register" the right callbacks here.

Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
emulation.

[snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, s/iommu/groups and you are pretty close to my original idea :-)
> 
> I don't mind that much what the details are, but I like the idea of not
> having to construct a 3-pages command line every time I want to
> pass-through a device, most "simple" usage scenario don't care that
> much.
> 
> > That means we know /dev/uiommu7 (random example) is our access to a
> > specific iommu with a given set of devices behind it.
> 
> Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
>   
> >   If that iommu is
> > a PE (via those capability files), then a user space entity (trying hard
> > not to call it libvirt) can unbind all those devices from the host,
> > maybe bind the ones it wants to assign to a guest to vfio and bind the
> > others to pci-stub for safe keeping.  If you trust a user with
> > everything in a PE, bind all the devices to VFIO, chown all
> > the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
> >
> > We might then come up with qemu command lines to describe interesting
> > configurations, such as:
> > 
> > -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> > -device pci-bus,...,iommu=iommu0,id=pci.0 \
> > -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> > 
> > The userspace entity would obviously need to put things in the same PE
> > in the right place, but it doesn't seem to take a lot of sysfs info to
> > get that right.
> > 
> > Today we do DMA mapping via the VFIO device because the capabilities of
> > the IOMMU domains change depending on which devices are connected (for
> > VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> > DMA mappings through VFIO naturally forces the call order.  If we moved
> > to something like above, we could switch the DMA mapping to the uiommu
> > device, since the IOMMU would have fixed capabilities.
> 
> That makes sense.
> 
> > What gaps would something like this leave for your IOMMU granularity
> > problems?  I'll need to think through how it works when we don't want to
> > expose the iommu to the guest, maybe a model=none (default) that doesn't
> > need to be connected to a pci bus and maps all guest memory.  Thanks,
> 
> Well, I would map those "iommus" to PEs, so what remains is the path to
> put all the "other" bits and pieces such as inform qemu of the location
> and size of the MMIO segment(s) (so we can map the whole thing and not
> bother with individual BARs) etc... 

My assumption is that PEs are largely defined by the iommus already.
Are MMIO segments a property of the iommu too?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-01 18:59       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 18:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
> reliably isolate devices. But in practice, it's chancy. Some devices for
> example have "backdoors" into their own config space via MMIO. If I have
> such a device in a guest, I can completely override your DisINTx and
> thus DOS your host or another guest with a shared interrupt. I can move
> my MMIO around and DOS another function by overlapping the addresses.
> 
> You can really only be protect yourself against a device if you have it
> behind a bridge (in addition to having a filtering iommu), which limits
> the MMIO span (and thus letting the guest whack the BARs randomly will
> only allow that guest to shoot itself in the foot).
> 
> Some bridges also provide a way to block INTx below them which comes in
> handy but it's bridge specific. Some devices can be coerced to send the
> INTx "assert" message and never de-assert it (for example by doing a
> soft-reset while it's asserted, which can be done with some devices with
> an MMIO).
> 
> Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
> simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
> and fowards them up, but this isn't very reliable, for example it fails
> over with split transactions).
> 
> Fortunately in PCIe land, we most have bridges above everything. The
> problem somewhat remains with functions of a device, how can you be sure
> that there isn't a way via some MMIO to create side effects on the other
> functions of the device ? (For example by checkstopping the whole
> thing). You can't really :-)
> 
> So it boils down of the "level" of safety/isolation you want to provide,
> and I suppose to some extent it's a user decision but the user needs to
> be informed to some extent. A hard problem :-)
>  
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.  If I have a NIC and HBA behind a
> > bridge, it's perfectly reasonable that I might only assign the NIC to
> > the guest, but as you describe, we then need to prevent the host, or any
> > other guest from making use of the HBA.
> 
> Yes. However the other device is in "limbo" and it may be not clear to
> the user why it can't be used anymore :-)
> 
> The question is more, the user needs to "know" (or libvirt does, or
> somebody ... ) that in order to pass-through device A, it must also
> "remove" device B from the host. How can you even provide a meaningful
> error message to the user if all VFIO does is give you something like
> -EBUSY ?
> 
> So the information about the grouping constraint must trickle down
> somewhat.
> 
> Look at it from a GUI perspective for example. Imagine a front-end
> showing you devices in your system and allowing you to "Drag & drop"
> them to your guest. How do you represent that need for grouping ? First
> how do you expose it from kernel/libvirt to the GUI tool and how do you
> represent it to the user ?
> 
> By grouping the devices in logical groups which end up being the
> "objects" you can drag around, at least you provide some amount of
> clarity. Now if you follow that path down to how the GUI app, libvirt
> and possibly qemu need to know / resolve the dependency, being given the
> "groups" as the primary information of what can be used for pass-through
> makes everything a lot simpler.
>  
> > > - The -minimum- granularity of pass-through is not always a single
> > > device and not always under SW control
> > 
> > But IMHO, we need to preserve the granularity of exposing a device to a
> > guest as a single device.  That might mean some devices are held hostage
> > by an agent on the host.
> 
> Maybe but wouldn't that be even more confusing from a user perspective ?
> And I think it makes it harder from an implementation of admin &
> management tools perspective too.
> 
> > > - Having a magic heuristic in libvirt to figure out those constraints is
> > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > > knowledge of PCI resource management and getting it wrong in many many
> > > cases, something that took years to fix essentially by ripping it all
> > > out. This is kernel knowledge and thus we need the kernel to expose in a
> > > way or another what those constraints are, what those "partitionable
> > > groups" are.
> > > 
> > > - That does -not- mean that we cannot specify for each individual device
> > > within such a group where we want to put it in qemu (what devfn etc...).
> > > As long as there is a clear understanding that the "ownership" of the
> > > device goes with the group, this is somewhat orthogonal to how they are
> > > represented in qemu. (Not completely... if the iommu is exposed to the
> > > guest ,via paravirt for example, some of these constraints must be
> > > exposed but I'll talk about that more later).
> > 
> > Or we can choose not to expose all of the devices in the group to the
> > guest?
> 
> As I said, I don't mind if you don't, I'm just worried about the
> consequences of that from a usability standpoint. Having advanced
> command line option to fine tune is fine. Being able to specify within a
> "group" which devices to show and at what address if fine.
> 
> But I believe the basic entity to be manipulated from an interface
> standpoitn remains the group.
> 
> To get back to my GUI example, once you've D&D your group of devices
> over, you can have the option to open that group and check/uncheck
> individual devices & assign them addresses if you want. That doesn't
> change the fact that practically speaking, the whole group is now owned
> by the guest.
> 
> I will go further than that actually. If you look at how the isolation
> HW works on POWER, the fact that I have the MMIO segmentation means that
> I can simply give the entire group MMIO space to the guest. No problem
> of small BARs, no need to slow-map them ... etc.. that's a pretty handy
> feature don't you think ?
> 
> But that means that those other devices -will- be there, mapped along
> with the one you care about. We may not expose it in config space but it
> will be accessible. I suppose we can keep its IO/MEM decoding disabled.
> But my point is that for all intend and purpose, it's actually owned by
> the guest.
> 
> > > The interface currently proposed for VFIO (and associated uiommu)
> > > doesn't handle that problem at all. Instead, it is entirely centered
> > > around a specific "feature" of the VTd iommu's for creating arbitrary
> > > domains with arbitrary devices (tho those devices -do- have the same
> > > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > > the same bridge into 2 different domains !), but the API totally ignores
> > > the problem, leaves it to libvirt "magic foo" and focuses on something
> > > that is both quite secondary in the grand scheme of things, and quite
> > > x86 VTd specific in the implementation and API definition.
> > 
> > To be fair, libvirt's "magic foo" is built out of the necessity that
> > nobody else is defining the rules.
> 
> Sure, which is why I propose that the kernel exposes the rules since
> it's really the one right place to have that sort of HW constraint
> knowledge, especially since it can be partially at least platform
> specific.
>  
>  .../...

I'll try to consolidate my reply to all the above here because there are
too many places above to interject and make this thread even more
difficult to respond to.  Much of what you're discussion above comes
down to policy.  Do we trust DisINTx?  Do we trust multi-function
devices?  I have no doubt there are devices we can use as examples for
each behaving badly.  On x86 this is one of the reasons we have SR-IOV.
Besides splitting a single device into multiple, it makes sure each
devices is actually virtualization friendly.  POWER seems to add
multiple layers of hardware so that you don't actually have to trust the
device, which is a great value add for enterprise systems, but in doing
so it mostly defeats the purpose and functionality of SR-IOV.

How we present this in a GUI is largely irrelevant because something has
to create a superset of what the hardware dictates (can I uniquely
identify transactions from this device, can I protect other devices from
it, etc.), the system policy (do I trust DisINTx, do I trust function
isolation, do I require ACS) and mold that with what the user actually
wants to assign.  For the VFIO kernel interface, we should only be
concerned with the first problem.  Userspace is free to make the rest as
simple or complete as it cares to.  I argue for x86, we want device
level granularity of assignment, but that also tends to be the typical
case (when only factoring in hardware restrictions) due to our advanced
iommus.

> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Well, iommu aren't the only factor. I mentioned shared interrupts (and
> my unwillingness to always trust DisINTx),

*userspace policy*

>  there's also the MMIO
> grouping I mentioned above (in which case it's an x86 -limitation- with
> small BARs that I don't want to inherit, especially since it's based on
> PAGE_SIZE and we commonly have 64K page size on POWER), etc...

But isn't MMIO grouping effectively *at* the iommu?

> So I'm not too fan of making it entirely look like the iommu is the
> primary factor, but we -can-, that would be workable. I still prefer
> calling a cat a cat and exposing the grouping for what it is, as I think
> I've explained already above, tho. 

The trouble is the "group" analogy is more fitting to a partitionable
system, whereas on x86 we can really mix-n-match devices across iommus
fairly easily.  The iommu seems to be the common point to describe these
differences.

> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. 
> 
> No but you could emulate a HW iommu no ?

We can, but then we have to worry about supporting legacy, proprietary
OSes that may not have support or may make use of it differently.  As
Avi mentions, hardware is coming the eases the "pin the whole guest"
requirement and we may implement emulated iommus for the benefit of some
guests.

> >  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> For your current case maybe. It's just not very future proof imho.
> Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

You expect more 32bit devices in the future?

> > > Also our next generation chipset may drop support for PIO completely.
> > > 
> > > On the other hand, because PIO is just a special range of MMIO for us,
> > > we can do normal pass-through on it and don't need any of the emulation
> > > done qemu.
> > 
> > Maybe we can add mmap support to PIO regions on non-x86.
> 
> We have to yes. I haven't looked into it yet, it should be easy if VFIO
> kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> same interfaces sysfs & proc use).

Patches welcome.

> > >   * MMIO constraints
> > > 
> > > The QEMU side VFIO code hard wires various constraints that are entirely
> > > based on various requirements you decided you have on x86 but don't
> > > necessarily apply to us :-)
> > > 
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors & addresses, the guest
> > > will call hyercalls to configure things anyways.
> > 
> > With interrupt remapping, we can allow the guest access to the MSI-X
> > table, but since that takes the host out of the loop, there's
> > effectively no way for the guest to correctly program it directly by
> > itself.
> 
> Right, I think what we need here is some kind of capabilities to
> "disable" those "features" of qemu vfio.c that aren't needed on our
> platform :-) Shouldn't be too hard. We need to make this runtime tho
> since different machines can have different "capabilities".

Sure, we'll probably eventually want a switch to push the MSI-X table to
KVM when it's available.

> > > We don't need to prevent MMIO pass-through for small BARs at all. This
> > > should be some kind of capability or flag passed by the arch. Our
> > > segmentation of the MMIO domain means that we can give entire segments
> > > to the guest and let it access anything in there (those segments are a
> > > multiple of the page size always). Worst case it will access outside of
> > > a device BAR within a segment and will cause the PE to go into error
> > > state, shooting itself in the foot, there is no risk of side effect
> > > outside of the guest boundaries.
> > 
> > Sure, this could be some kind of capability flag, maybe even implicit in
> > certain configurations.
> 
> Yup.
> 
> > > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > > paravirt guests expect the BARs to have been already allocated for them
> > > by the firmware and will pick up the addresses from the device-tree :-)
> > > 
> > > Today we use a "hack", putting all 0's in there and triggering the linux
> > > code path to reassign unassigned resources (which will use BAR
> > > emulation) but that's not what we are -supposed- to do. Not a big deal
> > > and having the emulation there won't -hurt- us, it's just that we don't
> > > really need any of it.
> > > 
> > > We have a small issue with ROMs. Our current KVM only works with huge
> > > pages for guest memory but that is being fixed. So the way qemu maps the
> > > ROM copy into the guest address space doesn't work. It might be handy
> > > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > > fallback. I'll look into it.
> > 
> > So that means ROMs don't work for you on emulated devices either?  The
> > reason we read it once and map it into the guest is because Michael
> > Tsirkin found a section in the PCI spec that indicates devices can share
> > address decoders between BARs and ROM.
> 
> Yes, he is correct.
> 
> >   This means we can't just leave
> > the enabled bit set in the ROM BAR, because it could actually disable an
> > address decoder for a regular BAR.  We could slow-map the actual ROM,
> > enabling it around each read, but shadowing it seemed far more
> > efficient.
> 
> Right. We can slow map the ROM, or we can not care :-) At the end of the
> day, what is the difference here between a "guest" under qemu and the
> real thing bare metal on the machine ? IE. They have the same issue vs.
> accessing the ROM. IE. I don't see why qemu should try to make it safe
> to access it at any time while it isn't on a real machine. Since VFIO
> resets the devices before putting them in guest space, they should be
> accessible no ? (Might require a hard reset for some devices tho ... )

My primary motivator for doing the ROM the way it's done today is that I
get to push all the ROM handling off to QEMU core PCI code.  The ROM for
an assigned device is handled exactly like the ROM for an emulated
device except it might be generated by reading it from the hardware.
This gives us the benefit of things like rombar=0 if I want to hide the
ROM or romfile=<file> if I want to load an ipxe image for a device that
may not even have a physical ROM.  Not to mention I don't have to
special case ROM handling routines in VFIO.  So it actually has little
to do w/ making it safe to access the ROM at any time.

> In any case, it's not a big deal and we can sort it out, I'm happy to
> fallback to slow map to start with and eventually we will support small
> pages mappings on POWER anyways, it's a temporary limitation.

Perhaps this could also be fixed in the generic QEMU PCI ROM support so
it works for emulated devices too... code reuse paying off already ;)

> > >   * EEH
> > > 
> > > This is the name of those fancy error handling & isolation features I
> > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > generally expose AER to guests (or even the host), it's swallowed by
> > > firmware into something else that provides a superset (well mostly) of
> > > the AER information, and allow us to do those additional things like
> > > isolating/de-isolating, reset control etc...
> > > 
> > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > huge deal, I mention it for completeness.
> > 
> > We expect to do AER via the VFIO netlink interface, which even though
> > its bashed below, would be quite extensible to supporting different
> > kinds of errors.
> 
> As could platform specific ioctls :-)

Is qemu going to poll for errors?

> > >    * Misc
> > > 
> > > There's lots of small bits and pieces... in no special order:
> > > 
> > >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > > netlink and a bit of ioctl's ... it's not like there's something
> > > fundamentally  better for netlink vs. ioctl... it really depends what
> > > you are doing, and in this case I fail to see what netlink brings you
> > > other than bloat and more stupid userspace library deps.
> > 
> > The netlink interface is primarily for host->guest signaling.  I've only
> > implemented the remove command (since we're lacking a pcie-host in qemu
> > to do AER), but it seems to work quite well.  If you have suggestions
> > for how else we might do it, please let me know.  This seems to be the
> > sort of thing netlink is supposed to be used for.
> 
> I don't understand what the advantage of netlink is compared to just
> extending your existing VFIO ioctl interface, possibly using children
> fd's as we do for example with spufs but it's not a huge deal. It just
> that netlink has its own gotchas and I don't like multi-headed
> interfaces.

We could do yet another eventfd that triggers the VFIO user to go call
an ioctl to see what happened, but then we're locked into an ioctl
interface for something that we may want to more easily extend over
time.  As I said, it feels like this is what netlink is for and the
arguments against seem to be more gut reaction.

> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> > 
> > > One thing I thought about but you don't seem to like it ... was to use
> > > the need to represent the partitionable entity as groups in sysfs that I
> > > talked about earlier. Those could have per-device subdirs with the usual
> > > config & resource files, same semantic as the ones in the real device,
> > > but when accessed via the group they get filtering. I might or might not
> > > be practical in the end, tbd, but it would allow apps using a slightly
> > > modified libpci for example to exploit some of this.
> > 
> > I may be tainted by our disagreement that all the devices in a group
> > need to be exposed to the guest and qemu could just take a pointer to a
> > sysfs directory.  That seems very unlike qemu and pushes more of the
> > policy into qemu, which seems like the wrong direction.
> 
> I don't see how it pushes "policy" into qemu.
> 
> The "policy" here is imposed by the HW setup and exposed by the
> kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
> of devices, so far I don't see what's policy about that. From there, it
> would be "handy" for people to just stop there and just see all the
> devices of the group show up in the guest, but by all means feel free to
> suggest a command line interface that allows to more precisely specify
> which of the devices in the group to pass through and at what address.

That's exactly the policy I'm thinking of.  Here's a group of devices,
do something with them...  Does qemu assign them all?  where?  does it
allow hotplug?  do we have ROMs?  should we?  from where?

> > >  - The qemu vfio code hooks directly into ioapic ... of course that
> > > won't fly with anything !x86
> > 
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.
> 
> No it doesn't I agree, that's why it should be some kind of notifier or
> function pointer setup by the platform specific code.

Hmm... it is.  I added a pci_get_irq() that returns a
platform/architecture specific translation of a PCI interrupt to it's
resulting system interrupt.  Implement this in your PCI root bridge.
There's a notifier for when this changes, so vfio will check
pci_get_irq() again, also to be implemented in the PCI root bridge code.
And a notifier that gets registered with that system interrupt and gets
notice for EOI... implemented in x86 ioapic, somewhere else for power.

> >   The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> Right, and we need to cook a similiar sauce for POWER, it's an area that
> has to be arch specific (and in fact specific to the specific HW machine
> being emulated), so we just need to find out what's the cleanest way for
> the plaform to "register" the right callbacks here.

Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
emulation.

[snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, s/iommu/groups and you are pretty close to my original idea :-)
> 
> I don't mind that much what the details are, but I like the idea of not
> having to construct a 3-pages command line every time I want to
> pass-through a device, most "simple" usage scenario don't care that
> much.
> 
> > That means we know /dev/uiommu7 (random example) is our access to a
> > specific iommu with a given set of devices behind it.
> 
> Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
>   
> >   If that iommu is
> > a PE (via those capability files), then a user space entity (trying hard
> > not to call it libvirt) can unbind all those devices from the host,
> > maybe bind the ones it wants to assign to a guest to vfio and bind the
> > others to pci-stub for safe keeping.  If you trust a user with
> > everything in a PE, bind all the devices to VFIO, chown all
> > the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
> >
> > We might then come up with qemu command lines to describe interesting
> > configurations, such as:
> > 
> > -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> > -device pci-bus,...,iommu=iommu0,id=pci.0 \
> > -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> > 
> > The userspace entity would obviously need to put things in the same PE
> > in the right place, but it doesn't seem to take a lot of sysfs info to
> > get that right.
> > 
> > Today we do DMA mapping via the VFIO device because the capabilities of
> > the IOMMU domains change depending on which devices are connected (for
> > VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> > DMA mappings through VFIO naturally forces the call order.  If we moved
> > to something like above, we could switch the DMA mapping to the uiommu
> > device, since the IOMMU would have fixed capabilities.
> 
> That makes sense.
> 
> > What gaps would something like this leave for your IOMMU granularity
> > problems?  I'll need to think through how it works when we don't want to
> > expose the iommu to the guest, maybe a model=none (default) that doesn't
> > need to be connected to a pci bus and maps all guest memory.  Thanks,
> 
> Well, I would map those "iommus" to PEs, so what remains is the path to
> put all the "other" bits and pieces such as inform qemu of the location
> and size of the MMIO segment(s) (so we can map the whole thing and not
> bother with individual BARs) etc... 

My assumption is that PEs are largely defined by the iommus already.
Are MMIO segments a property of the iommu too?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-01 18:59       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 18:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
> reliably isolate devices. But in practice, it's chancy. Some devices for
> example have "backdoors" into their own config space via MMIO. If I have
> such a device in a guest, I can completely override your DisINTx and
> thus DOS your host or another guest with a shared interrupt. I can move
> my MMIO around and DOS another function by overlapping the addresses.
> 
> You can really only be protect yourself against a device if you have it
> behind a bridge (in addition to having a filtering iommu), which limits
> the MMIO span (and thus letting the guest whack the BARs randomly will
> only allow that guest to shoot itself in the foot).
> 
> Some bridges also provide a way to block INTx below them which comes in
> handy but it's bridge specific. Some devices can be coerced to send the
> INTx "assert" message and never de-assert it (for example by doing a
> soft-reset while it's asserted, which can be done with some devices with
> an MMIO).
> 
> Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
> simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
> and fowards them up, but this isn't very reliable, for example it fails
> over with split transactions).
> 
> Fortunately in PCIe land, we most have bridges above everything. The
> problem somewhat remains with functions of a device, how can you be sure
> that there isn't a way via some MMIO to create side effects on the other
> functions of the device ? (For example by checkstopping the whole
> thing). You can't really :-)
> 
> So it boils down of the "level" of safety/isolation you want to provide,
> and I suppose to some extent it's a user decision but the user needs to
> be informed to some extent. A hard problem :-)
>  
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.  If I have a NIC and HBA behind a
> > bridge, it's perfectly reasonable that I might only assign the NIC to
> > the guest, but as you describe, we then need to prevent the host, or any
> > other guest from making use of the HBA.
> 
> Yes. However the other device is in "limbo" and it may be not clear to
> the user why it can't be used anymore :-)
> 
> The question is more, the user needs to "know" (or libvirt does, or
> somebody ... ) that in order to pass-through device A, it must also
> "remove" device B from the host. How can you even provide a meaningful
> error message to the user if all VFIO does is give you something like
> -EBUSY ?
> 
> So the information about the grouping constraint must trickle down
> somewhat.
> 
> Look at it from a GUI perspective for example. Imagine a front-end
> showing you devices in your system and allowing you to "Drag & drop"
> them to your guest. How do you represent that need for grouping ? First
> how do you expose it from kernel/libvirt to the GUI tool and how do you
> represent it to the user ?
> 
> By grouping the devices in logical groups which end up being the
> "objects" you can drag around, at least you provide some amount of
> clarity. Now if you follow that path down to how the GUI app, libvirt
> and possibly qemu need to know / resolve the dependency, being given the
> "groups" as the primary information of what can be used for pass-through
> makes everything a lot simpler.
>  
> > > - The -minimum- granularity of pass-through is not always a single
> > > device and not always under SW control
> > 
> > But IMHO, we need to preserve the granularity of exposing a device to a
> > guest as a single device.  That might mean some devices are held hostage
> > by an agent on the host.
> 
> Maybe but wouldn't that be even more confusing from a user perspective ?
> And I think it makes it harder from an implementation of admin &
> management tools perspective too.
> 
> > > - Having a magic heuristic in libvirt to figure out those constraints is
> > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > > knowledge of PCI resource management and getting it wrong in many many
> > > cases, something that took years to fix essentially by ripping it all
> > > out. This is kernel knowledge and thus we need the kernel to expose in a
> > > way or another what those constraints are, what those "partitionable
> > > groups" are.
> > > 
> > > - That does -not- mean that we cannot specify for each individual device
> > > within such a group where we want to put it in qemu (what devfn etc...).
> > > As long as there is a clear understanding that the "ownership" of the
> > > device goes with the group, this is somewhat orthogonal to how they are
> > > represented in qemu. (Not completely... if the iommu is exposed to the
> > > guest ,via paravirt for example, some of these constraints must be
> > > exposed but I'll talk about that more later).
> > 
> > Or we can choose not to expose all of the devices in the group to the
> > guest?
> 
> As I said, I don't mind if you don't, I'm just worried about the
> consequences of that from a usability standpoint. Having advanced
> command line option to fine tune is fine. Being able to specify within a
> "group" which devices to show and at what address if fine.
> 
> But I believe the basic entity to be manipulated from an interface
> standpoitn remains the group.
> 
> To get back to my GUI example, once you've D&D your group of devices
> over, you can have the option to open that group and check/uncheck
> individual devices & assign them addresses if you want. That doesn't
> change the fact that practically speaking, the whole group is now owned
> by the guest.
> 
> I will go further than that actually. If you look at how the isolation
> HW works on POWER, the fact that I have the MMIO segmentation means that
> I can simply give the entire group MMIO space to the guest. No problem
> of small BARs, no need to slow-map them ... etc.. that's a pretty handy
> feature don't you think ?
> 
> But that means that those other devices -will- be there, mapped along
> with the one you care about. We may not expose it in config space but it
> will be accessible. I suppose we can keep its IO/MEM decoding disabled.
> But my point is that for all intend and purpose, it's actually owned by
> the guest.
> 
> > > The interface currently proposed for VFIO (and associated uiommu)
> > > doesn't handle that problem at all. Instead, it is entirely centered
> > > around a specific "feature" of the VTd iommu's for creating arbitrary
> > > domains with arbitrary devices (tho those devices -do- have the same
> > > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > > the same bridge into 2 different domains !), but the API totally ignores
> > > the problem, leaves it to libvirt "magic foo" and focuses on something
> > > that is both quite secondary in the grand scheme of things, and quite
> > > x86 VTd specific in the implementation and API definition.
> > 
> > To be fair, libvirt's "magic foo" is built out of the necessity that
> > nobody else is defining the rules.
> 
> Sure, which is why I propose that the kernel exposes the rules since
> it's really the one right place to have that sort of HW constraint
> knowledge, especially since it can be partially at least platform
> specific.
>  
>  .../...

I'll try to consolidate my reply to all the above here because there are
too many places above to interject and make this thread even more
difficult to respond to.  Much of what you're discussion above comes
down to policy.  Do we trust DisINTx?  Do we trust multi-function
devices?  I have no doubt there are devices we can use as examples for
each behaving badly.  On x86 this is one of the reasons we have SR-IOV.
Besides splitting a single device into multiple, it makes sure each
devices is actually virtualization friendly.  POWER seems to add
multiple layers of hardware so that you don't actually have to trust the
device, which is a great value add for enterprise systems, but in doing
so it mostly defeats the purpose and functionality of SR-IOV.

How we present this in a GUI is largely irrelevant because something has
to create a superset of what the hardware dictates (can I uniquely
identify transactions from this device, can I protect other devices from
it, etc.), the system policy (do I trust DisINTx, do I trust function
isolation, do I require ACS) and mold that with what the user actually
wants to assign.  For the VFIO kernel interface, we should only be
concerned with the first problem.  Userspace is free to make the rest as
simple or complete as it cares to.  I argue for x86, we want device
level granularity of assignment, but that also tends to be the typical
case (when only factoring in hardware restrictions) due to our advanced
iommus.

> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Well, iommu aren't the only factor. I mentioned shared interrupts (and
> my unwillingness to always trust DisINTx),

*userspace policy*

>  there's also the MMIO
> grouping I mentioned above (in which case it's an x86 -limitation- with
> small BARs that I don't want to inherit, especially since it's based on
> PAGE_SIZE and we commonly have 64K page size on POWER), etc...

But isn't MMIO grouping effectively *at* the iommu?

> So I'm not too fan of making it entirely look like the iommu is the
> primary factor, but we -can-, that would be workable. I still prefer
> calling a cat a cat and exposing the grouping for what it is, as I think
> I've explained already above, tho. 

The trouble is the "group" analogy is more fitting to a partitionable
system, whereas on x86 we can really mix-n-match devices across iommus
fairly easily.  The iommu seems to be the common point to describe these
differences.

> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. 
> 
> No but you could emulate a HW iommu no ?

We can, but then we have to worry about supporting legacy, proprietary
OSes that may not have support or may make use of it differently.  As
Avi mentions, hardware is coming the eases the "pin the whole guest"
requirement and we may implement emulated iommus for the benefit of some
guests.

> >  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> For your current case maybe. It's just not very future proof imho.
> Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

You expect more 32bit devices in the future?

> > > Also our next generation chipset may drop support for PIO completely.
> > > 
> > > On the other hand, because PIO is just a special range of MMIO for us,
> > > we can do normal pass-through on it and don't need any of the emulation
> > > done qemu.
> > 
> > Maybe we can add mmap support to PIO regions on non-x86.
> 
> We have to yes. I haven't looked into it yet, it should be easy if VFIO
> kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> same interfaces sysfs & proc use).

Patches welcome.

> > >   * MMIO constraints
> > > 
> > > The QEMU side VFIO code hard wires various constraints that are entirely
> > > based on various requirements you decided you have on x86 but don't
> > > necessarily apply to us :-)
> > > 
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors & addresses, the guest
> > > will call hyercalls to configure things anyways.
> > 
> > With interrupt remapping, we can allow the guest access to the MSI-X
> > table, but since that takes the host out of the loop, there's
> > effectively no way for the guest to correctly program it directly by
> > itself.
> 
> Right, I think what we need here is some kind of capabilities to
> "disable" those "features" of qemu vfio.c that aren't needed on our
> platform :-) Shouldn't be too hard. We need to make this runtime tho
> since different machines can have different "capabilities".

Sure, we'll probably eventually want a switch to push the MSI-X table to
KVM when it's available.

> > > We don't need to prevent MMIO pass-through for small BARs at all. This
> > > should be some kind of capability or flag passed by the arch. Our
> > > segmentation of the MMIO domain means that we can give entire segments
> > > to the guest and let it access anything in there (those segments are a
> > > multiple of the page size always). Worst case it will access outside of
> > > a device BAR within a segment and will cause the PE to go into error
> > > state, shooting itself in the foot, there is no risk of side effect
> > > outside of the guest boundaries.
> > 
> > Sure, this could be some kind of capability flag, maybe even implicit in
> > certain configurations.
> 
> Yup.
> 
> > > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > > paravirt guests expect the BARs to have been already allocated for them
> > > by the firmware and will pick up the addresses from the device-tree :-)
> > > 
> > > Today we use a "hack", putting all 0's in there and triggering the linux
> > > code path to reassign unassigned resources (which will use BAR
> > > emulation) but that's not what we are -supposed- to do. Not a big deal
> > > and having the emulation there won't -hurt- us, it's just that we don't
> > > really need any of it.
> > > 
> > > We have a small issue with ROMs. Our current KVM only works with huge
> > > pages for guest memory but that is being fixed. So the way qemu maps the
> > > ROM copy into the guest address space doesn't work. It might be handy
> > > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > > fallback. I'll look into it.
> > 
> > So that means ROMs don't work for you on emulated devices either?  The
> > reason we read it once and map it into the guest is because Michael
> > Tsirkin found a section in the PCI spec that indicates devices can share
> > address decoders between BARs and ROM.
> 
> Yes, he is correct.
> 
> >   This means we can't just leave
> > the enabled bit set in the ROM BAR, because it could actually disable an
> > address decoder for a regular BAR.  We could slow-map the actual ROM,
> > enabling it around each read, but shadowing it seemed far more
> > efficient.
> 
> Right. We can slow map the ROM, or we can not care :-) At the end of the
> day, what is the difference here between a "guest" under qemu and the
> real thing bare metal on the machine ? IE. They have the same issue vs.
> accessing the ROM. IE. I don't see why qemu should try to make it safe
> to access it at any time while it isn't on a real machine. Since VFIO
> resets the devices before putting them in guest space, they should be
> accessible no ? (Might require a hard reset for some devices tho ... )

My primary motivator for doing the ROM the way it's done today is that I
get to push all the ROM handling off to QEMU core PCI code.  The ROM for
an assigned device is handled exactly like the ROM for an emulated
device except it might be generated by reading it from the hardware.
This gives us the benefit of things like rombar=0 if I want to hide the
ROM or romfile=<file> if I want to load an ipxe image for a device that
may not even have a physical ROM.  Not to mention I don't have to
special case ROM handling routines in VFIO.  So it actually has little
to do w/ making it safe to access the ROM at any time.

> In any case, it's not a big deal and we can sort it out, I'm happy to
> fallback to slow map to start with and eventually we will support small
> pages mappings on POWER anyways, it's a temporary limitation.

Perhaps this could also be fixed in the generic QEMU PCI ROM support so
it works for emulated devices too... code reuse paying off already ;)

> > >   * EEH
> > > 
> > > This is the name of those fancy error handling & isolation features I
> > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > generally expose AER to guests (or even the host), it's swallowed by
> > > firmware into something else that provides a superset (well mostly) of
> > > the AER information, and allow us to do those additional things like
> > > isolating/de-isolating, reset control etc...
> > > 
> > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > huge deal, I mention it for completeness.
> > 
> > We expect to do AER via the VFIO netlink interface, which even though
> > its bashed below, would be quite extensible to supporting different
> > kinds of errors.
> 
> As could platform specific ioctls :-)

Is qemu going to poll for errors?

> > >    * Misc
> > > 
> > > There's lots of small bits and pieces... in no special order:
> > > 
> > >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > > netlink and a bit of ioctl's ... it's not like there's something
> > > fundamentally  better for netlink vs. ioctl... it really depends what
> > > you are doing, and in this case I fail to see what netlink brings you
> > > other than bloat and more stupid userspace library deps.
> > 
> > The netlink interface is primarily for host->guest signaling.  I've only
> > implemented the remove command (since we're lacking a pcie-host in qemu
> > to do AER), but it seems to work quite well.  If you have suggestions
> > for how else we might do it, please let me know.  This seems to be the
> > sort of thing netlink is supposed to be used for.
> 
> I don't understand what the advantage of netlink is compared to just
> extending your existing VFIO ioctl interface, possibly using children
> fd's as we do for example with spufs but it's not a huge deal. It just
> that netlink has its own gotchas and I don't like multi-headed
> interfaces.

We could do yet another eventfd that triggers the VFIO user to go call
an ioctl to see what happened, but then we're locked into an ioctl
interface for something that we may want to more easily extend over
time.  As I said, it feels like this is what netlink is for and the
arguments against seem to be more gut reaction.

> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> > 
> > > One thing I thought about but you don't seem to like it ... was to use
> > > the need to represent the partitionable entity as groups in sysfs that I
> > > talked about earlier. Those could have per-device subdirs with the usual
> > > config & resource files, same semantic as the ones in the real device,
> > > but when accessed via the group they get filtering. I might or might not
> > > be practical in the end, tbd, but it would allow apps using a slightly
> > > modified libpci for example to exploit some of this.
> > 
> > I may be tainted by our disagreement that all the devices in a group
> > need to be exposed to the guest and qemu could just take a pointer to a
> > sysfs directory.  That seems very unlike qemu and pushes more of the
> > policy into qemu, which seems like the wrong direction.
> 
> I don't see how it pushes "policy" into qemu.
> 
> The "policy" here is imposed by the HW setup and exposed by the
> kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
> of devices, so far I don't see what's policy about that. From there, it
> would be "handy" for people to just stop there and just see all the
> devices of the group show up in the guest, but by all means feel free to
> suggest a command line interface that allows to more precisely specify
> which of the devices in the group to pass through and at what address.

That's exactly the policy I'm thinking of.  Here's a group of devices,
do something with them...  Does qemu assign them all?  where?  does it
allow hotplug?  do we have ROMs?  should we?  from where?

> > >  - The qemu vfio code hooks directly into ioapic ... of course that
> > > won't fly with anything !x86
> > 
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.
> 
> No it doesn't I agree, that's why it should be some kind of notifier or
> function pointer setup by the platform specific code.

Hmm... it is.  I added a pci_get_irq() that returns a
platform/architecture specific translation of a PCI interrupt to it's
resulting system interrupt.  Implement this in your PCI root bridge.
There's a notifier for when this changes, so vfio will check
pci_get_irq() again, also to be implemented in the PCI root bridge code.
And a notifier that gets registered with that system interrupt and gets
notice for EOI... implemented in x86 ioapic, somewhere else for power.

> >   The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> Right, and we need to cook a similiar sauce for POWER, it's an area that
> has to be arch specific (and in fact specific to the specific HW machine
> being emulated), so we just need to find out what's the cleanest way for
> the plaform to "register" the right callbacks here.

Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
emulation.

[snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, s/iommu/groups and you are pretty close to my original idea :-)
> 
> I don't mind that much what the details are, but I like the idea of not
> having to construct a 3-pages command line every time I want to
> pass-through a device, most "simple" usage scenario don't care that
> much.
> 
> > That means we know /dev/uiommu7 (random example) is our access to a
> > specific iommu with a given set of devices behind it.
> 
> Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
>   
> >   If that iommu is
> > a PE (via those capability files), then a user space entity (trying hard
> > not to call it libvirt) can unbind all those devices from the host,
> > maybe bind the ones it wants to assign to a guest to vfio and bind the
> > others to pci-stub for safe keeping.  If you trust a user with
> > everything in a PE, bind all the devices to VFIO, chown all
> > the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
> >
> > We might then come up with qemu command lines to describe interesting
> > configurations, such as:
> > 
> > -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> > -device pci-bus,...,iommu=iommu0,id=pci.0 \
> > -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> > 
> > The userspace entity would obviously need to put things in the same PE
> > in the right place, but it doesn't seem to take a lot of sysfs info to
> > get that right.
> > 
> > Today we do DMA mapping via the VFIO device because the capabilities of
> > the IOMMU domains change depending on which devices are connected (for
> > VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> > DMA mappings through VFIO naturally forces the call order.  If we moved
> > to something like above, we could switch the DMA mapping to the uiommu
> > device, since the IOMMU would have fixed capabilities.
> 
> That makes sense.
> 
> > What gaps would something like this leave for your IOMMU granularity
> > problems?  I'll need to think through how it works when we don't want to
> > expose the iommu to the guest, maybe a model=none (default) that doesn't
> > need to be connected to a pci bus and maps all guest memory.  Thanks,
> 
> Well, I would map those "iommus" to PEs, so what remains is the path to
> put all the "other" bits and pieces such as inform qemu of the location
> and size of the MMIO segment(s) (so we can map the whole thing and not
> bother with individual BARs) etc... 

My assumption is that PEs are largely defined by the iommus already.
Are MMIO segments a property of the iommu too?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-31 14:09   ` Avi Kivity
@ 2011-08-01 20:27     ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 20:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?
> 
> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
makes this easier?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-01 20:27     ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 20:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Anthony Liguori, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?
> 
> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
makes this easier?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-31 14:09   ` Avi Kivity
@ 2011-08-02  1:27     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  1:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> How about a sysfs entry partition=<partition-id>? then libvirt knows not 
> to assign devices from the same partition to different guests (and not 
> to let the host play with them, either).

That would work. On POWER I also need to expose the way that such
partitions also mean shared iommu domain but that's probably doable.

It would be easy for me to implement it that way since I would just pass
down my PE#.

However, it seems to be a bit of the "smallest possible tweak" to get it
to work. We keep a completely orthogonal iommu domain handling for x86
and there is no link between them.

I still personally prefer a way to statically define the grouping, but
it looks like you guys don't agree... oh well.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> >
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> 
> I have a feeling you'll be getting the same capabilities sooner or 
> later, or you won't be able to make use of S/R IOV VFs.

I'm not sure why you mean. We can do SR/IOV just fine (well, with some
limitations due to constraints with how our MMIO segmenting works and
indeed some of those are being lifted in our future chipsets but
overall, it works).

In -theory-, one could do the grouping dynamically with some kind of API
for us as well. However the constraints are such that it's not
practical. Filtering on RID is based on number of bits to match in the
bus number and whether to match the dev and fn. So it's not arbitrary
(but works fine for SR-IOV).

The MMIO segmentation is a bit special too. There is a single MMIO
region in 32-bit space (size is configurable but that's not very
practical so for now we stick it to 1G) which is evenly divided into N
segments (where N is the number of PE# supported by the host bridge,
typically 128 with the current bridges).

Each segment goes through a remapping table to select the actual PE# (so
large BARs use consecutive segments mapped to the same PE#).
 
For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
regions which act as some kind of "accordions", they are evenly divided
into segments in different PE# and there's several of them which we can
"move around" and typically use to map VF BARs.

>  While we should 
> support the older hardware, the interfaces should be designed with the 
> newer hardware in mind.

Well, our newer hardware will relax some of our limitations, like the
way our 64-bit segments work (I didn't go into details but they have
some inconvenient size constraints that will be lifted), having more
PE#, supporting more MSI ports etc... but the basic scheme remains the
same. Oh and the newer IOMMU will support separate address spaces.

But as you said, we -do- need to support the older stuff.

> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> 
> Such magic is nice for a developer playing with qemu but in general less 
> useful for a managed system where the various cards need to be exposed 
> to the user interface anyway.

Right but at least the code that does that exposure can work top-down,
picking groups and exposing their content.

> > * IOMMU
> >
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> >
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> >
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> 
> A single level iommu cannot be exposed to guests.  Well, it can be 
> exposed as an iommu that does not provide per-device mapping.

Well, x86 ones can't maybe but on POWER we can and must thanks to our
essentially paravirt model :-) Even if it' wasn't and we used trapping
of accesses to the table, it would work because in practice, even with
filtering, what we end up having is a per-device (or rather per-PE#
table).

> A two level iommu can be emulated and exposed to the guest.  See 
> http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?). We don't have that and probably never will. But again, because
we have a paravirt interface to the iommu, it's less of an issue.

> > This means:
> >
> >    - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> >
> >    - It requires the guest to be pinned. Pass-through ->  no more swap
> 
> Newer iommus (and devices, unfortunately) (will) support I/O page faults 
> and then the requirement can be removed.

No. -Some- newer devices will. Out of these, a bunch will have so many
bugs in it it's not usable. Some never will. It's a mess really and I
wouldn't design my stuff based on those premises just yet. Making it
possible to support it for sure, having it in mind, but not making it
the fundation on which the whole API is designed. 

> >    - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb&  bounce buffering.
> 
> Is this a problem in practice?

Could be. It's an artificial limitation we don't need on POWER.

> >    - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> 
> Then you need to provide that same interface, and implement it using the 
> real iommu.

Yes. Working on it. It's not very practical due to how VFIO interacts in
terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
almost entirely real-mode for performance reasons.

> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> 
> Does the guest iomap each request?  Why?

Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So yes,
we'll eventually do it in kernel. We just haven't yet.

> Emulating the iommu in the kernel is of course the way to go if that's 
> the case, still won't performance suck even then?

Well, we have HW on the field where we still beat intel on 10G
networking performances but heh, yeah, the cost of those h-calls is a
concern.

There are some new interfaces in pHyp that we'll eventually support that
allow to create additional iommu mappings in 64-bit space (the current
base mapping is 32-bit and 4K for backward compatibility) with larger
iommu page sizes.

This will eventually help. For guests backed with hugetlbfs we might be
able to map the whole guest in using 16M pages at the iommu level. 

But on the other hand, the current method means that we can support
pass-through without losing overcommit & paging which is handy.

> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> >
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?

Not exactly. The MSI-X address is a real PCI address to an MSI port and
the value is a real interrupt number in the PIC.

However, the MSI port filters by RID (using the same matching as PE#) to
ensure that only allowed devices can write to it, and the PIC has a
matching PE# information to ensure that only allowed devices can trigger
the interrupt.

As for the guest knowing what values to put in there (what port address
and interrupt source numbers to use), this is part of the paravirt APIs.

So the paravirt APIs handles the configuration and the HW ensures that
the guest cannot do anything else than what it's allowed to.

> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?
> 
> If so, it's not arch specific, it's interrupt redirection specific.
> 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Does the BAR value contain the segment base address?  Or is that added 
> later?

It's a shared address space. With a basic configuration on p7ioc for
example we have MMIO going from 3G to 4G (PCI side addresses). BARs
contain the normal PCI address there. But that 1G is divided in 128
segments of equal size which can separately be assigned to PE#'s.

So BARs are allocated by firmware or the kernel PCI code so that devices
in different PEs don't share segments.

Of course there's always the risk that a device can be hacked via a
sideband access to BARs to move out of it's allocated segment. That
means that the guest owning that device won't be able to access it
anymore and can potentially disturb a guest or host owning whatever is
in that other segment.

The only way to enforce isolation here is to ensure that PE# are
entirely behind P2P bridges, since those would then ensure that even if
you put crap into your BARs you won't be able to walk over a neighbour.

I believe pHyp enforces that, for example, if you have a slot, all
devices & functions behind that slot pertain to the same PE# under pHyp.

That means you cannot put individual functions of a device into
different PE# with pHyp.

We plan to be a bit less restrictive here for KVM, assuming that if you
use a device that allows such a back-channel to the BARs, then it's your
problem to not trust such a device for virtualization. And most of the
time, you -will- have a P2P to protect you anyways.

The problem doesn't exist (or is assumed as non-existing) for SR-IOV
since in that case, the VFs are meant to be virtualized, so pHyp assumes
there is no such back-channel and it can trust them to be in different
PE#.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  1:27     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  1:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> How about a sysfs entry partition=<partition-id>? then libvirt knows not 
> to assign devices from the same partition to different guests (and not 
> to let the host play with them, either).

That would work. On POWER I also need to expose the way that such
partitions also mean shared iommu domain but that's probably doable.

It would be easy for me to implement it that way since I would just pass
down my PE#.

However, it seems to be a bit of the "smallest possible tweak" to get it
to work. We keep a completely orthogonal iommu domain handling for x86
and there is no link between them.

I still personally prefer a way to statically define the grouping, but
it looks like you guys don't agree... oh well.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> >
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> 
> I have a feeling you'll be getting the same capabilities sooner or 
> later, or you won't be able to make use of S/R IOV VFs.

I'm not sure why you mean. We can do SR/IOV just fine (well, with some
limitations due to constraints with how our MMIO segmenting works and
indeed some of those are being lifted in our future chipsets but
overall, it works).

In -theory-, one could do the grouping dynamically with some kind of API
for us as well. However the constraints are such that it's not
practical. Filtering on RID is based on number of bits to match in the
bus number and whether to match the dev and fn. So it's not arbitrary
(but works fine for SR-IOV).

The MMIO segmentation is a bit special too. There is a single MMIO
region in 32-bit space (size is configurable but that's not very
practical so for now we stick it to 1G) which is evenly divided into N
segments (where N is the number of PE# supported by the host bridge,
typically 128 with the current bridges).

Each segment goes through a remapping table to select the actual PE# (so
large BARs use consecutive segments mapped to the same PE#).
 
For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
regions which act as some kind of "accordions", they are evenly divided
into segments in different PE# and there's several of them which we can
"move around" and typically use to map VF BARs.

>  While we should 
> support the older hardware, the interfaces should be designed with the 
> newer hardware in mind.

Well, our newer hardware will relax some of our limitations, like the
way our 64-bit segments work (I didn't go into details but they have
some inconvenient size constraints that will be lifted), having more
PE#, supporting more MSI ports etc... but the basic scheme remains the
same. Oh and the newer IOMMU will support separate address spaces.

But as you said, we -do- need to support the older stuff.

> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> 
> Such magic is nice for a developer playing with qemu but in general less 
> useful for a managed system where the various cards need to be exposed 
> to the user interface anyway.

Right but at least the code that does that exposure can work top-down,
picking groups and exposing their content.

> > * IOMMU
> >
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> >
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> >
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> 
> A single level iommu cannot be exposed to guests.  Well, it can be 
> exposed as an iommu that does not provide per-device mapping.

Well, x86 ones can't maybe but on POWER we can and must thanks to our
essentially paravirt model :-) Even if it' wasn't and we used trapping
of accesses to the table, it would work because in practice, even with
filtering, what we end up having is a per-device (or rather per-PE#
table).

> A two level iommu can be emulated and exposed to the guest.  See 
> http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?). We don't have that and probably never will. But again, because
we have a paravirt interface to the iommu, it's less of an issue.

> > This means:
> >
> >    - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> >
> >    - It requires the guest to be pinned. Pass-through ->  no more swap
> 
> Newer iommus (and devices, unfortunately) (will) support I/O page faults 
> and then the requirement can be removed.

No. -Some- newer devices will. Out of these, a bunch will have so many
bugs in it it's not usable. Some never will. It's a mess really and I
wouldn't design my stuff based on those premises just yet. Making it
possible to support it for sure, having it in mind, but not making it
the fundation on which the whole API is designed. 

> >    - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb&  bounce buffering.
> 
> Is this a problem in practice?

Could be. It's an artificial limitation we don't need on POWER.

> >    - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> 
> Then you need to provide that same interface, and implement it using the 
> real iommu.

Yes. Working on it. It's not very practical due to how VFIO interacts in
terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
almost entirely real-mode for performance reasons.

> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> 
> Does the guest iomap each request?  Why?

Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So yes,
we'll eventually do it in kernel. We just haven't yet.

> Emulating the iommu in the kernel is of course the way to go if that's 
> the case, still won't performance suck even then?

Well, we have HW on the field where we still beat intel on 10G
networking performances but heh, yeah, the cost of those h-calls is a
concern.

There are some new interfaces in pHyp that we'll eventually support that
allow to create additional iommu mappings in 64-bit space (the current
base mapping is 32-bit and 4K for backward compatibility) with larger
iommu page sizes.

This will eventually help. For guests backed with hugetlbfs we might be
able to map the whole guest in using 16M pages at the iommu level. 

But on the other hand, the current method means that we can support
pass-through without losing overcommit & paging which is handy.

> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> >
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?

Not exactly. The MSI-X address is a real PCI address to an MSI port and
the value is a real interrupt number in the PIC.

However, the MSI port filters by RID (using the same matching as PE#) to
ensure that only allowed devices can write to it, and the PIC has a
matching PE# information to ensure that only allowed devices can trigger
the interrupt.

As for the guest knowing what values to put in there (what port address
and interrupt source numbers to use), this is part of the paravirt APIs.

So the paravirt APIs handles the configuration and the HW ensures that
the guest cannot do anything else than what it's allowed to.

> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?
> 
> If so, it's not arch specific, it's interrupt redirection specific.
> 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Does the BAR value contain the segment base address?  Or is that added 
> later?

It's a shared address space. With a basic configuration on p7ioc for
example we have MMIO going from 3G to 4G (PCI side addresses). BARs
contain the normal PCI address there. But that 1G is divided in 128
segments of equal size which can separately be assigned to PE#'s.

So BARs are allocated by firmware or the kernel PCI code so that devices
in different PEs don't share segments.

Of course there's always the risk that a device can be hacked via a
sideband access to BARs to move out of it's allocated segment. That
means that the guest owning that device won't be able to access it
anymore and can potentially disturb a guest or host owning whatever is
in that other segment.

The only way to enforce isolation here is to ensure that PE# are
entirely behind P2P bridges, since those would then ensure that even if
you put crap into your BARs you won't be able to walk over a neighbour.

I believe pHyp enforces that, for example, if you have a slot, all
devices & functions behind that slot pertain to the same PE# under pHyp.

That means you cannot put individual functions of a device into
different PE# with pHyp.

We plan to be a bit less restrictive here for KVM, assuming that if you
use a device that allows such a back-channel to the BARs, then it's your
problem to not trust such a device for virtualization. And most of the
time, you -will- have a P2P to protect you anyways.

The problem doesn't exist (or is assumed as non-existing) for SR-IOV
since in that case, the VFs are meant to be virtualized, so pHyp assumes
there is no such back-channel and it can trust them to be in different
PE#.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 16:40     ` Alex Williamson
  (?)
@ 2011-08-02  1:29     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  1:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Anthony Liguori, linuxppc-dev

On Mon, 2011-08-01 at 10:40 -0600, Alex Williamson wrote:
> On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > Hi folks !
> > > 
> > > So I promised Anthony I would try to summarize some of the comments &
> > > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > > on POWER. It's pretty long, there are various items with more or less
> > > impact, some of it is easily fixable, some are API issues, and we'll
> > > probably want to discuss them separately, but for now here's a brain
> > > dump.
> > > 
> > > David, Alexei, please make sure I haven't missed anything :-)
> > 
> > And I think I have :-)
> > 
> >   * Config space
> > 
> > VFIO currently handles that as a byte stream. It's quite gross to be
> > honest and it's not right. You shouldn't lose access size information
> > between guest and host when performing real accesses.
> > 
> > Some config space registers can have side effects and not respecting
> > access sizes can be nasty.
> 
> It's a bug, let's fix it.

Right. I was just trying to be exhaustive :-) If you don't beat us to
it, we'll eventually submit patches to fix it, we haven't fixed it yet
either, just something I noticed (because this byte-transport also makes
handling of endianess clumsly).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 18:59       ` Alex Williamson
  (?)
@ 2011-08-02  2:00         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  2:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Mon, 2011-08-01 at 12:59 -0600, Alex Williamson wrote:

> >  
> >  .../...
> 
> I'll try to consolidate my reply to all the above here because there are
> too many places above to interject and make this thread even more
> difficult to respond to.

True, I should try to do the same :-)

>   Much of what you're discussion above comes
> down to policy.  Do we trust DisINTx?  Do we trust multi-function
> devices?  I have no doubt there are devices we can use as examples for
> each behaving badly.  On x86 this is one of the reasons we have SR-IOV.

Right, that and having the ability to provide way more functions that
you would normally have.

> Besides splitting a single device into multiple, it makes sure each
> devices is actually virtualization friendly.  POWER seems to add
> multiple layers of hardware so that you don't actually have to trust the
> device, which is a great value add for enterprise systems, but in doing
> so it mostly defeats the purpose and functionality of SR-IOV.

Well not entirely. A lot of what POWER does is also about isolation on
errors. This is going to be useful with and without SR-IOV. Also not all
devices are SR-IOV capable and there are plenty of situations where one
would want to pass-through devices that aren't, I don't see that as
disappearing tomorrow.

> How we present this in a GUI is largely irrelevant because something has
> to create a superset of what the hardware dictates (can I uniquely
> identify transactions from this device, can I protect other devices from
> it, etc.), the system policy (do I trust DisINTx, do I trust function
> isolation, do I require ACS) and mold that with what the user actually
> wants to assign.  For the VFIO kernel interface, we should only be
> concerned with the first problem.  Userspace is free to make the rest as
> simple or complete as it cares to.  I argue for x86, we want device
> level granularity of assignment, but that also tends to be the typical
> case (when only factoring in hardware restrictions) due to our advanced
> iommus.

Well, POWER iommu's are advanced too ... just in a different way :-) x86
seems to be a lot less interested in robustness and reliability for
example :-)

I tend to agree that the policy decisions in general should be done by
the user, tho with appropriate information :-)

But some of them on our side are hard requirements imposed by how our
firmware or early kernel code assigned the PE's and we need to expose
that. It directly derives the sharing of iommu's too but then we -could-
have those different iommu's point to the same table in memory and
essentially mimmic the x86 domains. We chose not to. The segments are
too small in our current HW design for one and it means we lose the
isolation between devices which is paramount to getting the kind of
reliability and error handling we want to achieve. 

> > > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > > more kernel people into the discussion.
> > > 
> > > I don't yet buy into passing groups to qemu since I don't buy into the
> > > idea of always exposing all of those devices to qemu.  Would it be
> > > sufficient to expose iommu nodes in sysfs that link to the devices
> > > behind them and describe properties and capabilities of the iommu
> > > itself?  More on this at the end.
> > 
> > Well, iommu aren't the only factor. I mentioned shared interrupts (and
> > my unwillingness to always trust DisINTx),
> 
> *userspace policy*

Maybe ... some of it yes. I suppose. You can always hand out to
userspace bigger guns to shoot itself in the foot. Not always very wise
but heh.

Some of these are hard requirements tho. And we have to make that
decision when we assign PE's at boot time.

> >  there's also the MMIO
> > grouping I mentioned above (in which case it's an x86 -limitation- with
> > small BARs that I don't want to inherit, especially since it's based on
> > PAGE_SIZE and we commonly have 64K page size on POWER), etc...
> 
> But isn't MMIO grouping effectively *at* the iommu?

No exactly. It's a different set of tables & registers in the host
bridge and essentially a different set of logic, tho it does hook into
the whole "shared PE# state" thingy to enforce isolation of all layers
on error.

> > So I'm not too fan of making it entirely look like the iommu is the
> > primary factor, but we -can-, that would be workable. I still prefer
> > calling a cat a cat and exposing the grouping for what it is, as I think
> > I've explained already above, tho. 
> 
> The trouble is the "group" analogy is more fitting to a partitionable
> system, whereas on x86 we can really mix-n-match devices across iommus
> fairly easily.  The iommu seems to be the common point to describe these
> differences.

No. You can do that by throwing away isolation between those devices and
thus throwing away error isolation capabilities as well. I suppose if
you don't care about RAS... :-)

> > > > Now some of this can be fixed with tweaks, and we've started doing it
> > > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > > just that we don't like what we had to do to get there).
> > > 
> > > This is a result of wanting to support *unmodified* x86 guests.  We
> > > don't have the luxury of having a predefined pvDMA spec that all x86
> > > OSes adhere to. 
> > 
> > No but you could emulate a HW iommu no ?
> 
> We can, but then we have to worry about supporting legacy, proprietary
> OSes that may not have support or may make use of it differently.  As
> Avi mentions, hardware is coming the eases the "pin the whole guest"
> requirement and we may implement emulated iommus for the benefit of some
> guests.

That's a pipe dream :-) It will take a LONG time before a reasonable
proportion of devices does this in a reliable way I believe.

> > >  The 32bit problem is unfortunate, but the priority use
> > > case for assigning devices to guests is high performance I/O, which
> > > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > > point of having emulated IOMMU hardware on x86, which could then be
> > > backed by VFIO, but for now guest pinning is the most practical and
> > > useful.
> > 
> > For your current case maybe. It's just not very future proof imho.
> > Anyways, it's fixable, but the APIs as they are make it a bit clumsy.
> 
> You expect more 32bit devices in the future?

Got knows what embedded ARM folks will come up with :-) I wouldn't
dismiss that completely. I do expect to have to deal with OHCI for a
while tho.

> > > > Also our next generation chipset may drop support for PIO completely.
> > > > 
> > > > On the other hand, because PIO is just a special range of MMIO for us,
> > > > we can do normal pass-through on it and don't need any of the emulation
> > > > done qemu.
> > > 
> > > Maybe we can add mmap support to PIO regions on non-x86.
> > 
> > We have to yes. I haven't looked into it yet, it should be easy if VFIO
> > kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> > same interfaces sysfs & proc use).
> 
> Patches welcome.

Sure, we do plan to send patches for a lot of those things as we get
there, I'm just chosing to mention all the issues at once here and we
haven't go to fixing -that- just yet.
 
 .../...

> > Right. We can slow map the ROM, or we can not care :-) At the end of the
> > day, what is the difference here between a "guest" under qemu and the
> > real thing bare metal on the machine ? IE. They have the same issue vs.
> > accessing the ROM. IE. I don't see why qemu should try to make it safe
> > to access it at any time while it isn't on a real machine. Since VFIO
> > resets the devices before putting them in guest space, they should be
> > accessible no ? (Might require a hard reset for some devices tho ... )
> 
> My primary motivator for doing the ROM the way it's done today is that I
> get to push all the ROM handling off to QEMU core PCI code.  The ROM for
> an assigned device is handled exactly like the ROM for an emulated
> device except it might be generated by reading it from the hardware.
> This gives us the benefit of things like rombar=0 if I want to hide the
> ROM or romfile=<file> if I want to load an ipxe image for a device that
> may not even have a physical ROM.  Not to mention I don't have to
> special case ROM handling routines in VFIO.  So it actually has little
> to do w/ making it safe to access the ROM at any time.

On the other hand, let's hope no device has side effects on the ROM and
expects to exploit them :-) Do we know how ROM/flash updates work for
devices in practice ? Do they expect to be able to write to the ROM BAR
or they always use a different MMIO based sideband access ?
 
> > In any case, it's not a big deal and we can sort it out, I'm happy to
> > fallback to slow map to start with and eventually we will support small
> > pages mappings on POWER anyways, it's a temporary limitation.
> 
> Perhaps this could also be fixed in the generic QEMU PCI ROM support so
> it works for emulated devices too... code reuse paying off already ;)

Heh, I think emulation works.

> > > >   * EEH
> > > > 
> > > > This is the name of those fancy error handling & isolation features I
> > > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > > generally expose AER to guests (or even the host), it's swallowed by
> > > > firmware into something else that provides a superset (well mostly) of
> > > > the AER information, and allow us to do those additional things like
> > > > isolating/de-isolating, reset control etc...
> > > > 
> > > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > > huge deal, I mention it for completeness.
> > > 
> > > We expect to do AER via the VFIO netlink interface, which even though
> > > its bashed below, would be quite extensible to supporting different
> > > kinds of errors.
> > 
> > As could platform specific ioctls :-)
> 
> Is qemu going to poll for errors?

I wouldn't mind eventfd + ioctl, I really don't like netlink :-) But
others might disagree with me here. However that's not really my
argument, see below...

> > I don't understand what the advantage of netlink is compared to just
> > extending your existing VFIO ioctl interface, possibly using children
> > fd's as we do for example with spufs but it's not a huge deal. It just
> > that netlink has its own gotchas and I don't like multi-headed
> > interfaces.
> 
> We could do yet another eventfd that triggers the VFIO user to go call
> an ioctl to see what happened, but then we're locked into an ioctl
> interface for something that we may want to more easily extend over
> time.  As I said, it feels like this is what netlink is for and the
> arguments against seem to be more gut reaction.

My argument here is we already have an fd open, ie, we already have a
communication open to vfio as a chardev, I don't like the idea of
creating -another- one.

> Hmm... it is.  I added a pci_get_irq() that returns a
> platform/architecture specific translation of a PCI interrupt to it's
> resulting system interrupt.  Implement this in your PCI root bridge.
> There's a notifier for when this changes, so vfio will check
> pci_get_irq() again, also to be implemented in the PCI root bridge code.
> And a notifier that gets registered with that system interrupt and gets
> notice for EOI... implemented in x86 ioapic, somewhere else for power.

Let's leave this one alone, we'll fix it a way or another and we can
discuss the patches when it comes down to it.

> > >   The problem is
> > > that we have to disable INTx on an assigned device after it fires (VFIO
> > > does this automatically).  If we don't do this, a non-responsive or
> > > malicious guest could sit on the interrupt, causing it to fire
> > > repeatedly as a DoS on the host.  The only indication that we can rely
> > > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > > We can't just wait for device accesses because a) the device CSRs are
> > > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > > do some kind of dirty logging to detect when they're accesses b) what
> > > constitutes an interrupt service is device specific.
> > > 
> > > That means we need to figure out how PCI interrupt 'A' (or B...)
> > > translates to a GSI (Global System Interrupt - ACPI definition, but
> > > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > > which will also see the APIC EOI.  And just to spice things up, the
> > > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > > callbacks I've added are generic (maybe I left ioapic in the name), but
> > > yes they do need to be implemented for other architectures.  Patches
> > > appreciated from those with knowledge of the systems and/or access to
> > > device specs.  This is the only reason that I make QEMU VFIO only build
> > > for x86.
> > 
> > Right, and we need to cook a similiar sauce for POWER, it's an area that
> > has to be arch specific (and in fact specific to the specific HW machine
> > being emulated), so we just need to find out what's the cleanest way for
> > the plaform to "register" the right callbacks here.
> 
> Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
> emulation.

Yeah, we'll see, whatever we come up with and we discuss the details
then :-)

>  Thanks,
> > 
> > Well, I would map those "iommus" to PEs, so what remains is the path to
> > put all the "other" bits and pieces such as inform qemu of the location
> > and size of the MMIO segment(s) (so we can map the whole thing and not
> > bother with individual BARs) etc... 
> 
> My assumption is that PEs are largely defined by the iommus already.
> Are MMIO segments a property of the iommu too?  Thanks,

Not exactly but it's all tied together. See my other replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  2:00         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  2:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Mon, 2011-08-01 at 12:59 -0600, Alex Williamson wrote:

> >  
> >  .../...
> 
> I'll try to consolidate my reply to all the above here because there are
> too many places above to interject and make this thread even more
> difficult to respond to.

True, I should try to do the same :-)

>   Much of what you're discussion above comes
> down to policy.  Do we trust DisINTx?  Do we trust multi-function
> devices?  I have no doubt there are devices we can use as examples for
> each behaving badly.  On x86 this is one of the reasons we have SR-IOV.

Right, that and having the ability to provide way more functions that
you would normally have.

> Besides splitting a single device into multiple, it makes sure each
> devices is actually virtualization friendly.  POWER seems to add
> multiple layers of hardware so that you don't actually have to trust the
> device, which is a great value add for enterprise systems, but in doing
> so it mostly defeats the purpose and functionality of SR-IOV.

Well not entirely. A lot of what POWER does is also about isolation on
errors. This is going to be useful with and without SR-IOV. Also not all
devices are SR-IOV capable and there are plenty of situations where one
would want to pass-through devices that aren't, I don't see that as
disappearing tomorrow.

> How we present this in a GUI is largely irrelevant because something has
> to create a superset of what the hardware dictates (can I uniquely
> identify transactions from this device, can I protect other devices from
> it, etc.), the system policy (do I trust DisINTx, do I trust function
> isolation, do I require ACS) and mold that with what the user actually
> wants to assign.  For the VFIO kernel interface, we should only be
> concerned with the first problem.  Userspace is free to make the rest as
> simple or complete as it cares to.  I argue for x86, we want device
> level granularity of assignment, but that also tends to be the typical
> case (when only factoring in hardware restrictions) due to our advanced
> iommus.

Well, POWER iommu's are advanced too ... just in a different way :-) x86
seems to be a lot less interested in robustness and reliability for
example :-)

I tend to agree that the policy decisions in general should be done by
the user, tho with appropriate information :-)

But some of them on our side are hard requirements imposed by how our
firmware or early kernel code assigned the PE's and we need to expose
that. It directly derives the sharing of iommu's too but then we -could-
have those different iommu's point to the same table in memory and
essentially mimmic the x86 domains. We chose not to. The segments are
too small in our current HW design for one and it means we lose the
isolation between devices which is paramount to getting the kind of
reliability and error handling we want to achieve. 

> > > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > > more kernel people into the discussion.
> > > 
> > > I don't yet buy into passing groups to qemu since I don't buy into the
> > > idea of always exposing all of those devices to qemu.  Would it be
> > > sufficient to expose iommu nodes in sysfs that link to the devices
> > > behind them and describe properties and capabilities of the iommu
> > > itself?  More on this at the end.
> > 
> > Well, iommu aren't the only factor. I mentioned shared interrupts (and
> > my unwillingness to always trust DisINTx),
> 
> *userspace policy*

Maybe ... some of it yes. I suppose. You can always hand out to
userspace bigger guns to shoot itself in the foot. Not always very wise
but heh.

Some of these are hard requirements tho. And we have to make that
decision when we assign PE's at boot time.

> >  there's also the MMIO
> > grouping I mentioned above (in which case it's an x86 -limitation- with
> > small BARs that I don't want to inherit, especially since it's based on
> > PAGE_SIZE and we commonly have 64K page size on POWER), etc...
> 
> But isn't MMIO grouping effectively *at* the iommu?

No exactly. It's a different set of tables & registers in the host
bridge and essentially a different set of logic, tho it does hook into
the whole "shared PE# state" thingy to enforce isolation of all layers
on error.

> > So I'm not too fan of making it entirely look like the iommu is the
> > primary factor, but we -can-, that would be workable. I still prefer
> > calling a cat a cat and exposing the grouping for what it is, as I think
> > I've explained already above, tho. 
> 
> The trouble is the "group" analogy is more fitting to a partitionable
> system, whereas on x86 we can really mix-n-match devices across iommus
> fairly easily.  The iommu seems to be the common point to describe these
> differences.

No. You can do that by throwing away isolation between those devices and
thus throwing away error isolation capabilities as well. I suppose if
you don't care about RAS... :-)

> > > > Now some of this can be fixed with tweaks, and we've started doing it
> > > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > > just that we don't like what we had to do to get there).
> > > 
> > > This is a result of wanting to support *unmodified* x86 guests.  We
> > > don't have the luxury of having a predefined pvDMA spec that all x86
> > > OSes adhere to. 
> > 
> > No but you could emulate a HW iommu no ?
> 
> We can, but then we have to worry about supporting legacy, proprietary
> OSes that may not have support or may make use of it differently.  As
> Avi mentions, hardware is coming the eases the "pin the whole guest"
> requirement and we may implement emulated iommus for the benefit of some
> guests.

That's a pipe dream :-) It will take a LONG time before a reasonable
proportion of devices does this in a reliable way I believe.

> > >  The 32bit problem is unfortunate, but the priority use
> > > case for assigning devices to guests is high performance I/O, which
> > > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > > point of having emulated IOMMU hardware on x86, which could then be
> > > backed by VFIO, but for now guest pinning is the most practical and
> > > useful.
> > 
> > For your current case maybe. It's just not very future proof imho.
> > Anyways, it's fixable, but the APIs as they are make it a bit clumsy.
> 
> You expect more 32bit devices in the future?

Got knows what embedded ARM folks will come up with :-) I wouldn't
dismiss that completely. I do expect to have to deal with OHCI for a
while tho.

> > > > Also our next generation chipset may drop support for PIO completely.
> > > > 
> > > > On the other hand, because PIO is just a special range of MMIO for us,
> > > > we can do normal pass-through on it and don't need any of the emulation
> > > > done qemu.
> > > 
> > > Maybe we can add mmap support to PIO regions on non-x86.
> > 
> > We have to yes. I haven't looked into it yet, it should be easy if VFIO
> > kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> > same interfaces sysfs & proc use).
> 
> Patches welcome.

Sure, we do plan to send patches for a lot of those things as we get
there, I'm just chosing to mention all the issues at once here and we
haven't go to fixing -that- just yet.
 
 .../...

> > Right. We can slow map the ROM, or we can not care :-) At the end of the
> > day, what is the difference here between a "guest" under qemu and the
> > real thing bare metal on the machine ? IE. They have the same issue vs.
> > accessing the ROM. IE. I don't see why qemu should try to make it safe
> > to access it at any time while it isn't on a real machine. Since VFIO
> > resets the devices before putting them in guest space, they should be
> > accessible no ? (Might require a hard reset for some devices tho ... )
> 
> My primary motivator for doing the ROM the way it's done today is that I
> get to push all the ROM handling off to QEMU core PCI code.  The ROM for
> an assigned device is handled exactly like the ROM for an emulated
> device except it might be generated by reading it from the hardware.
> This gives us the benefit of things like rombar=0 if I want to hide the
> ROM or romfile=<file> if I want to load an ipxe image for a device that
> may not even have a physical ROM.  Not to mention I don't have to
> special case ROM handling routines in VFIO.  So it actually has little
> to do w/ making it safe to access the ROM at any time.

On the other hand, let's hope no device has side effects on the ROM and
expects to exploit them :-) Do we know how ROM/flash updates work for
devices in practice ? Do they expect to be able to write to the ROM BAR
or they always use a different MMIO based sideband access ?
 
> > In any case, it's not a big deal and we can sort it out, I'm happy to
> > fallback to slow map to start with and eventually we will support small
> > pages mappings on POWER anyways, it's a temporary limitation.
> 
> Perhaps this could also be fixed in the generic QEMU PCI ROM support so
> it works for emulated devices too... code reuse paying off already ;)

Heh, I think emulation works.

> > > >   * EEH
> > > > 
> > > > This is the name of those fancy error handling & isolation features I
> > > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > > generally expose AER to guests (or even the host), it's swallowed by
> > > > firmware into something else that provides a superset (well mostly) of
> > > > the AER information, and allow us to do those additional things like
> > > > isolating/de-isolating, reset control etc...
> > > > 
> > > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > > huge deal, I mention it for completeness.
> > > 
> > > We expect to do AER via the VFIO netlink interface, which even though
> > > its bashed below, would be quite extensible to supporting different
> > > kinds of errors.
> > 
> > As could platform specific ioctls :-)
> 
> Is qemu going to poll for errors?

I wouldn't mind eventfd + ioctl, I really don't like netlink :-) But
others might disagree with me here. However that's not really my
argument, see below...

> > I don't understand what the advantage of netlink is compared to just
> > extending your existing VFIO ioctl interface, possibly using children
> > fd's as we do for example with spufs but it's not a huge deal. It just
> > that netlink has its own gotchas and I don't like multi-headed
> > interfaces.
> 
> We could do yet another eventfd that triggers the VFIO user to go call
> an ioctl to see what happened, but then we're locked into an ioctl
> interface for something that we may want to more easily extend over
> time.  As I said, it feels like this is what netlink is for and the
> arguments against seem to be more gut reaction.

My argument here is we already have an fd open, ie, we already have a
communication open to vfio as a chardev, I don't like the idea of
creating -another- one.

> Hmm... it is.  I added a pci_get_irq() that returns a
> platform/architecture specific translation of a PCI interrupt to it's
> resulting system interrupt.  Implement this in your PCI root bridge.
> There's a notifier for when this changes, so vfio will check
> pci_get_irq() again, also to be implemented in the PCI root bridge code.
> And a notifier that gets registered with that system interrupt and gets
> notice for EOI... implemented in x86 ioapic, somewhere else for power.

Let's leave this one alone, we'll fix it a way or another and we can
discuss the patches when it comes down to it.

> > >   The problem is
> > > that we have to disable INTx on an assigned device after it fires (VFIO
> > > does this automatically).  If we don't do this, a non-responsive or
> > > malicious guest could sit on the interrupt, causing it to fire
> > > repeatedly as a DoS on the host.  The only indication that we can rely
> > > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > > We can't just wait for device accesses because a) the device CSRs are
> > > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > > do some kind of dirty logging to detect when they're accesses b) what
> > > constitutes an interrupt service is device specific.
> > > 
> > > That means we need to figure out how PCI interrupt 'A' (or B...)
> > > translates to a GSI (Global System Interrupt - ACPI definition, but
> > > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > > which will also see the APIC EOI.  And just to spice things up, the
> > > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > > callbacks I've added are generic (maybe I left ioapic in the name), but
> > > yes they do need to be implemented for other architectures.  Patches
> > > appreciated from those with knowledge of the systems and/or access to
> > > device specs.  This is the only reason that I make QEMU VFIO only build
> > > for x86.
> > 
> > Right, and we need to cook a similiar sauce for POWER, it's an area that
> > has to be arch specific (and in fact specific to the specific HW machine
> > being emulated), so we just need to find out what's the cleanest way for
> > the plaform to "register" the right callbacks here.
> 
> Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
> emulation.

Yeah, we'll see, whatever we come up with and we discuss the details
then :-)

>  Thanks,
> > 
> > Well, I would map those "iommus" to PEs, so what remains is the path to
> > put all the "other" bits and pieces such as inform qemu of the location
> > and size of the MMIO segment(s) (so we can map the whole thing and not
> > bother with individual BARs) etc... 
> 
> My assumption is that PEs are largely defined by the iommus already.
> Are MMIO segments a property of the iommu too?  Thanks,

Not exactly but it's all tied together. See my other replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02  2:00         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  2:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Mon, 2011-08-01 at 12:59 -0600, Alex Williamson wrote:

> >  
> >  .../...
> 
> I'll try to consolidate my reply to all the above here because there are
> too many places above to interject and make this thread even more
> difficult to respond to.

True, I should try to do the same :-)

>   Much of what you're discussion above comes
> down to policy.  Do we trust DisINTx?  Do we trust multi-function
> devices?  I have no doubt there are devices we can use as examples for
> each behaving badly.  On x86 this is one of the reasons we have SR-IOV.

Right, that and having the ability to provide way more functions that
you would normally have.

> Besides splitting a single device into multiple, it makes sure each
> devices is actually virtualization friendly.  POWER seems to add
> multiple layers of hardware so that you don't actually have to trust the
> device, which is a great value add for enterprise systems, but in doing
> so it mostly defeats the purpose and functionality of SR-IOV.

Well not entirely. A lot of what POWER does is also about isolation on
errors. This is going to be useful with and without SR-IOV. Also not all
devices are SR-IOV capable and there are plenty of situations where one
would want to pass-through devices that aren't, I don't see that as
disappearing tomorrow.

> How we present this in a GUI is largely irrelevant because something has
> to create a superset of what the hardware dictates (can I uniquely
> identify transactions from this device, can I protect other devices from
> it, etc.), the system policy (do I trust DisINTx, do I trust function
> isolation, do I require ACS) and mold that with what the user actually
> wants to assign.  For the VFIO kernel interface, we should only be
> concerned with the first problem.  Userspace is free to make the rest as
> simple or complete as it cares to.  I argue for x86, we want device
> level granularity of assignment, but that also tends to be the typical
> case (when only factoring in hardware restrictions) due to our advanced
> iommus.

Well, POWER iommu's are advanced too ... just in a different way :-) x86
seems to be a lot less interested in robustness and reliability for
example :-)

I tend to agree that the policy decisions in general should be done by
the user, tho with appropriate information :-)

But some of them on our side are hard requirements imposed by how our
firmware or early kernel code assigned the PE's and we need to expose
that. It directly derives the sharing of iommu's too but then we -could-
have those different iommu's point to the same table in memory and
essentially mimmic the x86 domains. We chose not to. The segments are
too small in our current HW design for one and it means we lose the
isolation between devices which is paramount to getting the kind of
reliability and error handling we want to achieve. 

> > > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > > more kernel people into the discussion.
> > > 
> > > I don't yet buy into passing groups to qemu since I don't buy into the
> > > idea of always exposing all of those devices to qemu.  Would it be
> > > sufficient to expose iommu nodes in sysfs that link to the devices
> > > behind them and describe properties and capabilities of the iommu
> > > itself?  More on this at the end.
> > 
> > Well, iommu aren't the only factor. I mentioned shared interrupts (and
> > my unwillingness to always trust DisINTx),
> 
> *userspace policy*

Maybe ... some of it yes. I suppose. You can always hand out to
userspace bigger guns to shoot itself in the foot. Not always very wise
but heh.

Some of these are hard requirements tho. And we have to make that
decision when we assign PE's at boot time.

> >  there's also the MMIO
> > grouping I mentioned above (in which case it's an x86 -limitation- with
> > small BARs that I don't want to inherit, especially since it's based on
> > PAGE_SIZE and we commonly have 64K page size on POWER), etc...
> 
> But isn't MMIO grouping effectively *at* the iommu?

No exactly. It's a different set of tables & registers in the host
bridge and essentially a different set of logic, tho it does hook into
the whole "shared PE# state" thingy to enforce isolation of all layers
on error.

> > So I'm not too fan of making it entirely look like the iommu is the
> > primary factor, but we -can-, that would be workable. I still prefer
> > calling a cat a cat and exposing the grouping for what it is, as I think
> > I've explained already above, tho. 
> 
> The trouble is the "group" analogy is more fitting to a partitionable
> system, whereas on x86 we can really mix-n-match devices across iommus
> fairly easily.  The iommu seems to be the common point to describe these
> differences.

No. You can do that by throwing away isolation between those devices and
thus throwing away error isolation capabilities as well. I suppose if
you don't care about RAS... :-)

> > > > Now some of this can be fixed with tweaks, and we've started doing it
> > > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > > just that we don't like what we had to do to get there).
> > > 
> > > This is a result of wanting to support *unmodified* x86 guests.  We
> > > don't have the luxury of having a predefined pvDMA spec that all x86
> > > OSes adhere to. 
> > 
> > No but you could emulate a HW iommu no ?
> 
> We can, but then we have to worry about supporting legacy, proprietary
> OSes that may not have support or may make use of it differently.  As
> Avi mentions, hardware is coming the eases the "pin the whole guest"
> requirement and we may implement emulated iommus for the benefit of some
> guests.

That's a pipe dream :-) It will take a LONG time before a reasonable
proportion of devices does this in a reliable way I believe.

> > >  The 32bit problem is unfortunate, but the priority use
> > > case for assigning devices to guests is high performance I/O, which
> > > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > > point of having emulated IOMMU hardware on x86, which could then be
> > > backed by VFIO, but for now guest pinning is the most practical and
> > > useful.
> > 
> > For your current case maybe. It's just not very future proof imho.
> > Anyways, it's fixable, but the APIs as they are make it a bit clumsy.
> 
> You expect more 32bit devices in the future?

Got knows what embedded ARM folks will come up with :-) I wouldn't
dismiss that completely. I do expect to have to deal with OHCI for a
while tho.

> > > > Also our next generation chipset may drop support for PIO completely.
> > > > 
> > > > On the other hand, because PIO is just a special range of MMIO for us,
> > > > we can do normal pass-through on it and don't need any of the emulation
> > > > done qemu.
> > > 
> > > Maybe we can add mmap support to PIO regions on non-x86.
> > 
> > We have to yes. I haven't looked into it yet, it should be easy if VFIO
> > kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> > same interfaces sysfs & proc use).
> 
> Patches welcome.

Sure, we do plan to send patches for a lot of those things as we get
there, I'm just chosing to mention all the issues at once here and we
haven't go to fixing -that- just yet.
 
 .../...

> > Right. We can slow map the ROM, or we can not care :-) At the end of the
> > day, what is the difference here between a "guest" under qemu and the
> > real thing bare metal on the machine ? IE. They have the same issue vs.
> > accessing the ROM. IE. I don't see why qemu should try to make it safe
> > to access it at any time while it isn't on a real machine. Since VFIO
> > resets the devices before putting them in guest space, they should be
> > accessible no ? (Might require a hard reset for some devices tho ... )
> 
> My primary motivator for doing the ROM the way it's done today is that I
> get to push all the ROM handling off to QEMU core PCI code.  The ROM for
> an assigned device is handled exactly like the ROM for an emulated
> device except it might be generated by reading it from the hardware.
> This gives us the benefit of things like rombar=0 if I want to hide the
> ROM or romfile=<file> if I want to load an ipxe image for a device that
> may not even have a physical ROM.  Not to mention I don't have to
> special case ROM handling routines in VFIO.  So it actually has little
> to do w/ making it safe to access the ROM at any time.

On the other hand, let's hope no device has side effects on the ROM and
expects to exploit them :-) Do we know how ROM/flash updates work for
devices in practice ? Do they expect to be able to write to the ROM BAR
or they always use a different MMIO based sideband access ?
 
> > In any case, it's not a big deal and we can sort it out, I'm happy to
> > fallback to slow map to start with and eventually we will support small
> > pages mappings on POWER anyways, it's a temporary limitation.
> 
> Perhaps this could also be fixed in the generic QEMU PCI ROM support so
> it works for emulated devices too... code reuse paying off already ;)

Heh, I think emulation works.

> > > >   * EEH
> > > > 
> > > > This is the name of those fancy error handling & isolation features I
> > > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > > generally expose AER to guests (or even the host), it's swallowed by
> > > > firmware into something else that provides a superset (well mostly) of
> > > > the AER information, and allow us to do those additional things like
> > > > isolating/de-isolating, reset control etc...
> > > > 
> > > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > > huge deal, I mention it for completeness.
> > > 
> > > We expect to do AER via the VFIO netlink interface, which even though
> > > its bashed below, would be quite extensible to supporting different
> > > kinds of errors.
> > 
> > As could platform specific ioctls :-)
> 
> Is qemu going to poll for errors?

I wouldn't mind eventfd + ioctl, I really don't like netlink :-) But
others might disagree with me here. However that's not really my
argument, see below...

> > I don't understand what the advantage of netlink is compared to just
> > extending your existing VFIO ioctl interface, possibly using children
> > fd's as we do for example with spufs but it's not a huge deal. It just
> > that netlink has its own gotchas and I don't like multi-headed
> > interfaces.
> 
> We could do yet another eventfd that triggers the VFIO user to go call
> an ioctl to see what happened, but then we're locked into an ioctl
> interface for something that we may want to more easily extend over
> time.  As I said, it feels like this is what netlink is for and the
> arguments against seem to be more gut reaction.

My argument here is we already have an fd open, ie, we already have a
communication open to vfio as a chardev, I don't like the idea of
creating -another- one.

> Hmm... it is.  I added a pci_get_irq() that returns a
> platform/architecture specific translation of a PCI interrupt to it's
> resulting system interrupt.  Implement this in your PCI root bridge.
> There's a notifier for when this changes, so vfio will check
> pci_get_irq() again, also to be implemented in the PCI root bridge code.
> And a notifier that gets registered with that system interrupt and gets
> notice for EOI... implemented in x86 ioapic, somewhere else for power.

Let's leave this one alone, we'll fix it a way or another and we can
discuss the patches when it comes down to it.

> > >   The problem is
> > > that we have to disable INTx on an assigned device after it fires (VFIO
> > > does this automatically).  If we don't do this, a non-responsive or
> > > malicious guest could sit on the interrupt, causing it to fire
> > > repeatedly as a DoS on the host.  The only indication that we can rely
> > > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > > We can't just wait for device accesses because a) the device CSRs are
> > > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > > do some kind of dirty logging to detect when they're accesses b) what
> > > constitutes an interrupt service is device specific.
> > > 
> > > That means we need to figure out how PCI interrupt 'A' (or B...)
> > > translates to a GSI (Global System Interrupt - ACPI definition, but
> > > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > > which will also see the APIC EOI.  And just to spice things up, the
> > > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > > callbacks I've added are generic (maybe I left ioapic in the name), but
> > > yes they do need to be implemented for other architectures.  Patches
> > > appreciated from those with knowledge of the systems and/or access to
> > > device specs.  This is the only reason that I make QEMU VFIO only build
> > > for x86.
> > 
> > Right, and we need to cook a similiar sauce for POWER, it's an area that
> > has to be arch specific (and in fact specific to the specific HW machine
> > being emulated), so we just need to find out what's the cleanest way for
> > the plaform to "register" the right callbacks here.
> 
> Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
> emulation.

Yeah, we'll see, whatever we come up with and we discuss the details
then :-)

>  Thanks,
> > 
> > Well, I would map those "iommus" to PEs, so what remains is the path to
> > put all the "other" bits and pieces such as inform qemu of the location
> > and size of the MMIO segment(s) (so we can map the whole thing and not
> > bother with individual BARs) etc... 
> 
> My assumption is that PEs are largely defined by the iommus already.
> Are MMIO segments a property of the iommu too?  Thanks,

Not exactly but it's all tied together. See my other replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-08-02  8:28     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, aafabbri, iommu,
	Anthony Liguori, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning".  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> 
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now.  I'll try to come back to the rest of this
> below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.

[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.  The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.

[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic.  In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.

[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  8:28     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning".  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> 
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now.  I'll try to come back to the rest of this
> below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.

[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.  The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.

[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic.  In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.

[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02  8:28     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning".  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> 
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now.  I'll try to come back to the rest of this
> below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.

[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.  The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.

[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic.  In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.

[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 20:27     ` Alex Williamson
@ 2011-08-02  8:32       ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  8:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 08/01/2011 11:27 PM, Alex Williamson wrote:
> On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> >  On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> >  >  Due to our paravirt nature, we don't need to masquerade the MSI-X table
> >  >  for example. At all. If the guest configures crap into it, too bad, it
> >  >  can only shoot itself in the foot since the host bridge enforce
> >  >  validation anyways as I explained earlier. Because it's all paravirt, we
> >  >  don't need to "translate" the interrupt vectors&   addresses, the guest
> >  >  will call hyercalls to configure things anyways.
> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
> >
> >  Alex, with interrupt redirection, we can skip this as well?  Perhaps
> >  only if the guest enables interrupt redirection?
>
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.

Yeah.  We need the interrupt remapping hardware to indirect based on the 
source of the message, not just the address and data.

> Maybe AMD IOMMU
> makes this easier?  Thanks,
>

No idea.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  8:32       ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  8:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Anthony Liguori, linux-pci, linuxppc-dev

On 08/01/2011 11:27 PM, Alex Williamson wrote:
> On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> >  On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> >  >  Due to our paravirt nature, we don't need to masquerade the MSI-X table
> >  >  for example. At all. If the guest configures crap into it, too bad, it
> >  >  can only shoot itself in the foot since the host bridge enforce
> >  >  validation anyways as I explained earlier. Because it's all paravirt, we
> >  >  don't need to "translate" the interrupt vectors&   addresses, the guest
> >  >  will call hyercalls to configure things anyways.
> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
> >
> >  Alex, with interrupt redirection, we can skip this as well?  Perhaps
> >  only if the guest enables interrupt redirection?
>
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.

Yeah.  We need the interrupt remapping hardware to indirect based on the 
source of the message, not just the address and data.

> Maybe AMD IOMMU
> makes this easier?  Thanks,
>

No idea.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  1:27     ` Benjamin Herrenschmidt
@ 2011-08-02  9:12       ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  9:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> >  I have a feeling you'll be getting the same capabilities sooner or
> >  later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.

So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.

> >  >
> >  >  VFIO here is basically designed for one and only one thing: expose the
> >  >  entire guest physical address space to the device more/less 1:1.
> >
> >  A single level iommu cannot be exposed to guests.  Well, it can be
> >  exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> >  A two level iommu can be emulated and exposed to the guest.  See
> >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).

(16 or 25)

> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.

Well, then, I guess we need an additional interface to expose that to 
the guest.

> >  >  This means:
> >  >
> >  >     - It only works with iommu's that provide complete DMA address spaces
> >  >  to devices. Won't work with a single 'segmented' address space like we
> >  >  have on POWER.
> >  >
> >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> >
> >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> >  and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.

The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.

But I see what you mean, the API is designed around up-front 
specification of all guest memory.

> >  >     - It doesn't work for POWER server anyways because of our need to
> >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> >  >  works today and how existing OSes expect to operate.
> >
> >  Then you need to provide that same interface, and implement it using the
> >  real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.

The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.

> >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> >  >  call directly in the kernel eventually ...
> >
> >  Does the guest iomap each request?  Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.

I see.  x86 traditionally doesn't do it for every request.  We had some 
proposals to do a pviommu that does map every request, but none reached 
maturity.

> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.

Okay, this is something that x86 doesn't have.  Strange that it can 
filter DMA at a fine granularity but not MSI, which is practically the 
same thing.

> >
> >  Does the BAR value contain the segment base address?  Or is that added
> >  later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.

Okay, and config space virtualization ensures that the guest can't remap?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  9:12       ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  9:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> >  I have a feeling you'll be getting the same capabilities sooner or
> >  later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.

So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.

> >  >
> >  >  VFIO here is basically designed for one and only one thing: expose the
> >  >  entire guest physical address space to the device more/less 1:1.
> >
> >  A single level iommu cannot be exposed to guests.  Well, it can be
> >  exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> >  A two level iommu can be emulated and exposed to the guest.  See
> >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).

(16 or 25)

> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.

Well, then, I guess we need an additional interface to expose that to 
the guest.

> >  >  This means:
> >  >
> >  >     - It only works with iommu's that provide complete DMA address spaces
> >  >  to devices. Won't work with a single 'segmented' address space like we
> >  >  have on POWER.
> >  >
> >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> >
> >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> >  and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.

The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.

But I see what you mean, the API is designed around up-front 
specification of all guest memory.

> >  >     - It doesn't work for POWER server anyways because of our need to
> >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> >  >  works today and how existing OSes expect to operate.
> >
> >  Then you need to provide that same interface, and implement it using the
> >  real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.

The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.

> >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> >  >  call directly in the kernel eventually ...
> >
> >  Does the guest iomap each request?  Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.

I see.  x86 traditionally doesn't do it for every request.  We had some 
proposals to do a pviommu that does map every request, but none reached 
maturity.

> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.

Okay, this is something that x86 doesn't have.  Strange that it can 
filter DMA at a fine granularity but not MSI, which is practically the 
same thing.

> >
> >  Does the BAR value contain the segment base address?  Or is that added
> >  later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.

Okay, and config space virtualization ensures that the guest can't remap?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  9:12       ` Avi Kivity
@ 2011-08-02 12:58         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02 12:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
> On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> > >
> > >  I have a feeling you'll be getting the same capabilities sooner or
> > >  later, or you won't be able to make use of S/R IOV VFs.
> >
> > I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> > limitations due to constraints with how our MMIO segmenting works and
> > indeed some of those are being lifted in our future chipsets but
> > overall, it works).
> 
> Don't those limitations include "all VFs must be assigned to the same 
> guest"?

No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can "resize" to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc...  for these things which can cause us to play
interesting games with the system page size setting to find a good
match.

> PCI on x86 has function granularity, SRIOV reduces this to VF 
> granularity, but I thought power has partition or group granularity 
> which is much coarser?

The granularity of a "Group" really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.

In fact I currently go down to function granularity on anything pure
PCIe as well, though as I explained earlier, that's a bit chancy since
some adapters -will- allow to create side effects such as side band
access to config space.

pHyp doesn't allow that granularity as far as I can tell, one slot is
always fully assigned to a PE.

However, we might have resource constraints as in reaching max number of
segments or iommu regions that may force us to group a bit more coarsly
under some circumstances.

The main point is that the grouping is pre-existing, so an API designed
around the idea of: 1- create domain, 2- add random devices to it, 3-
use it, won't work for us very well :-)

Since the grouping implies the sharing of iommu's, from a VFIO point of
view is really matches well with the idea of having the domains
pre-existing.

That's why I think a good fit is to have a static representation of the
grouping, with tools allowing to create/manipulate the groups (or
domains) for archs that allow this sort of manipulations, separately
from qemu/libvirt, avoiding those "on the fly" groups whose lifetime is
tied to an instance of a file descriptor.

> > In -theory-, one could do the grouping dynamically with some kind of API
> > for us as well. However the constraints are such that it's not
> > practical. Filtering on RID is based on number of bits to match in the
> > bus number and whether to match the dev and fn. So it's not arbitrary
> > (but works fine for SR-IOV).
> >
> > The MMIO segmentation is a bit special too. There is a single MMIO
> > region in 32-bit space (size is configurable but that's not very
> > practical so for now we stick it to 1G) which is evenly divided into N
> > segments (where N is the number of PE# supported by the host bridge,
> > typically 128 with the current bridges).
> >
> > Each segment goes through a remapping table to select the actual PE# (so
> > large BARs use consecutive segments mapped to the same PE#).
> >
> > For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> > regions which act as some kind of "accordions", they are evenly divided
> > into segments in different PE# and there's several of them which we can
> > "move around" and typically use to map VF BARs.
> 
> So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
> technical details with no ppc background to put them to, I can't say I'm 
> making any sense of this.

:-)

Don't worry, it took me a while to get my head around the HW :-) SR-IOV
VFs will generally not have limitations like that no, but on the other
hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
take a bunch of VFs and put them in the same 'domain'.

I think the main deal is that VFIO/qemu sees "domains" as "guests" and
tries to put all devices for a given guest into a "domain".

On POWER, we have a different view of things were domains/groups are
defined to be the smallest granularity we can (down to a single VF) and
we give several groups to a guest (ie we avoid sharing the iommu in most
cases)

This is driven by the HW design but that design is itself driven by the
idea that the domains/group are also error isolation groups and we don't
want to take all of the IOs of a guest down if one adapter in that guest
is having an error.

The x86 domains are conceptually different as they are about sharing the
iommu page tables with the clear long term intent of then sharing those
page tables with the guest CPU own. We aren't going in that direction
(at this point at least) on POWER..

> > >  >  VFIO here is basically designed for one and only one thing: expose the
> > >  >  entire guest physical address space to the device more/less 1:1.
> > >
> > >  A single level iommu cannot be exposed to guests.  Well, it can be
> > >  exposed as an iommu that does not provide per-device mapping.
> >
> > Well, x86 ones can't maybe but on POWER we can and must thanks to our
> > essentially paravirt model :-) Even if it' wasn't and we used trapping
> > of accesses to the table, it would work because in practice, even with
> > filtering, what we end up having is a per-device (or rather per-PE#
> > table).
> >
> > >  A two level iommu can be emulated and exposed to the guest.  See
> > >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
> >
> > What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> > right ?).
> 
> (16 or 25)

25 levels ? You mean 25 loads to get to a translation ? And you get any
kind of performance out of that ? :-)

> > We don't have that and probably never will. But again, because
> > we have a paravirt interface to the iommu, it's less of an issue.
> 
> Well, then, I guess we need an additional interface to expose that to 
> the guest.
> 
> > >  >  This means:
> > >  >
> > >  >     - It only works with iommu's that provide complete DMA address spaces
> > >  >  to devices. Won't work with a single 'segmented' address space like we
> > >  >  have on POWER.
> > >  >
> > >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> > >
> > >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > >  and then the requirement can be removed.
> >
> > No. -Some- newer devices will. Out of these, a bunch will have so many
> > bugs in it it's not usable. Some never will. It's a mess really and I
> > wouldn't design my stuff based on those premises just yet. Making it
> > possible to support it for sure, having it in mind, but not making it
> > the fundation on which the whole API is designed.
> 
> The API is not designed around pinning.  It's a side effect of how the 
> IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
> then it would only pin those pages.
> 
> But I see what you mean, the API is designed around up-front 
> specification of all guest memory.

Right :-)

> > >  >     - It doesn't work for POWER server anyways because of our need to
> > >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> > >  >  works today and how existing OSes expect to operate.
> > >
> > >  Then you need to provide that same interface, and implement it using the
> > >  real iommu.
> >
> > Yes. Working on it. It's not very practical due to how VFIO interacts in
> > terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> > almost entirely real-mode for performance reasons.
> 
> The original kvm device assignment code was (and is) part of kvm 
> itself.  We're trying to move to vfio to allow sharing with non-kvm 
> users, but it does reduce flexibility.  We can have an internal vfio-kvm 
> interface to update mappings in real time.
> 
> > >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> > >  >  call directly in the kernel eventually ...
> > >
> > >  Does the guest iomap each request?  Why?
> >
> > Not sure what you mean... the guest calls h-calls for every iommu page
> > mapping/unmapping, yes. So the performance of these is critical. So yes,
> > we'll eventually do it in kernel. We just haven't yet.
> 
> I see.  x86 traditionally doesn't do it for every request.  We had some 
> proposals to do a pviommu that does map every request, but none reached 
> maturity.

It's quite performance critical, you don't want to go anywhere near a
full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
straight off the interrupt handlers, with the CPU still basically
operating in guest context with HV permission. That is basically do the
permission check, translation and whack the HW iommu immediately. If for
some reason one step fails (!present PTE or something like that), we'd
then fallback to an exit to Linux to handle it in a more "common"
environment where we can handle page faults etc...

> > >  So, you have interrupt redirection?  That is, MSI-x table values encode
> > >  the vcpu, not pcpu?
> >
> > Not exactly. The MSI-X address is a real PCI address to an MSI port and
> > the value is a real interrupt number in the PIC.
> >
> > However, the MSI port filters by RID (using the same matching as PE#) to
> > ensure that only allowed devices can write to it, and the PIC has a
> > matching PE# information to ensure that only allowed devices can trigger
> > the interrupt.
> >
> > As for the guest knowing what values to put in there (what port address
> > and interrupt source numbers to use), this is part of the paravirt APIs.
> >
> > So the paravirt APIs handles the configuration and the HW ensures that
> > the guest cannot do anything else than what it's allowed to.
> 
> Okay, this is something that x86 doesn't have.  Strange that it can 
> filter DMA at a fine granularity but not MSI, which is practically the 
> same thing.

I wouldn't be surprised if it's actually a quite different path in HW.
There's some magic decoding based on top bits usually that decides it's
an MSI and it goes completely elsewhere from there in the bridge. 

> > >  Does the BAR value contain the segment base address?  Or is that added
> > >  later?
> >
> > It's a shared address space. With a basic configuration on p7ioc for
> > example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> > contain the normal PCI address there. But that 1G is divided in 128
> > segments of equal size which can separately be assigned to PE#'s.
> >
> > So BARs are allocated by firmware or the kernel PCI code so that devices
> > in different PEs don't share segments.
> 
> Okay, and config space virtualization ensures that the guest can't remap?

Well, so it depends :-)

With KVM we currently use whatever config space virtualization you do
and so we somewhat rely on this but it's not very fool proof.

I believe pHyp doesn't even bother filtering config space. As I said in
another note, you can't trust adapters anyway. Plenty of them (video
cards come to mind) have ways to get to their own config space via MMIO
registers for example.

So what pHyp does is that it always create PE's (aka groups) that are
below a bridge. With PCIe, everything mostly is below a bridge so that's
easy, but that does mean that you always have all functions of a device
in the same PE (and thus in the same partition). SR-IOV is an exception
to this rule since in that case the HW is designed to be trusted.

That way, being behind a bridge, the bridge windows are going to define
what can be forwarded to the device, and thus the system is immune to
the guest putting crap into the BARs. It can't be remapped to overlap a
neighbouring device.

Note that the bridge itself isn't visible to the guest, so yes, config
space is -somewhat- virtualized, typically pHyp make every pass-through
PE look like a separate PCI host bridge with the devices below it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 12:58         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02 12:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
> On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> > >
> > >  I have a feeling you'll be getting the same capabilities sooner or
> > >  later, or you won't be able to make use of S/R IOV VFs.
> >
> > I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> > limitations due to constraints with how our MMIO segmenting works and
> > indeed some of those are being lifted in our future chipsets but
> > overall, it works).
> 
> Don't those limitations include "all VFs must be assigned to the same 
> guest"?

No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can "resize" to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc...  for these things which can cause us to play
interesting games with the system page size setting to find a good
match.

> PCI on x86 has function granularity, SRIOV reduces this to VF 
> granularity, but I thought power has partition or group granularity 
> which is much coarser?

The granularity of a "Group" really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.

In fact I currently go down to function granularity on anything pure
PCIe as well, though as I explained earlier, that's a bit chancy since
some adapters -will- allow to create side effects such as side band
access to config space.

pHyp doesn't allow that granularity as far as I can tell, one slot is
always fully assigned to a PE.

However, we might have resource constraints as in reaching max number of
segments or iommu regions that may force us to group a bit more coarsly
under some circumstances.

The main point is that the grouping is pre-existing, so an API designed
around the idea of: 1- create domain, 2- add random devices to it, 3-
use it, won't work for us very well :-)

Since the grouping implies the sharing of iommu's, from a VFIO point of
view is really matches well with the idea of having the domains
pre-existing.

That's why I think a good fit is to have a static representation of the
grouping, with tools allowing to create/manipulate the groups (or
domains) for archs that allow this sort of manipulations, separately
from qemu/libvirt, avoiding those "on the fly" groups whose lifetime is
tied to an instance of a file descriptor.

> > In -theory-, one could do the grouping dynamically with some kind of API
> > for us as well. However the constraints are such that it's not
> > practical. Filtering on RID is based on number of bits to match in the
> > bus number and whether to match the dev and fn. So it's not arbitrary
> > (but works fine for SR-IOV).
> >
> > The MMIO segmentation is a bit special too. There is a single MMIO
> > region in 32-bit space (size is configurable but that's not very
> > practical so for now we stick it to 1G) which is evenly divided into N
> > segments (where N is the number of PE# supported by the host bridge,
> > typically 128 with the current bridges).
> >
> > Each segment goes through a remapping table to select the actual PE# (so
> > large BARs use consecutive segments mapped to the same PE#).
> >
> > For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> > regions which act as some kind of "accordions", they are evenly divided
> > into segments in different PE# and there's several of them which we can
> > "move around" and typically use to map VF BARs.
> 
> So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
> technical details with no ppc background to put them to, I can't say I'm 
> making any sense of this.

:-)

Don't worry, it took me a while to get my head around the HW :-) SR-IOV
VFs will generally not have limitations like that no, but on the other
hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
take a bunch of VFs and put them in the same 'domain'.

I think the main deal is that VFIO/qemu sees "domains" as "guests" and
tries to put all devices for a given guest into a "domain".

On POWER, we have a different view of things were domains/groups are
defined to be the smallest granularity we can (down to a single VF) and
we give several groups to a guest (ie we avoid sharing the iommu in most
cases)

This is driven by the HW design but that design is itself driven by the
idea that the domains/group are also error isolation groups and we don't
want to take all of the IOs of a guest down if one adapter in that guest
is having an error.

The x86 domains are conceptually different as they are about sharing the
iommu page tables with the clear long term intent of then sharing those
page tables with the guest CPU own. We aren't going in that direction
(at this point at least) on POWER..

> > >  >  VFIO here is basically designed for one and only one thing: expose the
> > >  >  entire guest physical address space to the device more/less 1:1.
> > >
> > >  A single level iommu cannot be exposed to guests.  Well, it can be
> > >  exposed as an iommu that does not provide per-device mapping.
> >
> > Well, x86 ones can't maybe but on POWER we can and must thanks to our
> > essentially paravirt model :-) Even if it' wasn't and we used trapping
> > of accesses to the table, it would work because in practice, even with
> > filtering, what we end up having is a per-device (or rather per-PE#
> > table).
> >
> > >  A two level iommu can be emulated and exposed to the guest.  See
> > >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
> >
> > What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> > right ?).
> 
> (16 or 25)

25 levels ? You mean 25 loads to get to a translation ? And you get any
kind of performance out of that ? :-)

> > We don't have that and probably never will. But again, because
> > we have a paravirt interface to the iommu, it's less of an issue.
> 
> Well, then, I guess we need an additional interface to expose that to 
> the guest.
> 
> > >  >  This means:
> > >  >
> > >  >     - It only works with iommu's that provide complete DMA address spaces
> > >  >  to devices. Won't work with a single 'segmented' address space like we
> > >  >  have on POWER.
> > >  >
> > >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> > >
> > >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > >  and then the requirement can be removed.
> >
> > No. -Some- newer devices will. Out of these, a bunch will have so many
> > bugs in it it's not usable. Some never will. It's a mess really and I
> > wouldn't design my stuff based on those premises just yet. Making it
> > possible to support it for sure, having it in mind, but not making it
> > the fundation on which the whole API is designed.
> 
> The API is not designed around pinning.  It's a side effect of how the 
> IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
> then it would only pin those pages.
> 
> But I see what you mean, the API is designed around up-front 
> specification of all guest memory.

Right :-)

> > >  >     - It doesn't work for POWER server anyways because of our need to
> > >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> > >  >  works today and how existing OSes expect to operate.
> > >
> > >  Then you need to provide that same interface, and implement it using the
> > >  real iommu.
> >
> > Yes. Working on it. It's not very practical due to how VFIO interacts in
> > terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> > almost entirely real-mode for performance reasons.
> 
> The original kvm device assignment code was (and is) part of kvm 
> itself.  We're trying to move to vfio to allow sharing with non-kvm 
> users, but it does reduce flexibility.  We can have an internal vfio-kvm 
> interface to update mappings in real time.
> 
> > >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> > >  >  call directly in the kernel eventually ...
> > >
> > >  Does the guest iomap each request?  Why?
> >
> > Not sure what you mean... the guest calls h-calls for every iommu page
> > mapping/unmapping, yes. So the performance of these is critical. So yes,
> > we'll eventually do it in kernel. We just haven't yet.
> 
> I see.  x86 traditionally doesn't do it for every request.  We had some 
> proposals to do a pviommu that does map every request, but none reached 
> maturity.

It's quite performance critical, you don't want to go anywhere near a
full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
straight off the interrupt handlers, with the CPU still basically
operating in guest context with HV permission. That is basically do the
permission check, translation and whack the HW iommu immediately. If for
some reason one step fails (!present PTE or something like that), we'd
then fallback to an exit to Linux to handle it in a more "common"
environment where we can handle page faults etc...

> > >  So, you have interrupt redirection?  That is, MSI-x table values encode
> > >  the vcpu, not pcpu?
> >
> > Not exactly. The MSI-X address is a real PCI address to an MSI port and
> > the value is a real interrupt number in the PIC.
> >
> > However, the MSI port filters by RID (using the same matching as PE#) to
> > ensure that only allowed devices can write to it, and the PIC has a
> > matching PE# information to ensure that only allowed devices can trigger
> > the interrupt.
> >
> > As for the guest knowing what values to put in there (what port address
> > and interrupt source numbers to use), this is part of the paravirt APIs.
> >
> > So the paravirt APIs handles the configuration and the HW ensures that
> > the guest cannot do anything else than what it's allowed to.
> 
> Okay, this is something that x86 doesn't have.  Strange that it can 
> filter DMA at a fine granularity but not MSI, which is practically the 
> same thing.

I wouldn't be surprised if it's actually a quite different path in HW.
There's some magic decoding based on top bits usually that decides it's
an MSI and it goes completely elsewhere from there in the bridge. 

> > >  Does the BAR value contain the segment base address?  Or is that added
> > >  later?
> >
> > It's a shared address space. With a basic configuration on p7ioc for
> > example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> > contain the normal PCI address there. But that 1G is divided in 128
> > segments of equal size which can separately be assigned to PE#'s.
> >
> > So BARs are allocated by firmware or the kernel PCI code so that devices
> > in different PEs don't share segments.
> 
> Okay, and config space virtualization ensures that the guest can't remap?

Well, so it depends :-)

With KVM we currently use whatever config space virtualization you do
and so we somewhat rely on this but it's not very fool proof.

I believe pHyp doesn't even bother filtering config space. As I said in
another note, you can't trust adapters anyway. Plenty of them (video
cards come to mind) have ways to get to their own config space via MMIO
registers for example.

So what pHyp does is that it always create PE's (aka groups) that are
below a bridge. With PCIe, everything mostly is below a bridge so that's
easy, but that does mean that you always have all functions of a device
in the same PE (and thus in the same partition). SR-IOV is an exception
to this rule since in that case the HW is designed to be trusted.

That way, being behind a bridge, the bridge windows are going to define
what can be forwarded to the device, and thus the system is immune to
the guest putting crap into the BARs. It can't be remapped to overlap a
neighbouring device.

Note that the bridge itself isn't visible to the guest, so yes, config
space is -somewhat- virtualized, typically pHyp make every pass-through
PE look like a separate PCI host bridge with the devices below it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 12:58         ` Benjamin Herrenschmidt
@ 2011-08-02 13:39           ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02 13:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:
> >  >
> >  >  What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> >  >  right ?).
> >
> >  (16 or 25)
>
> 25 levels ? You mean 25 loads to get to a translation ? And you get any
> kind of performance out of that ? :-)
>

Aggressive partial translation caching.  Even then, performance does 
suffer on memory intensive workloads.  The fix was transparent 
hugepages; that makes the page table walks much faster since they're 
fully cached, the partial translation caches become more effective, and 
the tlb itself becomes more effective.  On some workloads, THP on both 
guest and host was faster than no-THP on bare metal.

> >  >
> >  >  Not sure what you mean... the guest calls h-calls for every iommu page
> >  >  mapping/unmapping, yes. So the performance of these is critical. So yes,
> >  >  we'll eventually do it in kernel. We just haven't yet.
> >
> >  I see.  x86 traditionally doesn't do it for every request.  We had some
> >  proposals to do a pviommu that does map every request, but none reached
> >  maturity.
>
> It's quite performance critical, you don't want to go anywhere near a
> full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
> straight off the interrupt handlers, with the CPU still basically
> operating in guest context with HV permission. That is basically do the
> permission check, translation and whack the HW iommu immediately. If for
> some reason one step fails (!present PTE or something like that), we'd
> then fallback to an exit to Linux to handle it in a more "common"
> environment where we can handle page faults etc...

I guess we can hack some kind of private interface, though I'd hoped to 
avoid it (and so far we succeeded - we can even get vfio to inject 
interrupts into kvm from the kernel without either knowing anything 
about the other).

> >  >  >   Does the BAR value contain the segment base address?  Or is that added
> >  >  >   later?
> >  >
> >  >  It's a shared address space. With a basic configuration on p7ioc for
> >  >  example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> >  >  contain the normal PCI address there. But that 1G is divided in 128
> >  >  segments of equal size which can separately be assigned to PE#'s.
> >  >
> >  >  So BARs are allocated by firmware or the kernel PCI code so that devices
> >  >  in different PEs don't share segments.
> >
> >  Okay, and config space virtualization ensures that the guest can't remap?
>
> Well, so it depends :-)
>
> With KVM we currently use whatever config space virtualization you do
> and so we somewhat rely on this but it's not very fool proof.
>
> I believe pHyp doesn't even bother filtering config space. As I said in
> another note, you can't trust adapters anyway. Plenty of them (video
> cards come to mind) have ways to get to their own config space via MMIO
> registers for example.

Yes, we've seen that.

> So what pHyp does is that it always create PE's (aka groups) that are
> below a bridge. With PCIe, everything mostly is below a bridge so that's
> easy, but that does mean that you always have all functions of a device
> in the same PE (and thus in the same partition). SR-IOV is an exception
> to this rule since in that case the HW is designed to be trusted.
>
> That way, being behind a bridge, the bridge windows are going to define
> what can be forwarded to the device, and thus the system is immune to
> the guest putting crap into the BARs. It can't be remapped to overlap a
> neighbouring device.
>
> Note that the bridge itself isn't visible to the guest, so yes, config
> space is -somewhat- virtualized, typically pHyp make every pass-through
> PE look like a separate PCI host bridge with the devices below it.

I think I see, yes.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 13:39           ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02 13:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:
> >  >
> >  >  What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> >  >  right ?).
> >
> >  (16 or 25)
>
> 25 levels ? You mean 25 loads to get to a translation ? And you get any
> kind of performance out of that ? :-)
>

Aggressive partial translation caching.  Even then, performance does 
suffer on memory intensive workloads.  The fix was transparent 
hugepages; that makes the page table walks much faster since they're 
fully cached, the partial translation caches become more effective, and 
the tlb itself becomes more effective.  On some workloads, THP on both 
guest and host was faster than no-THP on bare metal.

> >  >
> >  >  Not sure what you mean... the guest calls h-calls for every iommu page
> >  >  mapping/unmapping, yes. So the performance of these is critical. So yes,
> >  >  we'll eventually do it in kernel. We just haven't yet.
> >
> >  I see.  x86 traditionally doesn't do it for every request.  We had some
> >  proposals to do a pviommu that does map every request, but none reached
> >  maturity.
>
> It's quite performance critical, you don't want to go anywhere near a
> full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
> straight off the interrupt handlers, with the CPU still basically
> operating in guest context with HV permission. That is basically do the
> permission check, translation and whack the HW iommu immediately. If for
> some reason one step fails (!present PTE or something like that), we'd
> then fallback to an exit to Linux to handle it in a more "common"
> environment where we can handle page faults etc...

I guess we can hack some kind of private interface, though I'd hoped to 
avoid it (and so far we succeeded - we can even get vfio to inject 
interrupts into kvm from the kernel without either knowing anything 
about the other).

> >  >  >   Does the BAR value contain the segment base address?  Or is that added
> >  >  >   later?
> >  >
> >  >  It's a shared address space. With a basic configuration on p7ioc for
> >  >  example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> >  >  contain the normal PCI address there. But that 1G is divided in 128
> >  >  segments of equal size which can separately be assigned to PE#'s.
> >  >
> >  >  So BARs are allocated by firmware or the kernel PCI code so that devices
> >  >  in different PEs don't share segments.
> >
> >  Okay, and config space virtualization ensures that the guest can't remap?
>
> Well, so it depends :-)
>
> With KVM we currently use whatever config space virtualization you do
> and so we somewhat rely on this but it's not very fool proof.
>
> I believe pHyp doesn't even bother filtering config space. As I said in
> another note, you can't trust adapters anyway. Plenty of them (video
> cards come to mind) have ways to get to their own config space via MMIO
> registers for example.

Yes, we've seen that.

> So what pHyp does is that it always create PE's (aka groups) that are
> below a bridge. With PCIe, everything mostly is below a bridge so that's
> easy, but that does mean that you always have all functions of a device
> in the same PE (and thus in the same partition). SR-IOV is an exception
> to this rule since in that case the HW is designed to be trusted.
>
> That way, being behind a bridge, the bridge windows are going to define
> what can be forwarded to the device, and thus the system is immune to
> the guest putting crap into the BARs. It can't be remapped to overlap a
> neighbouring device.
>
> Note that the bridge itself isn't visible to the guest, so yes, config
> space is -somewhat- virtualized, typically pHyp make every pass-through
> PE look like a separate PCI host bridge with the devices below it.

I think I see, yes.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  1:27     ` Benjamin Herrenschmidt
@ 2011-08-02 14:39       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 14:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote:
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
> 
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
> 
> Of course there's always the risk that a device can be hacked via a
> sideband access to BARs to move out of it's allocated segment. That
> means that the guest owning that device won't be able to access it
> anymore and can potentially disturb a guest or host owning whatever is
> in that other segment.

Wait, what?  I thought the MMIO segments were specifically so that if
the device BARs moved out of the segment the guest only hurts itself and
not the new segments overlapped.

> The only way to enforce isolation here is to ensure that PE# are
> entirely behind P2P bridges, since those would then ensure that even if
> you put crap into your BARs you won't be able to walk over a neighbour.

Ok, so the MMIO segments are really just a configuration nuance of the
platform and being behind a P2P bridge is what allows you to hand off
BARs to a guest (which needs to know the bridge window to do anything
useful with them).  Is that right?

> I believe pHyp enforces that, for example, if you have a slot, all
> devices & functions behind that slot pertain to the same PE# under pHyp.
> 
> That means you cannot put individual functions of a device into
> different PE# with pHyp.
> 
> We plan to be a bit less restrictive here for KVM, assuming that if you
> use a device that allows such a back-channel to the BARs, then it's your
> problem to not trust such a device for virtualization. And most of the
> time, you -will- have a P2P to protect you anyways.
> 
> The problem doesn't exist (or is assumed as non-existing) for SR-IOV
> since in that case, the VFs are meant to be virtualized, so pHyp assumes
> there is no such back-channel and it can trust them to be in different
> PE#.

But you still need the P2P bridge to protect MMIO segments?  Or do
SR-IOV BARs need to be virtualized?  I'm having trouble with the mental
model of how you can do both.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 14:39       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 14:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Avi Kivity, Anthony Liguori, linuxppc-dev

On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote:
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
> 
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
> 
> Of course there's always the risk that a device can be hacked via a
> sideband access to BARs to move out of it's allocated segment. That
> means that the guest owning that device won't be able to access it
> anymore and can potentially disturb a guest or host owning whatever is
> in that other segment.

Wait, what?  I thought the MMIO segments were specifically so that if
the device BARs moved out of the segment the guest only hurts itself and
not the new segments overlapped.

> The only way to enforce isolation here is to ensure that PE# are
> entirely behind P2P bridges, since those would then ensure that even if
> you put crap into your BARs you won't be able to walk over a neighbour.

Ok, so the MMIO segments are really just a configuration nuance of the
platform and being behind a P2P bridge is what allows you to hand off
BARs to a guest (which needs to know the bridge window to do anything
useful with them).  Is that right?

> I believe pHyp enforces that, for example, if you have a slot, all
> devices & functions behind that slot pertain to the same PE# under pHyp.
> 
> That means you cannot put individual functions of a device into
> different PE# with pHyp.
> 
> We plan to be a bit less restrictive here for KVM, assuming that if you
> use a device that allows such a back-channel to the BARs, then it's your
> problem to not trust such a device for virtualization. And most of the
> time, you -will- have a P2P to protect you anyways.
> 
> The problem doesn't exist (or is assumed as non-existing) for SR-IOV
> since in that case, the VFs are meant to be virtualized, so pHyp assumes
> there is no such back-channel and it can trust them to be in different
> PE#.

But you still need the P2P bridge to protect MMIO segments?  Or do
SR-IOV BARs need to be virtualized?  I'm having trouble with the mental
model of how you can do both.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 12:58         ` Benjamin Herrenschmidt
@ 2011-08-02 15:34           ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 15:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> 
> Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> VFs will generally not have limitations like that no, but on the other
> hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> take a bunch of VFs and put them in the same 'domain'.
> 
> I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> tries to put all devices for a given guest into a "domain".

Actually, that's only a recent optimization, before that each device got
it's own iommu domain.  It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.

> On POWER, we have a different view of things were domains/groups are
> defined to be the smallest granularity we can (down to a single VF) and
> we give several groups to a guest (ie we avoid sharing the iommu in most
> cases)
> 
> This is driven by the HW design but that design is itself driven by the
> idea that the domains/group are also error isolation groups and we don't
> want to take all of the IOs of a guest down if one adapter in that guest
> is having an error.
> 
> The x86 domains are conceptually different as they are about sharing the
> iommu page tables with the clear long term intent of then sharing those
> page tables with the guest CPU own. We aren't going in that direction
> (at this point at least) on POWER..

Yes and no.  The x86 domains are pretty flexible and used a few
different ways.  On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges.  In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest.  As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway).  If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 15:34           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 15:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Avi Kivity, Anthony Liguori, linuxppc-dev

On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> 
> Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> VFs will generally not have limitations like that no, but on the other
> hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> take a bunch of VFs and put them in the same 'domain'.
> 
> I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> tries to put all devices for a given guest into a "domain".

Actually, that's only a recent optimization, before that each device got
it's own iommu domain.  It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.

> On POWER, we have a different view of things were domains/groups are
> defined to be the smallest granularity we can (down to a single VF) and
> we give several groups to a guest (ie we avoid sharing the iommu in most
> cases)
> 
> This is driven by the HW design but that design is itself driven by the
> idea that the domains/group are also error isolation groups and we don't
> want to take all of the IOs of a guest down if one adapter in that guest
> is having an error.
> 
> The x86 domains are conceptually different as they are about sharing the
> iommu page tables with the clear long term intent of then sharing those
> page tables with the guest CPU own. We aren't going in that direction
> (at this point at least) on POWER..

Yes and no.  The x86 domains are pretty flexible and used a few
different ways.  On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges.  In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest.  As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway).  If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  8:28     ` David Gibson
  (?)
@ 2011-08-02 18:14       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, aafabbri, iommu,
	Anthony Liguori, linuxppc-dev, benve

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
> 
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here.  The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host).  Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning".  There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> No-one's suggesting that this isn't a valid mode of operation.  It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

> [snip]
> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> 
> Hrm.  I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group.  Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.

Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices.  Where do we manage enforcement of hardware policy
vs userspace policy?

> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.  The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.

Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series?  Shoot me for using ioapic in the name, but it's
exactly what you ask for.  It just needs to be made a common service and
implemented for power.

> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic.  In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.

One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:14       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
> 
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here.  The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host).  Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning".  There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> No-one's suggesting that this isn't a valid mode of operation.  It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

> [snip]
> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> 
> Hrm.  I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group.  Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.

Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices.  Where do we manage enforcement of hardware policy
vs userspace policy?

> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.  The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.

Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series?  Shoot me for using ioapic in the name, but it's
exactly what you ask for.  It just needs to be made a common service and
implemented for power.

> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic.  In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.

One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:14       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
> 
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here.  The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host).  Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning".  There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> No-one's suggesting that this isn't a valid mode of operation.  It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

> [snip]
> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> 
> Hrm.  I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group.  Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.

Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices.  Where do we manage enforcement of hardware policy
vs userspace policy?

> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.  The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.

Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series?  Shoot me for using ioapic in the name, but it's
exactly what you ask for.  It just needs to be made a common service and
implemented for power.

> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic.  In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.

One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 18:14       ` Alex Williamson
  (?)
@ 2011-08-02 18:35         ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt.  But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges.  However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> > 
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here.  The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host).  Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
> 
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy.  And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition".  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:35         ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt.  But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges.  However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> > 
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here.  The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host).  Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
> 
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy.  And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition".  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:35         ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt.  But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges.  However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> > 
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here.  The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host).  Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
> 
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy.  And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition".  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 15:34           ` Alex Williamson
@ 2011-08-02 21:29             ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 322+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-08-02 21:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, Avi Kivity, kvm, Anthony Liguori,
	David Gibson, Paul Mackerras, Alexey Kardashevskiy, linux-pci,
	linuxppc-dev

On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > 
> > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > VFs will generally not have limitations like that no, but on the other
> > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > take a bunch of VFs and put them in the same 'domain'.
> > 
> > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > tries to put all devices for a given guest into a "domain".
> 
> Actually, that's only a recent optimization, before that each device got
> it's own iommu domain.  It's actually completely configurable on the
> qemu command line which devices get their own iommu and which share.
> The default optimizes the number of domains (one) and thus the number of
> mapping callbacks since we pin the entire guest.
> 
> > On POWER, we have a different view of things were domains/groups are
> > defined to be the smallest granularity we can (down to a single VF) and
> > we give several groups to a guest (ie we avoid sharing the iommu in most
> > cases)
> > 
> > This is driven by the HW design but that design is itself driven by the
> > idea that the domains/group are also error isolation groups and we don't
> > want to take all of the IOs of a guest down if one adapter in that guest
> > is having an error.
> > 
> > The x86 domains are conceptually different as they are about sharing the
> > iommu page tables with the clear long term intent of then sharing those
> > page tables with the guest CPU own. We aren't going in that direction
> > (at this point at least) on POWER..
> 
> Yes and no.  The x86 domains are pretty flexible and used a few
> different ways.  On the host we do dynamic DMA with a domain per device,
> mapping only the inflight DMA ranges.  In order to achieve the
> transparent device assignment model, we have to flip that around and map
> the entire guest.  As noted, we can continue to use separate domains for
> this, but since each maps the entire guest, it doesn't add a lot of
> value and uses more resources and requires more mapping callbacks (and
> x86 doesn't have the best error containment anyway).  If we had a well
> supported IOMMU model that we could adapt for pvDMA, then it would make
> sense to keep each device in it's own domain again.  Thanks,

Could you have an PV IOMMU (in the guest) that would set up those
maps?

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 21:29             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 322+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-08-02 21:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev

On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > 
> > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > VFs will generally not have limitations like that no, but on the other
> > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > take a bunch of VFs and put them in the same 'domain'.
> > 
> > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > tries to put all devices for a given guest into a "domain".
> 
> Actually, that's only a recent optimization, before that each device got
> it's own iommu domain.  It's actually completely configurable on the
> qemu command line which devices get their own iommu and which share.
> The default optimizes the number of domains (one) and thus the number of
> mapping callbacks since we pin the entire guest.
> 
> > On POWER, we have a different view of things were domains/groups are
> > defined to be the smallest granularity we can (down to a single VF) and
> > we give several groups to a guest (ie we avoid sharing the iommu in most
> > cases)
> > 
> > This is driven by the HW design but that design is itself driven by the
> > idea that the domains/group are also error isolation groups and we don't
> > want to take all of the IOs of a guest down if one adapter in that guest
> > is having an error.
> > 
> > The x86 domains are conceptually different as they are about sharing the
> > iommu page tables with the clear long term intent of then sharing those
> > page tables with the guest CPU own. We aren't going in that direction
> > (at this point at least) on POWER..
> 
> Yes and no.  The x86 domains are pretty flexible and used a few
> different ways.  On the host we do dynamic DMA with a domain per device,
> mapping only the inflight DMA ranges.  In order to achieve the
> transparent device assignment model, we have to flip that around and map
> the entire guest.  As noted, we can continue to use separate domains for
> this, but since each maps the entire guest, it doesn't add a lot of
> value and uses more resources and requires more mapping callbacks (and
> x86 doesn't have the best error containment anyway).  If we had a well
> supported IOMMU model that we could adapt for pvDMA, then it would make
> sense to keep each device in it's own domain again.  Thanks,

Could you have an PV IOMMU (in the guest) that would set up those
maps?

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 21:29             ` Konrad Rzeszutek Wilk
@ 2011-08-03  1:02               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  1:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Benjamin Herrenschmidt, Avi Kivity, kvm, Anthony Liguori,
	David Gibson, Paul Mackerras, Alexey Kardashevskiy, linux-pci,
	linuxppc-dev

On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > > VFs will generally not have limitations like that no, but on the other
> > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > > take a bunch of VFs and put them in the same 'domain'.
> > > 
> > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > > tries to put all devices for a given guest into a "domain".
> > 
> > Actually, that's only a recent optimization, before that each device got
> > it's own iommu domain.  It's actually completely configurable on the
> > qemu command line which devices get their own iommu and which share.
> > The default optimizes the number of domains (one) and thus the number of
> > mapping callbacks since we pin the entire guest.
> > 
> > > On POWER, we have a different view of things were domains/groups are
> > > defined to be the smallest granularity we can (down to a single VF) and
> > > we give several groups to a guest (ie we avoid sharing the iommu in most
> > > cases)
> > > 
> > > This is driven by the HW design but that design is itself driven by the
> > > idea that the domains/group are also error isolation groups and we don't
> > > want to take all of the IOs of a guest down if one adapter in that guest
> > > is having an error.
> > > 
> > > The x86 domains are conceptually different as they are about sharing the
> > > iommu page tables with the clear long term intent of then sharing those
> > > page tables with the guest CPU own. We aren't going in that direction
> > > (at this point at least) on POWER..
> > 
> > Yes and no.  The x86 domains are pretty flexible and used a few
> > different ways.  On the host we do dynamic DMA with a domain per device,
> > mapping only the inflight DMA ranges.  In order to achieve the
> > transparent device assignment model, we have to flip that around and map
> > the entire guest.  As noted, we can continue to use separate domains for
> > this, but since each maps the entire guest, it doesn't add a lot of
> > value and uses more resources and requires more mapping callbacks (and
> > x86 doesn't have the best error containment anyway).  If we had a well
> > supported IOMMU model that we could adapt for pvDMA, then it would make
> > sense to keep each device in it's own domain again.  Thanks,
> 
> Could you have an PV IOMMU (in the guest) that would set up those
> maps?

Yep, definitely.  That's effectively what power wants to do.  We could
do it on x86, but as others have noted, the map/unmap interface isn't
tuned to do this at that granularity and our target guest OS audience is
effectively reduced to Linux.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-03  1:02               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  1:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > > VFs will generally not have limitations like that no, but on the other
> > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > > take a bunch of VFs and put them in the same 'domain'.
> > > 
> > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > > tries to put all devices for a given guest into a "domain".
> > 
> > Actually, that's only a recent optimization, before that each device got
> > it's own iommu domain.  It's actually completely configurable on the
> > qemu command line which devices get their own iommu and which share.
> > The default optimizes the number of domains (one) and thus the number of
> > mapping callbacks since we pin the entire guest.
> > 
> > > On POWER, we have a different view of things were domains/groups are
> > > defined to be the smallest granularity we can (down to a single VF) and
> > > we give several groups to a guest (ie we avoid sharing the iommu in most
> > > cases)
> > > 
> > > This is driven by the HW design but that design is itself driven by the
> > > idea that the domains/group are also error isolation groups and we don't
> > > want to take all of the IOs of a guest down if one adapter in that guest
> > > is having an error.
> > > 
> > > The x86 domains are conceptually different as they are about sharing the
> > > iommu page tables with the clear long term intent of then sharing those
> > > page tables with the guest CPU own. We aren't going in that direction
> > > (at this point at least) on POWER..
> > 
> > Yes and no.  The x86 domains are pretty flexible and used a few
> > different ways.  On the host we do dynamic DMA with a domain per device,
> > mapping only the inflight DMA ranges.  In order to achieve the
> > transparent device assignment model, we have to flip that around and map
> > the entire guest.  As noted, we can continue to use separate domains for
> > this, but since each maps the entire guest, it doesn't add a lot of
> > value and uses more resources and requires more mapping callbacks (and
> > x86 doesn't have the best error containment anyway).  If we had a well
> > supported IOMMU model that we could adapt for pvDMA, then it would make
> > sense to keep each device in it's own domain again.  Thanks,
> 
> Could you have an PV IOMMU (in the guest) that would set up those
> maps?

Yep, definitely.  That's effectively what power wants to do.  We could
do it on x86, but as others have noted, the map/unmap interface isn't
tuned to do this at that granularity and our target guest OS audience is
effectively reduced to Linux.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 18:35         ` Alex Williamson
  (?)
@ 2011-08-03  2:04           ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-03  2:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt.  But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > > 
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here.  The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host).  Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> > 
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy.  And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
> 
> Sorry, I didn't intend to have such circular logic.  "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition".  Thanks,

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-03  2:04           ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-03  2:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Anthony Liguori, linuxppc-dev,
	benve

On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt.  But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > > 
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here.  The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host).  Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> > 
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy.  And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
> 
> Sorry, I didn't intend to have such circular logic.  "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition".  Thanks,

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-03  2:04           ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-03  2:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt.  But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > > 
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here.  The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host).  Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> > 
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy.  And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
> 
> Sorry, I didn't intend to have such circular logic.  "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition".  Thanks,

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-03  2:04           ` David Gibson
  (?)
@ 2011-08-03  3:44             ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > [snip]
> > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > There's a difference between removing the device from the host and
> > > > > exposing the device to the guest.
> > > > 
> > > > I think you're arguing only over details of what words to use for
> > > > what, rather than anything of substance here.  The point is that an
> > > > entire partitionable group must be assigned to "host" (in which case
> > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > at least to a single UID on the host).  Which of the assigned devices
> > > > the partition actually uses is another matter of course, as is at
> > > > exactly which level they become "de-exposed" if you don't want to use
> > > > all of then.
> > > 
> > > Well first we need to define what a partitionable group is, whether it's
> > > based on hardware requirements or user policy.  And while I agree that
> > > we need unique ownership of a partition, I disagree that qemu is
> > > necessarily the owner of the entire partition vs individual devices.
> > 
> > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > that qemu is necessarily the owner of the entire partition vs granted
> > access to devices within the partition".  Thanks,
> 
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
> 
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.

How does a qemu instance that uses none of the devices in a group still
own that group?  Aren't we at that point free to move the group to a
different qemu instance or return ownership to the host?  Who does that?
In my mental model, there's an intermediary that "owns" the group and
just as kernel drivers bind to devices when the host owns the group,
qemu is a userspace device driver that binds to sets of devices when the
intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
have to be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-03  3:44             ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Anthony Liguori, linuxppc-dev,
	benve

On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > [snip]
> > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > There's a difference between removing the device from the host and
> > > > > exposing the device to the guest.
> > > > 
> > > > I think you're arguing only over details of what words to use for
> > > > what, rather than anything of substance here.  The point is that an
> > > > entire partitionable group must be assigned to "host" (in which case
> > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > at least to a single UID on the host).  Which of the assigned devices
> > > > the partition actually uses is another matter of course, as is at
> > > > exactly which level they become "de-exposed" if you don't want to use
> > > > all of then.
> > > 
> > > Well first we need to define what a partitionable group is, whether it's
> > > based on hardware requirements or user policy.  And while I agree that
> > > we need unique ownership of a partition, I disagree that qemu is
> > > necessarily the owner of the entire partition vs individual devices.
> > 
> > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > that qemu is necessarily the owner of the entire partition vs granted
> > access to devices within the partition".  Thanks,
> 
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
> 
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.

How does a qemu instance that uses none of the devices in a group still
own that group?  Aren't we at that point free to move the group to a
different qemu instance or return ownership to the host?  Who does that?
In my mental model, there's an intermediary that "owns" the group and
just as kernel drivers bind to devices when the host owns the group,
qemu is a userspace device driver that binds to sets of devices when the
intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
have to be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-03  3:44             ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > [snip]
> > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > There's a difference between removing the device from the host and
> > > > > exposing the device to the guest.
> > > > 
> > > > I think you're arguing only over details of what words to use for
> > > > what, rather than anything of substance here.  The point is that an
> > > > entire partitionable group must be assigned to "host" (in which case
> > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > at least to a single UID on the host).  Which of the assigned devices
> > > > the partition actually uses is another matter of course, as is at
> > > > exactly which level they become "de-exposed" if you don't want to use
> > > > all of then.
> > > 
> > > Well first we need to define what a partitionable group is, whether it's
> > > based on hardware requirements or user policy.  And while I agree that
> > > we need unique ownership of a partition, I disagree that qemu is
> > > necessarily the owner of the entire partition vs individual devices.
> > 
> > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > that qemu is necessarily the owner of the entire partition vs granted
> > access to devices within the partition".  Thanks,
> 
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
> 
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.

How does a qemu instance that uses none of the devices in a group still
own that group?  Aren't we at that point free to move the group to a
different qemu instance or return ownership to the host?  Who does that?
In my mental model, there's an intermediary that "owns" the group and
just as kernel drivers bind to devices when the host owns the group,
qemu is a userspace device driver that binds to sets of devices when the
intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
have to be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-03  3:44             ` Alex Williamson
@ 2011-08-04  0:39               ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-04  0:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 09:44:49PM -0600, Alex Williamson wrote:
> On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> > On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > > [snip]
> > > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > > There's a difference between removing the device from the host and
> > > > > > exposing the device to the guest.
> > > > > 
> > > > > I think you're arguing only over details of what words to use for
> > > > > what, rather than anything of substance here.  The point is that an
> > > > > entire partitionable group must be assigned to "host" (in which case
> > > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > > at least to a single UID on the host).  Which of the assigned devices
> > > > > the partition actually uses is another matter of course, as is at
> > > > > exactly which level they become "de-exposed" if you don't want to use
> > > > > all of then.
> > > > 
> > > > Well first we need to define what a partitionable group is, whether it's
> > > > based on hardware requirements or user policy.  And while I agree that
> > > > we need unique ownership of a partition, I disagree that qemu is
> > > > necessarily the owner of the entire partition vs individual devices.
> > > 
> > > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > > that qemu is necessarily the owner of the entire partition vs granted
> > > access to devices within the partition".  Thanks,
> > 
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> > 
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> 
> How does a qemu instance that uses none of the devices in a group still
> own that group?

?? In the same way that you still own a file you don't have open..?

>  Aren't we at that point free to move the group to a
> different qemu instance or return ownership to the host?

Of course.  But until you actually do that, the group is still
notionally owned by the guest.

>  Who does that?

The admin.  Possily by poking sysfs, or possibly by frobbing some
character device, or maybe something else.  Naturally libvirt or
whatever could also do this.

> In my mental model, there's an intermediary that "owns" the group and
> just as kernel drivers bind to devices when the host owns the group,
> qemu is a userspace device driver that binds to sets of devices when the
> intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
> have to be.  Thanks,

Well sure, but I really don't see how such an intermediary fits into
the kernel's model of ownership.

So, first, take a step back and look at what sort of entities can
"own" a group (or device or whatever).  I notice that when I've said
"owned by the guest" you seem to have read this as "owned by qemu"
which is not necessarily the same thing.

What I had in mind is that each group is either owned by "host", in
which case host kernel drivers can bind to it, or it's in "guest mode"
in which case it has a user, group and mode and can be bound by user
drivers (and therefore guests) with the right permission.  From the
kernel's perspective there is therefore no distinction between "owned
by qemu" and "owned by libvirt".


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-04  0:39               ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-04  0:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 09:44:49PM -0600, Alex Williamson wrote:
> On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> > On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > > [snip]
> > > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > > There's a difference between removing the device from the host and
> > > > > > exposing the device to the guest.
> > > > > 
> > > > > I think you're arguing only over details of what words to use for
> > > > > what, rather than anything of substance here.  The point is that an
> > > > > entire partitionable group must be assigned to "host" (in which case
> > > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > > at least to a single UID on the host).  Which of the assigned devices
> > > > > the partition actually uses is another matter of course, as is at
> > > > > exactly which level they become "de-exposed" if you don't want to use
> > > > > all of then.
> > > > 
> > > > Well first we need to define what a partitionable group is, whether it's
> > > > based on hardware requirements or user policy.  And while I agree that
> > > > we need unique ownership of a partition, I disagree that qemu is
> > > > necessarily the owner of the entire partition vs individual devices.
> > > 
> > > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > > that qemu is necessarily the owner of the entire partition vs granted
> > > access to devices within the partition".  Thanks,
> > 
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> > 
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> 
> How does a qemu instance that uses none of the devices in a group still
> own that group?

?? In the same way that you still own a file you don't have open..?

>  Aren't we at that point free to move the group to a
> different qemu instance or return ownership to the host?

Of course.  But until you actually do that, the group is still
notionally owned by the guest.

>  Who does that?

The admin.  Possily by poking sysfs, or possibly by frobbing some
character device, or maybe something else.  Naturally libvirt or
whatever could also do this.

> In my mental model, there's an intermediary that "owns" the group and
> just as kernel drivers bind to devices when the host owns the group,
> qemu is a userspace device driver that binds to sets of devices when the
> intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
> have to be.  Thanks,

Well sure, but I really don't see how such an intermediary fits into
the kernel's model of ownership.

So, first, take a step back and look at what sort of entities can
"own" a group (or device or whatever).  I notice that when I've said
"owned by the guest" you seem to have read this as "owned by qemu"
which is not necessarily the same thing.

What I had in mind is that each group is either owned by "host", in
which case host kernel drivers can bind to it, or it's in "guest mode"
in which case it has a user, group and mode and can be bound by user
drivers (and therefore guests) with the right permission.  From the
kernel's perspective there is therefore no distinction between "owned
by qemu" and "owned by libvirt".


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
@ 2011-08-04 10:27   ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

Hi Ben,

thanks for your detailed introduction to the requirements for POWER. Its
good to know that the granularity problem is not x86-only.

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on.

On x86 this is mostly an issue of the IOMMU and which set of devices use
the same request-id. I used to call that an alias-group because the
devices have a request-id alias to the pci-bridge.

> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

Correct.
 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

I agree. Managing the ownership of a group should be done in the kernel.
Doing this in userspace is just too dangerous.

The problem to be solved here is how to present these PEs inside the
kernel and to userspace. I thought a bit about making this visbible
through the iommu-api for in-kernel users. That is probably the most
logical place.

For userspace I would like to propose a new device attribute in sysfs.
This attribute contains the group number. All devices with the same
group number belong to the same PE. Libvirt needs to scan the whole
device tree to build the groups but that is probalbly not a big deal.


	Joerg

> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Nothing great was ever achieved without enthusiasm.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:27   ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

Hi Ben,

thanks for your detailed introduction to the requirements for POWER. Its
good to know that the granularity problem is not x86-only.

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on.

On x86 this is mostly an issue of the IOMMU and which set of devices use
the same request-id. I used to call that an alias-group because the
devices have a request-id alias to the pci-bridge.

> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

Correct.
 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

I agree. Managing the ownership of a group should be done in the kernel.
Doing this in userspace is just too dangerous.

The problem to be solved here is how to present these PEs inside the
kernel and to userspace. I thought a bit about making this visbible
through the iommu-api for in-kernel users. That is probably the most
logical place.

For userspace I would like to propose a new device attribute in sysfs.
This attribute contains the group number. All devices with the same
group number belong to the same PE. Libvirt needs to scan the whole
device tree to build the groups but that is probalbly not a big deal.


	Joerg

> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Nothing great was ever achieved without enthusiasm.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-08-04 10:35     ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev,
	iommu, benve, aafabbri, chrisw, qemu-devel

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:35     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	David Gibson, aafabbri, iommu, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:35     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	David Gibson, aafabbri, iommu, linux-pci, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 20:27     ` Alex Williamson
@ 2011-08-04 10:41       ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avi Kivity, Benjamin Herrenschmidt, kvm, Anthony Liguori,
	David Gibson, Paul Mackerras, Alexey Kardashevskiy, linux-pci,
	linuxppc-dev

On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> makes this easier?

AMD IOMMU provides remapping tables per-device, and not a global one.
But that does not make direct guest-access to the MSI-X table safe. The
table contains the table contains the interrupt-type and the vector
which is used as an index into the remapping table by the IOMMU. So when
the guest writes into its MSI-X table the remapping-table in the host
needs to be updated too.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:41       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev

On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> makes this easier?

AMD IOMMU provides remapping tables per-device, and not a global one.
But that does not make direct guest-access to the MSI-X table safe. The
table contains the table contains the interrupt-type and the vector
which is used as an index into the remapping table by the IOMMU. So when
the guest writes into its MSI-X table the remapping-table in the host
needs to be updated too.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-04 10:41       ` Joerg Roedel
@ 2011-08-05 10:26         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, Avi Kivity, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > implement an emulated interrupt remapper and hope that the guest picks
> > unused indexes in the host interrupt remapping table before it could do
> > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > makes this easier?
> 
> AMD IOMMU provides remapping tables per-device, and not a global one.
> But that does not make direct guest-access to the MSI-X table safe. The
> table contains the table contains the interrupt-type and the vector
> which is used as an index into the remapping table by the IOMMU. So when
> the guest writes into its MSI-X table the remapping-table in the host
> needs to be updated too.

Right, you need paravirt to avoid filtering :-)

IE the problem is two fold:

 - Getting the right value in the table / remapper so things work
(paravirt)

 - Protecting against the guest somewhat managing to change the value in
the table (either directly or via a backdoor access to its own config
space).

The later for us comes from the HW PE filtering of the MSI transactions.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 10:26         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev

On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > implement an emulated interrupt remapper and hope that the guest picks
> > unused indexes in the host interrupt remapping table before it could do
> > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > makes this easier?
> 
> AMD IOMMU provides remapping tables per-device, and not a global one.
> But that does not make direct guest-access to the MSI-X table safe. The
> table contains the table contains the interrupt-type and the vector
> which is used as an index into the remapping table by the IOMMU. So when
> the guest writes into its MSI-X table the remapping-table in the host
> needs to be updated too.

Right, you need paravirt to avoid filtering :-)

IE the problem is two fold:

 - Getting the right value in the table / remapper so things work
(paravirt)

 - Protecting against the guest somewhat managing to change the value in
the table (either directly or via a backdoor access to its own config
space).

The later for us comes from the HW PE filtering of the MSI transactions.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-04 10:27   ` Joerg Roedel
@ 2011-08-05 10:42     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:42 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Thu, 2011-08-04 at 12:27 +0200, Joerg Roedel wrote:
> Hi Ben,
> 
> thanks for your detailed introduction to the requirements for POWER. Its
> good to know that the granularity problem is not x86-only.

I'm happy to see your reply :-) I had the feeling I was a bit alone
here...

> On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> > In IBM POWER land, we call this a "partitionable endpoint" (the term
> > "endpoint" here is historic, such a PE can be made of several PCIe
> > "endpoints"). I think "partitionable" is a pretty good name tho to
> > represent the constraints, so I'll call this a "partitionable group"
> > from now on.
> 
> On x86 this is mostly an issue of the IOMMU and which set of devices use
> the same request-id. I used to call that an alias-group because the
> devices have a request-id alias to the pci-bridge.

Right. In fact to try to clarify the problem for everybody, I think we
can distinguish two different classes of "constraints" that can
influence the grouping of devices:

 1- Hard constraints. These are typically devices using the same RID or
where the RID cannot be reliably guaranteed (the later is the case with
some PCIe-PCIX bridges which will take ownership of "some" transactions
such as split but not all). Devices like that must be in the same
domain. This is where PowerPC adds to what x86 does today the concept
that the domains are pre-existing, since we use the RID for error
isolation & MMIO segmenting as well. so we need to create those domains
at boot time.

 2- Softer constraints. Those constraints derive from the fact that not
applying them risks enabling the guest to create side effects outside of
its "sandbox". To some extent, there can be "degrees" of badness between
the various things that can cause such constraints. Examples are shared
LSIs (since trusting DisINTx can be chancy, see earlier discussions),
potentially any set of functions in the same device can be problematic
due to the possibility to get backdoor access to the BARs etc...

Now, what I derive from the discussion we've had so far, is that we need
to find a proper fix for #1, but Alex and Avi seem to prefer that #2
remains a matter of libvirt/user doing the right thing (basically
keeping a loaded gun aimed at the user's foot with a very very very
sweet trigger but heh, let's not start a flamewar here :-)

So let's try to find a proper solution for #1 now, and leave #2 alone
for the time being.

Maybe the right option is for x86 to move toward pre-existing domains
like powerpc does, or maybe we can just expose some kind of ID.

Because #1 is a mix of generic constraints (nasty bridges) and very
platform specific ones (whatever capacity limits in our MMIO segmenting
forced us to put two devices in the same hard domain on power), I
believe it's really something the kernel must solve, not libvirt nor
qemu user or anything else.

I am open to suggestions here. I can easily expose my PE# (it's just a
number) somewhere in sysfs, in fact I'm considering doing it in the PCI
devices sysfs directory, simply because it can/will be useful for other
things such as error reporting, so we could maybe build on that.

The crux for me is really the need for pre-existence of the iommu
domains as my PE's imply a shared iommu space.

> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> Correct.
>  
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> I agree. Managing the ownership of a group should be done in the kernel.
> Doing this in userspace is just too dangerous.
> 
> The problem to be solved here is how to present these PEs inside the
> kernel and to userspace. I thought a bit about making this visbible
> through the iommu-api for in-kernel users. That is probably the most
> logical place.

Ah you started answering to my above questions :-)

We could do what you propose. It depends what we want to do with
domains. Practically speaking, we could make domains pre-existing (with
the ability to group several PEs into larger domains) or we could keep
the concepts different, possibly with the limitation that on powerpc, a
domain == a PE.

I suppose we -could- make arbitrary domains on ppc as well by making the
various PE's iommu's in HW point to the same in-memory table, but that's
a bit nasty in practice due to the way we manage those, and it would to
some extent increase the risk of a failing device/driver stomping on
another one and thus taking it down with itself. IE. isolation of errors
is an important feature for us.

So I'd rather avoid the whole domain thing for now and keep the
constraint, for powerpc at least, that a domain == a PE, and thus find a
proper way to expose that to qemu/libvirt.

> For userspace I would like to propose a new device attribute in sysfs.
> This attribute contains the group number. All devices with the same
> group number belong to the same PE. Libvirt needs to scan the whole
> device tree to build the groups but that is probalbly not a big deal.

That's trivial for me to map that to my existing PE number. Should we
define the number space to be within a PCI domain (ie a host bridge). Or
should it be a global space ? In the later case I can construct them
using domain << 16 | PE# or something like that.

Cheers,
Ben.

>	Joerg
> 
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> > 
> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> > 
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> > 
> > I'll talk a little bit more about recent POWER iommu's here to
> > illustrate where I'm coming from with my idea of groups:
> > 
> > On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> > of domain and a per-RID filtering. However it differs from VTd in a few
> > ways:
> > 
> > The "domains" (aka PEs) encompass more than just an iommu filtering
> > scheme. The MMIO space and PIO space are also segmented, and those
> > segments assigned to domains. Interrupts (well, MSI ports at least) are
> > assigned to domains. Inbound PCIe error messages are targeted to
> > domains, etc...
> > 
> > Basically, the PEs provide a very strong isolation feature which
> > includes errors, and has the ability to immediately "isolate" a PE on
> > the first occurence of an error. For example, if an inbound PCIe error
> > is signaled by a device on a PE or such a device does a DMA to a
> > non-authorized address, the whole PE gets into error state. All
> > subsequent stores (both DMA and MMIO) are swallowed and reads return all
> > 1's, interrupts are blocked. This is designed to prevent any propagation
> > of bad data, which is a very important feature in large high reliability
> > systems.
> > 
> > Software then has the ability to selectively turn back on MMIO and/or
> > DMA, perform diagnostics, reset devices etc...
> > 
> > Because the domains encompass more than just DMA, but also segment the
> > MMIO space, it is not practical at all to dynamically reconfigure them
> > at runtime to "move" devices into domains. The firmware or early kernel
> > code (it depends) will assign devices BARs using an algorithm that keeps
> > them within PE segment boundaries, etc....
> > 
> > Additionally (and this is indeed a "restriction" compared to VTd, though
> > I expect our future IO chips to lift it to some extent), PE don't get
> > separate DMA address spaces. There is one 64-bit DMA address space per
> > PCI host bridge, and it is 'segmented' with each segment being assigned
> > to a PE. Due to the way PE assignment works in hardware, it is not
> > practical to make several devices share a segment unless they are on the
> > same bus. Also the resulting limit in the amount of 32-bit DMA space a
> > device can access means that it's impractical to put too many devices in
> > a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> > more about that later).
> > 
> > The above essentially extends the granularity requirement (or rather is
> > another factor defining what the granularity of partitionable entities
> > is). You can think of it as "pre-existing" domains.
> > 
> > I believe the way to solve that is to introduce a kernel interface to
> > expose those "partitionable entities" to userspace. In addition, it
> > occurs to me that the ability to manipulate VTd domains essentially
> > boils down to manipulating those groups (creating larger ones with
> > individual components).
> > 
> > I like the idea of defining / playing with those groups statically
> > (using a command line tool or sysfs, possibly having a config file
> > defining them in a persistent way) rather than having their lifetime
> > tied to a uiommu file descriptor.
> > 
> > It also makes it a LOT easier to have a channel to manipulate
> > platform/arch specific attributes of those domains if any.
> > 
> > So we could define an API or representation in sysfs that exposes what
> > the partitionable entities are, and we may add to it an API to
> > manipulate them. But we don't have to and I'm happy to keep the
> > additional SW grouping you can do on VTd as a sepparate "add-on" API
> > (tho I don't like at all the way it works with uiommu). However, qemu
> > needs to know what the grouping is regardless of the domains, and it's
> > not nice if it has to manipulate two different concepts here so
> > eventually those "partitionable entities" from a qemu standpoint must
> > look like domains.
> > 
> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> > 
> > This can be done in a way that isn't PCI specific as well (the
> > definition of the groups and what is grouped would would obviously be
> > somewhat bus specific and handled by platform code in the kernel).
> > 
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> > 
> > * IOMMU
> > 
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> > 
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> > 
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> > 
> > This means:
> > 
> >   - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> > 
> >   - It requires the guest to be pinned. Pass-through -> no more swap
> > 
> >   - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb & bounce buffering.
> > 
> >   - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> > 
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> > 
> > Basically, what we do today is:
> > 
> > - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> > What is the DMA address and size of the DMA "window" usable for a given
> > device. This is a tweak, that should really be handled at the "domain"
> > level.
> > 
> > That current hack won't work well if two devices share an iommu. Note
> > that we have an additional constraint here due to our paravirt
> > interfaces (specificed in PAPR) which is that PE domains must have a
> > common parent. Basically, pHyp makes them look like a PCIe host bridge
> > per domain in the guest. I think that's a pretty good idea and qemu
> > might want to do the same.
> > 
> > - We hack out the currently unconditional mapping of the entire guest
> > space in the iommu. Something will have to be done to "decide" whether
> > to do that or not ... qemu argument -> ioctl ?
> > 
> > - We hook up the paravirt call to insert/remove a translation from the
> > iommu to the VFIO map/unmap ioctl's.
> > 
> > This limps along but it's not great. Some of the problems are:
> > 
> > - I've already mentioned, the domain problem again :-) 
> > 
> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> > 
> >   - ... which isn't trivial to get back to our underlying arch specific
> > iommu object from there. We'll probably need a set of arch specific
> > "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> > link them to the real thing kernel-side.
> > 
> > - PAPR (the specification of our paravirt interface and the expectation
> > of current OSes) wants iommu pages to be 4k by default, regardless of
> > the kernel host page size, which makes things a bit tricky since our
> > enterprise host kernels have a 64k base page size. Additionally, we have
> > new PAPR interfaces that we want to exploit, to allow the guest to
> > create secondary iommu segments (in 64-bit space), which can be used
> > (under guest control) to do things like map the entire guest (here it
> > is :-) or use larger iommu page sizes (if permitted by the host kernel,
> > in our case we could allow 64k iommu page size with a 64k host kernel).
> > 
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> > 
> > * IO space
> > 
> > On most (if not all) non-x86 archs, each PCI host bridge provide a
> > completely separate PCI address space. Qemu doesn't deal with that very
> > well. For MMIO it can be handled since those PCI address spaces are
> > "remapped" holes in the main CPU address space so devices can be
> > registered by using BAR + offset of that window in qemu MMIO mapping.
> > 
> > For PIO things get nasty. We have totally separate PIO spaces and qemu
> > doesn't seem to like that. We can try to play the offset trick as well,
> > we haven't tried yet, but basically that's another one to fix. Not a
> > huge deal I suppose but heh ...
> > 
> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> > 
> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> > 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> > 
> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> > 
> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> > 
> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> > 
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> > 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> > 
> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> > 
> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> > 
> >   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> > majority here is platform devices. There's quite a bit of vfio that
> > isn't intrinsically PCI specific. We could have an in-kernel platform
> > driver like we have an in-kernel PCI driver to attach to. The mapping of
> > resources to userspace is rather generic, as goes for interrupts. I
> > don't know whether that idea can be pushed much further, I don't have
> > the bandwidth to look into it much at this point, but maybe it would be
> > possible to refactor vfio a bit to better separate what is PCI specific
> > to what is not. The idea would be to move the PCI specific bits to
> > inside the "placeholder" PCI driver, and same goes for platform bits.
> > "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> > passes them to the driver which allows the PCI one to handle things
> > differently than the platform one, maybe an amba one while at it,
> > etc.... just a thought, I haven't gone into the details at all.
> > 
> > I think that's all I had on my plate today, it's a long enough email
> > anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> > wiki-disabled myself so he proposed to pickup my email and do that. We
> > should probably discuss the various items in here separately as
> > different threads to avoid too much confusion.
> > 
> > One other thing we should do on our side is publish somewhere our
> > current hacks to get you an idea of where we are going and what we had
> > to do (code speaks more than words). We'll try to do that asap, possibly
> > next week.
> > 
> > Note that I'll be on/off the next few weeks, travelling and doing
> > bringup. So expect latency in my replies.
> > 
> > Cheers,
> > Ben.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 10:42     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:42 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Thu, 2011-08-04 at 12:27 +0200, Joerg Roedel wrote:
> Hi Ben,
> 
> thanks for your detailed introduction to the requirements for POWER. Its
> good to know that the granularity problem is not x86-only.

I'm happy to see your reply :-) I had the feeling I was a bit alone
here...

> On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> > In IBM POWER land, we call this a "partitionable endpoint" (the term
> > "endpoint" here is historic, such a PE can be made of several PCIe
> > "endpoints"). I think "partitionable" is a pretty good name tho to
> > represent the constraints, so I'll call this a "partitionable group"
> > from now on.
> 
> On x86 this is mostly an issue of the IOMMU and which set of devices use
> the same request-id. I used to call that an alias-group because the
> devices have a request-id alias to the pci-bridge.

Right. In fact to try to clarify the problem for everybody, I think we
can distinguish two different classes of "constraints" that can
influence the grouping of devices:

 1- Hard constraints. These are typically devices using the same RID or
where the RID cannot be reliably guaranteed (the later is the case with
some PCIe-PCIX bridges which will take ownership of "some" transactions
such as split but not all). Devices like that must be in the same
domain. This is where PowerPC adds to what x86 does today the concept
that the domains are pre-existing, since we use the RID for error
isolation & MMIO segmenting as well. so we need to create those domains
at boot time.

 2- Softer constraints. Those constraints derive from the fact that not
applying them risks enabling the guest to create side effects outside of
its "sandbox". To some extent, there can be "degrees" of badness between
the various things that can cause such constraints. Examples are shared
LSIs (since trusting DisINTx can be chancy, see earlier discussions),
potentially any set of functions in the same device can be problematic
due to the possibility to get backdoor access to the BARs etc...

Now, what I derive from the discussion we've had so far, is that we need
to find a proper fix for #1, but Alex and Avi seem to prefer that #2
remains a matter of libvirt/user doing the right thing (basically
keeping a loaded gun aimed at the user's foot with a very very very
sweet trigger but heh, let's not start a flamewar here :-)

So let's try to find a proper solution for #1 now, and leave #2 alone
for the time being.

Maybe the right option is for x86 to move toward pre-existing domains
like powerpc does, or maybe we can just expose some kind of ID.

Because #1 is a mix of generic constraints (nasty bridges) and very
platform specific ones (whatever capacity limits in our MMIO segmenting
forced us to put two devices in the same hard domain on power), I
believe it's really something the kernel must solve, not libvirt nor
qemu user or anything else.

I am open to suggestions here. I can easily expose my PE# (it's just a
number) somewhere in sysfs, in fact I'm considering doing it in the PCI
devices sysfs directory, simply because it can/will be useful for other
things such as error reporting, so we could maybe build on that.

The crux for me is really the need for pre-existence of the iommu
domains as my PE's imply a shared iommu space.

> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> Correct.
>  
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> I agree. Managing the ownership of a group should be done in the kernel.
> Doing this in userspace is just too dangerous.
> 
> The problem to be solved here is how to present these PEs inside the
> kernel and to userspace. I thought a bit about making this visbible
> through the iommu-api for in-kernel users. That is probably the most
> logical place.

Ah you started answering to my above questions :-)

We could do what you propose. It depends what we want to do with
domains. Practically speaking, we could make domains pre-existing (with
the ability to group several PEs into larger domains) or we could keep
the concepts different, possibly with the limitation that on powerpc, a
domain == a PE.

I suppose we -could- make arbitrary domains on ppc as well by making the
various PE's iommu's in HW point to the same in-memory table, but that's
a bit nasty in practice due to the way we manage those, and it would to
some extent increase the risk of a failing device/driver stomping on
another one and thus taking it down with itself. IE. isolation of errors
is an important feature for us.

So I'd rather avoid the whole domain thing for now and keep the
constraint, for powerpc at least, that a domain == a PE, and thus find a
proper way to expose that to qemu/libvirt.

> For userspace I would like to propose a new device attribute in sysfs.
> This attribute contains the group number. All devices with the same
> group number belong to the same PE. Libvirt needs to scan the whole
> device tree to build the groups but that is probalbly not a big deal.

That's trivial for me to map that to my existing PE number. Should we
define the number space to be within a PCI domain (ie a host bridge). Or
should it be a global space ? In the later case I can construct them
using domain << 16 | PE# or something like that.

Cheers,
Ben.

>	Joerg
> 
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> > 
> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> > 
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> > 
> > I'll talk a little bit more about recent POWER iommu's here to
> > illustrate where I'm coming from with my idea of groups:
> > 
> > On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> > of domain and a per-RID filtering. However it differs from VTd in a few
> > ways:
> > 
> > The "domains" (aka PEs) encompass more than just an iommu filtering
> > scheme. The MMIO space and PIO space are also segmented, and those
> > segments assigned to domains. Interrupts (well, MSI ports at least) are
> > assigned to domains. Inbound PCIe error messages are targeted to
> > domains, etc...
> > 
> > Basically, the PEs provide a very strong isolation feature which
> > includes errors, and has the ability to immediately "isolate" a PE on
> > the first occurence of an error. For example, if an inbound PCIe error
> > is signaled by a device on a PE or such a device does a DMA to a
> > non-authorized address, the whole PE gets into error state. All
> > subsequent stores (both DMA and MMIO) are swallowed and reads return all
> > 1's, interrupts are blocked. This is designed to prevent any propagation
> > of bad data, which is a very important feature in large high reliability
> > systems.
> > 
> > Software then has the ability to selectively turn back on MMIO and/or
> > DMA, perform diagnostics, reset devices etc...
> > 
> > Because the domains encompass more than just DMA, but also segment the
> > MMIO space, it is not practical at all to dynamically reconfigure them
> > at runtime to "move" devices into domains. The firmware or early kernel
> > code (it depends) will assign devices BARs using an algorithm that keeps
> > them within PE segment boundaries, etc....
> > 
> > Additionally (and this is indeed a "restriction" compared to VTd, though
> > I expect our future IO chips to lift it to some extent), PE don't get
> > separate DMA address spaces. There is one 64-bit DMA address space per
> > PCI host bridge, and it is 'segmented' with each segment being assigned
> > to a PE. Due to the way PE assignment works in hardware, it is not
> > practical to make several devices share a segment unless they are on the
> > same bus. Also the resulting limit in the amount of 32-bit DMA space a
> > device can access means that it's impractical to put too many devices in
> > a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> > more about that later).
> > 
> > The above essentially extends the granularity requirement (or rather is
> > another factor defining what the granularity of partitionable entities
> > is). You can think of it as "pre-existing" domains.
> > 
> > I believe the way to solve that is to introduce a kernel interface to
> > expose those "partitionable entities" to userspace. In addition, it
> > occurs to me that the ability to manipulate VTd domains essentially
> > boils down to manipulating those groups (creating larger ones with
> > individual components).
> > 
> > I like the idea of defining / playing with those groups statically
> > (using a command line tool or sysfs, possibly having a config file
> > defining them in a persistent way) rather than having their lifetime
> > tied to a uiommu file descriptor.
> > 
> > It also makes it a LOT easier to have a channel to manipulate
> > platform/arch specific attributes of those domains if any.
> > 
> > So we could define an API or representation in sysfs that exposes what
> > the partitionable entities are, and we may add to it an API to
> > manipulate them. But we don't have to and I'm happy to keep the
> > additional SW grouping you can do on VTd as a sepparate "add-on" API
> > (tho I don't like at all the way it works with uiommu). However, qemu
> > needs to know what the grouping is regardless of the domains, and it's
> > not nice if it has to manipulate two different concepts here so
> > eventually those "partitionable entities" from a qemu standpoint must
> > look like domains.
> > 
> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> > 
> > This can be done in a way that isn't PCI specific as well (the
> > definition of the groups and what is grouped would would obviously be
> > somewhat bus specific and handled by platform code in the kernel).
> > 
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> > 
> > * IOMMU
> > 
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> > 
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> > 
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> > 
> > This means:
> > 
> >   - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> > 
> >   - It requires the guest to be pinned. Pass-through -> no more swap
> > 
> >   - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb & bounce buffering.
> > 
> >   - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> > 
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> > 
> > Basically, what we do today is:
> > 
> > - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> > What is the DMA address and size of the DMA "window" usable for a given
> > device. This is a tweak, that should really be handled at the "domain"
> > level.
> > 
> > That current hack won't work well if two devices share an iommu. Note
> > that we have an additional constraint here due to our paravirt
> > interfaces (specificed in PAPR) which is that PE domains must have a
> > common parent. Basically, pHyp makes them look like a PCIe host bridge
> > per domain in the guest. I think that's a pretty good idea and qemu
> > might want to do the same.
> > 
> > - We hack out the currently unconditional mapping of the entire guest
> > space in the iommu. Something will have to be done to "decide" whether
> > to do that or not ... qemu argument -> ioctl ?
> > 
> > - We hook up the paravirt call to insert/remove a translation from the
> > iommu to the VFIO map/unmap ioctl's.
> > 
> > This limps along but it's not great. Some of the problems are:
> > 
> > - I've already mentioned, the domain problem again :-) 
> > 
> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> > 
> >   - ... which isn't trivial to get back to our underlying arch specific
> > iommu object from there. We'll probably need a set of arch specific
> > "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> > link them to the real thing kernel-side.
> > 
> > - PAPR (the specification of our paravirt interface and the expectation
> > of current OSes) wants iommu pages to be 4k by default, regardless of
> > the kernel host page size, which makes things a bit tricky since our
> > enterprise host kernels have a 64k base page size. Additionally, we have
> > new PAPR interfaces that we want to exploit, to allow the guest to
> > create secondary iommu segments (in 64-bit space), which can be used
> > (under guest control) to do things like map the entire guest (here it
> > is :-) or use larger iommu page sizes (if permitted by the host kernel,
> > in our case we could allow 64k iommu page size with a 64k host kernel).
> > 
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> > 
> > * IO space
> > 
> > On most (if not all) non-x86 archs, each PCI host bridge provide a
> > completely separate PCI address space. Qemu doesn't deal with that very
> > well. For MMIO it can be handled since those PCI address spaces are
> > "remapped" holes in the main CPU address space so devices can be
> > registered by using BAR + offset of that window in qemu MMIO mapping.
> > 
> > For PIO things get nasty. We have totally separate PIO spaces and qemu
> > doesn't seem to like that. We can try to play the offset trick as well,
> > we haven't tried yet, but basically that's another one to fix. Not a
> > huge deal I suppose but heh ...
> > 
> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> > 
> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> > 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> > 
> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> > 
> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> > 
> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> > 
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> > 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> > 
> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> > 
> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> > 
> >   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> > majority here is platform devices. There's quite a bit of vfio that
> > isn't intrinsically PCI specific. We could have an in-kernel platform
> > driver like we have an in-kernel PCI driver to attach to. The mapping of
> > resources to userspace is rather generic, as goes for interrupts. I
> > don't know whether that idea can be pushed much further, I don't have
> > the bandwidth to look into it much at this point, but maybe it would be
> > possible to refactor vfio a bit to better separate what is PCI specific
> > to what is not. The idea would be to move the PCI specific bits to
> > inside the "placeholder" PCI driver, and same goes for platform bits.
> > "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> > passes them to the driver which allows the PCI one to handle things
> > differently than the platform one, maybe an amba one while at it,
> > etc.... just a thought, I haven't gone into the details at all.
> > 
> > I think that's all I had on my plate today, it's a long enough email
> > anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> > wiki-disabled myself so he proposed to pickup my email and do that. We
> > should probably discuss the various items in here separately as
> > different threads to avoid too much confusion.
> > 
> > One other thing we should do on our side is publish somewhere our
> > current hacks to get you an idea of where we are going and what we had
> > to do (code speaks more than words). We'll try to do that asap, possibly
> > next week.
> > 
> > Note that I'll be on/off the next few weeks, travelling and doing
> > bringup. So expect latency in my replies.
> > 
> > Cheers,
> > Ben.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 10:26         ` Benjamin Herrenschmidt
@ 2011-08-05 12:57           ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 12:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Avi Kivity, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, Aug 05, 2011 at 08:26:11PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> > On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > > implement an emulated interrupt remapper and hope that the guest picks
> > > unused indexes in the host interrupt remapping table before it could do
> > > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > > makes this easier?
> > 
> > AMD IOMMU provides remapping tables per-device, and not a global one.
> > But that does not make direct guest-access to the MSI-X table safe. The
> > table contains the table contains the interrupt-type and the vector
> > which is used as an index into the remapping table by the IOMMU. So when
> > the guest writes into its MSI-X table the remapping-table in the host
> > needs to be updated too.
> 
> Right, you need paravirt to avoid filtering :-)

Or a shadow MSI-X table like done on x86. How to handle this seems to be
platform specific. As you indicate there is a standardized paravirt
interface for that on Power.

> IE the problem is two fold:
> 
>  - Getting the right value in the table / remapper so things work
> (paravirt)
> 
>  - Protecting against the guest somewhat managing to change the value in
> the table (either directly or via a backdoor access to its own config
> space).
> 
> The later for us comes from the HW PE filtering of the MSI transactions.

Right. The second part of the problem can be avoided with
interrupt-remapping/filtering hardware in the IOMMUs.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 12:57           ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 12:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev

On Fri, Aug 05, 2011 at 08:26:11PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> > On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > > implement an emulated interrupt remapper and hope that the guest picks
> > > unused indexes in the host interrupt remapping table before it could do
> > > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > > makes this easier?
> > 
> > AMD IOMMU provides remapping tables per-device, and not a global one.
> > But that does not make direct guest-access to the MSI-X table safe. The
> > table contains the table contains the interrupt-type and the vector
> > which is used as an index into the remapping table by the IOMMU. So when
> > the guest writes into its MSI-X table the remapping-table in the host
> > needs to be updated too.
> 
> Right, you need paravirt to avoid filtering :-)

Or a shadow MSI-X table like done on x86. How to handle this seems to be
platform specific. As you indicate there is a standardized paravirt
interface for that on Power.

> IE the problem is two fold:
> 
>  - Getting the right value in the table / remapper so things work
> (paravirt)
> 
>  - Protecting against the guest somewhat managing to change the value in
> the table (either directly or via a backdoor access to its own config
> space).
> 
> The later for us comes from the HW PE filtering of the MSI transactions.

Right. The second part of the problem can be avoided with
interrupt-remapping/filtering hardware in the IOMMUs.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 10:42     ` Benjamin Herrenschmidt
@ 2011-08-05 13:44       ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 13:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:

> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.

Domains (in the iommu-sense) are created at boot time on x86 today.
Every device needs at least a domain to provide dma-mapping
functionality to the drivers. So all the grouping is done too at
boot-time. This is specific to the iommu-drivers today but can be
generalized I think.

>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

Hmm, there is no sane way to handle such constraints in a safe way,
right? We can either blacklist devices which are know to have such
backdoors or we just ignore the problem.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)
> 
> So let's try to find a proper solution for #1 now, and leave #2 alone
> for the time being.

Yes, and the solution for #1 should be entirely in the kernel. The
question is how to do that. Probably the most sane way is to introduce a
concept of device ownership. The ownership can either be a kernel driver
or a userspace process. Giving ownership of a device to userspace is
only possible if all devices in the same group are unbound from its
respective drivers. This is a very intrusive concept, no idea if it
has a chance of acceptance :-)
But the advantage is clearly that this allows better semantics in the
IOMMU drivers and a more stable handover of devices from host drivers to
kvm guests.

> Maybe the right option is for x86 to move toward pre-existing domains
> like powerpc does, or maybe we can just expose some kind of ID.

As I said, the domains are created a iommu driver initialization time
(usually boot time). But the groups are internal to the iommu drivers
and not visible somewhere else.

> Ah you started answering to my above questions :-)
> 
> We could do what you propose. It depends what we want to do with
> domains. Practically speaking, we could make domains pre-existing (with
> the ability to group several PEs into larger domains) or we could keep
> the concepts different, possibly with the limitation that on powerpc, a
> domain == a PE.
> 
> I suppose we -could- make arbitrary domains on ppc as well by making the
> various PE's iommu's in HW point to the same in-memory table, but that's
> a bit nasty in practice due to the way we manage those, and it would to
> some extent increase the risk of a failing device/driver stomping on
> another one and thus taking it down with itself. IE. isolation of errors
> is an important feature for us.

These arbitrary domains exist in the iommu-api. It would be good to
emulate them on Power too. Can't you put a PE into an isolated
error-domain when something goes wrong with it? This should provide the
same isolation as before.
What you derive the group number from is your business :-) On x86 it is
certainly the best to use the RID these devices share together with the
PCI segment number.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 13:44       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 13:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:

> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.

Domains (in the iommu-sense) are created at boot time on x86 today.
Every device needs at least a domain to provide dma-mapping
functionality to the drivers. So all the grouping is done too at
boot-time. This is specific to the iommu-drivers today but can be
generalized I think.

>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

Hmm, there is no sane way to handle such constraints in a safe way,
right? We can either blacklist devices which are know to have such
backdoors or we just ignore the problem.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)
> 
> So let's try to find a proper solution for #1 now, and leave #2 alone
> for the time being.

Yes, and the solution for #1 should be entirely in the kernel. The
question is how to do that. Probably the most sane way is to introduce a
concept of device ownership. The ownership can either be a kernel driver
or a userspace process. Giving ownership of a device to userspace is
only possible if all devices in the same group are unbound from its
respective drivers. This is a very intrusive concept, no idea if it
has a chance of acceptance :-)
But the advantage is clearly that this allows better semantics in the
IOMMU drivers and a more stable handover of devices from host drivers to
kvm guests.

> Maybe the right option is for x86 to move toward pre-existing domains
> like powerpc does, or maybe we can just expose some kind of ID.

As I said, the domains are created a iommu driver initialization time
(usually boot time). But the groups are internal to the iommu drivers
and not visible somewhere else.

> Ah you started answering to my above questions :-)
> 
> We could do what you propose. It depends what we want to do with
> domains. Practically speaking, we could make domains pre-existing (with
> the ability to group several PEs into larger domains) or we could keep
> the concepts different, possibly with the limitation that on powerpc, a
> domain == a PE.
> 
> I suppose we -could- make arbitrary domains on ppc as well by making the
> various PE's iommu's in HW point to the same in-memory table, but that's
> a bit nasty in practice due to the way we manage those, and it would to
> some extent increase the risk of a failing device/driver stomping on
> another one and thus taking it down with itself. IE. isolation of errors
> is an important feature for us.

These arbitrary domains exist in the iommu-api. It would be good to
emulate them on Power too. Can't you put a PE into an isolated
error-domain when something goes wrong with it? This should provide the
same isolation as before.
What you derive the group number from is your business :-) On x86 it is
certainly the best to use the RID these devices share together with the
PCI segment number.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 10:42     ` Benjamin Herrenschmidt
@ 2011-08-05 15:10       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-05 15:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Joerg Roedel, kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.
> 
>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

This is what I've been trying to get to, hardware constraints vs system
policy constraints.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)

Doesn't your own uncertainty of whether or not to allow this lead to the
same conclusion, that it belongs in userspace policy?  I don't think we
want to make white lists of which devices we trust to do DisINTx
correctly part of the kernel interface, do we?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 15:10       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-05 15:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Joerg Roedel, David Gibson, Anthony Liguori, linuxppc-dev

On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.
> 
>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

This is what I've been trying to get to, hardware constraints vs system
policy constraints.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)

Doesn't your own uncertainty of whether or not to allow this lead to the
same conclusion, that it belongs in userspace policy?  I don't think we
want to make white lists of which devices we trust to do DisINTx
correctly part of the kernel interface, do we?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 13:44       ` Joerg Roedel
@ 2011-08-05 22:49         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 22:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, 2011-08-05 at 15:44 +0200, Joerg Roedel wrote:
> On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Right. In fact to try to clarify the problem for everybody, I think we
> > can distinguish two different classes of "constraints" that can
> > influence the grouping of devices:
> > 
> >  1- Hard constraints. These are typically devices using the same RID or
> > where the RID cannot be reliably guaranteed (the later is the case with
> > some PCIe-PCIX bridges which will take ownership of "some" transactions
> > such as split but not all). Devices like that must be in the same
> > domain. This is where PowerPC adds to what x86 does today the concept
> > that the domains are pre-existing, since we use the RID for error
> > isolation & MMIO segmenting as well. so we need to create those domains
> > at boot time.
> 
> Domains (in the iommu-sense) are created at boot time on x86 today.
> Every device needs at least a domain to provide dma-mapping
> functionality to the drivers. So all the grouping is done too at
> boot-time. This is specific to the iommu-drivers today but can be
> generalized I think.

Ok, let's go there then.

> >  2- Softer constraints. Those constraints derive from the fact that not
> > applying them risks enabling the guest to create side effects outside of
> > its "sandbox". To some extent, there can be "degrees" of badness between
> > the various things that can cause such constraints. Examples are shared
> > LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> > potentially any set of functions in the same device can be problematic
> > due to the possibility to get backdoor access to the BARs etc...
> 
> Hmm, there is no sane way to handle such constraints in a safe way,
> right? We can either blacklist devices which are know to have such
> backdoors or we just ignore the problem.

Arguably they probably all do have such backdoors. A debug register,
JTAG register, ... My point is you don't really know unless you get
manufacturer guarantee that there is no undocumented register somewhere
or way to change the microcode so that it does it etc.... The more
complex the devices, the less likely to have a guarantee.

The "safe" way is what pHyp does and basically boils down to only
allowing pass-through of entire 'slots', ie, things that are behind a
P2P bridge (virtual one typically, ie, a PCIe switch) and disallowing
pass-through with shared interrupts.

That way, even if the guest can move the BARs around, it cannot make
them overlap somebody else device because the parent bridge restricts
the portion of MMIO space that is forwarded down to that device anyway.

> > Now, what I derive from the discussion we've had so far, is that we need
> > to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> > remains a matter of libvirt/user doing the right thing (basically
> > keeping a loaded gun aimed at the user's foot with a very very very
> > sweet trigger but heh, let's not start a flamewar here :-)
> > 
> > So let's try to find a proper solution for #1 now, and leave #2 alone
> > for the time being.
> 
> Yes, and the solution for #1 should be entirely in the kernel. The
> question is how to do that. Probably the most sane way is to introduce a
> concept of device ownership. The ownership can either be a kernel driver
> or a userspace process. Giving ownership of a device to userspace is
> only possible if all devices in the same group are unbound from its
> respective drivers. This is a very intrusive concept, no idea if it
> has a chance of acceptance :-)
> But the advantage is clearly that this allows better semantics in the
> IOMMU drivers and a more stable handover of devices from host drivers to
> kvm guests.

I tend to think around those lines too, but the ownership concept
doesn't necessarily have to be core-kernel enforced itself, it can be in
VFIO.

If we have a common API to expose the "domain number", it can perfectly
be a matter of VFIO itself not allowing to do pass-through until it has 
attached its stub driver to all the devices with that domain number, and
it can handle exclusion of iommu domains from there.

> > Maybe the right option is for x86 to move toward pre-existing domains
> > like powerpc does, or maybe we can just expose some kind of ID.
> 
> As I said, the domains are created a iommu driver initialization time
> (usually boot time). But the groups are internal to the iommu drivers
> and not visible somewhere else.

That's what we need to fix :-)

> > Ah you started answering to my above questions :-)
> > 
> > We could do what you propose. It depends what we want to do with
> > domains. Practically speaking, we could make domains pre-existing (with
> > the ability to group several PEs into larger domains) or we could keep
> > the concepts different, possibly with the limitation that on powerpc, a
> > domain == a PE.
> > 
> > I suppose we -could- make arbitrary domains on ppc as well by making the
> > various PE's iommu's in HW point to the same in-memory table, but that's
> > a bit nasty in practice due to the way we manage those, and it would to
> > some extent increase the risk of a failing device/driver stomping on
> > another one and thus taking it down with itself. IE. isolation of errors
> > is an important feature for us.
> 
> These arbitrary domains exist in the iommu-api. It would be good to
> emulate them on Power too. Can't you put a PE into an isolated
> error-domain when something goes wrong with it? This should provide the
> same isolation as before.

Well, my problem is that it's quite hard for me to arbitrarily make PEs
shared the same iommu table. The iommu tables are assigned at boot time
along with the creation of the PEs, and because sadly, I don't (yet)
support tree structures for them, they are large physically contiguous
things, so I need to allocate them early and keep them around.

I -could- make a hack to share tables when creating such arbitrary
domains, but I would definitely have to keep track of the "original"
table of the PE so that can be reverted, I can't afford to free the
memory or I risk not being able to re-allocate it.

We'll have tree iommu's in future HW but not yet.

> What you derive the group number from is your business :-) On x86 it is
> certainly the best to use the RID these devices share together with the
> PCI segment number.

Ok. The question is more in term of API whether this number is to be
unique at the system scope or only at the PCI host bridge scope. 

Cheers,
Ben.

> Regards,
> 
> 	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 22:49         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 22:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Fri, 2011-08-05 at 15:44 +0200, Joerg Roedel wrote:
> On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Right. In fact to try to clarify the problem for everybody, I think we
> > can distinguish two different classes of "constraints" that can
> > influence the grouping of devices:
> > 
> >  1- Hard constraints. These are typically devices using the same RID or
> > where the RID cannot be reliably guaranteed (the later is the case with
> > some PCIe-PCIX bridges which will take ownership of "some" transactions
> > such as split but not all). Devices like that must be in the same
> > domain. This is where PowerPC adds to what x86 does today the concept
> > that the domains are pre-existing, since we use the RID for error
> > isolation & MMIO segmenting as well. so we need to create those domains
> > at boot time.
> 
> Domains (in the iommu-sense) are created at boot time on x86 today.
> Every device needs at least a domain to provide dma-mapping
> functionality to the drivers. So all the grouping is done too at
> boot-time. This is specific to the iommu-drivers today but can be
> generalized I think.

Ok, let's go there then.

> >  2- Softer constraints. Those constraints derive from the fact that not
> > applying them risks enabling the guest to create side effects outside of
> > its "sandbox". To some extent, there can be "degrees" of badness between
> > the various things that can cause such constraints. Examples are shared
> > LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> > potentially any set of functions in the same device can be problematic
> > due to the possibility to get backdoor access to the BARs etc...
> 
> Hmm, there is no sane way to handle such constraints in a safe way,
> right? We can either blacklist devices which are know to have such
> backdoors or we just ignore the problem.

Arguably they probably all do have such backdoors. A debug register,
JTAG register, ... My point is you don't really know unless you get
manufacturer guarantee that there is no undocumented register somewhere
or way to change the microcode so that it does it etc.... The more
complex the devices, the less likely to have a guarantee.

The "safe" way is what pHyp does and basically boils down to only
allowing pass-through of entire 'slots', ie, things that are behind a
P2P bridge (virtual one typically, ie, a PCIe switch) and disallowing
pass-through with shared interrupts.

That way, even if the guest can move the BARs around, it cannot make
them overlap somebody else device because the parent bridge restricts
the portion of MMIO space that is forwarded down to that device anyway.

> > Now, what I derive from the discussion we've had so far, is that we need
> > to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> > remains a matter of libvirt/user doing the right thing (basically
> > keeping a loaded gun aimed at the user's foot with a very very very
> > sweet trigger but heh, let's not start a flamewar here :-)
> > 
> > So let's try to find a proper solution for #1 now, and leave #2 alone
> > for the time being.
> 
> Yes, and the solution for #1 should be entirely in the kernel. The
> question is how to do that. Probably the most sane way is to introduce a
> concept of device ownership. The ownership can either be a kernel driver
> or a userspace process. Giving ownership of a device to userspace is
> only possible if all devices in the same group are unbound from its
> respective drivers. This is a very intrusive concept, no idea if it
> has a chance of acceptance :-)
> But the advantage is clearly that this allows better semantics in the
> IOMMU drivers and a more stable handover of devices from host drivers to
> kvm guests.

I tend to think around those lines too, but the ownership concept
doesn't necessarily have to be core-kernel enforced itself, it can be in
VFIO.

If we have a common API to expose the "domain number", it can perfectly
be a matter of VFIO itself not allowing to do pass-through until it has 
attached its stub driver to all the devices with that domain number, and
it can handle exclusion of iommu domains from there.

> > Maybe the right option is for x86 to move toward pre-existing domains
> > like powerpc does, or maybe we can just expose some kind of ID.
> 
> As I said, the domains are created a iommu driver initialization time
> (usually boot time). But the groups are internal to the iommu drivers
> and not visible somewhere else.

That's what we need to fix :-)

> > Ah you started answering to my above questions :-)
> > 
> > We could do what you propose. It depends what we want to do with
> > domains. Practically speaking, we could make domains pre-existing (with
> > the ability to group several PEs into larger domains) or we could keep
> > the concepts different, possibly with the limitation that on powerpc, a
> > domain == a PE.
> > 
> > I suppose we -could- make arbitrary domains on ppc as well by making the
> > various PE's iommu's in HW point to the same in-memory table, but that's
> > a bit nasty in practice due to the way we manage those, and it would to
> > some extent increase the risk of a failing device/driver stomping on
> > another one and thus taking it down with itself. IE. isolation of errors
> > is an important feature for us.
> 
> These arbitrary domains exist in the iommu-api. It would be good to
> emulate them on Power too. Can't you put a PE into an isolated
> error-domain when something goes wrong with it? This should provide the
> same isolation as before.

Well, my problem is that it's quite hard for me to arbitrarily make PEs
shared the same iommu table. The iommu tables are assigned at boot time
along with the creation of the PEs, and because sadly, I don't (yet)
support tree structures for them, they are large physically contiguous
things, so I need to allocate them early and keep them around.

I -could- make a hack to share tables when creating such arbitrary
domains, but I would definitely have to keep track of the "original"
table of the PE so that can be reverted, I can't afford to free the
memory or I risk not being able to re-allocate it.

We'll have tree iommu's in future HW but not yet.

> What you derive the group number from is your business :-) On x86 it is
> certainly the best to use the RID these devices share together with the
> PCI segment number.

Ok. The question is more in term of API whether this number is to be
unique at the system scope or only at the PCI host bridge scope. 

Cheers,
Ben.

> Regards,
> 
> 	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 15:10       ` Alex Williamson
  (?)
@ 2011-08-08  6:07       ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-08  6:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, Joerg Roedel,
	Anthony Liguori, linux-pci, linuxppc-dev

On Fri, Aug 05, 2011 at 09:10:09AM -0600, Alex Williamson wrote:
> On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
> > Right. In fact to try to clarify the problem for everybody, I think we
> > can distinguish two different classes of "constraints" that can
> > influence the grouping of devices:
> > 
> >  1- Hard constraints. These are typically devices using the same RID or
> > where the RID cannot be reliably guaranteed (the later is the case with
> > some PCIe-PCIX bridges which will take ownership of "some" transactions
> > such as split but not all). Devices like that must be in the same
> > domain. This is where PowerPC adds to what x86 does today the concept
> > that the domains are pre-existing, since we use the RID for error
> > isolation & MMIO segmenting as well. so we need to create those domains
> > at boot time.
> > 
> >  2- Softer constraints. Those constraints derive from the fact that not
> > applying them risks enabling the guest to create side effects outside of
> > its "sandbox". To some extent, there can be "degrees" of badness between
> > the various things that can cause such constraints. Examples are shared
> > LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> > potentially any set of functions in the same device can be problematic
> > due to the possibility to get backdoor access to the BARs etc...
> 
> This is what I've been trying to get to, hardware constraints vs system
> policy constraints.
> 
> > Now, what I derive from the discussion we've had so far, is that we need
> > to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> > remains a matter of libvirt/user doing the right thing (basically
> > keeping a loaded gun aimed at the user's foot with a very very very
> > sweet trigger but heh, let's not start a flamewar here :-)
> 
> Doesn't your own uncertainty of whether or not to allow this lead to the
> same conclusion, that it belongs in userspace policy?  I don't think we
> want to make white lists of which devices we trust to do DisINTx
> correctly part of the kernel interface, do we?  Thanks,

Yes, but the overall point is that both the hard and soft constraints
are much easier to handle if a group or iommu domain or whatever is a
persistent entity that can be set up once-per-boot by the admin with
whatever degree of safety they want, rather than a transient entity
tied to an fd's lifetime, which must be set up correctly, every time,
by the thing establishing it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-03  2:04           ` David Gibson
  (?)
@ 2011-08-08  8:28             ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-08  8:28 UTC (permalink / raw)
  To: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-deve

On 08/03/2011 05:04 AM, David Gibson wrote:
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
>
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.
>

Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.

Is this an accurate way of putting it?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-08  8:28             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-08  8:28 UTC (permalink / raw)
  To: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On 08/03/2011 05:04 AM, David Gibson wrote:
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
>
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.
>

Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.

Is this an accurate way of putting it?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-08  8:28             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-08  8:28 UTC (permalink / raw)
  To: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On 08/03/2011 05:04 AM, David Gibson wrote:
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
>
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.
>

Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.

Is this an accurate way of putting it?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-08  8:28             ` Avi Kivity
  (?)
@ 2011-08-09 23:24               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-09 23:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
> On 08/03/2011 05:04 AM, David Gibson wrote:
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> >
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> >
> 
> Alex (and I) think that we should work with device/function granularity, 
> as is common with other archs, and that the group thing is just a 
> constraint on which functions may be assigned where, while you think 
> that we should work at group granularity, with 1-function groups for 
> archs which don't have constraints.
> 
> Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-09 23:24               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-09 23:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Anthony Liguori, linuxppc-dev,
	benve

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
> On 08/03/2011 05:04 AM, David Gibson wrote:
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> >
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> >
> 
> Alex (and I) think that we should work with device/function granularity, 
> as is common with other archs, and that the group thing is just a 
> constraint on which functions may be assigned where, while you think 
> that we should work at group granularity, with 1-function groups for 
> archs which don't have constraints.
> 
> Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-09 23:24               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-09 23:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
> On 08/03/2011 05:04 AM, David Gibson wrote:
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> >
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> >
> 
> Alex (and I) think that we should work with device/function granularity, 
> as is common with other archs, and that the group thing is just a 
> constraint on which functions may be assigned where, while you think 
> that we should work at group granularity, with 1-function groups for 
> archs which don't have constraints.
> 
> Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-09 23:24               ` Alex Williamson
  (?)
@ 2011-08-10  2:48                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-10  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve


> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-10  2:48                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-10  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve


> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-10  2:48                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-10  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve


> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-10  2:48                 ` Benjamin Herrenschmidt
  (?)
@ 2011-08-20 16:51                   ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-20 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

>From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This "group" fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a "device" fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-20 16:51                   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-20 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

>From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This "group" fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a "device" fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-20 16:51                   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-20 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

>From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This "group" fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a "device" fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-20 16:51                   ` Alex Williamson
  (?)
@ 2011-08-22  5:55                     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-22  5:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, aafabbri, iommu,
	Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

> >From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.  As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.

A 1:1 group<->process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

> (do non-PCI
> devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22  5:55                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-22  5:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

> >From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.  As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.

A 1:1 group<->process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

> (do non-PCI
> devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22  5:55                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-22  5:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

> >From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.  As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.

A 1:1 group<->process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

> (do non-PCI
> devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-20 16:51                   ` Alex Williamson
  (?)
@ 2011-08-22  6:30                     ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22  6:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	aafabbri, iommu, linux-pci, linuxppc-dev, benve

On 08/20/2011 07:51 PM, Alex Williamson wrote:
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
>
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
>

$ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22  6:30                     ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22  6:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	aafabbri, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/20/2011 07:51 PM, Alex Williamson wrote:
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
>
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
>

$ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22  6:30                     ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22  6:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	aafabbri, iommu, linux-pci, linuxppc-dev, benve

On 08/20/2011 07:51 PM, Alex Williamson wrote:
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
>
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
>

$ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22  6:30                     ` Avi Kivity
  (?)
@ 2011-08-22 10:46                       ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:46                       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:46                       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 10:46                       ` Joerg Roedel
  (?)
@ 2011-08-22 10:51                         ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 10:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> >  ../../../path/to/device/which/represents/the/resource/constraint
> >
> >  (the pci-to-pci bridge on x86, or whatever node represents partitionable
> >  endpoints on power)
>
> That does not work. The bridge in question may not even be visible as a
> PCI device, so you can't link to it. This is the case on a few PCIe
> cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> the PCIe interface (yes, I have seen those cards).
>

How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:51                         ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 10:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> >  ../../../path/to/device/which/represents/the/resource/constraint
> >
> >  (the pci-to-pci bridge on x86, or whatever node represents partitionable
> >  endpoints on power)
>
> That does not work. The bridge in question may not even be visible as a
> PCI device, so you can't link to it. This is the case on a few PCIe
> cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> the PCIe interface (yes, I have seen those cards).
>

How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:51                         ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 10:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> >  ../../../path/to/device/which/represents/the/resource/constraint
> >
> >  (the pci-to-pci bridge on x86, or whatever node represents partitionable
> >  endpoints on power)
>
> That does not work. The bridge in question may not even be visible as a
> PCI device, so you can't link to it. This is the case on a few PCIe
> cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> the PCIe interface (yes, I have seen those cards).
>

How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 10:51                         ` Avi Kivity
  (?)
@ 2011-08-22 12:36                           ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:36                           ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:36                           ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 12:36                           ` Roedel, Joerg
  (?)
@ 2011-08-22 12:42                             ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 12:42 UTC (permalink /