All of lore.kernel.org
 help / color / mirror / Atom feed
* kvm PCI assignment & VFIO ramblings
@ 2011-07-29 23:58 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-29 23:58 UTC (permalink / raw)
  To: kvm
  Cc: Anthony Liguori, Alex Williamson, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

Hi folks !

So I promised Anthony I would try to summarize some of the comments &
issues we have vs. VFIO after we've tried to use it for PCI pass-through
on POWER. It's pretty long, there are various items with more or less
impact, some of it is easily fixable, some are API issues, and we'll
probably want to discuss them separately, but for now here's a brain
dump.

David, Alexei, please make sure I haven't missed anything :-)

* Granularity of pass-through

So let's first start with what is probably the main issue and the most
contentious, which is the problem of dealing with the various
constraints which define the granularity of pass-through, along with
exploiting features like the VTd iommu domains.

For the sake of clarity, let me first talk a bit about the "granularity"
issue I've mentioned above.

There are various constraints that can/will force several devices to be
"owned" by the same guest and on the same side of the host/guest
boundary. This is generally because some kind of HW resource is shared
and thus not doing so would break the isolation barrier and enable a
guest to disrupt the operations of the host and/or another guest.

Some of those constraints are well know, such as shared interrupts. Some
are more subtle, for example, if a PCIe->PCI bridge exist in the system,
there is no way for the iommu to identify transactions from devices
coming from the PCI segment of that bridge with a granularity other than
"behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
behind such a bridge must be treated as a single "entity" for
pass-trough purposes.

In IBM POWER land, we call this a "partitionable endpoint" (the term
"endpoint" here is historic, such a PE can be made of several PCIe
"endpoints"). I think "partitionable" is a pretty good name tho to
represent the constraints, so I'll call this a "partitionable group"
from now on. 

Other examples of such HW imposed constraints can be a shared iommu with
no filtering capability (some older POWER hardware which we might want
to support fall into that category, each PCI host bridge is its own
domain but doesn't have a finer granularity... however those machines
tend to have a lot of host bridges :)

If we are ever going to consider applying some of this to non-PCI
devices (see the ongoing discussions here), then we will be faced with
the crazyness of embedded designers which probably means all sort of new
constraints we can't even begin to think about

This leads me to those initial conclusions:

- The -minimum- granularity of pass-through is not always a single
device and not always under SW control

- Having a magic heuristic in libvirt to figure out those constraints is
WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
knowledge of PCI resource management and getting it wrong in many many
cases, something that took years to fix essentially by ripping it all
out. This is kernel knowledge and thus we need the kernel to expose in a
way or another what those constraints are, what those "partitionable
groups" are.

- That does -not- mean that we cannot specify for each individual device
within such a group where we want to put it in qemu (what devfn etc...).
As long as there is a clear understanding that the "ownership" of the
device goes with the group, this is somewhat orthogonal to how they are
represented in qemu. (Not completely... if the iommu is exposed to the
guest ,via paravirt for example, some of these constraints must be
exposed but I'll talk about that more later).

The interface currently proposed for VFIO (and associated uiommu)
doesn't handle that problem at all. Instead, it is entirely centered
around a specific "feature" of the VTd iommu's for creating arbitrary
domains with arbitrary devices (tho those devices -do- have the same
constraints exposed above, don't try to put 2 legacy PCI devices behind
the same bridge into 2 different domains !), but the API totally ignores
the problem, leaves it to libvirt "magic foo" and focuses on something
that is both quite secondary in the grand scheme of things, and quite
x86 VTd specific in the implementation and API definition.

Now, I'm not saying these programmable iommu domains aren't a nice
feature and that we shouldn't exploit them when available, but as it is,
it is too much a central part of the API.

I'll talk a little bit more about recent POWER iommu's here to
illustrate where I'm coming from with my idea of groups:

On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
of domain and a per-RID filtering. However it differs from VTd in a few
ways:

The "domains" (aka PEs) encompass more than just an iommu filtering
scheme. The MMIO space and PIO space are also segmented, and those
segments assigned to domains. Interrupts (well, MSI ports at least) are
assigned to domains. Inbound PCIe error messages are targeted to
domains, etc...

Basically, the PEs provide a very strong isolation feature which
includes errors, and has the ability to immediately "isolate" a PE on
the first occurence of an error. For example, if an inbound PCIe error
is signaled by a device on a PE or such a device does a DMA to a
non-authorized address, the whole PE gets into error state. All
subsequent stores (both DMA and MMIO) are swallowed and reads return all
1's, interrupts are blocked. This is designed to prevent any propagation
of bad data, which is a very important feature in large high reliability
systems.

Software then has the ability to selectively turn back on MMIO and/or
DMA, perform diagnostics, reset devices etc...

Because the domains encompass more than just DMA, but also segment the
MMIO space, it is not practical at all to dynamically reconfigure them
at runtime to "move" devices into domains. The firmware or early kernel
code (it depends) will assign devices BARs using an algorithm that keeps
them within PE segment boundaries, etc....

Additionally (and this is indeed a "restriction" compared to VTd, though
I expect our future IO chips to lift it to some extent), PE don't get
separate DMA address spaces. There is one 64-bit DMA address space per
PCI host bridge, and it is 'segmented' with each segment being assigned
to a PE. Due to the way PE assignment works in hardware, it is not
practical to make several devices share a segment unless they are on the
same bus. Also the resulting limit in the amount of 32-bit DMA space a
device can access means that it's impractical to put too many devices in
a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
more about that later).

The above essentially extends the granularity requirement (or rather is
another factor defining what the granularity of partitionable entities
is). You can think of it as "pre-existing" domains.

I believe the way to solve that is to introduce a kernel interface to
expose those "partitionable entities" to userspace. In addition, it
occurs to me that the ability to manipulate VTd domains essentially
boils down to manipulating those groups (creating larger ones with
individual components).

I like the idea of defining / playing with those groups statically
(using a command line tool or sysfs, possibly having a config file
defining them in a persistent way) rather than having their lifetime
tied to a uiommu file descriptor.

It also makes it a LOT easier to have a channel to manipulate
platform/arch specific attributes of those domains if any.

So we could define an API or representation in sysfs that exposes what
the partitionable entities are, and we may add to it an API to
manipulate them. But we don't have to and I'm happy to keep the
additional SW grouping you can do on VTd as a sepparate "add-on" API
(tho I don't like at all the way it works with uiommu). However, qemu
needs to know what the grouping is regardless of the domains, and it's
not nice if it has to manipulate two different concepts here so
eventually those "partitionable entities" from a qemu standpoint must
look like domains.

My main point is that I don't want the "knowledge" here to be in libvirt
or qemu. In fact, I want to be able to do something as simple as passing
a reference to a PE to qemu (sysfs path ?) and have it just pickup all
the devices in there and expose them to the guest.

This can be done in a way that isn't PCI specific as well (the
definition of the groups and what is grouped would would obviously be
somewhat bus specific and handled by platform code in the kernel).

Maybe something like /sys/devgroups ? This probably warrants involving
more kernel people into the discussion.

* IOMMU

Now more on iommu. I've described I think in enough details how ours
work, there are others, I don't know what freescale or ARM are doing,
sparc doesn't quite work like VTd either, etc...

The main problem isn't that much the mechanics of the iommu but really
how it's exposed (or not) to guests.

VFIO here is basically designed for one and only one thing: expose the
entire guest physical address space to the device more/less 1:1.

This means:

  - It only works with iommu's that provide complete DMA address spaces
to devices. Won't work with a single 'segmented' address space like we
have on POWER.

  - It requires the guest to be pinned. Pass-through -> no more swap

  - The guest cannot make use of the iommu to deal with 32-bit DMA
devices, thus a guest with more than a few G of RAM (I don't know the
exact limit on x86, depends on your IO hole I suppose), and you end up
back to swiotlb & bounce buffering.

  - It doesn't work for POWER server anyways because of our need to
provide a paravirt iommu interface to the guest since that's how pHyp
works today and how existing OSes expect to operate.

Now some of this can be fixed with tweaks, and we've started doing it
(we have a working pass-through using VFIO, forgot to mention that, it's
just that we don't like what we had to do to get there).

Basically, what we do today is:

- We add an ioctl to VFIO to expose to qemu the segment information. IE.
What is the DMA address and size of the DMA "window" usable for a given
device. This is a tweak, that should really be handled at the "domain"
level.

That current hack won't work well if two devices share an iommu. Note
that we have an additional constraint here due to our paravirt
interfaces (specificed in PAPR) which is that PE domains must have a
common parent. Basically, pHyp makes them look like a PCIe host bridge
per domain in the guest. I think that's a pretty good idea and qemu
might want to do the same.

- We hack out the currently unconditional mapping of the entire guest
space in the iommu. Something will have to be done to "decide" whether
to do that or not ... qemu argument -> ioctl ?

- We hook up the paravirt call to insert/remove a translation from the
iommu to the VFIO map/unmap ioctl's.

This limps along but it's not great. Some of the problems are:

- I've already mentioned, the domain problem again :-) 

- Performance sucks of course, the vfio map ioctl wasn't mean for that
and has quite a bit of overhead. However we'll want to do the paravirt
call directly in the kernel eventually ...

  - ... which isn't trivial to get back to our underlying arch specific
iommu object from there. We'll probably need a set of arch specific
"sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
link them to the real thing kernel-side.

- PAPR (the specification of our paravirt interface and the expectation
of current OSes) wants iommu pages to be 4k by default, regardless of
the kernel host page size, which makes things a bit tricky since our
enterprise host kernels have a 64k base page size. Additionally, we have
new PAPR interfaces that we want to exploit, to allow the guest to
create secondary iommu segments (in 64-bit space), which can be used
(under guest control) to do things like map the entire guest (here it
is :-) or use larger iommu page sizes (if permitted by the host kernel,
in our case we could allow 64k iommu page size with a 64k host kernel).

The above means we need arch specific APIs. So arch specific vfio
ioctl's, either that or kvm ones going to vfio or something ... the
current structure of vfio/kvm interaction doesn't make it easy.

* IO space

On most (if not all) non-x86 archs, each PCI host bridge provide a
completely separate PCI address space. Qemu doesn't deal with that very
well. For MMIO it can be handled since those PCI address spaces are
"remapped" holes in the main CPU address space so devices can be
registered by using BAR + offset of that window in qemu MMIO mapping.

For PIO things get nasty. We have totally separate PIO spaces and qemu
doesn't seem to like that. We can try to play the offset trick as well,
we haven't tried yet, but basically that's another one to fix. Not a
huge deal I suppose but heh ...

Also our next generation chipset may drop support for PIO completely.

On the other hand, because PIO is just a special range of MMIO for us,
we can do normal pass-through on it and don't need any of the emulation
done qemu.

  * MMIO constraints

The QEMU side VFIO code hard wires various constraints that are entirely
based on various requirements you decided you have on x86 but don't
necessarily apply to us :-)

Due to our paravirt nature, we don't need to masquerade the MSI-X table
for example. At all. If the guest configures crap into it, too bad, it
can only shoot itself in the foot since the host bridge enforce
validation anyways as I explained earlier. Because it's all paravirt, we
don't need to "translate" the interrupt vectors & addresses, the guest
will call hyercalls to configure things anyways.

We don't need to prevent MMIO pass-through for small BARs at all. This
should be some kind of capability or flag passed by the arch. Our
segmentation of the MMIO domain means that we can give entire segments
to the guest and let it access anything in there (those segments are a
multiple of the page size always). Worst case it will access outside of
a device BAR within a segment and will cause the PE to go into error
state, shooting itself in the foot, there is no risk of side effect
outside of the guest boundaries.

In fact, we don't even need to emulate BAR sizing etc... in theory. Our
paravirt guests expect the BARs to have been already allocated for them
by the firmware and will pick up the addresses from the device-tree :-)

Today we use a "hack", putting all 0's in there and triggering the linux
code path to reassign unassigned resources (which will use BAR
emulation) but that's not what we are -supposed- to do. Not a big deal
and having the emulation there won't -hurt- us, it's just that we don't
really need any of it.

We have a small issue with ROMs. Our current KVM only works with huge
pages for guest memory but that is being fixed. So the way qemu maps the
ROM copy into the guest address space doesn't work. It might be handy
anyways to have a way for qemu to use MMIO emulation for ROM access as a
fallback. I'll look into it.

  * EEH

This is the name of those fancy error handling & isolation features I
mentioned earlier. To some extent it's a superset of AER, but we don't
generally expose AER to guests (or even the host), it's swallowed by
firmware into something else that provides a superset (well mostly) of
the AER information, and allow us to do those additional things like
isolating/de-isolating, reset control etc...

Here too, we'll need arch specific APIs through VFIO. Not necessarily a
huge deal, I mention it for completeness.

   * Misc

There's lots of small bits and pieces... in no special order:

 - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
netlink and a bit of ioctl's ... it's not like there's something
fundamentally  better for netlink vs. ioctl... it really depends what
you are doing, and in this case I fail to see what netlink brings you
other than bloat and more stupid userspace library deps.

 - I don't like too much the fact that VFIO provides yet another
different API to do what we already have at least 2 kernel APIs for, ie,
BAR mapping and config space access. At least it should be better at
using the backend infrastructure of the 2 others (sysfs & procfs). I
understand it wants to filter in some case (config space) and -maybe-
yet another API is the right way to go but allow me to have my doubts.

One thing I thought about but you don't seem to like it ... was to use
the need to represent the partitionable entity as groups in sysfs that I
talked about earlier. Those could have per-device subdirs with the usual
config & resource files, same semantic as the ones in the real device,
but when accessed via the group they get filtering. I might or might not
be practical in the end, tbd, but it would allow apps using a slightly
modified libpci for example to exploit some of this.

 - The qemu vfio code hooks directly into ioapic ... of course that
won't fly with anything !x86

 - The various "objects" dealt with here, -especially- interrupts and
iommu, need a better in-kernel API so that fast in-kernel emulation can
take over from qemu based emulation. The way we need to do some of this
on POWER differs from x86. We can elaborate later, it's not necessarily
a killer either but essentially we'll take the bulk of interrupt
handling away from VFIO to the point where it won't see any of it at
all.

  - Non-PCI devices. That's a hot topic for embedded. I think the vast
majority here is platform devices. There's quite a bit of vfio that
isn't intrinsically PCI specific. We could have an in-kernel platform
driver like we have an in-kernel PCI driver to attach to. The mapping of
resources to userspace is rather generic, as goes for interrupts. I
don't know whether that idea can be pushed much further, I don't have
the bandwidth to look into it much at this point, but maybe it would be
possible to refactor vfio a bit to better separate what is PCI specific
to what is not. The idea would be to move the PCI specific bits to
inside the "placeholder" PCI driver, and same goes for platform bits.
"generic" ioctl's go to VFIO core, anything that doesn't handle, it
passes them to the driver which allows the PCI one to handle things
differently than the platform one, maybe an amba one while at it,
etc.... just a thought, I haven't gone into the details at all.

I think that's all I had on my plate today, it's a long enough email
anyway :-) Anthony suggested we put that on a wiki, I'm a bit
wiki-disabled myself so he proposed to pickup my email and do that. We
should probably discuss the various items in here separately as
different threads to avoid too much confusion.

One other thing we should do on our side is publish somewhere our
current hacks to get you an idea of where we are going and what we had
to do (code speaks more than words). We'll try to do that asap, possibly
next week.

Note that I'll be on/off the next few weeks, travelling and doing
bringup. So expect latency in my replies.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 322+ messages in thread

* kvm PCI assignment & VFIO ramblings
@ 2011-07-29 23:58 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-29 23:58 UTC (permalink / raw)
  To: kvm
  Cc: Alexey Kardashevskiy, Paul Mackerras, linux-pci, David Gibson,
	Alex Williamson, Anthony Liguori, linuxppc-dev

Hi folks !

So I promised Anthony I would try to summarize some of the comments &
issues we have vs. VFIO after we've tried to use it for PCI pass-through
on POWER. It's pretty long, there are various items with more or less
impact, some of it is easily fixable, some are API issues, and we'll
probably want to discuss them separately, but for now here's a brain
dump.

David, Alexei, please make sure I haven't missed anything :-)

* Granularity of pass-through

So let's first start with what is probably the main issue and the most
contentious, which is the problem of dealing with the various
constraints which define the granularity of pass-through, along with
exploiting features like the VTd iommu domains.

For the sake of clarity, let me first talk a bit about the "granularity"
issue I've mentioned above.

There are various constraints that can/will force several devices to be
"owned" by the same guest and on the same side of the host/guest
boundary. This is generally because some kind of HW resource is shared
and thus not doing so would break the isolation barrier and enable a
guest to disrupt the operations of the host and/or another guest.

Some of those constraints are well know, such as shared interrupts. Some
are more subtle, for example, if a PCIe->PCI bridge exist in the system,
there is no way for the iommu to identify transactions from devices
coming from the PCI segment of that bridge with a granularity other than
"behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
behind such a bridge must be treated as a single "entity" for
pass-trough purposes.

In IBM POWER land, we call this a "partitionable endpoint" (the term
"endpoint" here is historic, such a PE can be made of several PCIe
"endpoints"). I think "partitionable" is a pretty good name tho to
represent the constraints, so I'll call this a "partitionable group"
from now on. 

Other examples of such HW imposed constraints can be a shared iommu with
no filtering capability (some older POWER hardware which we might want
to support fall into that category, each PCI host bridge is its own
domain but doesn't have a finer granularity... however those machines
tend to have a lot of host bridges :)

If we are ever going to consider applying some of this to non-PCI
devices (see the ongoing discussions here), then we will be faced with
the crazyness of embedded designers which probably means all sort of new
constraints we can't even begin to think about

This leads me to those initial conclusions:

- The -minimum- granularity of pass-through is not always a single
device and not always under SW control

- Having a magic heuristic in libvirt to figure out those constraints is
WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
knowledge of PCI resource management and getting it wrong in many many
cases, something that took years to fix essentially by ripping it all
out. This is kernel knowledge and thus we need the kernel to expose in a
way or another what those constraints are, what those "partitionable
groups" are.

- That does -not- mean that we cannot specify for each individual device
within such a group where we want to put it in qemu (what devfn etc...).
As long as there is a clear understanding that the "ownership" of the
device goes with the group, this is somewhat orthogonal to how they are
represented in qemu. (Not completely... if the iommu is exposed to the
guest ,via paravirt for example, some of these constraints must be
exposed but I'll talk about that more later).

The interface currently proposed for VFIO (and associated uiommu)
doesn't handle that problem at all. Instead, it is entirely centered
around a specific "feature" of the VTd iommu's for creating arbitrary
domains with arbitrary devices (tho those devices -do- have the same
constraints exposed above, don't try to put 2 legacy PCI devices behind
the same bridge into 2 different domains !), but the API totally ignores
the problem, leaves it to libvirt "magic foo" and focuses on something
that is both quite secondary in the grand scheme of things, and quite
x86 VTd specific in the implementation and API definition.

Now, I'm not saying these programmable iommu domains aren't a nice
feature and that we shouldn't exploit them when available, but as it is,
it is too much a central part of the API.

I'll talk a little bit more about recent POWER iommu's here to
illustrate where I'm coming from with my idea of groups:

On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
of domain and a per-RID filtering. However it differs from VTd in a few
ways:

The "domains" (aka PEs) encompass more than just an iommu filtering
scheme. The MMIO space and PIO space are also segmented, and those
segments assigned to domains. Interrupts (well, MSI ports at least) are
assigned to domains. Inbound PCIe error messages are targeted to
domains, etc...

Basically, the PEs provide a very strong isolation feature which
includes errors, and has the ability to immediately "isolate" a PE on
the first occurence of an error. For example, if an inbound PCIe error
is signaled by a device on a PE or such a device does a DMA to a
non-authorized address, the whole PE gets into error state. All
subsequent stores (both DMA and MMIO) are swallowed and reads return all
1's, interrupts are blocked. This is designed to prevent any propagation
of bad data, which is a very important feature in large high reliability
systems.

Software then has the ability to selectively turn back on MMIO and/or
DMA, perform diagnostics, reset devices etc...

Because the domains encompass more than just DMA, but also segment the
MMIO space, it is not practical at all to dynamically reconfigure them
at runtime to "move" devices into domains. The firmware or early kernel
code (it depends) will assign devices BARs using an algorithm that keeps
them within PE segment boundaries, etc....

Additionally (and this is indeed a "restriction" compared to VTd, though
I expect our future IO chips to lift it to some extent), PE don't get
separate DMA address spaces. There is one 64-bit DMA address space per
PCI host bridge, and it is 'segmented' with each segment being assigned
to a PE. Due to the way PE assignment works in hardware, it is not
practical to make several devices share a segment unless they are on the
same bus. Also the resulting limit in the amount of 32-bit DMA space a
device can access means that it's impractical to put too many devices in
a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
more about that later).

The above essentially extends the granularity requirement (or rather is
another factor defining what the granularity of partitionable entities
is). You can think of it as "pre-existing" domains.

I believe the way to solve that is to introduce a kernel interface to
expose those "partitionable entities" to userspace. In addition, it
occurs to me that the ability to manipulate VTd domains essentially
boils down to manipulating those groups (creating larger ones with
individual components).

I like the idea of defining / playing with those groups statically
(using a command line tool or sysfs, possibly having a config file
defining them in a persistent way) rather than having their lifetime
tied to a uiommu file descriptor.

It also makes it a LOT easier to have a channel to manipulate
platform/arch specific attributes of those domains if any.

So we could define an API or representation in sysfs that exposes what
the partitionable entities are, and we may add to it an API to
manipulate them. But we don't have to and I'm happy to keep the
additional SW grouping you can do on VTd as a sepparate "add-on" API
(tho I don't like at all the way it works with uiommu). However, qemu
needs to know what the grouping is regardless of the domains, and it's
not nice if it has to manipulate two different concepts here so
eventually those "partitionable entities" from a qemu standpoint must
look like domains.

My main point is that I don't want the "knowledge" here to be in libvirt
or qemu. In fact, I want to be able to do something as simple as passing
a reference to a PE to qemu (sysfs path ?) and have it just pickup all
the devices in there and expose them to the guest.

This can be done in a way that isn't PCI specific as well (the
definition of the groups and what is grouped would would obviously be
somewhat bus specific and handled by platform code in the kernel).

Maybe something like /sys/devgroups ? This probably warrants involving
more kernel people into the discussion.

* IOMMU

Now more on iommu. I've described I think in enough details how ours
work, there are others, I don't know what freescale or ARM are doing,
sparc doesn't quite work like VTd either, etc...

The main problem isn't that much the mechanics of the iommu but really
how it's exposed (or not) to guests.

VFIO here is basically designed for one and only one thing: expose the
entire guest physical address space to the device more/less 1:1.

This means:

  - It only works with iommu's that provide complete DMA address spaces
to devices. Won't work with a single 'segmented' address space like we
have on POWER.

  - It requires the guest to be pinned. Pass-through -> no more swap

  - The guest cannot make use of the iommu to deal with 32-bit DMA
devices, thus a guest with more than a few G of RAM (I don't know the
exact limit on x86, depends on your IO hole I suppose), and you end up
back to swiotlb & bounce buffering.

  - It doesn't work for POWER server anyways because of our need to
provide a paravirt iommu interface to the guest since that's how pHyp
works today and how existing OSes expect to operate.

Now some of this can be fixed with tweaks, and we've started doing it
(we have a working pass-through using VFIO, forgot to mention that, it's
just that we don't like what we had to do to get there).

Basically, what we do today is:

- We add an ioctl to VFIO to expose to qemu the segment information. IE.
What is the DMA address and size of the DMA "window" usable for a given
device. This is a tweak, that should really be handled at the "domain"
level.

That current hack won't work well if two devices share an iommu. Note
that we have an additional constraint here due to our paravirt
interfaces (specificed in PAPR) which is that PE domains must have a
common parent. Basically, pHyp makes them look like a PCIe host bridge
per domain in the guest. I think that's a pretty good idea and qemu
might want to do the same.

- We hack out the currently unconditional mapping of the entire guest
space in the iommu. Something will have to be done to "decide" whether
to do that or not ... qemu argument -> ioctl ?

- We hook up the paravirt call to insert/remove a translation from the
iommu to the VFIO map/unmap ioctl's.

This limps along but it's not great. Some of the problems are:

- I've already mentioned, the domain problem again :-) 

- Performance sucks of course, the vfio map ioctl wasn't mean for that
and has quite a bit of overhead. However we'll want to do the paravirt
call directly in the kernel eventually ...

  - ... which isn't trivial to get back to our underlying arch specific
iommu object from there. We'll probably need a set of arch specific
"sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
link them to the real thing kernel-side.

- PAPR (the specification of our paravirt interface and the expectation
of current OSes) wants iommu pages to be 4k by default, regardless of
the kernel host page size, which makes things a bit tricky since our
enterprise host kernels have a 64k base page size. Additionally, we have
new PAPR interfaces that we want to exploit, to allow the guest to
create secondary iommu segments (in 64-bit space), which can be used
(under guest control) to do things like map the entire guest (here it
is :-) or use larger iommu page sizes (if permitted by the host kernel,
in our case we could allow 64k iommu page size with a 64k host kernel).

The above means we need arch specific APIs. So arch specific vfio
ioctl's, either that or kvm ones going to vfio or something ... the
current structure of vfio/kvm interaction doesn't make it easy.

* IO space

On most (if not all) non-x86 archs, each PCI host bridge provide a
completely separate PCI address space. Qemu doesn't deal with that very
well. For MMIO it can be handled since those PCI address spaces are
"remapped" holes in the main CPU address space so devices can be
registered by using BAR + offset of that window in qemu MMIO mapping.

For PIO things get nasty. We have totally separate PIO spaces and qemu
doesn't seem to like that. We can try to play the offset trick as well,
we haven't tried yet, but basically that's another one to fix. Not a
huge deal I suppose but heh ...

Also our next generation chipset may drop support for PIO completely.

On the other hand, because PIO is just a special range of MMIO for us,
we can do normal pass-through on it and don't need any of the emulation
done qemu.

  * MMIO constraints

The QEMU side VFIO code hard wires various constraints that are entirely
based on various requirements you decided you have on x86 but don't
necessarily apply to us :-)

Due to our paravirt nature, we don't need to masquerade the MSI-X table
for example. At all. If the guest configures crap into it, too bad, it
can only shoot itself in the foot since the host bridge enforce
validation anyways as I explained earlier. Because it's all paravirt, we
don't need to "translate" the interrupt vectors & addresses, the guest
will call hyercalls to configure things anyways.

We don't need to prevent MMIO pass-through for small BARs at all. This
should be some kind of capability or flag passed by the arch. Our
segmentation of the MMIO domain means that we can give entire segments
to the guest and let it access anything in there (those segments are a
multiple of the page size always). Worst case it will access outside of
a device BAR within a segment and will cause the PE to go into error
state, shooting itself in the foot, there is no risk of side effect
outside of the guest boundaries.

In fact, we don't even need to emulate BAR sizing etc... in theory. Our
paravirt guests expect the BARs to have been already allocated for them
by the firmware and will pick up the addresses from the device-tree :-)

Today we use a "hack", putting all 0's in there and triggering the linux
code path to reassign unassigned resources (which will use BAR
emulation) but that's not what we are -supposed- to do. Not a big deal
and having the emulation there won't -hurt- us, it's just that we don't
really need any of it.

We have a small issue with ROMs. Our current KVM only works with huge
pages for guest memory but that is being fixed. So the way qemu maps the
ROM copy into the guest address space doesn't work. It might be handy
anyways to have a way for qemu to use MMIO emulation for ROM access as a
fallback. I'll look into it.

  * EEH

This is the name of those fancy error handling & isolation features I
mentioned earlier. To some extent it's a superset of AER, but we don't
generally expose AER to guests (or even the host), it's swallowed by
firmware into something else that provides a superset (well mostly) of
the AER information, and allow us to do those additional things like
isolating/de-isolating, reset control etc...

Here too, we'll need arch specific APIs through VFIO. Not necessarily a
huge deal, I mention it for completeness.

   * Misc

There's lots of small bits and pieces... in no special order:

 - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
netlink and a bit of ioctl's ... it's not like there's something
fundamentally  better for netlink vs. ioctl... it really depends what
you are doing, and in this case I fail to see what netlink brings you
other than bloat and more stupid userspace library deps.

 - I don't like too much the fact that VFIO provides yet another
different API to do what we already have at least 2 kernel APIs for, ie,
BAR mapping and config space access. At least it should be better at
using the backend infrastructure of the 2 others (sysfs & procfs). I
understand it wants to filter in some case (config space) and -maybe-
yet another API is the right way to go but allow me to have my doubts.

One thing I thought about but you don't seem to like it ... was to use
the need to represent the partitionable entity as groups in sysfs that I
talked about earlier. Those could have per-device subdirs with the usual
config & resource files, same semantic as the ones in the real device,
but when accessed via the group they get filtering. I might or might not
be practical in the end, tbd, but it would allow apps using a slightly
modified libpci for example to exploit some of this.

 - The qemu vfio code hooks directly into ioapic ... of course that
won't fly with anything !x86

 - The various "objects" dealt with here, -especially- interrupts and
iommu, need a better in-kernel API so that fast in-kernel emulation can
take over from qemu based emulation. The way we need to do some of this
on POWER differs from x86. We can elaborate later, it's not necessarily
a killer either but essentially we'll take the bulk of interrupt
handling away from VFIO to the point where it won't see any of it at
all.

  - Non-PCI devices. That's a hot topic for embedded. I think the vast
majority here is platform devices. There's quite a bit of vfio that
isn't intrinsically PCI specific. We could have an in-kernel platform
driver like we have an in-kernel PCI driver to attach to. The mapping of
resources to userspace is rather generic, as goes for interrupts. I
don't know whether that idea can be pushed much further, I don't have
the bandwidth to look into it much at this point, but maybe it would be
possible to refactor vfio a bit to better separate what is PCI specific
to what is not. The idea would be to move the PCI specific bits to
inside the "placeholder" PCI driver, and same goes for platform bits.
"generic" ioctl's go to VFIO core, anything that doesn't handle, it
passes them to the driver which allows the PCI one to handle things
differently than the platform one, maybe an amba one while at it,
etc.... just a thought, I haven't gone into the details at all.

I think that's all I had on my plate today, it's a long enough email
anyway :-) Anthony suggested we put that on a wiki, I'm a bit
wiki-disabled myself so he proposed to pickup my email and do that. We
should probably discuss the various items in here separately as
different threads to avoid too much confusion.

One other thing we should do on our side is publish somewhere our
current hacks to get you an idea of where we are going and what we had
to do (code speaks more than words). We'll try to do that asap, possibly
next week.

Note that I'll be on/off the next few weeks, travelling and doing
bringup. So expect latency in my replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
  (?)
@ 2011-07-30 18:20   ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev, iommu, benve,
	aafabbri, chrisw, qemu-devel

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 18:20   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-07-30 18:20   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
@ 2011-07-30 22:21   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 22:21 UTC (permalink / raw)
  To: kvm
  Cc: Anthony Liguori, Alex Williamson, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.
> 
> David, Alexei, please make sure I haven't missed anything :-)

And I think I have :-)

  * Config space

VFIO currently handles that as a byte stream. It's quite gross to be
honest and it's not right. You shouldn't lose access size information
between guest and host when performing real accesses.

Some config space registers can have side effects and not respecting
access sizes can be nasty.

Cheers,
Ben.

> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.
> 
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control
> 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 22:21   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 22:21 UTC (permalink / raw)
  To: kvm
  Cc: Alexey Kardashevskiy, Paul Mackerras, linux-pci, David Gibson,
	Alex Williamson, Anthony Liguori, linuxppc-dev

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.
> 
> David, Alexei, please make sure I haven't missed anything :-)

And I think I have :-)

  * Config space

VFIO currently handles that as a byte stream. It's quite gross to be
honest and it's not right. You shouldn't lose access size information
between guest and host when performing real accesses.

Some config space registers can have side effects and not respecting
access sizes can be nasty.

Cheers,
Ben.

> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.
> 
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control
> 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-07-30 23:54     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-07-30 23:55     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:55     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-07-30 23:55     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)
 
> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.
 
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.
 
 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
  
>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
@ 2011-07-31 14:09   ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-07-31 14:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

How about a sysfs entry partition=<partition-id>? then libvirt knows not 
to assign devices from the same partition to different guests (and not 
to let the host play with them, either).

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
>
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.

I have a feeling you'll be getting the same capabilities sooner or 
later, or you won't be able to make use of S/R IOV VFs.  While we should 
support the older hardware, the interfaces should be designed with the 
newer hardware in mind.

> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.

Such magic is nice for a developer playing with qemu but in general less 
useful for a managed system where the various cards need to be exposed 
to the user interface anyway.

> * IOMMU
>
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
>
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
>
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.

A single level iommu cannot be exposed to guests.  Well, it can be 
exposed as an iommu that does not provide per-device mapping.

A two level iommu can be emulated and exposed to the guest.  See 
http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

> This means:
>
>    - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
>
>    - It requires the guest to be pinned. Pass-through ->  no more swap

Newer iommus (and devices, unfortunately) (will) support I/O page faults 
and then the requirement can be removed.

>    - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb&  bounce buffering.

Is this a problem in practice?

>    - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.

Then you need to provide that same interface, and implement it using the 
real iommu.

> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...

Does the guest iomap each request?  Why?

Emulating the iommu in the kernel is of course the way to go if that's 
the case, still won't performance suck even then?

> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
>
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors&  addresses, the guest
> will call hyercalls to configure things anyways.

So, you have interrupt redirection?  That is, MSI-x table values encode 
the vcpu, not pcpu?

Alex, with interrupt redirection, we can skip this as well?  Perhaps 
only if the guest enables interrupt redirection?

If so, it's not arch specific, it's interrupt redirection specific.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Does the BAR value contain the segment base address?  Or is that added 
later?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-07-31 14:09   ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-07-31 14:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

How about a sysfs entry partition=<partition-id>? then libvirt knows not 
to assign devices from the same partition to different guests (and not 
to let the host play with them, either).

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
>
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.

I have a feeling you'll be getting the same capabilities sooner or 
later, or you won't be able to make use of S/R IOV VFs.  While we should 
support the older hardware, the interfaces should be designed with the 
newer hardware in mind.

> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.

Such magic is nice for a developer playing with qemu but in general less 
useful for a managed system where the various cards need to be exposed 
to the user interface anyway.

> * IOMMU
>
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
>
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
>
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.

A single level iommu cannot be exposed to guests.  Well, it can be 
exposed as an iommu that does not provide per-device mapping.

A two level iommu can be emulated and exposed to the guest.  See 
http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

> This means:
>
>    - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
>
>    - It requires the guest to be pinned. Pass-through ->  no more swap

Newer iommus (and devices, unfortunately) (will) support I/O page faults 
and then the requirement can be removed.

>    - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb&  bounce buffering.

Is this a problem in practice?

>    - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.

Then you need to provide that same interface, and implement it using the 
real iommu.

> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...

Does the guest iomap each request?  Why?

Emulating the iommu in the kernel is of course the way to go if that's 
the case, still won't performance suck even then?

> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
>
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors&  addresses, the guest
> will call hyercalls to configure things anyways.

So, you have interrupt redirection?  That is, MSI-x table values encode 
the vcpu, not pcpu?

Alex, with interrupt redirection, we can skip this as well?  Perhaps 
only if the guest enables interrupt redirection?

If so, it's not arch specific, it's interrupt redirection specific.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Does the BAR value contain the segment base address?  Or is that added 
later?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
                   ` (3 preceding siblings ...)
  (?)
@ 2011-08-01  2:48 ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-01  2:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alex Williamson, Anthony Liguori, linuxppc-dev

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
[snip]
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?

Not quite.  We already require the not-yet-upstream patches which add
guest-side (emulated) IOMMU support to qemu.  The approach we're using
for the passthrough (or at least will when I fix up my patches again)
is that we only map all guest ram into the vfio iommu if and only if
there is no guest visible iommu advertised in the qdev.

This kind of makes sense - if there is no iommu from the guest
perspective, the guest will expect to see all its physical memory 1:1
in DMA.

The hacky bit is that when there *is* a guest visible iommu, it's
assumed that whatever interface the guest iommu uses is somehow wired
up to vfio map/unmap calls.  For us at the moment, this means
passthrough devices for us must be assigned to a special (guest) pci
domain which sets up a suitable wires up the paravirt iommu to the vfio iommu.

In theory under some circumstances, with full emu, you could wire up
an emulated guest iommu interface to a different host iommu
implementation via this mechanism.  However that wouldn't work if the
guest and host iommus capabilities are too different, and in any case
would require considerable extra abstraction work on the qemu guest
iommu code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 22:21   ` Benjamin Herrenschmidt
@ 2011-08-01 16:40     ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 16:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > So I promised Anthony I would try to summarize some of the comments &
> > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > on POWER. It's pretty long, there are various items with more or less
> > impact, some of it is easily fixable, some are API issues, and we'll
> > probably want to discuss them separately, but for now here's a brain
> > dump.
> > 
> > David, Alexei, please make sure I haven't missed anything :-)
> 
> And I think I have :-)
> 
>   * Config space
> 
> VFIO currently handles that as a byte stream. It's quite gross to be
> honest and it's not right. You shouldn't lose access size information
> between guest and host when performing real accesses.
> 
> Some config space registers can have side effects and not respecting
> access sizes can be nasty.

It's a bug, let's fix it.

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-01 16:40     ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 16:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Anthony Liguori, linuxppc-dev

On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > So I promised Anthony I would try to summarize some of the comments &
> > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > on POWER. It's pretty long, there are various items with more or less
> > impact, some of it is easily fixable, some are API issues, and we'll
> > probably want to discuss them separately, but for now here's a brain
> > dump.
> > 
> > David, Alexei, please make sure I haven't missed anything :-)
> 
> And I think I have :-)
> 
>   * Config space
> 
> VFIO currently handles that as a byte stream. It's quite gross to be
> honest and it's not right. You shouldn't lose access size information
> between guest and host when performing real accesses.
> 
> Some config space registers can have side effects and not respecting
> access sizes can be nasty.

It's a bug, let's fix it.

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 23:54     ` Benjamin Herrenschmidt
  (?)
@ 2011-08-01 18:59       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 18:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev, iommu, benve,
	aafabbri, chrisw, qemu-devel

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
> reliably isolate devices. But in practice, it's chancy. Some devices for
> example have "backdoors" into their own config space via MMIO. If I have
> such a device in a guest, I can completely override your DisINTx and
> thus DOS your host or another guest with a shared interrupt. I can move
> my MMIO around and DOS another function by overlapping the addresses.
> 
> You can really only be protect yourself against a device if you have it
> behind a bridge (in addition to having a filtering iommu), which limits
> the MMIO span (and thus letting the guest whack the BARs randomly will
> only allow that guest to shoot itself in the foot).
> 
> Some bridges also provide a way to block INTx below them which comes in
> handy but it's bridge specific. Some devices can be coerced to send the
> INTx "assert" message and never de-assert it (for example by doing a
> soft-reset while it's asserted, which can be done with some devices with
> an MMIO).
> 
> Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
> simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
> and fowards them up, but this isn't very reliable, for example it fails
> over with split transactions).
> 
> Fortunately in PCIe land, we most have bridges above everything. The
> problem somewhat remains with functions of a device, how can you be sure
> that there isn't a way via some MMIO to create side effects on the other
> functions of the device ? (For example by checkstopping the whole
> thing). You can't really :-)
> 
> So it boils down of the "level" of safety/isolation you want to provide,
> and I suppose to some extent it's a user decision but the user needs to
> be informed to some extent. A hard problem :-)
>  
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.  If I have a NIC and HBA behind a
> > bridge, it's perfectly reasonable that I might only assign the NIC to
> > the guest, but as you describe, we then need to prevent the host, or any
> > other guest from making use of the HBA.
> 
> Yes. However the other device is in "limbo" and it may be not clear to
> the user why it can't be used anymore :-)
> 
> The question is more, the user needs to "know" (or libvirt does, or
> somebody ... ) that in order to pass-through device A, it must also
> "remove" device B from the host. How can you even provide a meaningful
> error message to the user if all VFIO does is give you something like
> -EBUSY ?
> 
> So the information about the grouping constraint must trickle down
> somewhat.
> 
> Look at it from a GUI perspective for example. Imagine a front-end
> showing you devices in your system and allowing you to "Drag & drop"
> them to your guest. How do you represent that need for grouping ? First
> how do you expose it from kernel/libvirt to the GUI tool and how do you
> represent it to the user ?
> 
> By grouping the devices in logical groups which end up being the
> "objects" you can drag around, at least you provide some amount of
> clarity. Now if you follow that path down to how the GUI app, libvirt
> and possibly qemu need to know / resolve the dependency, being given the
> "groups" as the primary information of what can be used for pass-through
> makes everything a lot simpler.
>  
> > > - The -minimum- granularity of pass-through is not always a single
> > > device and not always under SW control
> > 
> > But IMHO, we need to preserve the granularity of exposing a device to a
> > guest as a single device.  That might mean some devices are held hostage
> > by an agent on the host.
> 
> Maybe but wouldn't that be even more confusing from a user perspective ?
> And I think it makes it harder from an implementation of admin &
> management tools perspective too.
> 
> > > - Having a magic heuristic in libvirt to figure out those constraints is
> > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > > knowledge of PCI resource management and getting it wrong in many many
> > > cases, something that took years to fix essentially by ripping it all
> > > out. This is kernel knowledge and thus we need the kernel to expose in a
> > > way or another what those constraints are, what those "partitionable
> > > groups" are.
> > > 
> > > - That does -not- mean that we cannot specify for each individual device
> > > within such a group where we want to put it in qemu (what devfn etc...).
> > > As long as there is a clear understanding that the "ownership" of the
> > > device goes with the group, this is somewhat orthogonal to how they are
> > > represented in qemu. (Not completely... if the iommu is exposed to the
> > > guest ,via paravirt for example, some of these constraints must be
> > > exposed but I'll talk about that more later).
> > 
> > Or we can choose not to expose all of the devices in the group to the
> > guest?
> 
> As I said, I don't mind if you don't, I'm just worried about the
> consequences of that from a usability standpoint. Having advanced
> command line option to fine tune is fine. Being able to specify within a
> "group" which devices to show and at what address if fine.
> 
> But I believe the basic entity to be manipulated from an interface
> standpoitn remains the group.
> 
> To get back to my GUI example, once you've D&D your group of devices
> over, you can have the option to open that group and check/uncheck
> individual devices & assign them addresses if you want. That doesn't
> change the fact that practically speaking, the whole group is now owned
> by the guest.
> 
> I will go further than that actually. If you look at how the isolation
> HW works on POWER, the fact that I have the MMIO segmentation means that
> I can simply give the entire group MMIO space to the guest. No problem
> of small BARs, no need to slow-map them ... etc.. that's a pretty handy
> feature don't you think ?
> 
> But that means that those other devices -will- be there, mapped along
> with the one you care about. We may not expose it in config space but it
> will be accessible. I suppose we can keep its IO/MEM decoding disabled.
> But my point is that for all intend and purpose, it's actually owned by
> the guest.
> 
> > > The interface currently proposed for VFIO (and associated uiommu)
> > > doesn't handle that problem at all. Instead, it is entirely centered
> > > around a specific "feature" of the VTd iommu's for creating arbitrary
> > > domains with arbitrary devices (tho those devices -do- have the same
> > > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > > the same bridge into 2 different domains !), but the API totally ignores
> > > the problem, leaves it to libvirt "magic foo" and focuses on something
> > > that is both quite secondary in the grand scheme of things, and quite
> > > x86 VTd specific in the implementation and API definition.
> > 
> > To be fair, libvirt's "magic foo" is built out of the necessity that
> > nobody else is defining the rules.
> 
> Sure, which is why I propose that the kernel exposes the rules since
> it's really the one right place to have that sort of HW constraint
> knowledge, especially since it can be partially at least platform
> specific.
>  
>  .../...

I'll try to consolidate my reply to all the above here because there are
too many places above to interject and make this thread even more
difficult to respond to.  Much of what you're discussion above comes
down to policy.  Do we trust DisINTx?  Do we trust multi-function
devices?  I have no doubt there are devices we can use as examples for
each behaving badly.  On x86 this is one of the reasons we have SR-IOV.
Besides splitting a single device into multiple, it makes sure each
devices is actually virtualization friendly.  POWER seems to add
multiple layers of hardware so that you don't actually have to trust the
device, which is a great value add for enterprise systems, but in doing
so it mostly defeats the purpose and functionality of SR-IOV.

How we present this in a GUI is largely irrelevant because something has
to create a superset of what the hardware dictates (can I uniquely
identify transactions from this device, can I protect other devices from
it, etc.), the system policy (do I trust DisINTx, do I trust function
isolation, do I require ACS) and mold that with what the user actually
wants to assign.  For the VFIO kernel interface, we should only be
concerned with the first problem.  Userspace is free to make the rest as
simple or complete as it cares to.  I argue for x86, we want device
level granularity of assignment, but that also tends to be the typical
case (when only factoring in hardware restrictions) due to our advanced
iommus.

> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Well, iommu aren't the only factor. I mentioned shared interrupts (and
> my unwillingness to always trust DisINTx),

*userspace policy*

>  there's also the MMIO
> grouping I mentioned above (in which case it's an x86 -limitation- with
> small BARs that I don't want to inherit, especially since it's based on
> PAGE_SIZE and we commonly have 64K page size on POWER), etc...

But isn't MMIO grouping effectively *at* the iommu?

> So I'm not too fan of making it entirely look like the iommu is the
> primary factor, but we -can-, that would be workable. I still prefer
> calling a cat a cat and exposing the grouping for what it is, as I think
> I've explained already above, tho. 

The trouble is the "group" analogy is more fitting to a partitionable
system, whereas on x86 we can really mix-n-match devices across iommus
fairly easily.  The iommu seems to be the common point to describe these
differences.

> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. 
> 
> No but you could emulate a HW iommu no ?

We can, but then we have to worry about supporting legacy, proprietary
OSes that may not have support or may make use of it differently.  As
Avi mentions, hardware is coming the eases the "pin the whole guest"
requirement and we may implement emulated iommus for the benefit of some
guests.

> >  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> For your current case maybe. It's just not very future proof imho.
> Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

You expect more 32bit devices in the future?

> > > Also our next generation chipset may drop support for PIO completely.
> > > 
> > > On the other hand, because PIO is just a special range of MMIO for us,
> > > we can do normal pass-through on it and don't need any of the emulation
> > > done qemu.
> > 
> > Maybe we can add mmap support to PIO regions on non-x86.
> 
> We have to yes. I haven't looked into it yet, it should be easy if VFIO
> kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> same interfaces sysfs & proc use).

Patches welcome.

> > >   * MMIO constraints
> > > 
> > > The QEMU side VFIO code hard wires various constraints that are entirely
> > > based on various requirements you decided you have on x86 but don't
> > > necessarily apply to us :-)
> > > 
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors & addresses, the guest
> > > will call hyercalls to configure things anyways.
> > 
> > With interrupt remapping, we can allow the guest access to the MSI-X
> > table, but since that takes the host out of the loop, there's
> > effectively no way for the guest to correctly program it directly by
> > itself.
> 
> Right, I think what we need here is some kind of capabilities to
> "disable" those "features" of qemu vfio.c that aren't needed on our
> platform :-) Shouldn't be too hard. We need to make this runtime tho
> since different machines can have different "capabilities".

Sure, we'll probably eventually want a switch to push the MSI-X table to
KVM when it's available.

> > > We don't need to prevent MMIO pass-through for small BARs at all. This
> > > should be some kind of capability or flag passed by the arch. Our
> > > segmentation of the MMIO domain means that we can give entire segments
> > > to the guest and let it access anything in there (those segments are a
> > > multiple of the page size always). Worst case it will access outside of
> > > a device BAR within a segment and will cause the PE to go into error
> > > state, shooting itself in the foot, there is no risk of side effect
> > > outside of the guest boundaries.
> > 
> > Sure, this could be some kind of capability flag, maybe even implicit in
> > certain configurations.
> 
> Yup.
> 
> > > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > > paravirt guests expect the BARs to have been already allocated for them
> > > by the firmware and will pick up the addresses from the device-tree :-)
> > > 
> > > Today we use a "hack", putting all 0's in there and triggering the linux
> > > code path to reassign unassigned resources (which will use BAR
> > > emulation) but that's not what we are -supposed- to do. Not a big deal
> > > and having the emulation there won't -hurt- us, it's just that we don't
> > > really need any of it.
> > > 
> > > We have a small issue with ROMs. Our current KVM only works with huge
> > > pages for guest memory but that is being fixed. So the way qemu maps the
> > > ROM copy into the guest address space doesn't work. It might be handy
> > > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > > fallback. I'll look into it.
> > 
> > So that means ROMs don't work for you on emulated devices either?  The
> > reason we read it once and map it into the guest is because Michael
> > Tsirkin found a section in the PCI spec that indicates devices can share
> > address decoders between BARs and ROM.
> 
> Yes, he is correct.
> 
> >   This means we can't just leave
> > the enabled bit set in the ROM BAR, because it could actually disable an
> > address decoder for a regular BAR.  We could slow-map the actual ROM,
> > enabling it around each read, but shadowing it seemed far more
> > efficient.
> 
> Right. We can slow map the ROM, or we can not care :-) At the end of the
> day, what is the difference here between a "guest" under qemu and the
> real thing bare metal on the machine ? IE. They have the same issue vs.
> accessing the ROM. IE. I don't see why qemu should try to make it safe
> to access it at any time while it isn't on a real machine. Since VFIO
> resets the devices before putting them in guest space, they should be
> accessible no ? (Might require a hard reset for some devices tho ... )

My primary motivator for doing the ROM the way it's done today is that I
get to push all the ROM handling off to QEMU core PCI code.  The ROM for
an assigned device is handled exactly like the ROM for an emulated
device except it might be generated by reading it from the hardware.
This gives us the benefit of things like rombar=0 if I want to hide the
ROM or romfile=<file> if I want to load an ipxe image for a device that
may not even have a physical ROM.  Not to mention I don't have to
special case ROM handling routines in VFIO.  So it actually has little
to do w/ making it safe to access the ROM at any time.

> In any case, it's not a big deal and we can sort it out, I'm happy to
> fallback to slow map to start with and eventually we will support small
> pages mappings on POWER anyways, it's a temporary limitation.

Perhaps this could also be fixed in the generic QEMU PCI ROM support so
it works for emulated devices too... code reuse paying off already ;)

> > >   * EEH
> > > 
> > > This is the name of those fancy error handling & isolation features I
> > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > generally expose AER to guests (or even the host), it's swallowed by
> > > firmware into something else that provides a superset (well mostly) of
> > > the AER information, and allow us to do those additional things like
> > > isolating/de-isolating, reset control etc...
> > > 
> > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > huge deal, I mention it for completeness.
> > 
> > We expect to do AER via the VFIO netlink interface, which even though
> > its bashed below, would be quite extensible to supporting different
> > kinds of errors.
> 
> As could platform specific ioctls :-)

Is qemu going to poll for errors?

> > >    * Misc
> > > 
> > > There's lots of small bits and pieces... in no special order:
> > > 
> > >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > > netlink and a bit of ioctl's ... it's not like there's something
> > > fundamentally  better for netlink vs. ioctl... it really depends what
> > > you are doing, and in this case I fail to see what netlink brings you
> > > other than bloat and more stupid userspace library deps.
> > 
> > The netlink interface is primarily for host->guest signaling.  I've only
> > implemented the remove command (since we're lacking a pcie-host in qemu
> > to do AER), but it seems to work quite well.  If you have suggestions
> > for how else we might do it, please let me know.  This seems to be the
> > sort of thing netlink is supposed to be used for.
> 
> I don't understand what the advantage of netlink is compared to just
> extending your existing VFIO ioctl interface, possibly using children
> fd's as we do for example with spufs but it's not a huge deal. It just
> that netlink has its own gotchas and I don't like multi-headed
> interfaces.

We could do yet another eventfd that triggers the VFIO user to go call
an ioctl to see what happened, but then we're locked into an ioctl
interface for something that we may want to more easily extend over
time.  As I said, it feels like this is what netlink is for and the
arguments against seem to be more gut reaction.

> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> > 
> > > One thing I thought about but you don't seem to like it ... was to use
> > > the need to represent the partitionable entity as groups in sysfs that I
> > > talked about earlier. Those could have per-device subdirs with the usual
> > > config & resource files, same semantic as the ones in the real device,
> > > but when accessed via the group they get filtering. I might or might not
> > > be practical in the end, tbd, but it would allow apps using a slightly
> > > modified libpci for example to exploit some of this.
> > 
> > I may be tainted by our disagreement that all the devices in a group
> > need to be exposed to the guest and qemu could just take a pointer to a
> > sysfs directory.  That seems very unlike qemu and pushes more of the
> > policy into qemu, which seems like the wrong direction.
> 
> I don't see how it pushes "policy" into qemu.
> 
> The "policy" here is imposed by the HW setup and exposed by the
> kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
> of devices, so far I don't see what's policy about that. From there, it
> would be "handy" for people to just stop there and just see all the
> devices of the group show up in the guest, but by all means feel free to
> suggest a command line interface that allows to more precisely specify
> which of the devices in the group to pass through and at what address.

That's exactly the policy I'm thinking of.  Here's a group of devices,
do something with them...  Does qemu assign them all?  where?  does it
allow hotplug?  do we have ROMs?  should we?  from where?

> > >  - The qemu vfio code hooks directly into ioapic ... of course that
> > > won't fly with anything !x86
> > 
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.
> 
> No it doesn't I agree, that's why it should be some kind of notifier or
> function pointer setup by the platform specific code.

Hmm... it is.  I added a pci_get_irq() that returns a
platform/architecture specific translation of a PCI interrupt to it's
resulting system interrupt.  Implement this in your PCI root bridge.
There's a notifier for when this changes, so vfio will check
pci_get_irq() again, also to be implemented in the PCI root bridge code.
And a notifier that gets registered with that system interrupt and gets
notice for EOI... implemented in x86 ioapic, somewhere else for power.

> >   The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> Right, and we need to cook a similiar sauce for POWER, it's an area that
> has to be arch specific (and in fact specific to the specific HW machine
> being emulated), so we just need to find out what's the cleanest way for
> the plaform to "register" the right callbacks here.

Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
emulation.

[snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, s/iommu/groups and you are pretty close to my original idea :-)
> 
> I don't mind that much what the details are, but I like the idea of not
> having to construct a 3-pages command line every time I want to
> pass-through a device, most "simple" usage scenario don't care that
> much.
> 
> > That means we know /dev/uiommu7 (random example) is our access to a
> > specific iommu with a given set of devices behind it.
> 
> Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
>   
> >   If that iommu is
> > a PE (via those capability files), then a user space entity (trying hard
> > not to call it libvirt) can unbind all those devices from the host,
> > maybe bind the ones it wants to assign to a guest to vfio and bind the
> > others to pci-stub for safe keeping.  If you trust a user with
> > everything in a PE, bind all the devices to VFIO, chown all
> > the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
> >
> > We might then come up with qemu command lines to describe interesting
> > configurations, such as:
> > 
> > -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> > -device pci-bus,...,iommu=iommu0,id=pci.0 \
> > -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> > 
> > The userspace entity would obviously need to put things in the same PE
> > in the right place, but it doesn't seem to take a lot of sysfs info to
> > get that right.
> > 
> > Today we do DMA mapping via the VFIO device because the capabilities of
> > the IOMMU domains change depending on which devices are connected (for
> > VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> > DMA mappings through VFIO naturally forces the call order.  If we moved
> > to something like above, we could switch the DMA mapping to the uiommu
> > device, since the IOMMU would have fixed capabilities.
> 
> That makes sense.
> 
> > What gaps would something like this leave for your IOMMU granularity
> > problems?  I'll need to think through how it works when we don't want to
> > expose the iommu to the guest, maybe a model=none (default) that doesn't
> > need to be connected to a pci bus and maps all guest memory.  Thanks,
> 
> Well, I would map those "iommus" to PEs, so what remains is the path to
> put all the "other" bits and pieces such as inform qemu of the location
> and size of the MMIO segment(s) (so we can map the whole thing and not
> bother with individual BARs) etc... 

My assumption is that PEs are largely defined by the iommus already.
Are MMIO segments a property of the iommu too?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-01 18:59       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 18:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
> reliably isolate devices. But in practice, it's chancy. Some devices for
> example have "backdoors" into their own config space via MMIO. If I have
> such a device in a guest, I can completely override your DisINTx and
> thus DOS your host or another guest with a shared interrupt. I can move
> my MMIO around and DOS another function by overlapping the addresses.
> 
> You can really only be protect yourself against a device if you have it
> behind a bridge (in addition to having a filtering iommu), which limits
> the MMIO span (and thus letting the guest whack the BARs randomly will
> only allow that guest to shoot itself in the foot).
> 
> Some bridges also provide a way to block INTx below them which comes in
> handy but it's bridge specific. Some devices can be coerced to send the
> INTx "assert" message and never de-assert it (for example by doing a
> soft-reset while it's asserted, which can be done with some devices with
> an MMIO).
> 
> Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
> simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
> and fowards them up, but this isn't very reliable, for example it fails
> over with split transactions).
> 
> Fortunately in PCIe land, we most have bridges above everything. The
> problem somewhat remains with functions of a device, how can you be sure
> that there isn't a way via some MMIO to create side effects on the other
> functions of the device ? (For example by checkstopping the whole
> thing). You can't really :-)
> 
> So it boils down of the "level" of safety/isolation you want to provide,
> and I suppose to some extent it's a user decision but the user needs to
> be informed to some extent. A hard problem :-)
>  
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.  If I have a NIC and HBA behind a
> > bridge, it's perfectly reasonable that I might only assign the NIC to
> > the guest, but as you describe, we then need to prevent the host, or any
> > other guest from making use of the HBA.
> 
> Yes. However the other device is in "limbo" and it may be not clear to
> the user why it can't be used anymore :-)
> 
> The question is more, the user needs to "know" (or libvirt does, or
> somebody ... ) that in order to pass-through device A, it must also
> "remove" device B from the host. How can you even provide a meaningful
> error message to the user if all VFIO does is give you something like
> -EBUSY ?
> 
> So the information about the grouping constraint must trickle down
> somewhat.
> 
> Look at it from a GUI perspective for example. Imagine a front-end
> showing you devices in your system and allowing you to "Drag & drop"
> them to your guest. How do you represent that need for grouping ? First
> how do you expose it from kernel/libvirt to the GUI tool and how do you
> represent it to the user ?
> 
> By grouping the devices in logical groups which end up being the
> "objects" you can drag around, at least you provide some amount of
> clarity. Now if you follow that path down to how the GUI app, libvirt
> and possibly qemu need to know / resolve the dependency, being given the
> "groups" as the primary information of what can be used for pass-through
> makes everything a lot simpler.
>  
> > > - The -minimum- granularity of pass-through is not always a single
> > > device and not always under SW control
> > 
> > But IMHO, we need to preserve the granularity of exposing a device to a
> > guest as a single device.  That might mean some devices are held hostage
> > by an agent on the host.
> 
> Maybe but wouldn't that be even more confusing from a user perspective ?
> And I think it makes it harder from an implementation of admin &
> management tools perspective too.
> 
> > > - Having a magic heuristic in libvirt to figure out those constraints is
> > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > > knowledge of PCI resource management and getting it wrong in many many
> > > cases, something that took years to fix essentially by ripping it all
> > > out. This is kernel knowledge and thus we need the kernel to expose in a
> > > way or another what those constraints are, what those "partitionable
> > > groups" are.
> > > 
> > > - That does -not- mean that we cannot specify for each individual device
> > > within such a group where we want to put it in qemu (what devfn etc...).
> > > As long as there is a clear understanding that the "ownership" of the
> > > device goes with the group, this is somewhat orthogonal to how they are
> > > represented in qemu. (Not completely... if the iommu is exposed to the
> > > guest ,via paravirt for example, some of these constraints must be
> > > exposed but I'll talk about that more later).
> > 
> > Or we can choose not to expose all of the devices in the group to the
> > guest?
> 
> As I said, I don't mind if you don't, I'm just worried about the
> consequences of that from a usability standpoint. Having advanced
> command line option to fine tune is fine. Being able to specify within a
> "group" which devices to show and at what address if fine.
> 
> But I believe the basic entity to be manipulated from an interface
> standpoitn remains the group.
> 
> To get back to my GUI example, once you've D&D your group of devices
> over, you can have the option to open that group and check/uncheck
> individual devices & assign them addresses if you want. That doesn't
> change the fact that practically speaking, the whole group is now owned
> by the guest.
> 
> I will go further than that actually. If you look at how the isolation
> HW works on POWER, the fact that I have the MMIO segmentation means that
> I can simply give the entire group MMIO space to the guest. No problem
> of small BARs, no need to slow-map them ... etc.. that's a pretty handy
> feature don't you think ?
> 
> But that means that those other devices -will- be there, mapped along
> with the one you care about. We may not expose it in config space but it
> will be accessible. I suppose we can keep its IO/MEM decoding disabled.
> But my point is that for all intend and purpose, it's actually owned by
> the guest.
> 
> > > The interface currently proposed for VFIO (and associated uiommu)
> > > doesn't handle that problem at all. Instead, it is entirely centered
> > > around a specific "feature" of the VTd iommu's for creating arbitrary
> > > domains with arbitrary devices (tho those devices -do- have the same
> > > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > > the same bridge into 2 different domains !), but the API totally ignores
> > > the problem, leaves it to libvirt "magic foo" and focuses on something
> > > that is both quite secondary in the grand scheme of things, and quite
> > > x86 VTd specific in the implementation and API definition.
> > 
> > To be fair, libvirt's "magic foo" is built out of the necessity that
> > nobody else is defining the rules.
> 
> Sure, which is why I propose that the kernel exposes the rules since
> it's really the one right place to have that sort of HW constraint
> knowledge, especially since it can be partially at least platform
> specific.
>  
>  .../...

I'll try to consolidate my reply to all the above here because there are
too many places above to interject and make this thread even more
difficult to respond to.  Much of what you're discussion above comes
down to policy.  Do we trust DisINTx?  Do we trust multi-function
devices?  I have no doubt there are devices we can use as examples for
each behaving badly.  On x86 this is one of the reasons we have SR-IOV.
Besides splitting a single device into multiple, it makes sure each
devices is actually virtualization friendly.  POWER seems to add
multiple layers of hardware so that you don't actually have to trust the
device, which is a great value add for enterprise systems, but in doing
so it mostly defeats the purpose and functionality of SR-IOV.

How we present this in a GUI is largely irrelevant because something has
to create a superset of what the hardware dictates (can I uniquely
identify transactions from this device, can I protect other devices from
it, etc.), the system policy (do I trust DisINTx, do I trust function
isolation, do I require ACS) and mold that with what the user actually
wants to assign.  For the VFIO kernel interface, we should only be
concerned with the first problem.  Userspace is free to make the rest as
simple or complete as it cares to.  I argue for x86, we want device
level granularity of assignment, but that also tends to be the typical
case (when only factoring in hardware restrictions) due to our advanced
iommus.

> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Well, iommu aren't the only factor. I mentioned shared interrupts (and
> my unwillingness to always trust DisINTx),

*userspace policy*

>  there's also the MMIO
> grouping I mentioned above (in which case it's an x86 -limitation- with
> small BARs that I don't want to inherit, especially since it's based on
> PAGE_SIZE and we commonly have 64K page size on POWER), etc...

But isn't MMIO grouping effectively *at* the iommu?

> So I'm not too fan of making it entirely look like the iommu is the
> primary factor, but we -can-, that would be workable. I still prefer
> calling a cat a cat and exposing the grouping for what it is, as I think
> I've explained already above, tho. 

The trouble is the "group" analogy is more fitting to a partitionable
system, whereas on x86 we can really mix-n-match devices across iommus
fairly easily.  The iommu seems to be the common point to describe these
differences.

> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. 
> 
> No but you could emulate a HW iommu no ?

We can, but then we have to worry about supporting legacy, proprietary
OSes that may not have support or may make use of it differently.  As
Avi mentions, hardware is coming the eases the "pin the whole guest"
requirement and we may implement emulated iommus for the benefit of some
guests.

> >  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> For your current case maybe. It's just not very future proof imho.
> Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

You expect more 32bit devices in the future?

> > > Also our next generation chipset may drop support for PIO completely.
> > > 
> > > On the other hand, because PIO is just a special range of MMIO for us,
> > > we can do normal pass-through on it and don't need any of the emulation
> > > done qemu.
> > 
> > Maybe we can add mmap support to PIO regions on non-x86.
> 
> We have to yes. I haven't looked into it yet, it should be easy if VFIO
> kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> same interfaces sysfs & proc use).

Patches welcome.

> > >   * MMIO constraints
> > > 
> > > The QEMU side VFIO code hard wires various constraints that are entirely
> > > based on various requirements you decided you have on x86 but don't
> > > necessarily apply to us :-)
> > > 
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors & addresses, the guest
> > > will call hyercalls to configure things anyways.
> > 
> > With interrupt remapping, we can allow the guest access to the MSI-X
> > table, but since that takes the host out of the loop, there's
> > effectively no way for the guest to correctly program it directly by
> > itself.
> 
> Right, I think what we need here is some kind of capabilities to
> "disable" those "features" of qemu vfio.c that aren't needed on our
> platform :-) Shouldn't be too hard. We need to make this runtime tho
> since different machines can have different "capabilities".

Sure, we'll probably eventually want a switch to push the MSI-X table to
KVM when it's available.

> > > We don't need to prevent MMIO pass-through for small BARs at all. This
> > > should be some kind of capability or flag passed by the arch. Our
> > > segmentation of the MMIO domain means that we can give entire segments
> > > to the guest and let it access anything in there (those segments are a
> > > multiple of the page size always). Worst case it will access outside of
> > > a device BAR within a segment and will cause the PE to go into error
> > > state, shooting itself in the foot, there is no risk of side effect
> > > outside of the guest boundaries.
> > 
> > Sure, this could be some kind of capability flag, maybe even implicit in
> > certain configurations.
> 
> Yup.
> 
> > > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > > paravirt guests expect the BARs to have been already allocated for them
> > > by the firmware and will pick up the addresses from the device-tree :-)
> > > 
> > > Today we use a "hack", putting all 0's in there and triggering the linux
> > > code path to reassign unassigned resources (which will use BAR
> > > emulation) but that's not what we are -supposed- to do. Not a big deal
> > > and having the emulation there won't -hurt- us, it's just that we don't
> > > really need any of it.
> > > 
> > > We have a small issue with ROMs. Our current KVM only works with huge
> > > pages for guest memory but that is being fixed. So the way qemu maps the
> > > ROM copy into the guest address space doesn't work. It might be handy
> > > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > > fallback. I'll look into it.
> > 
> > So that means ROMs don't work for you on emulated devices either?  The
> > reason we read it once and map it into the guest is because Michael
> > Tsirkin found a section in the PCI spec that indicates devices can share
> > address decoders between BARs and ROM.
> 
> Yes, he is correct.
> 
> >   This means we can't just leave
> > the enabled bit set in the ROM BAR, because it could actually disable an
> > address decoder for a regular BAR.  We could slow-map the actual ROM,
> > enabling it around each read, but shadowing it seemed far more
> > efficient.
> 
> Right. We can slow map the ROM, or we can not care :-) At the end of the
> day, what is the difference here between a "guest" under qemu and the
> real thing bare metal on the machine ? IE. They have the same issue vs.
> accessing the ROM. IE. I don't see why qemu should try to make it safe
> to access it at any time while it isn't on a real machine. Since VFIO
> resets the devices before putting them in guest space, they should be
> accessible no ? (Might require a hard reset for some devices tho ... )

My primary motivator for doing the ROM the way it's done today is that I
get to push all the ROM handling off to QEMU core PCI code.  The ROM for
an assigned device is handled exactly like the ROM for an emulated
device except it might be generated by reading it from the hardware.
This gives us the benefit of things like rombar=0 if I want to hide the
ROM or romfile=<file> if I want to load an ipxe image for a device that
may not even have a physical ROM.  Not to mention I don't have to
special case ROM handling routines in VFIO.  So it actually has little
to do w/ making it safe to access the ROM at any time.

> In any case, it's not a big deal and we can sort it out, I'm happy to
> fallback to slow map to start with and eventually we will support small
> pages mappings on POWER anyways, it's a temporary limitation.

Perhaps this could also be fixed in the generic QEMU PCI ROM support so
it works for emulated devices too... code reuse paying off already ;)

> > >   * EEH
> > > 
> > > This is the name of those fancy error handling & isolation features I
> > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > generally expose AER to guests (or even the host), it's swallowed by
> > > firmware into something else that provides a superset (well mostly) of
> > > the AER information, and allow us to do those additional things like
> > > isolating/de-isolating, reset control etc...
> > > 
> > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > huge deal, I mention it for completeness.
> > 
> > We expect to do AER via the VFIO netlink interface, which even though
> > its bashed below, would be quite extensible to supporting different
> > kinds of errors.
> 
> As could platform specific ioctls :-)

Is qemu going to poll for errors?

> > >    * Misc
> > > 
> > > There's lots of small bits and pieces... in no special order:
> > > 
> > >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > > netlink and a bit of ioctl's ... it's not like there's something
> > > fundamentally  better for netlink vs. ioctl... it really depends what
> > > you are doing, and in this case I fail to see what netlink brings you
> > > other than bloat and more stupid userspace library deps.
> > 
> > The netlink interface is primarily for host->guest signaling.  I've only
> > implemented the remove command (since we're lacking a pcie-host in qemu
> > to do AER), but it seems to work quite well.  If you have suggestions
> > for how else we might do it, please let me know.  This seems to be the
> > sort of thing netlink is supposed to be used for.
> 
> I don't understand what the advantage of netlink is compared to just
> extending your existing VFIO ioctl interface, possibly using children
> fd's as we do for example with spufs but it's not a huge deal. It just
> that netlink has its own gotchas and I don't like multi-headed
> interfaces.

We could do yet another eventfd that triggers the VFIO user to go call
an ioctl to see what happened, but then we're locked into an ioctl
interface for something that we may want to more easily extend over
time.  As I said, it feels like this is what netlink is for and the
arguments against seem to be more gut reaction.

> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> > 
> > > One thing I thought about but you don't seem to like it ... was to use
> > > the need to represent the partitionable entity as groups in sysfs that I
> > > talked about earlier. Those could have per-device subdirs with the usual
> > > config & resource files, same semantic as the ones in the real device,
> > > but when accessed via the group they get filtering. I might or might not
> > > be practical in the end, tbd, but it would allow apps using a slightly
> > > modified libpci for example to exploit some of this.
> > 
> > I may be tainted by our disagreement that all the devices in a group
> > need to be exposed to the guest and qemu could just take a pointer to a
> > sysfs directory.  That seems very unlike qemu and pushes more of the
> > policy into qemu, which seems like the wrong direction.
> 
> I don't see how it pushes "policy" into qemu.
> 
> The "policy" here is imposed by the HW setup and exposed by the
> kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
> of devices, so far I don't see what's policy about that. From there, it
> would be "handy" for people to just stop there and just see all the
> devices of the group show up in the guest, but by all means feel free to
> suggest a command line interface that allows to more precisely specify
> which of the devices in the group to pass through and at what address.

That's exactly the policy I'm thinking of.  Here's a group of devices,
do something with them...  Does qemu assign them all?  where?  does it
allow hotplug?  do we have ROMs?  should we?  from where?

> > >  - The qemu vfio code hooks directly into ioapic ... of course that
> > > won't fly with anything !x86
> > 
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.
> 
> No it doesn't I agree, that's why it should be some kind of notifier or
> function pointer setup by the platform specific code.

Hmm... it is.  I added a pci_get_irq() that returns a
platform/architecture specific translation of a PCI interrupt to it's
resulting system interrupt.  Implement this in your PCI root bridge.
There's a notifier for when this changes, so vfio will check
pci_get_irq() again, also to be implemented in the PCI root bridge code.
And a notifier that gets registered with that system interrupt and gets
notice for EOI... implemented in x86 ioapic, somewhere else for power.

> >   The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> Right, and we need to cook a similiar sauce for POWER, it's an area that
> has to be arch specific (and in fact specific to the specific HW machine
> being emulated), so we just need to find out what's the cleanest way for
> the plaform to "register" the right callbacks here.

Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
emulation.

[snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, s/iommu/groups and you are pretty close to my original idea :-)
> 
> I don't mind that much what the details are, but I like the idea of not
> having to construct a 3-pages command line every time I want to
> pass-through a device, most "simple" usage scenario don't care that
> much.
> 
> > That means we know /dev/uiommu7 (random example) is our access to a
> > specific iommu with a given set of devices behind it.
> 
> Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
>   
> >   If that iommu is
> > a PE (via those capability files), then a user space entity (trying hard
> > not to call it libvirt) can unbind all those devices from the host,
> > maybe bind the ones it wants to assign to a guest to vfio and bind the
> > others to pci-stub for safe keeping.  If you trust a user with
> > everything in a PE, bind all the devices to VFIO, chown all
> > the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
> >
> > We might then come up with qemu command lines to describe interesting
> > configurations, such as:
> > 
> > -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> > -device pci-bus,...,iommu=iommu0,id=pci.0 \
> > -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> > 
> > The userspace entity would obviously need to put things in the same PE
> > in the right place, but it doesn't seem to take a lot of sysfs info to
> > get that right.
> > 
> > Today we do DMA mapping via the VFIO device because the capabilities of
> > the IOMMU domains change depending on which devices are connected (for
> > VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> > DMA mappings through VFIO naturally forces the call order.  If we moved
> > to something like above, we could switch the DMA mapping to the uiommu
> > device, since the IOMMU would have fixed capabilities.
> 
> That makes sense.
> 
> > What gaps would something like this leave for your IOMMU granularity
> > problems?  I'll need to think through how it works when we don't want to
> > expose the iommu to the guest, maybe a model=none (default) that doesn't
> > need to be connected to a pci bus and maps all guest memory.  Thanks,
> 
> Well, I would map those "iommus" to PEs, so what remains is the path to
> put all the "other" bits and pieces such as inform qemu of the location
> and size of the MMIO segment(s) (so we can map the whole thing and not
> bother with individual BARs) etc... 

My assumption is that PEs are largely defined by the iommus already.
Are MMIO segments a property of the iommu too?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-01 18:59       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 18:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
> reliably isolate devices. But in practice, it's chancy. Some devices for
> example have "backdoors" into their own config space via MMIO. If I have
> such a device in a guest, I can completely override your DisINTx and
> thus DOS your host or another guest with a shared interrupt. I can move
> my MMIO around and DOS another function by overlapping the addresses.
> 
> You can really only be protect yourself against a device if you have it
> behind a bridge (in addition to having a filtering iommu), which limits
> the MMIO span (and thus letting the guest whack the BARs randomly will
> only allow that guest to shoot itself in the foot).
> 
> Some bridges also provide a way to block INTx below them which comes in
> handy but it's bridge specific. Some devices can be coerced to send the
> INTx "assert" message and never de-assert it (for example by doing a
> soft-reset while it's asserted, which can be done with some devices with
> an MMIO).
> 
> Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
> simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
> and fowards them up, but this isn't very reliable, for example it fails
> over with split transactions).
> 
> Fortunately in PCIe land, we most have bridges above everything. The
> problem somewhat remains with functions of a device, how can you be sure
> that there isn't a way via some MMIO to create side effects on the other
> functions of the device ? (For example by checkstopping the whole
> thing). You can't really :-)
> 
> So it boils down of the "level" of safety/isolation you want to provide,
> and I suppose to some extent it's a user decision but the user needs to
> be informed to some extent. A hard problem :-)
>  
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.  If I have a NIC and HBA behind a
> > bridge, it's perfectly reasonable that I might only assign the NIC to
> > the guest, but as you describe, we then need to prevent the host, or any
> > other guest from making use of the HBA.
> 
> Yes. However the other device is in "limbo" and it may be not clear to
> the user why it can't be used anymore :-)
> 
> The question is more, the user needs to "know" (or libvirt does, or
> somebody ... ) that in order to pass-through device A, it must also
> "remove" device B from the host. How can you even provide a meaningful
> error message to the user if all VFIO does is give you something like
> -EBUSY ?
> 
> So the information about the grouping constraint must trickle down
> somewhat.
> 
> Look at it from a GUI perspective for example. Imagine a front-end
> showing you devices in your system and allowing you to "Drag & drop"
> them to your guest. How do you represent that need for grouping ? First
> how do you expose it from kernel/libvirt to the GUI tool and how do you
> represent it to the user ?
> 
> By grouping the devices in logical groups which end up being the
> "objects" you can drag around, at least you provide some amount of
> clarity. Now if you follow that path down to how the GUI app, libvirt
> and possibly qemu need to know / resolve the dependency, being given the
> "groups" as the primary information of what can be used for pass-through
> makes everything a lot simpler.
>  
> > > - The -minimum- granularity of pass-through is not always a single
> > > device and not always under SW control
> > 
> > But IMHO, we need to preserve the granularity of exposing a device to a
> > guest as a single device.  That might mean some devices are held hostage
> > by an agent on the host.
> 
> Maybe but wouldn't that be even more confusing from a user perspective ?
> And I think it makes it harder from an implementation of admin &
> management tools perspective too.
> 
> > > - Having a magic heuristic in libvirt to figure out those constraints is
> > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > > knowledge of PCI resource management and getting it wrong in many many
> > > cases, something that took years to fix essentially by ripping it all
> > > out. This is kernel knowledge and thus we need the kernel to expose in a
> > > way or another what those constraints are, what those "partitionable
> > > groups" are.
> > > 
> > > - That does -not- mean that we cannot specify for each individual device
> > > within such a group where we want to put it in qemu (what devfn etc...).
> > > As long as there is a clear understanding that the "ownership" of the
> > > device goes with the group, this is somewhat orthogonal to how they are
> > > represented in qemu. (Not completely... if the iommu is exposed to the
> > > guest ,via paravirt for example, some of these constraints must be
> > > exposed but I'll talk about that more later).
> > 
> > Or we can choose not to expose all of the devices in the group to the
> > guest?
> 
> As I said, I don't mind if you don't, I'm just worried about the
> consequences of that from a usability standpoint. Having advanced
> command line option to fine tune is fine. Being able to specify within a
> "group" which devices to show and at what address if fine.
> 
> But I believe the basic entity to be manipulated from an interface
> standpoitn remains the group.
> 
> To get back to my GUI example, once you've D&D your group of devices
> over, you can have the option to open that group and check/uncheck
> individual devices & assign them addresses if you want. That doesn't
> change the fact that practically speaking, the whole group is now owned
> by the guest.
> 
> I will go further than that actually. If you look at how the isolation
> HW works on POWER, the fact that I have the MMIO segmentation means that
> I can simply give the entire group MMIO space to the guest. No problem
> of small BARs, no need to slow-map them ... etc.. that's a pretty handy
> feature don't you think ?
> 
> But that means that those other devices -will- be there, mapped along
> with the one you care about. We may not expose it in config space but it
> will be accessible. I suppose we can keep its IO/MEM decoding disabled.
> But my point is that for all intend and purpose, it's actually owned by
> the guest.
> 
> > > The interface currently proposed for VFIO (and associated uiommu)
> > > doesn't handle that problem at all. Instead, it is entirely centered
> > > around a specific "feature" of the VTd iommu's for creating arbitrary
> > > domains with arbitrary devices (tho those devices -do- have the same
> > > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > > the same bridge into 2 different domains !), but the API totally ignores
> > > the problem, leaves it to libvirt "magic foo" and focuses on something
> > > that is both quite secondary in the grand scheme of things, and quite
> > > x86 VTd specific in the implementation and API definition.
> > 
> > To be fair, libvirt's "magic foo" is built out of the necessity that
> > nobody else is defining the rules.
> 
> Sure, which is why I propose that the kernel exposes the rules since
> it's really the one right place to have that sort of HW constraint
> knowledge, especially since it can be partially at least platform
> specific.
>  
>  .../...

I'll try to consolidate my reply to all the above here because there are
too many places above to interject and make this thread even more
difficult to respond to.  Much of what you're discussion above comes
down to policy.  Do we trust DisINTx?  Do we trust multi-function
devices?  I have no doubt there are devices we can use as examples for
each behaving badly.  On x86 this is one of the reasons we have SR-IOV.
Besides splitting a single device into multiple, it makes sure each
devices is actually virtualization friendly.  POWER seems to add
multiple layers of hardware so that you don't actually have to trust the
device, which is a great value add for enterprise systems, but in doing
so it mostly defeats the purpose and functionality of SR-IOV.

How we present this in a GUI is largely irrelevant because something has
to create a superset of what the hardware dictates (can I uniquely
identify transactions from this device, can I protect other devices from
it, etc.), the system policy (do I trust DisINTx, do I trust function
isolation, do I require ACS) and mold that with what the user actually
wants to assign.  For the VFIO kernel interface, we should only be
concerned with the first problem.  Userspace is free to make the rest as
simple or complete as it cares to.  I argue for x86, we want device
level granularity of assignment, but that also tends to be the typical
case (when only factoring in hardware restrictions) due to our advanced
iommus.

> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Well, iommu aren't the only factor. I mentioned shared interrupts (and
> my unwillingness to always trust DisINTx),

*userspace policy*

>  there's also the MMIO
> grouping I mentioned above (in which case it's an x86 -limitation- with
> small BARs that I don't want to inherit, especially since it's based on
> PAGE_SIZE and we commonly have 64K page size on POWER), etc...

But isn't MMIO grouping effectively *at* the iommu?

> So I'm not too fan of making it entirely look like the iommu is the
> primary factor, but we -can-, that would be workable. I still prefer
> calling a cat a cat and exposing the grouping for what it is, as I think
> I've explained already above, tho. 

The trouble is the "group" analogy is more fitting to a partitionable
system, whereas on x86 we can really mix-n-match devices across iommus
fairly easily.  The iommu seems to be the common point to describe these
differences.

> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. 
> 
> No but you could emulate a HW iommu no ?

We can, but then we have to worry about supporting legacy, proprietary
OSes that may not have support or may make use of it differently.  As
Avi mentions, hardware is coming the eases the "pin the whole guest"
requirement and we may implement emulated iommus for the benefit of some
guests.

> >  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> For your current case maybe. It's just not very future proof imho.
> Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

You expect more 32bit devices in the future?

> > > Also our next generation chipset may drop support for PIO completely.
> > > 
> > > On the other hand, because PIO is just a special range of MMIO for us,
> > > we can do normal pass-through on it and don't need any of the emulation
> > > done qemu.
> > 
> > Maybe we can add mmap support to PIO regions on non-x86.
> 
> We have to yes. I haven't looked into it yet, it should be easy if VFIO
> kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> same interfaces sysfs & proc use).

Patches welcome.

> > >   * MMIO constraints
> > > 
> > > The QEMU side VFIO code hard wires various constraints that are entirely
> > > based on various requirements you decided you have on x86 but don't
> > > necessarily apply to us :-)
> > > 
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors & addresses, the guest
> > > will call hyercalls to configure things anyways.
> > 
> > With interrupt remapping, we can allow the guest access to the MSI-X
> > table, but since that takes the host out of the loop, there's
> > effectively no way for the guest to correctly program it directly by
> > itself.
> 
> Right, I think what we need here is some kind of capabilities to
> "disable" those "features" of qemu vfio.c that aren't needed on our
> platform :-) Shouldn't be too hard. We need to make this runtime tho
> since different machines can have different "capabilities".

Sure, we'll probably eventually want a switch to push the MSI-X table to
KVM when it's available.

> > > We don't need to prevent MMIO pass-through for small BARs at all. This
> > > should be some kind of capability or flag passed by the arch. Our
> > > segmentation of the MMIO domain means that we can give entire segments
> > > to the guest and let it access anything in there (those segments are a
> > > multiple of the page size always). Worst case it will access outside of
> > > a device BAR within a segment and will cause the PE to go into error
> > > state, shooting itself in the foot, there is no risk of side effect
> > > outside of the guest boundaries.
> > 
> > Sure, this could be some kind of capability flag, maybe even implicit in
> > certain configurations.
> 
> Yup.
> 
> > > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > > paravirt guests expect the BARs to have been already allocated for them
> > > by the firmware and will pick up the addresses from the device-tree :-)
> > > 
> > > Today we use a "hack", putting all 0's in there and triggering the linux
> > > code path to reassign unassigned resources (which will use BAR
> > > emulation) but that's not what we are -supposed- to do. Not a big deal
> > > and having the emulation there won't -hurt- us, it's just that we don't
> > > really need any of it.
> > > 
> > > We have a small issue with ROMs. Our current KVM only works with huge
> > > pages for guest memory but that is being fixed. So the way qemu maps the
> > > ROM copy into the guest address space doesn't work. It might be handy
> > > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > > fallback. I'll look into it.
> > 
> > So that means ROMs don't work for you on emulated devices either?  The
> > reason we read it once and map it into the guest is because Michael
> > Tsirkin found a section in the PCI spec that indicates devices can share
> > address decoders between BARs and ROM.
> 
> Yes, he is correct.
> 
> >   This means we can't just leave
> > the enabled bit set in the ROM BAR, because it could actually disable an
> > address decoder for a regular BAR.  We could slow-map the actual ROM,
> > enabling it around each read, but shadowing it seemed far more
> > efficient.
> 
> Right. We can slow map the ROM, or we can not care :-) At the end of the
> day, what is the difference here between a "guest" under qemu and the
> real thing bare metal on the machine ? IE. They have the same issue vs.
> accessing the ROM. IE. I don't see why qemu should try to make it safe
> to access it at any time while it isn't on a real machine. Since VFIO
> resets the devices before putting them in guest space, they should be
> accessible no ? (Might require a hard reset for some devices tho ... )

My primary motivator for doing the ROM the way it's done today is that I
get to push all the ROM handling off to QEMU core PCI code.  The ROM for
an assigned device is handled exactly like the ROM for an emulated
device except it might be generated by reading it from the hardware.
This gives us the benefit of things like rombar=0 if I want to hide the
ROM or romfile=<file> if I want to load an ipxe image for a device that
may not even have a physical ROM.  Not to mention I don't have to
special case ROM handling routines in VFIO.  So it actually has little
to do w/ making it safe to access the ROM at any time.

> In any case, it's not a big deal and we can sort it out, I'm happy to
> fallback to slow map to start with and eventually we will support small
> pages mappings on POWER anyways, it's a temporary limitation.

Perhaps this could also be fixed in the generic QEMU PCI ROM support so
it works for emulated devices too... code reuse paying off already ;)

> > >   * EEH
> > > 
> > > This is the name of those fancy error handling & isolation features I
> > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > generally expose AER to guests (or even the host), it's swallowed by
> > > firmware into something else that provides a superset (well mostly) of
> > > the AER information, and allow us to do those additional things like
> > > isolating/de-isolating, reset control etc...
> > > 
> > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > huge deal, I mention it for completeness.
> > 
> > We expect to do AER via the VFIO netlink interface, which even though
> > its bashed below, would be quite extensible to supporting different
> > kinds of errors.
> 
> As could platform specific ioctls :-)

Is qemu going to poll for errors?

> > >    * Misc
> > > 
> > > There's lots of small bits and pieces... in no special order:
> > > 
> > >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > > netlink and a bit of ioctl's ... it's not like there's something
> > > fundamentally  better for netlink vs. ioctl... it really depends what
> > > you are doing, and in this case I fail to see what netlink brings you
> > > other than bloat and more stupid userspace library deps.
> > 
> > The netlink interface is primarily for host->guest signaling.  I've only
> > implemented the remove command (since we're lacking a pcie-host in qemu
> > to do AER), but it seems to work quite well.  If you have suggestions
> > for how else we might do it, please let me know.  This seems to be the
> > sort of thing netlink is supposed to be used for.
> 
> I don't understand what the advantage of netlink is compared to just
> extending your existing VFIO ioctl interface, possibly using children
> fd's as we do for example with spufs but it's not a huge deal. It just
> that netlink has its own gotchas and I don't like multi-headed
> interfaces.

We could do yet another eventfd that triggers the VFIO user to go call
an ioctl to see what happened, but then we're locked into an ioctl
interface for something that we may want to more easily extend over
time.  As I said, it feels like this is what netlink is for and the
arguments against seem to be more gut reaction.

> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> > 
> > > One thing I thought about but you don't seem to like it ... was to use
> > > the need to represent the partitionable entity as groups in sysfs that I
> > > talked about earlier. Those could have per-device subdirs with the usual
> > > config & resource files, same semantic as the ones in the real device,
> > > but when accessed via the group they get filtering. I might or might not
> > > be practical in the end, tbd, but it would allow apps using a slightly
> > > modified libpci for example to exploit some of this.
> > 
> > I may be tainted by our disagreement that all the devices in a group
> > need to be exposed to the guest and qemu could just take a pointer to a
> > sysfs directory.  That seems very unlike qemu and pushes more of the
> > policy into qemu, which seems like the wrong direction.
> 
> I don't see how it pushes "policy" into qemu.
> 
> The "policy" here is imposed by the HW setup and exposed by the
> kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
> of devices, so far I don't see what's policy about that. From there, it
> would be "handy" for people to just stop there and just see all the
> devices of the group show up in the guest, but by all means feel free to
> suggest a command line interface that allows to more precisely specify
> which of the devices in the group to pass through and at what address.

That's exactly the policy I'm thinking of.  Here's a group of devices,
do something with them...  Does qemu assign them all?  where?  does it
allow hotplug?  do we have ROMs?  should we?  from where?

> > >  - The qemu vfio code hooks directly into ioapic ... of course that
> > > won't fly with anything !x86
> > 
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.
> 
> No it doesn't I agree, that's why it should be some kind of notifier or
> function pointer setup by the platform specific code.

Hmm... it is.  I added a pci_get_irq() that returns a
platform/architecture specific translation of a PCI interrupt to it's
resulting system interrupt.  Implement this in your PCI root bridge.
There's a notifier for when this changes, so vfio will check
pci_get_irq() again, also to be implemented in the PCI root bridge code.
And a notifier that gets registered with that system interrupt and gets
notice for EOI... implemented in x86 ioapic, somewhere else for power.

> >   The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> Right, and we need to cook a similiar sauce for POWER, it's an area that
> has to be arch specific (and in fact specific to the specific HW machine
> being emulated), so we just need to find out what's the cleanest way for
> the plaform to "register" the right callbacks here.

Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
emulation.

[snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, s/iommu/groups and you are pretty close to my original idea :-)
> 
> I don't mind that much what the details are, but I like the idea of not
> having to construct a 3-pages command line every time I want to
> pass-through a device, most "simple" usage scenario don't care that
> much.
> 
> > That means we know /dev/uiommu7 (random example) is our access to a
> > specific iommu with a given set of devices behind it.
> 
> Linking those sysfs iommus or groups to a /dev/ entry is fine by me.
>   
> >   If that iommu is
> > a PE (via those capability files), then a user space entity (trying hard
> > not to call it libvirt) can unbind all those devices from the host,
> > maybe bind the ones it wants to assign to a guest to vfio and bind the
> > others to pci-stub for safe keeping.  If you trust a user with
> > everything in a PE, bind all the devices to VFIO, chown all
> > the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
> >
> > We might then come up with qemu command lines to describe interesting
> > configurations, such as:
> > 
> > -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> > -device pci-bus,...,iommu=iommu0,id=pci.0 \
> > -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> > 
> > The userspace entity would obviously need to put things in the same PE
> > in the right place, but it doesn't seem to take a lot of sysfs info to
> > get that right.
> > 
> > Today we do DMA mapping via the VFIO device because the capabilities of
> > the IOMMU domains change depending on which devices are connected (for
> > VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> > DMA mappings through VFIO naturally forces the call order.  If we moved
> > to something like above, we could switch the DMA mapping to the uiommu
> > device, since the IOMMU would have fixed capabilities.
> 
> That makes sense.
> 
> > What gaps would something like this leave for your IOMMU granularity
> > problems?  I'll need to think through how it works when we don't want to
> > expose the iommu to the guest, maybe a model=none (default) that doesn't
> > need to be connected to a pci bus and maps all guest memory.  Thanks,
> 
> Well, I would map those "iommus" to PEs, so what remains is the path to
> put all the "other" bits and pieces such as inform qemu of the location
> and size of the MMIO segment(s) (so we can map the whole thing and not
> bother with individual BARs) etc... 

My assumption is that PEs are largely defined by the iommus already.
Are MMIO segments a property of the iommu too?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-31 14:09   ` Avi Kivity
@ 2011-08-01 20:27     ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 20:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?
> 
> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
makes this easier?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-01 20:27     ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-01 20:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Anthony Liguori, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?
> 
> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
makes this easier?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-31 14:09   ` Avi Kivity
@ 2011-08-02  1:27     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  1:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> How about a sysfs entry partition=<partition-id>? then libvirt knows not 
> to assign devices from the same partition to different guests (and not 
> to let the host play with them, either).

That would work. On POWER I also need to expose the way that such
partitions also mean shared iommu domain but that's probably doable.

It would be easy for me to implement it that way since I would just pass
down my PE#.

However, it seems to be a bit of the "smallest possible tweak" to get it
to work. We keep a completely orthogonal iommu domain handling for x86
and there is no link between them.

I still personally prefer a way to statically define the grouping, but
it looks like you guys don't agree... oh well.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> >
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> 
> I have a feeling you'll be getting the same capabilities sooner or 
> later, or you won't be able to make use of S/R IOV VFs.

I'm not sure why you mean. We can do SR/IOV just fine (well, with some
limitations due to constraints with how our MMIO segmenting works and
indeed some of those are being lifted in our future chipsets but
overall, it works).

In -theory-, one could do the grouping dynamically with some kind of API
for us as well. However the constraints are such that it's not
practical. Filtering on RID is based on number of bits to match in the
bus number and whether to match the dev and fn. So it's not arbitrary
(but works fine for SR-IOV).

The MMIO segmentation is a bit special too. There is a single MMIO
region in 32-bit space (size is configurable but that's not very
practical so for now we stick it to 1G) which is evenly divided into N
segments (where N is the number of PE# supported by the host bridge,
typically 128 with the current bridges).

Each segment goes through a remapping table to select the actual PE# (so
large BARs use consecutive segments mapped to the same PE#).
 
For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
regions which act as some kind of "accordions", they are evenly divided
into segments in different PE# and there's several of them which we can
"move around" and typically use to map VF BARs.

>  While we should 
> support the older hardware, the interfaces should be designed with the 
> newer hardware in mind.

Well, our newer hardware will relax some of our limitations, like the
way our 64-bit segments work (I didn't go into details but they have
some inconvenient size constraints that will be lifted), having more
PE#, supporting more MSI ports etc... but the basic scheme remains the
same. Oh and the newer IOMMU will support separate address spaces.

But as you said, we -do- need to support the older stuff.

> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> 
> Such magic is nice for a developer playing with qemu but in general less 
> useful for a managed system where the various cards need to be exposed 
> to the user interface anyway.

Right but at least the code that does that exposure can work top-down,
picking groups and exposing their content.

> > * IOMMU
> >
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> >
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> >
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> 
> A single level iommu cannot be exposed to guests.  Well, it can be 
> exposed as an iommu that does not provide per-device mapping.

Well, x86 ones can't maybe but on POWER we can and must thanks to our
essentially paravirt model :-) Even if it' wasn't and we used trapping
of accesses to the table, it would work because in practice, even with
filtering, what we end up having is a per-device (or rather per-PE#
table).

> A two level iommu can be emulated and exposed to the guest.  See 
> http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?). We don't have that and probably never will. But again, because
we have a paravirt interface to the iommu, it's less of an issue.

> > This means:
> >
> >    - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> >
> >    - It requires the guest to be pinned. Pass-through ->  no more swap
> 
> Newer iommus (and devices, unfortunately) (will) support I/O page faults 
> and then the requirement can be removed.

No. -Some- newer devices will. Out of these, a bunch will have so many
bugs in it it's not usable. Some never will. It's a mess really and I
wouldn't design my stuff based on those premises just yet. Making it
possible to support it for sure, having it in mind, but not making it
the fundation on which the whole API is designed. 

> >    - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb&  bounce buffering.
> 
> Is this a problem in practice?

Could be. It's an artificial limitation we don't need on POWER.

> >    - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> 
> Then you need to provide that same interface, and implement it using the 
> real iommu.

Yes. Working on it. It's not very practical due to how VFIO interacts in
terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
almost entirely real-mode for performance reasons.

> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> 
> Does the guest iomap each request?  Why?

Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So yes,
we'll eventually do it in kernel. We just haven't yet.

> Emulating the iommu in the kernel is of course the way to go if that's 
> the case, still won't performance suck even then?

Well, we have HW on the field where we still beat intel on 10G
networking performances but heh, yeah, the cost of those h-calls is a
concern.

There are some new interfaces in pHyp that we'll eventually support that
allow to create additional iommu mappings in 64-bit space (the current
base mapping is 32-bit and 4K for backward compatibility) with larger
iommu page sizes.

This will eventually help. For guests backed with hugetlbfs we might be
able to map the whole guest in using 16M pages at the iommu level. 

But on the other hand, the current method means that we can support
pass-through without losing overcommit & paging which is handy.

> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> >
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?

Not exactly. The MSI-X address is a real PCI address to an MSI port and
the value is a real interrupt number in the PIC.

However, the MSI port filters by RID (using the same matching as PE#) to
ensure that only allowed devices can write to it, and the PIC has a
matching PE# information to ensure that only allowed devices can trigger
the interrupt.

As for the guest knowing what values to put in there (what port address
and interrupt source numbers to use), this is part of the paravirt APIs.

So the paravirt APIs handles the configuration and the HW ensures that
the guest cannot do anything else than what it's allowed to.

> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?
> 
> If so, it's not arch specific, it's interrupt redirection specific.
> 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Does the BAR value contain the segment base address?  Or is that added 
> later?

It's a shared address space. With a basic configuration on p7ioc for
example we have MMIO going from 3G to 4G (PCI side addresses). BARs
contain the normal PCI address there. But that 1G is divided in 128
segments of equal size which can separately be assigned to PE#'s.

So BARs are allocated by firmware or the kernel PCI code so that devices
in different PEs don't share segments.

Of course there's always the risk that a device can be hacked via a
sideband access to BARs to move out of it's allocated segment. That
means that the guest owning that device won't be able to access it
anymore and can potentially disturb a guest or host owning whatever is
in that other segment.

The only way to enforce isolation here is to ensure that PE# are
entirely behind P2P bridges, since those would then ensure that even if
you put crap into your BARs you won't be able to walk over a neighbour.

I believe pHyp enforces that, for example, if you have a slot, all
devices & functions behind that slot pertain to the same PE# under pHyp.

That means you cannot put individual functions of a device into
different PE# with pHyp.

We plan to be a bit less restrictive here for KVM, assuming that if you
use a device that allows such a back-channel to the BARs, then it's your
problem to not trust such a device for virtualization. And most of the
time, you -will- have a P2P to protect you anyways.

The problem doesn't exist (or is assumed as non-existing) for SR-IOV
since in that case, the VFs are meant to be virtualized, so pHyp assumes
there is no such back-channel and it can trust them to be in different
PE#.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  1:27     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  1:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> How about a sysfs entry partition=<partition-id>? then libvirt knows not 
> to assign devices from the same partition to different guests (and not 
> to let the host play with them, either).

That would work. On POWER I also need to expose the way that such
partitions also mean shared iommu domain but that's probably doable.

It would be easy for me to implement it that way since I would just pass
down my PE#.

However, it seems to be a bit of the "smallest possible tweak" to get it
to work. We keep a completely orthogonal iommu domain handling for x86
and there is no link between them.

I still personally prefer a way to statically define the grouping, but
it looks like you guys don't agree... oh well.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> >
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> 
> I have a feeling you'll be getting the same capabilities sooner or 
> later, or you won't be able to make use of S/R IOV VFs.

I'm not sure why you mean. We can do SR/IOV just fine (well, with some
limitations due to constraints with how our MMIO segmenting works and
indeed some of those are being lifted in our future chipsets but
overall, it works).

In -theory-, one could do the grouping dynamically with some kind of API
for us as well. However the constraints are such that it's not
practical. Filtering on RID is based on number of bits to match in the
bus number and whether to match the dev and fn. So it's not arbitrary
(but works fine for SR-IOV).

The MMIO segmentation is a bit special too. There is a single MMIO
region in 32-bit space (size is configurable but that's not very
practical so for now we stick it to 1G) which is evenly divided into N
segments (where N is the number of PE# supported by the host bridge,
typically 128 with the current bridges).

Each segment goes through a remapping table to select the actual PE# (so
large BARs use consecutive segments mapped to the same PE#).
 
For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
regions which act as some kind of "accordions", they are evenly divided
into segments in different PE# and there's several of them which we can
"move around" and typically use to map VF BARs.

>  While we should 
> support the older hardware, the interfaces should be designed with the 
> newer hardware in mind.

Well, our newer hardware will relax some of our limitations, like the
way our 64-bit segments work (I didn't go into details but they have
some inconvenient size constraints that will be lifted), having more
PE#, supporting more MSI ports etc... but the basic scheme remains the
same. Oh and the newer IOMMU will support separate address spaces.

But as you said, we -do- need to support the older stuff.

> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> 
> Such magic is nice for a developer playing with qemu but in general less 
> useful for a managed system where the various cards need to be exposed 
> to the user interface anyway.

Right but at least the code that does that exposure can work top-down,
picking groups and exposing their content.

> > * IOMMU
> >
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> >
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> >
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> 
> A single level iommu cannot be exposed to guests.  Well, it can be 
> exposed as an iommu that does not provide per-device mapping.

Well, x86 ones can't maybe but on POWER we can and must thanks to our
essentially paravirt model :-) Even if it' wasn't and we used trapping
of accesses to the table, it would work because in practice, even with
filtering, what we end up having is a per-device (or rather per-PE#
table).

> A two level iommu can be emulated and exposed to the guest.  See 
> http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?). We don't have that and probably never will. But again, because
we have a paravirt interface to the iommu, it's less of an issue.

> > This means:
> >
> >    - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> >
> >    - It requires the guest to be pinned. Pass-through ->  no more swap
> 
> Newer iommus (and devices, unfortunately) (will) support I/O page faults 
> and then the requirement can be removed.

No. -Some- newer devices will. Out of these, a bunch will have so many
bugs in it it's not usable. Some never will. It's a mess really and I
wouldn't design my stuff based on those premises just yet. Making it
possible to support it for sure, having it in mind, but not making it
the fundation on which the whole API is designed. 

> >    - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb&  bounce buffering.
> 
> Is this a problem in practice?

Could be. It's an artificial limitation we don't need on POWER.

> >    - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> 
> Then you need to provide that same interface, and implement it using the 
> real iommu.

Yes. Working on it. It's not very practical due to how VFIO interacts in
terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
almost entirely real-mode for performance reasons.

> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> 
> Does the guest iomap each request?  Why?

Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So yes,
we'll eventually do it in kernel. We just haven't yet.

> Emulating the iommu in the kernel is of course the way to go if that's 
> the case, still won't performance suck even then?

Well, we have HW on the field where we still beat intel on 10G
networking performances but heh, yeah, the cost of those h-calls is a
concern.

There are some new interfaces in pHyp that we'll eventually support that
allow to create additional iommu mappings in 64-bit space (the current
base mapping is 32-bit and 4K for backward compatibility) with larger
iommu page sizes.

This will eventually help. For guests backed with hugetlbfs we might be
able to map the whole guest in using 16M pages at the iommu level. 

But on the other hand, the current method means that we can support
pass-through without losing overcommit & paging which is handy.

> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> >
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?

Not exactly. The MSI-X address is a real PCI address to an MSI port and
the value is a real interrupt number in the PIC.

However, the MSI port filters by RID (using the same matching as PE#) to
ensure that only allowed devices can write to it, and the PIC has a
matching PE# information to ensure that only allowed devices can trigger
the interrupt.

As for the guest knowing what values to put in there (what port address
and interrupt source numbers to use), this is part of the paravirt APIs.

So the paravirt APIs handles the configuration and the HW ensures that
the guest cannot do anything else than what it's allowed to.

> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?
> 
> If so, it's not arch specific, it's interrupt redirection specific.
> 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Does the BAR value contain the segment base address?  Or is that added 
> later?

It's a shared address space. With a basic configuration on p7ioc for
example we have MMIO going from 3G to 4G (PCI side addresses). BARs
contain the normal PCI address there. But that 1G is divided in 128
segments of equal size which can separately be assigned to PE#'s.

So BARs are allocated by firmware or the kernel PCI code so that devices
in different PEs don't share segments.

Of course there's always the risk that a device can be hacked via a
sideband access to BARs to move out of it's allocated segment. That
means that the guest owning that device won't be able to access it
anymore and can potentially disturb a guest or host owning whatever is
in that other segment.

The only way to enforce isolation here is to ensure that PE# are
entirely behind P2P bridges, since those would then ensure that even if
you put crap into your BARs you won't be able to walk over a neighbour.

I believe pHyp enforces that, for example, if you have a slot, all
devices & functions behind that slot pertain to the same PE# under pHyp.

That means you cannot put individual functions of a device into
different PE# with pHyp.

We plan to be a bit less restrictive here for KVM, assuming that if you
use a device that allows such a back-channel to the BARs, then it's your
problem to not trust such a device for virtualization. And most of the
time, you -will- have a P2P to protect you anyways.

The problem doesn't exist (or is assumed as non-existing) for SR-IOV
since in that case, the VFs are meant to be virtualized, so pHyp assumes
there is no such back-channel and it can trust them to be in different
PE#.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 16:40     ` Alex Williamson
  (?)
@ 2011-08-02  1:29     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  1:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Anthony Liguori, linuxppc-dev

On Mon, 2011-08-01 at 10:40 -0600, Alex Williamson wrote:
> On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > Hi folks !
> > > 
> > > So I promised Anthony I would try to summarize some of the comments &
> > > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > > on POWER. It's pretty long, there are various items with more or less
> > > impact, some of it is easily fixable, some are API issues, and we'll
> > > probably want to discuss them separately, but for now here's a brain
> > > dump.
> > > 
> > > David, Alexei, please make sure I haven't missed anything :-)
> > 
> > And I think I have :-)
> > 
> >   * Config space
> > 
> > VFIO currently handles that as a byte stream. It's quite gross to be
> > honest and it's not right. You shouldn't lose access size information
> > between guest and host when performing real accesses.
> > 
> > Some config space registers can have side effects and not respecting
> > access sizes can be nasty.
> 
> It's a bug, let's fix it.

Right. I was just trying to be exhaustive :-) If you don't beat us to
it, we'll eventually submit patches to fix it, we haven't fixed it yet
either, just something I noticed (because this byte-transport also makes
handling of endianess clumsly).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 18:59       ` Alex Williamson
  (?)
@ 2011-08-02  2:00         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  2:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Mon, 2011-08-01 at 12:59 -0600, Alex Williamson wrote:

> >  
> >  .../...
> 
> I'll try to consolidate my reply to all the above here because there are
> too many places above to interject and make this thread even more
> difficult to respond to.

True, I should try to do the same :-)

>   Much of what you're discussion above comes
> down to policy.  Do we trust DisINTx?  Do we trust multi-function
> devices?  I have no doubt there are devices we can use as examples for
> each behaving badly.  On x86 this is one of the reasons we have SR-IOV.

Right, that and having the ability to provide way more functions that
you would normally have.

> Besides splitting a single device into multiple, it makes sure each
> devices is actually virtualization friendly.  POWER seems to add
> multiple layers of hardware so that you don't actually have to trust the
> device, which is a great value add for enterprise systems, but in doing
> so it mostly defeats the purpose and functionality of SR-IOV.

Well not entirely. A lot of what POWER does is also about isolation on
errors. This is going to be useful with and without SR-IOV. Also not all
devices are SR-IOV capable and there are plenty of situations where one
would want to pass-through devices that aren't, I don't see that as
disappearing tomorrow.

> How we present this in a GUI is largely irrelevant because something has
> to create a superset of what the hardware dictates (can I uniquely
> identify transactions from this device, can I protect other devices from
> it, etc.), the system policy (do I trust DisINTx, do I trust function
> isolation, do I require ACS) and mold that with what the user actually
> wants to assign.  For the VFIO kernel interface, we should only be
> concerned with the first problem.  Userspace is free to make the rest as
> simple or complete as it cares to.  I argue for x86, we want device
> level granularity of assignment, but that also tends to be the typical
> case (when only factoring in hardware restrictions) due to our advanced
> iommus.

Well, POWER iommu's are advanced too ... just in a different way :-) x86
seems to be a lot less interested in robustness and reliability for
example :-)

I tend to agree that the policy decisions in general should be done by
the user, tho with appropriate information :-)

But some of them on our side are hard requirements imposed by how our
firmware or early kernel code assigned the PE's and we need to expose
that. It directly derives the sharing of iommu's too but then we -could-
have those different iommu's point to the same table in memory and
essentially mimmic the x86 domains. We chose not to. The segments are
too small in our current HW design for one and it means we lose the
isolation between devices which is paramount to getting the kind of
reliability and error handling we want to achieve. 

> > > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > > more kernel people into the discussion.
> > > 
> > > I don't yet buy into passing groups to qemu since I don't buy into the
> > > idea of always exposing all of those devices to qemu.  Would it be
> > > sufficient to expose iommu nodes in sysfs that link to the devices
> > > behind them and describe properties and capabilities of the iommu
> > > itself?  More on this at the end.
> > 
> > Well, iommu aren't the only factor. I mentioned shared interrupts (and
> > my unwillingness to always trust DisINTx),
> 
> *userspace policy*

Maybe ... some of it yes. I suppose. You can always hand out to
userspace bigger guns to shoot itself in the foot. Not always very wise
but heh.

Some of these are hard requirements tho. And we have to make that
decision when we assign PE's at boot time.

> >  there's also the MMIO
> > grouping I mentioned above (in which case it's an x86 -limitation- with
> > small BARs that I don't want to inherit, especially since it's based on
> > PAGE_SIZE and we commonly have 64K page size on POWER), etc...
> 
> But isn't MMIO grouping effectively *at* the iommu?

No exactly. It's a different set of tables & registers in the host
bridge and essentially a different set of logic, tho it does hook into
the whole "shared PE# state" thingy to enforce isolation of all layers
on error.

> > So I'm not too fan of making it entirely look like the iommu is the
> > primary factor, but we -can-, that would be workable. I still prefer
> > calling a cat a cat and exposing the grouping for what it is, as I think
> > I've explained already above, tho. 
> 
> The trouble is the "group" analogy is more fitting to a partitionable
> system, whereas on x86 we can really mix-n-match devices across iommus
> fairly easily.  The iommu seems to be the common point to describe these
> differences.

No. You can do that by throwing away isolation between those devices and
thus throwing away error isolation capabilities as well. I suppose if
you don't care about RAS... :-)

> > > > Now some of this can be fixed with tweaks, and we've started doing it
> > > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > > just that we don't like what we had to do to get there).
> > > 
> > > This is a result of wanting to support *unmodified* x86 guests.  We
> > > don't have the luxury of having a predefined pvDMA spec that all x86
> > > OSes adhere to. 
> > 
> > No but you could emulate a HW iommu no ?
> 
> We can, but then we have to worry about supporting legacy, proprietary
> OSes that may not have support or may make use of it differently.  As
> Avi mentions, hardware is coming the eases the "pin the whole guest"
> requirement and we may implement emulated iommus for the benefit of some
> guests.

That's a pipe dream :-) It will take a LONG time before a reasonable
proportion of devices does this in a reliable way I believe.

> > >  The 32bit problem is unfortunate, but the priority use
> > > case for assigning devices to guests is high performance I/O, which
> > > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > > point of having emulated IOMMU hardware on x86, which could then be
> > > backed by VFIO, but for now guest pinning is the most practical and
> > > useful.
> > 
> > For your current case maybe. It's just not very future proof imho.
> > Anyways, it's fixable, but the APIs as they are make it a bit clumsy.
> 
> You expect more 32bit devices in the future?

Got knows what embedded ARM folks will come up with :-) I wouldn't
dismiss that completely. I do expect to have to deal with OHCI for a
while tho.

> > > > Also our next generation chipset may drop support for PIO completely.
> > > > 
> > > > On the other hand, because PIO is just a special range of MMIO for us,
> > > > we can do normal pass-through on it and don't need any of the emulation
> > > > done qemu.
> > > 
> > > Maybe we can add mmap support to PIO regions on non-x86.
> > 
> > We have to yes. I haven't looked into it yet, it should be easy if VFIO
> > kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> > same interfaces sysfs & proc use).
> 
> Patches welcome.

Sure, we do plan to send patches for a lot of those things as we get
there, I'm just chosing to mention all the issues at once here and we
haven't go to fixing -that- just yet.
 
 .../...

> > Right. We can slow map the ROM, or we can not care :-) At the end of the
> > day, what is the difference here between a "guest" under qemu and the
> > real thing bare metal on the machine ? IE. They have the same issue vs.
> > accessing the ROM. IE. I don't see why qemu should try to make it safe
> > to access it at any time while it isn't on a real machine. Since VFIO
> > resets the devices before putting them in guest space, they should be
> > accessible no ? (Might require a hard reset for some devices tho ... )
> 
> My primary motivator for doing the ROM the way it's done today is that I
> get to push all the ROM handling off to QEMU core PCI code.  The ROM for
> an assigned device is handled exactly like the ROM for an emulated
> device except it might be generated by reading it from the hardware.
> This gives us the benefit of things like rombar=0 if I want to hide the
> ROM or romfile=<file> if I want to load an ipxe image for a device that
> may not even have a physical ROM.  Not to mention I don't have to
> special case ROM handling routines in VFIO.  So it actually has little
> to do w/ making it safe to access the ROM at any time.

On the other hand, let's hope no device has side effects on the ROM and
expects to exploit them :-) Do we know how ROM/flash updates work for
devices in practice ? Do they expect to be able to write to the ROM BAR
or they always use a different MMIO based sideband access ?
 
> > In any case, it's not a big deal and we can sort it out, I'm happy to
> > fallback to slow map to start with and eventually we will support small
> > pages mappings on POWER anyways, it's a temporary limitation.
> 
> Perhaps this could also be fixed in the generic QEMU PCI ROM support so
> it works for emulated devices too... code reuse paying off already ;)

Heh, I think emulation works.

> > > >   * EEH
> > > > 
> > > > This is the name of those fancy error handling & isolation features I
> > > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > > generally expose AER to guests (or even the host), it's swallowed by
> > > > firmware into something else that provides a superset (well mostly) of
> > > > the AER information, and allow us to do those additional things like
> > > > isolating/de-isolating, reset control etc...
> > > > 
> > > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > > huge deal, I mention it for completeness.
> > > 
> > > We expect to do AER via the VFIO netlink interface, which even though
> > > its bashed below, would be quite extensible to supporting different
> > > kinds of errors.
> > 
> > As could platform specific ioctls :-)
> 
> Is qemu going to poll for errors?

I wouldn't mind eventfd + ioctl, I really don't like netlink :-) But
others might disagree with me here. However that's not really my
argument, see below...

> > I don't understand what the advantage of netlink is compared to just
> > extending your existing VFIO ioctl interface, possibly using children
> > fd's as we do for example with spufs but it's not a huge deal. It just
> > that netlink has its own gotchas and I don't like multi-headed
> > interfaces.
> 
> We could do yet another eventfd that triggers the VFIO user to go call
> an ioctl to see what happened, but then we're locked into an ioctl
> interface for something that we may want to more easily extend over
> time.  As I said, it feels like this is what netlink is for and the
> arguments against seem to be more gut reaction.

My argument here is we already have an fd open, ie, we already have a
communication open to vfio as a chardev, I don't like the idea of
creating -another- one.

> Hmm... it is.  I added a pci_get_irq() that returns a
> platform/architecture specific translation of a PCI interrupt to it's
> resulting system interrupt.  Implement this in your PCI root bridge.
> There's a notifier for when this changes, so vfio will check
> pci_get_irq() again, also to be implemented in the PCI root bridge code.
> And a notifier that gets registered with that system interrupt and gets
> notice for EOI... implemented in x86 ioapic, somewhere else for power.

Let's leave this one alone, we'll fix it a way or another and we can
discuss the patches when it comes down to it.

> > >   The problem is
> > > that we have to disable INTx on an assigned device after it fires (VFIO
> > > does this automatically).  If we don't do this, a non-responsive or
> > > malicious guest could sit on the interrupt, causing it to fire
> > > repeatedly as a DoS on the host.  The only indication that we can rely
> > > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > > We can't just wait for device accesses because a) the device CSRs are
> > > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > > do some kind of dirty logging to detect when they're accesses b) what
> > > constitutes an interrupt service is device specific.
> > > 
> > > That means we need to figure out how PCI interrupt 'A' (or B...)
> > > translates to a GSI (Global System Interrupt - ACPI definition, but
> > > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > > which will also see the APIC EOI.  And just to spice things up, the
> > > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > > callbacks I've added are generic (maybe I left ioapic in the name), but
> > > yes they do need to be implemented for other architectures.  Patches
> > > appreciated from those with knowledge of the systems and/or access to
> > > device specs.  This is the only reason that I make QEMU VFIO only build
> > > for x86.
> > 
> > Right, and we need to cook a similiar sauce for POWER, it's an area that
> > has to be arch specific (and in fact specific to the specific HW machine
> > being emulated), so we just need to find out what's the cleanest way for
> > the plaform to "register" the right callbacks here.
> 
> Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
> emulation.

Yeah, we'll see, whatever we come up with and we discuss the details
then :-)

>  Thanks,
> > 
> > Well, I would map those "iommus" to PEs, so what remains is the path to
> > put all the "other" bits and pieces such as inform qemu of the location
> > and size of the MMIO segment(s) (so we can map the whole thing and not
> > bother with individual BARs) etc... 
> 
> My assumption is that PEs are largely defined by the iommus already.
> Are MMIO segments a property of the iommu too?  Thanks,

Not exactly but it's all tied together. See my other replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  2:00         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  2:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Anthony Liguori,
	linuxppc-dev, benve

On Mon, 2011-08-01 at 12:59 -0600, Alex Williamson wrote:

> >  
> >  .../...
> 
> I'll try to consolidate my reply to all the above here because there are
> too many places above to interject and make this thread even more
> difficult to respond to.

True, I should try to do the same :-)

>   Much of what you're discussion above comes
> down to policy.  Do we trust DisINTx?  Do we trust multi-function
> devices?  I have no doubt there are devices we can use as examples for
> each behaving badly.  On x86 this is one of the reasons we have SR-IOV.

Right, that and having the ability to provide way more functions that
you would normally have.

> Besides splitting a single device into multiple, it makes sure each
> devices is actually virtualization friendly.  POWER seems to add
> multiple layers of hardware so that you don't actually have to trust the
> device, which is a great value add for enterprise systems, but in doing
> so it mostly defeats the purpose and functionality of SR-IOV.

Well not entirely. A lot of what POWER does is also about isolation on
errors. This is going to be useful with and without SR-IOV. Also not all
devices are SR-IOV capable and there are plenty of situations where one
would want to pass-through devices that aren't, I don't see that as
disappearing tomorrow.

> How we present this in a GUI is largely irrelevant because something has
> to create a superset of what the hardware dictates (can I uniquely
> identify transactions from this device, can I protect other devices from
> it, etc.), the system policy (do I trust DisINTx, do I trust function
> isolation, do I require ACS) and mold that with what the user actually
> wants to assign.  For the VFIO kernel interface, we should only be
> concerned with the first problem.  Userspace is free to make the rest as
> simple or complete as it cares to.  I argue for x86, we want device
> level granularity of assignment, but that also tends to be the typical
> case (when only factoring in hardware restrictions) due to our advanced
> iommus.

Well, POWER iommu's are advanced too ... just in a different way :-) x86
seems to be a lot less interested in robustness and reliability for
example :-)

I tend to agree that the policy decisions in general should be done by
the user, tho with appropriate information :-)

But some of them on our side are hard requirements imposed by how our
firmware or early kernel code assigned the PE's and we need to expose
that. It directly derives the sharing of iommu's too but then we -could-
have those different iommu's point to the same table in memory and
essentially mimmic the x86 domains. We chose not to. The segments are
too small in our current HW design for one and it means we lose the
isolation between devices which is paramount to getting the kind of
reliability and error handling we want to achieve. 

> > > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > > more kernel people into the discussion.
> > > 
> > > I don't yet buy into passing groups to qemu since I don't buy into the
> > > idea of always exposing all of those devices to qemu.  Would it be
> > > sufficient to expose iommu nodes in sysfs that link to the devices
> > > behind them and describe properties and capabilities of the iommu
> > > itself?  More on this at the end.
> > 
> > Well, iommu aren't the only factor. I mentioned shared interrupts (and
> > my unwillingness to always trust DisINTx),
> 
> *userspace policy*

Maybe ... some of it yes. I suppose. You can always hand out to
userspace bigger guns to shoot itself in the foot. Not always very wise
but heh.

Some of these are hard requirements tho. And we have to make that
decision when we assign PE's at boot time.

> >  there's also the MMIO
> > grouping I mentioned above (in which case it's an x86 -limitation- with
> > small BARs that I don't want to inherit, especially since it's based on
> > PAGE_SIZE and we commonly have 64K page size on POWER), etc...
> 
> But isn't MMIO grouping effectively *at* the iommu?

No exactly. It's a different set of tables & registers in the host
bridge and essentially a different set of logic, tho it does hook into
the whole "shared PE# state" thingy to enforce isolation of all layers
on error.

> > So I'm not too fan of making it entirely look like the iommu is the
> > primary factor, but we -can-, that would be workable. I still prefer
> > calling a cat a cat and exposing the grouping for what it is, as I think
> > I've explained already above, tho. 
> 
> The trouble is the "group" analogy is more fitting to a partitionable
> system, whereas on x86 we can really mix-n-match devices across iommus
> fairly easily.  The iommu seems to be the common point to describe these
> differences.

No. You can do that by throwing away isolation between those devices and
thus throwing away error isolation capabilities as well. I suppose if
you don't care about RAS... :-)

> > > > Now some of this can be fixed with tweaks, and we've started doing it
> > > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > > just that we don't like what we had to do to get there).
> > > 
> > > This is a result of wanting to support *unmodified* x86 guests.  We
> > > don't have the luxury of having a predefined pvDMA spec that all x86
> > > OSes adhere to. 
> > 
> > No but you could emulate a HW iommu no ?
> 
> We can, but then we have to worry about supporting legacy, proprietary
> OSes that may not have support or may make use of it differently.  As
> Avi mentions, hardware is coming the eases the "pin the whole guest"
> requirement and we may implement emulated iommus for the benefit of some
> guests.

That's a pipe dream :-) It will take a LONG time before a reasonable
proportion of devices does this in a reliable way I believe.

> > >  The 32bit problem is unfortunate, but the priority use
> > > case for assigning devices to guests is high performance I/O, which
> > > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > > point of having emulated IOMMU hardware on x86, which could then be
> > > backed by VFIO, but for now guest pinning is the most practical and
> > > useful.
> > 
> > For your current case maybe. It's just not very future proof imho.
> > Anyways, it's fixable, but the APIs as they are make it a bit clumsy.
> 
> You expect more 32bit devices in the future?

Got knows what embedded ARM folks will come up with :-) I wouldn't
dismiss that completely. I do expect to have to deal with OHCI for a
while tho.

> > > > Also our next generation chipset may drop support for PIO completely.
> > > > 
> > > > On the other hand, because PIO is just a special range of MMIO for us,
> > > > we can do normal pass-through on it and don't need any of the emulation
> > > > done qemu.
> > > 
> > > Maybe we can add mmap support to PIO regions on non-x86.
> > 
> > We have to yes. I haven't looked into it yet, it should be easy if VFIO
> > kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> > same interfaces sysfs & proc use).
> 
> Patches welcome.

Sure, we do plan to send patches for a lot of those things as we get
there, I'm just chosing to mention all the issues at once here and we
haven't go to fixing -that- just yet.
 
 .../...

> > Right. We can slow map the ROM, or we can not care :-) At the end of the
> > day, what is the difference here between a "guest" under qemu and the
> > real thing bare metal on the machine ? IE. They have the same issue vs.
> > accessing the ROM. IE. I don't see why qemu should try to make it safe
> > to access it at any time while it isn't on a real machine. Since VFIO
> > resets the devices before putting them in guest space, they should be
> > accessible no ? (Might require a hard reset for some devices tho ... )
> 
> My primary motivator for doing the ROM the way it's done today is that I
> get to push all the ROM handling off to QEMU core PCI code.  The ROM for
> an assigned device is handled exactly like the ROM for an emulated
> device except it might be generated by reading it from the hardware.
> This gives us the benefit of things like rombar=0 if I want to hide the
> ROM or romfile=<file> if I want to load an ipxe image for a device that
> may not even have a physical ROM.  Not to mention I don't have to
> special case ROM handling routines in VFIO.  So it actually has little
> to do w/ making it safe to access the ROM at any time.

On the other hand, let's hope no device has side effects on the ROM and
expects to exploit them :-) Do we know how ROM/flash updates work for
devices in practice ? Do they expect to be able to write to the ROM BAR
or they always use a different MMIO based sideband access ?
 
> > In any case, it's not a big deal and we can sort it out, I'm happy to
> > fallback to slow map to start with and eventually we will support small
> > pages mappings on POWER anyways, it's a temporary limitation.
> 
> Perhaps this could also be fixed in the generic QEMU PCI ROM support so
> it works for emulated devices too... code reuse paying off already ;)

Heh, I think emulation works.

> > > >   * EEH
> > > > 
> > > > This is the name of those fancy error handling & isolation features I
> > > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > > generally expose AER to guests (or even the host), it's swallowed by
> > > > firmware into something else that provides a superset (well mostly) of
> > > > the AER information, and allow us to do those additional things like
> > > > isolating/de-isolating, reset control etc...
> > > > 
> > > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > > huge deal, I mention it for completeness.
> > > 
> > > We expect to do AER via the VFIO netlink interface, which even though
> > > its bashed below, would be quite extensible to supporting different
> > > kinds of errors.
> > 
> > As could platform specific ioctls :-)
> 
> Is qemu going to poll for errors?

I wouldn't mind eventfd + ioctl, I really don't like netlink :-) But
others might disagree with me here. However that's not really my
argument, see below...

> > I don't understand what the advantage of netlink is compared to just
> > extending your existing VFIO ioctl interface, possibly using children
> > fd's as we do for example with spufs but it's not a huge deal. It just
> > that netlink has its own gotchas and I don't like multi-headed
> > interfaces.
> 
> We could do yet another eventfd that triggers the VFIO user to go call
> an ioctl to see what happened, but then we're locked into an ioctl
> interface for something that we may want to more easily extend over
> time.  As I said, it feels like this is what netlink is for and the
> arguments against seem to be more gut reaction.

My argument here is we already have an fd open, ie, we already have a
communication open to vfio as a chardev, I don't like the idea of
creating -another- one.

> Hmm... it is.  I added a pci_get_irq() that returns a
> platform/architecture specific translation of a PCI interrupt to it's
> resulting system interrupt.  Implement this in your PCI root bridge.
> There's a notifier for when this changes, so vfio will check
> pci_get_irq() again, also to be implemented in the PCI root bridge code.
> And a notifier that gets registered with that system interrupt and gets
> notice for EOI... implemented in x86 ioapic, somewhere else for power.

Let's leave this one alone, we'll fix it a way or another and we can
discuss the patches when it comes down to it.

> > >   The problem is
> > > that we have to disable INTx on an assigned device after it fires (VFIO
> > > does this automatically).  If we don't do this, a non-responsive or
> > > malicious guest could sit on the interrupt, causing it to fire
> > > repeatedly as a DoS on the host.  The only indication that we can rely
> > > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > > We can't just wait for device accesses because a) the device CSRs are
> > > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > > do some kind of dirty logging to detect when they're accesses b) what
> > > constitutes an interrupt service is device specific.
> > > 
> > > That means we need to figure out how PCI interrupt 'A' (or B...)
> > > translates to a GSI (Global System Interrupt - ACPI definition, but
> > > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > > which will also see the APIC EOI.  And just to spice things up, the
> > > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > > callbacks I've added are generic (maybe I left ioapic in the name), but
> > > yes they do need to be implemented for other architectures.  Patches
> > > appreciated from those with knowledge of the systems and/or access to
> > > device specs.  This is the only reason that I make QEMU VFIO only build
> > > for x86.
> > 
> > Right, and we need to cook a similiar sauce for POWER, it's an area that
> > has to be arch specific (and in fact specific to the specific HW machine
> > being emulated), so we just need to find out what's the cleanest way for
> > the plaform to "register" the right callbacks here.
> 
> Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
> emulation.

Yeah, we'll see, whatever we come up with and we discuss the details
then :-)

>  Thanks,
> > 
> > Well, I would map those "iommus" to PEs, so what remains is the path to
> > put all the "other" bits and pieces such as inform qemu of the location
> > and size of the MMIO segment(s) (so we can map the whole thing and not
> > bother with individual BARs) etc... 
> 
> My assumption is that PEs are largely defined by the iommus already.
> Are MMIO segments a property of the iommu too?  Thanks,

Not exactly but it's all tied together. See my other replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02  2:00         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02  2:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, linuxppc-dev, benve

On Mon, 2011-08-01 at 12:59 -0600, Alex Williamson wrote:

> >  
> >  .../...
> 
> I'll try to consolidate my reply to all the above here because there are
> too many places above to interject and make this thread even more
> difficult to respond to.

True, I should try to do the same :-)

>   Much of what you're discussion above comes
> down to policy.  Do we trust DisINTx?  Do we trust multi-function
> devices?  I have no doubt there are devices we can use as examples for
> each behaving badly.  On x86 this is one of the reasons we have SR-IOV.

Right, that and having the ability to provide way more functions that
you would normally have.

> Besides splitting a single device into multiple, it makes sure each
> devices is actually virtualization friendly.  POWER seems to add
> multiple layers of hardware so that you don't actually have to trust the
> device, which is a great value add for enterprise systems, but in doing
> so it mostly defeats the purpose and functionality of SR-IOV.

Well not entirely. A lot of what POWER does is also about isolation on
errors. This is going to be useful with and without SR-IOV. Also not all
devices are SR-IOV capable and there are plenty of situations where one
would want to pass-through devices that aren't, I don't see that as
disappearing tomorrow.

> How we present this in a GUI is largely irrelevant because something has
> to create a superset of what the hardware dictates (can I uniquely
> identify transactions from this device, can I protect other devices from
> it, etc.), the system policy (do I trust DisINTx, do I trust function
> isolation, do I require ACS) and mold that with what the user actually
> wants to assign.  For the VFIO kernel interface, we should only be
> concerned with the first problem.  Userspace is free to make the rest as
> simple or complete as it cares to.  I argue for x86, we want device
> level granularity of assignment, but that also tends to be the typical
> case (when only factoring in hardware restrictions) due to our advanced
> iommus.

Well, POWER iommu's are advanced too ... just in a different way :-) x86
seems to be a lot less interested in robustness and reliability for
example :-)

I tend to agree that the policy decisions in general should be done by
the user, tho with appropriate information :-)

But some of them on our side are hard requirements imposed by how our
firmware or early kernel code assigned the PE's and we need to expose
that. It directly derives the sharing of iommu's too but then we -could-
have those different iommu's point to the same table in memory and
essentially mimmic the x86 domains. We chose not to. The segments are
too small in our current HW design for one and it means we lose the
isolation between devices which is paramount to getting the kind of
reliability and error handling we want to achieve. 

> > > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > > more kernel people into the discussion.
> > > 
> > > I don't yet buy into passing groups to qemu since I don't buy into the
> > > idea of always exposing all of those devices to qemu.  Would it be
> > > sufficient to expose iommu nodes in sysfs that link to the devices
> > > behind them and describe properties and capabilities of the iommu
> > > itself?  More on this at the end.
> > 
> > Well, iommu aren't the only factor. I mentioned shared interrupts (and
> > my unwillingness to always trust DisINTx),
> 
> *userspace policy*

Maybe ... some of it yes. I suppose. You can always hand out to
userspace bigger guns to shoot itself in the foot. Not always very wise
but heh.

Some of these are hard requirements tho. And we have to make that
decision when we assign PE's at boot time.

> >  there's also the MMIO
> > grouping I mentioned above (in which case it's an x86 -limitation- with
> > small BARs that I don't want to inherit, especially since it's based on
> > PAGE_SIZE and we commonly have 64K page size on POWER), etc...
> 
> But isn't MMIO grouping effectively *at* the iommu?

No exactly. It's a different set of tables & registers in the host
bridge and essentially a different set of logic, tho it does hook into
the whole "shared PE# state" thingy to enforce isolation of all layers
on error.

> > So I'm not too fan of making it entirely look like the iommu is the
> > primary factor, but we -can-, that would be workable. I still prefer
> > calling a cat a cat and exposing the grouping for what it is, as I think
> > I've explained already above, tho. 
> 
> The trouble is the "group" analogy is more fitting to a partitionable
> system, whereas on x86 we can really mix-n-match devices across iommus
> fairly easily.  The iommu seems to be the common point to describe these
> differences.

No. You can do that by throwing away isolation between those devices and
thus throwing away error isolation capabilities as well. I suppose if
you don't care about RAS... :-)

> > > > Now some of this can be fixed with tweaks, and we've started doing it
> > > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > > just that we don't like what we had to do to get there).
> > > 
> > > This is a result of wanting to support *unmodified* x86 guests.  We
> > > don't have the luxury of having a predefined pvDMA spec that all x86
> > > OSes adhere to. 
> > 
> > No but you could emulate a HW iommu no ?
> 
> We can, but then we have to worry about supporting legacy, proprietary
> OSes that may not have support or may make use of it differently.  As
> Avi mentions, hardware is coming the eases the "pin the whole guest"
> requirement and we may implement emulated iommus for the benefit of some
> guests.

That's a pipe dream :-) It will take a LONG time before a reasonable
proportion of devices does this in a reliable way I believe.

> > >  The 32bit problem is unfortunate, but the priority use
> > > case for assigning devices to guests is high performance I/O, which
> > > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > > point of having emulated IOMMU hardware on x86, which could then be
> > > backed by VFIO, but for now guest pinning is the most practical and
> > > useful.
> > 
> > For your current case maybe. It's just not very future proof imho.
> > Anyways, it's fixable, but the APIs as they are make it a bit clumsy.
> 
> You expect more 32bit devices in the future?

Got knows what embedded ARM folks will come up with :-) I wouldn't
dismiss that completely. I do expect to have to deal with OHCI for a
while tho.

> > > > Also our next generation chipset may drop support for PIO completely.
> > > > 
> > > > On the other hand, because PIO is just a special range of MMIO for us,
> > > > we can do normal pass-through on it and don't need any of the emulation
> > > > done qemu.
> > > 
> > > Maybe we can add mmap support to PIO regions on non-x86.
> > 
> > We have to yes. I haven't looked into it yet, it should be easy if VFIO
> > kernel side starts using the "proper" PCI mmap interfaces in kernel (the
> > same interfaces sysfs & proc use).
> 
> Patches welcome.

Sure, we do plan to send patches for a lot of those things as we get
there, I'm just chosing to mention all the issues at once here and we
haven't go to fixing -that- just yet.
 
 .../...

> > Right. We can slow map the ROM, or we can not care :-) At the end of the
> > day, what is the difference here between a "guest" under qemu and the
> > real thing bare metal on the machine ? IE. They have the same issue vs.
> > accessing the ROM. IE. I don't see why qemu should try to make it safe
> > to access it at any time while it isn't on a real machine. Since VFIO
> > resets the devices before putting them in guest space, they should be
> > accessible no ? (Might require a hard reset for some devices tho ... )
> 
> My primary motivator for doing the ROM the way it's done today is that I
> get to push all the ROM handling off to QEMU core PCI code.  The ROM for
> an assigned device is handled exactly like the ROM for an emulated
> device except it might be generated by reading it from the hardware.
> This gives us the benefit of things like rombar=0 if I want to hide the
> ROM or romfile=<file> if I want to load an ipxe image for a device that
> may not even have a physical ROM.  Not to mention I don't have to
> special case ROM handling routines in VFIO.  So it actually has little
> to do w/ making it safe to access the ROM at any time.

On the other hand, let's hope no device has side effects on the ROM and
expects to exploit them :-) Do we know how ROM/flash updates work for
devices in practice ? Do they expect to be able to write to the ROM BAR
or they always use a different MMIO based sideband access ?
 
> > In any case, it's not a big deal and we can sort it out, I'm happy to
> > fallback to slow map to start with and eventually we will support small
> > pages mappings on POWER anyways, it's a temporary limitation.
> 
> Perhaps this could also be fixed in the generic QEMU PCI ROM support so
> it works for emulated devices too... code reuse paying off already ;)

Heh, I think emulation works.

> > > >   * EEH
> > > > 
> > > > This is the name of those fancy error handling & isolation features I
> > > > mentioned earlier. To some extent it's a superset of AER, but we don't
> > > > generally expose AER to guests (or even the host), it's swallowed by
> > > > firmware into something else that provides a superset (well mostly) of
> > > > the AER information, and allow us to do those additional things like
> > > > isolating/de-isolating, reset control etc...
> > > > 
> > > > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > > > huge deal, I mention it for completeness.
> > > 
> > > We expect to do AER via the VFIO netlink interface, which even though
> > > its bashed below, would be quite extensible to supporting different
> > > kinds of errors.
> > 
> > As could platform specific ioctls :-)
> 
> Is qemu going to poll for errors?

I wouldn't mind eventfd + ioctl, I really don't like netlink :-) But
others might disagree with me here. However that's not really my
argument, see below...

> > I don't understand what the advantage of netlink is compared to just
> > extending your existing VFIO ioctl interface, possibly using children
> > fd's as we do for example with spufs but it's not a huge deal. It just
> > that netlink has its own gotchas and I don't like multi-headed
> > interfaces.
> 
> We could do yet another eventfd that triggers the VFIO user to go call
> an ioctl to see what happened, but then we're locked into an ioctl
> interface for something that we may want to more easily extend over
> time.  As I said, it feels like this is what netlink is for and the
> arguments against seem to be more gut reaction.

My argument here is we already have an fd open, ie, we already have a
communication open to vfio as a chardev, I don't like the idea of
creating -another- one.

> Hmm... it is.  I added a pci_get_irq() that returns a
> platform/architecture specific translation of a PCI interrupt to it's
> resulting system interrupt.  Implement this in your PCI root bridge.
> There's a notifier for when this changes, so vfio will check
> pci_get_irq() again, also to be implemented in the PCI root bridge code.
> And a notifier that gets registered with that system interrupt and gets
> notice for EOI... implemented in x86 ioapic, somewhere else for power.

Let's leave this one alone, we'll fix it a way or another and we can
discuss the patches when it comes down to it.

> > >   The problem is
> > > that we have to disable INTx on an assigned device after it fires (VFIO
> > > does this automatically).  If we don't do this, a non-responsive or
> > > malicious guest could sit on the interrupt, causing it to fire
> > > repeatedly as a DoS on the host.  The only indication that we can rely
> > > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > > We can't just wait for device accesses because a) the device CSRs are
> > > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > > do some kind of dirty logging to detect when they're accesses b) what
> > > constitutes an interrupt service is device specific.
> > > 
> > > That means we need to figure out how PCI interrupt 'A' (or B...)
> > > translates to a GSI (Global System Interrupt - ACPI definition, but
> > > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > > which will also see the APIC EOI.  And just to spice things up, the
> > > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > > callbacks I've added are generic (maybe I left ioapic in the name), but
> > > yes they do need to be implemented for other architectures.  Patches
> > > appreciated from those with knowledge of the systems and/or access to
> > > device specs.  This is the only reason that I make QEMU VFIO only build
> > > for x86.
> > 
> > Right, and we need to cook a similiar sauce for POWER, it's an area that
> > has to be arch specific (and in fact specific to the specific HW machine
> > being emulated), so we just need to find out what's the cleanest way for
> > the plaform to "register" the right callbacks here.
> 
> Aside from the ioapic, I hope it's obvious hooks in the PCI root bridge
> emulation.

Yeah, we'll see, whatever we come up with and we discuss the details
then :-)

>  Thanks,
> > 
> > Well, I would map those "iommus" to PEs, so what remains is the path to
> > put all the "other" bits and pieces such as inform qemu of the location
> > and size of the MMIO segment(s) (so we can map the whole thing and not
> > bother with individual BARs) etc... 
> 
> My assumption is that PEs are largely defined by the iommus already.
> Are MMIO segments a property of the iommu too?  Thanks,

Not exactly but it's all tied together. See my other replies.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-08-02  8:28     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, aafabbri, iommu,
	Anthony Liguori, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning".  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> 
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now.  I'll try to come back to the rest of this
> below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.

[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.  The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.

[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic.  In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.

[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  8:28     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning".  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> 
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now.  I'll try to come back to the rest of this
> below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.

[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.  The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.

[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic.  In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.

[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02  8:28     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-02  8:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning".  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> 
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now.  I'll try to come back to the rest of this
> below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.

[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.  The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.

[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic.  In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.

[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 20:27     ` Alex Williamson
@ 2011-08-02  8:32       ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  8:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 08/01/2011 11:27 PM, Alex Williamson wrote:
> On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> >  On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> >  >  Due to our paravirt nature, we don't need to masquerade the MSI-X table
> >  >  for example. At all. If the guest configures crap into it, too bad, it
> >  >  can only shoot itself in the foot since the host bridge enforce
> >  >  validation anyways as I explained earlier. Because it's all paravirt, we
> >  >  don't need to "translate" the interrupt vectors&   addresses, the guest
> >  >  will call hyercalls to configure things anyways.
> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
> >
> >  Alex, with interrupt redirection, we can skip this as well?  Perhaps
> >  only if the guest enables interrupt redirection?
>
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.

Yeah.  We need the interrupt remapping hardware to indirect based on the 
source of the message, not just the address and data.

> Maybe AMD IOMMU
> makes this easier?  Thanks,
>

No idea.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  8:32       ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  8:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Anthony Liguori, linux-pci, linuxppc-dev

On 08/01/2011 11:27 PM, Alex Williamson wrote:
> On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> >  On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> >  >  Due to our paravirt nature, we don't need to masquerade the MSI-X table
> >  >  for example. At all. If the guest configures crap into it, too bad, it
> >  >  can only shoot itself in the foot since the host bridge enforce
> >  >  validation anyways as I explained earlier. Because it's all paravirt, we
> >  >  don't need to "translate" the interrupt vectors&   addresses, the guest
> >  >  will call hyercalls to configure things anyways.
> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
> >
> >  Alex, with interrupt redirection, we can skip this as well?  Perhaps
> >  only if the guest enables interrupt redirection?
>
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.

Yeah.  We need the interrupt remapping hardware to indirect based on the 
source of the message, not just the address and data.

> Maybe AMD IOMMU
> makes this easier?  Thanks,
>

No idea.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  1:27     ` Benjamin Herrenschmidt
@ 2011-08-02  9:12       ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  9:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> >  I have a feeling you'll be getting the same capabilities sooner or
> >  later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.

So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.

> >  >
> >  >  VFIO here is basically designed for one and only one thing: expose the
> >  >  entire guest physical address space to the device more/less 1:1.
> >
> >  A single level iommu cannot be exposed to guests.  Well, it can be
> >  exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> >  A two level iommu can be emulated and exposed to the guest.  See
> >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).

(16 or 25)

> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.

Well, then, I guess we need an additional interface to expose that to 
the guest.

> >  >  This means:
> >  >
> >  >     - It only works with iommu's that provide complete DMA address spaces
> >  >  to devices. Won't work with a single 'segmented' address space like we
> >  >  have on POWER.
> >  >
> >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> >
> >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> >  and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.

The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.

But I see what you mean, the API is designed around up-front 
specification of all guest memory.

> >  >     - It doesn't work for POWER server anyways because of our need to
> >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> >  >  works today and how existing OSes expect to operate.
> >
> >  Then you need to provide that same interface, and implement it using the
> >  real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.

The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.

> >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> >  >  call directly in the kernel eventually ...
> >
> >  Does the guest iomap each request?  Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.

I see.  x86 traditionally doesn't do it for every request.  We had some 
proposals to do a pviommu that does map every request, but none reached 
maturity.

> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.

Okay, this is something that x86 doesn't have.  Strange that it can 
filter DMA at a fine granularity but not MSI, which is practically the 
same thing.

> >
> >  Does the BAR value contain the segment base address?  Or is that added
> >  later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.

Okay, and config space virtualization ensures that the guest can't remap?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02  9:12       ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02  9:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> >  I have a feeling you'll be getting the same capabilities sooner or
> >  later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.

So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.

> >  >
> >  >  VFIO here is basically designed for one and only one thing: expose the
> >  >  entire guest physical address space to the device more/less 1:1.
> >
> >  A single level iommu cannot be exposed to guests.  Well, it can be
> >  exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> >  A two level iommu can be emulated and exposed to the guest.  See
> >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).

(16 or 25)

> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.

Well, then, I guess we need an additional interface to expose that to 
the guest.

> >  >  This means:
> >  >
> >  >     - It only works with iommu's that provide complete DMA address spaces
> >  >  to devices. Won't work with a single 'segmented' address space like we
> >  >  have on POWER.
> >  >
> >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> >
> >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> >  and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.

The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.

But I see what you mean, the API is designed around up-front 
specification of all guest memory.

> >  >     - It doesn't work for POWER server anyways because of our need to
> >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> >  >  works today and how existing OSes expect to operate.
> >
> >  Then you need to provide that same interface, and implement it using the
> >  real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.

The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.

> >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> >  >  call directly in the kernel eventually ...
> >
> >  Does the guest iomap each request?  Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.

I see.  x86 traditionally doesn't do it for every request.  We had some 
proposals to do a pviommu that does map every request, but none reached 
maturity.

> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.

Okay, this is something that x86 doesn't have.  Strange that it can 
filter DMA at a fine granularity but not MSI, which is practically the 
same thing.

> >
> >  Does the BAR value contain the segment base address?  Or is that added
> >  later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.

Okay, and config space virtualization ensures that the guest can't remap?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  9:12       ` Avi Kivity
@ 2011-08-02 12:58         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02 12:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
> On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> > >
> > >  I have a feeling you'll be getting the same capabilities sooner or
> > >  later, or you won't be able to make use of S/R IOV VFs.
> >
> > I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> > limitations due to constraints with how our MMIO segmenting works and
> > indeed some of those are being lifted in our future chipsets but
> > overall, it works).
> 
> Don't those limitations include "all VFs must be assigned to the same 
> guest"?

No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can "resize" to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc...  for these things which can cause us to play
interesting games with the system page size setting to find a good
match.

> PCI on x86 has function granularity, SRIOV reduces this to VF 
> granularity, but I thought power has partition or group granularity 
> which is much coarser?

The granularity of a "Group" really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.

In fact I currently go down to function granularity on anything pure
PCIe as well, though as I explained earlier, that's a bit chancy since
some adapters -will- allow to create side effects such as side band
access to config space.

pHyp doesn't allow that granularity as far as I can tell, one slot is
always fully assigned to a PE.

However, we might have resource constraints as in reaching max number of
segments or iommu regions that may force us to group a bit more coarsly
under some circumstances.

The main point is that the grouping is pre-existing, so an API designed
around the idea of: 1- create domain, 2- add random devices to it, 3-
use it, won't work for us very well :-)

Since the grouping implies the sharing of iommu's, from a VFIO point of
view is really matches well with the idea of having the domains
pre-existing.

That's why I think a good fit is to have a static representation of the
grouping, with tools allowing to create/manipulate the groups (or
domains) for archs that allow this sort of manipulations, separately
from qemu/libvirt, avoiding those "on the fly" groups whose lifetime is
tied to an instance of a file descriptor.

> > In -theory-, one could do the grouping dynamically with some kind of API
> > for us as well. However the constraints are such that it's not
> > practical. Filtering on RID is based on number of bits to match in the
> > bus number and whether to match the dev and fn. So it's not arbitrary
> > (but works fine for SR-IOV).
> >
> > The MMIO segmentation is a bit special too. There is a single MMIO
> > region in 32-bit space (size is configurable but that's not very
> > practical so for now we stick it to 1G) which is evenly divided into N
> > segments (where N is the number of PE# supported by the host bridge,
> > typically 128 with the current bridges).
> >
> > Each segment goes through a remapping table to select the actual PE# (so
> > large BARs use consecutive segments mapped to the same PE#).
> >
> > For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> > regions which act as some kind of "accordions", they are evenly divided
> > into segments in different PE# and there's several of them which we can
> > "move around" and typically use to map VF BARs.
> 
> So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
> technical details with no ppc background to put them to, I can't say I'm 
> making any sense of this.

:-)

Don't worry, it took me a while to get my head around the HW :-) SR-IOV
VFs will generally not have limitations like that no, but on the other
hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
take a bunch of VFs and put them in the same 'domain'.

I think the main deal is that VFIO/qemu sees "domains" as "guests" and
tries to put all devices for a given guest into a "domain".

On POWER, we have a different view of things were domains/groups are
defined to be the smallest granularity we can (down to a single VF) and
we give several groups to a guest (ie we avoid sharing the iommu in most
cases)

This is driven by the HW design but that design is itself driven by the
idea that the domains/group are also error isolation groups and we don't
want to take all of the IOs of a guest down if one adapter in that guest
is having an error.

The x86 domains are conceptually different as they are about sharing the
iommu page tables with the clear long term intent of then sharing those
page tables with the guest CPU own. We aren't going in that direction
(at this point at least) on POWER..

> > >  >  VFIO here is basically designed for one and only one thing: expose the
> > >  >  entire guest physical address space to the device more/less 1:1.
> > >
> > >  A single level iommu cannot be exposed to guests.  Well, it can be
> > >  exposed as an iommu that does not provide per-device mapping.
> >
> > Well, x86 ones can't maybe but on POWER we can and must thanks to our
> > essentially paravirt model :-) Even if it' wasn't and we used trapping
> > of accesses to the table, it would work because in practice, even with
> > filtering, what we end up having is a per-device (or rather per-PE#
> > table).
> >
> > >  A two level iommu can be emulated and exposed to the guest.  See
> > >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
> >
> > What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> > right ?).
> 
> (16 or 25)

25 levels ? You mean 25 loads to get to a translation ? And you get any
kind of performance out of that ? :-)

> > We don't have that and probably never will. But again, because
> > we have a paravirt interface to the iommu, it's less of an issue.
> 
> Well, then, I guess we need an additional interface to expose that to 
> the guest.
> 
> > >  >  This means:
> > >  >
> > >  >     - It only works with iommu's that provide complete DMA address spaces
> > >  >  to devices. Won't work with a single 'segmented' address space like we
> > >  >  have on POWER.
> > >  >
> > >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> > >
> > >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > >  and then the requirement can be removed.
> >
> > No. -Some- newer devices will. Out of these, a bunch will have so many
> > bugs in it it's not usable. Some never will. It's a mess really and I
> > wouldn't design my stuff based on those premises just yet. Making it
> > possible to support it for sure, having it in mind, but not making it
> > the fundation on which the whole API is designed.
> 
> The API is not designed around pinning.  It's a side effect of how the 
> IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
> then it would only pin those pages.
> 
> But I see what you mean, the API is designed around up-front 
> specification of all guest memory.

Right :-)

> > >  >     - It doesn't work for POWER server anyways because of our need to
> > >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> > >  >  works today and how existing OSes expect to operate.
> > >
> > >  Then you need to provide that same interface, and implement it using the
> > >  real iommu.
> >
> > Yes. Working on it. It's not very practical due to how VFIO interacts in
> > terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> > almost entirely real-mode for performance reasons.
> 
> The original kvm device assignment code was (and is) part of kvm 
> itself.  We're trying to move to vfio to allow sharing with non-kvm 
> users, but it does reduce flexibility.  We can have an internal vfio-kvm 
> interface to update mappings in real time.
> 
> > >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> > >  >  call directly in the kernel eventually ...
> > >
> > >  Does the guest iomap each request?  Why?
> >
> > Not sure what you mean... the guest calls h-calls for every iommu page
> > mapping/unmapping, yes. So the performance of these is critical. So yes,
> > we'll eventually do it in kernel. We just haven't yet.
> 
> I see.  x86 traditionally doesn't do it for every request.  We had some 
> proposals to do a pviommu that does map every request, but none reached 
> maturity.

It's quite performance critical, you don't want to go anywhere near a
full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
straight off the interrupt handlers, with the CPU still basically
operating in guest context with HV permission. That is basically do the
permission check, translation and whack the HW iommu immediately. If for
some reason one step fails (!present PTE or something like that), we'd
then fallback to an exit to Linux to handle it in a more "common"
environment where we can handle page faults etc...

> > >  So, you have interrupt redirection?  That is, MSI-x table values encode
> > >  the vcpu, not pcpu?
> >
> > Not exactly. The MSI-X address is a real PCI address to an MSI port and
> > the value is a real interrupt number in the PIC.
> >
> > However, the MSI port filters by RID (using the same matching as PE#) to
> > ensure that only allowed devices can write to it, and the PIC has a
> > matching PE# information to ensure that only allowed devices can trigger
> > the interrupt.
> >
> > As for the guest knowing what values to put in there (what port address
> > and interrupt source numbers to use), this is part of the paravirt APIs.
> >
> > So the paravirt APIs handles the configuration and the HW ensures that
> > the guest cannot do anything else than what it's allowed to.
> 
> Okay, this is something that x86 doesn't have.  Strange that it can 
> filter DMA at a fine granularity but not MSI, which is practically the 
> same thing.

I wouldn't be surprised if it's actually a quite different path in HW.
There's some magic decoding based on top bits usually that decides it's
an MSI and it goes completely elsewhere from there in the bridge. 

> > >  Does the BAR value contain the segment base address?  Or is that added
> > >  later?
> >
> > It's a shared address space. With a basic configuration on p7ioc for
> > example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> > contain the normal PCI address there. But that 1G is divided in 128
> > segments of equal size which can separately be assigned to PE#'s.
> >
> > So BARs are allocated by firmware or the kernel PCI code so that devices
> > in different PEs don't share segments.
> 
> Okay, and config space virtualization ensures that the guest can't remap?

Well, so it depends :-)

With KVM we currently use whatever config space virtualization you do
and so we somewhat rely on this but it's not very fool proof.

I believe pHyp doesn't even bother filtering config space. As I said in
another note, you can't trust adapters anyway. Plenty of them (video
cards come to mind) have ways to get to their own config space via MMIO
registers for example.

So what pHyp does is that it always create PE's (aka groups) that are
below a bridge. With PCIe, everything mostly is below a bridge so that's
easy, but that does mean that you always have all functions of a device
in the same PE (and thus in the same partition). SR-IOV is an exception
to this rule since in that case the HW is designed to be trusted.

That way, being behind a bridge, the bridge windows are going to define
what can be forwarded to the device, and thus the system is immune to
the guest putting crap into the BARs. It can't be remapped to overlap a
neighbouring device.

Note that the bridge itself isn't visible to the guest, so yes, config
space is -somewhat- virtualized, typically pHyp make every pass-through
PE look like a separate PCI host bridge with the devices below it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 12:58         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-02 12:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
> On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> > >
> > >  I have a feeling you'll be getting the same capabilities sooner or
> > >  later, or you won't be able to make use of S/R IOV VFs.
> >
> > I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> > limitations due to constraints with how our MMIO segmenting works and
> > indeed some of those are being lifted in our future chipsets but
> > overall, it works).
> 
> Don't those limitations include "all VFs must be assigned to the same 
> guest"?

No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can "resize" to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc...  for these things which can cause us to play
interesting games with the system page size setting to find a good
match.

> PCI on x86 has function granularity, SRIOV reduces this to VF 
> granularity, but I thought power has partition or group granularity 
> which is much coarser?

The granularity of a "Group" really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.

In fact I currently go down to function granularity on anything pure
PCIe as well, though as I explained earlier, that's a bit chancy since
some adapters -will- allow to create side effects such as side band
access to config space.

pHyp doesn't allow that granularity as far as I can tell, one slot is
always fully assigned to a PE.

However, we might have resource constraints as in reaching max number of
segments or iommu regions that may force us to group a bit more coarsly
under some circumstances.

The main point is that the grouping is pre-existing, so an API designed
around the idea of: 1- create domain, 2- add random devices to it, 3-
use it, won't work for us very well :-)

Since the grouping implies the sharing of iommu's, from a VFIO point of
view is really matches well with the idea of having the domains
pre-existing.

That's why I think a good fit is to have a static representation of the
grouping, with tools allowing to create/manipulate the groups (or
domains) for archs that allow this sort of manipulations, separately
from qemu/libvirt, avoiding those "on the fly" groups whose lifetime is
tied to an instance of a file descriptor.

> > In -theory-, one could do the grouping dynamically with some kind of API
> > for us as well. However the constraints are such that it's not
> > practical. Filtering on RID is based on number of bits to match in the
> > bus number and whether to match the dev and fn. So it's not arbitrary
> > (but works fine for SR-IOV).
> >
> > The MMIO segmentation is a bit special too. There is a single MMIO
> > region in 32-bit space (size is configurable but that's not very
> > practical so for now we stick it to 1G) which is evenly divided into N
> > segments (where N is the number of PE# supported by the host bridge,
> > typically 128 with the current bridges).
> >
> > Each segment goes through a remapping table to select the actual PE# (so
> > large BARs use consecutive segments mapped to the same PE#).
> >
> > For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> > regions which act as some kind of "accordions", they are evenly divided
> > into segments in different PE# and there's several of them which we can
> > "move around" and typically use to map VF BARs.
> 
> So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
> technical details with no ppc background to put them to, I can't say I'm 
> making any sense of this.

:-)

Don't worry, it took me a while to get my head around the HW :-) SR-IOV
VFs will generally not have limitations like that no, but on the other
hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
take a bunch of VFs and put them in the same 'domain'.

I think the main deal is that VFIO/qemu sees "domains" as "guests" and
tries to put all devices for a given guest into a "domain".

On POWER, we have a different view of things were domains/groups are
defined to be the smallest granularity we can (down to a single VF) and
we give several groups to a guest (ie we avoid sharing the iommu in most
cases)

This is driven by the HW design but that design is itself driven by the
idea that the domains/group are also error isolation groups and we don't
want to take all of the IOs of a guest down if one adapter in that guest
is having an error.

The x86 domains are conceptually different as they are about sharing the
iommu page tables with the clear long term intent of then sharing those
page tables with the guest CPU own. We aren't going in that direction
(at this point at least) on POWER..

> > >  >  VFIO here is basically designed for one and only one thing: expose the
> > >  >  entire guest physical address space to the device more/less 1:1.
> > >
> > >  A single level iommu cannot be exposed to guests.  Well, it can be
> > >  exposed as an iommu that does not provide per-device mapping.
> >
> > Well, x86 ones can't maybe but on POWER we can and must thanks to our
> > essentially paravirt model :-) Even if it' wasn't and we used trapping
> > of accesses to the table, it would work because in practice, even with
> > filtering, what we end up having is a per-device (or rather per-PE#
> > table).
> >
> > >  A two level iommu can be emulated and exposed to the guest.  See
> > >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
> >
> > What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> > right ?).
> 
> (16 or 25)

25 levels ? You mean 25 loads to get to a translation ? And you get any
kind of performance out of that ? :-)

> > We don't have that and probably never will. But again, because
> > we have a paravirt interface to the iommu, it's less of an issue.
> 
> Well, then, I guess we need an additional interface to expose that to 
> the guest.
> 
> > >  >  This means:
> > >  >
> > >  >     - It only works with iommu's that provide complete DMA address spaces
> > >  >  to devices. Won't work with a single 'segmented' address space like we
> > >  >  have on POWER.
> > >  >
> > >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> > >
> > >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > >  and then the requirement can be removed.
> >
> > No. -Some- newer devices will. Out of these, a bunch will have so many
> > bugs in it it's not usable. Some never will. It's a mess really and I
> > wouldn't design my stuff based on those premises just yet. Making it
> > possible to support it for sure, having it in mind, but not making it
> > the fundation on which the whole API is designed.
> 
> The API is not designed around pinning.  It's a side effect of how the 
> IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
> then it would only pin those pages.
> 
> But I see what you mean, the API is designed around up-front 
> specification of all guest memory.

Right :-)

> > >  >     - It doesn't work for POWER server anyways because of our need to
> > >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> > >  >  works today and how existing OSes expect to operate.
> > >
> > >  Then you need to provide that same interface, and implement it using the
> > >  real iommu.
> >
> > Yes. Working on it. It's not very practical due to how VFIO interacts in
> > terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> > almost entirely real-mode for performance reasons.
> 
> The original kvm device assignment code was (and is) part of kvm 
> itself.  We're trying to move to vfio to allow sharing with non-kvm 
> users, but it does reduce flexibility.  We can have an internal vfio-kvm 
> interface to update mappings in real time.
> 
> > >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> > >  >  call directly in the kernel eventually ...
> > >
> > >  Does the guest iomap each request?  Why?
> >
> > Not sure what you mean... the guest calls h-calls for every iommu page
> > mapping/unmapping, yes. So the performance of these is critical. So yes,
> > we'll eventually do it in kernel. We just haven't yet.
> 
> I see.  x86 traditionally doesn't do it for every request.  We had some 
> proposals to do a pviommu that does map every request, but none reached 
> maturity.

It's quite performance critical, you don't want to go anywhere near a
full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
straight off the interrupt handlers, with the CPU still basically
operating in guest context with HV permission. That is basically do the
permission check, translation and whack the HW iommu immediately. If for
some reason one step fails (!present PTE or something like that), we'd
then fallback to an exit to Linux to handle it in a more "common"
environment where we can handle page faults etc...

> > >  So, you have interrupt redirection?  That is, MSI-x table values encode
> > >  the vcpu, not pcpu?
> >
> > Not exactly. The MSI-X address is a real PCI address to an MSI port and
> > the value is a real interrupt number in the PIC.
> >
> > However, the MSI port filters by RID (using the same matching as PE#) to
> > ensure that only allowed devices can write to it, and the PIC has a
> > matching PE# information to ensure that only allowed devices can trigger
> > the interrupt.
> >
> > As for the guest knowing what values to put in there (what port address
> > and interrupt source numbers to use), this is part of the paravirt APIs.
> >
> > So the paravirt APIs handles the configuration and the HW ensures that
> > the guest cannot do anything else than what it's allowed to.
> 
> Okay, this is something that x86 doesn't have.  Strange that it can 
> filter DMA at a fine granularity but not MSI, which is practically the 
> same thing.

I wouldn't be surprised if it's actually a quite different path in HW.
There's some magic decoding based on top bits usually that decides it's
an MSI and it goes completely elsewhere from there in the bridge. 

> > >  Does the BAR value contain the segment base address?  Or is that added
> > >  later?
> >
> > It's a shared address space. With a basic configuration on p7ioc for
> > example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> > contain the normal PCI address there. But that 1G is divided in 128
> > segments of equal size which can separately be assigned to PE#'s.
> >
> > So BARs are allocated by firmware or the kernel PCI code so that devices
> > in different PEs don't share segments.
> 
> Okay, and config space virtualization ensures that the guest can't remap?

Well, so it depends :-)

With KVM we currently use whatever config space virtualization you do
and so we somewhat rely on this but it's not very fool proof.

I believe pHyp doesn't even bother filtering config space. As I said in
another note, you can't trust adapters anyway. Plenty of them (video
cards come to mind) have ways to get to their own config space via MMIO
registers for example.

So what pHyp does is that it always create PE's (aka groups) that are
below a bridge. With PCIe, everything mostly is below a bridge so that's
easy, but that does mean that you always have all functions of a device
in the same PE (and thus in the same partition). SR-IOV is an exception
to this rule since in that case the HW is designed to be trusted.

That way, being behind a bridge, the bridge windows are going to define
what can be forwarded to the device, and thus the system is immune to
the guest putting crap into the BARs. It can't be remapped to overlap a
neighbouring device.

Note that the bridge itself isn't visible to the guest, so yes, config
space is -somewhat- virtualized, typically pHyp make every pass-through
PE look like a separate PCI host bridge with the devices below it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 12:58         ` Benjamin Herrenschmidt
@ 2011-08-02 13:39           ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02 13:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:
> >  >
> >  >  What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> >  >  right ?).
> >
> >  (16 or 25)
>
> 25 levels ? You mean 25 loads to get to a translation ? And you get any
> kind of performance out of that ? :-)
>

Aggressive partial translation caching.  Even then, performance does 
suffer on memory intensive workloads.  The fix was transparent 
hugepages; that makes the page table walks much faster since they're 
fully cached, the partial translation caches become more effective, and 
the tlb itself becomes more effective.  On some workloads, THP on both 
guest and host was faster than no-THP on bare metal.

> >  >
> >  >  Not sure what you mean... the guest calls h-calls for every iommu page
> >  >  mapping/unmapping, yes. So the performance of these is critical. So yes,
> >  >  we'll eventually do it in kernel. We just haven't yet.
> >
> >  I see.  x86 traditionally doesn't do it for every request.  We had some
> >  proposals to do a pviommu that does map every request, but none reached
> >  maturity.
>
> It's quite performance critical, you don't want to go anywhere near a
> full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
> straight off the interrupt handlers, with the CPU still basically
> operating in guest context with HV permission. That is basically do the
> permission check, translation and whack the HW iommu immediately. If for
> some reason one step fails (!present PTE or something like that), we'd
> then fallback to an exit to Linux to handle it in a more "common"
> environment where we can handle page faults etc...

I guess we can hack some kind of private interface, though I'd hoped to 
avoid it (and so far we succeeded - we can even get vfio to inject 
interrupts into kvm from the kernel without either knowing anything 
about the other).

> >  >  >   Does the BAR value contain the segment base address?  Or is that added
> >  >  >   later?
> >  >
> >  >  It's a shared address space. With a basic configuration on p7ioc for
> >  >  example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> >  >  contain the normal PCI address there. But that 1G is divided in 128
> >  >  segments of equal size which can separately be assigned to PE#'s.
> >  >
> >  >  So BARs are allocated by firmware or the kernel PCI code so that devices
> >  >  in different PEs don't share segments.
> >
> >  Okay, and config space virtualization ensures that the guest can't remap?
>
> Well, so it depends :-)
>
> With KVM we currently use whatever config space virtualization you do
> and so we somewhat rely on this but it's not very fool proof.
>
> I believe pHyp doesn't even bother filtering config space. As I said in
> another note, you can't trust adapters anyway. Plenty of them (video
> cards come to mind) have ways to get to their own config space via MMIO
> registers for example.

Yes, we've seen that.

> So what pHyp does is that it always create PE's (aka groups) that are
> below a bridge. With PCIe, everything mostly is below a bridge so that's
> easy, but that does mean that you always have all functions of a device
> in the same PE (and thus in the same partition). SR-IOV is an exception
> to this rule since in that case the HW is designed to be trusted.
>
> That way, being behind a bridge, the bridge windows are going to define
> what can be forwarded to the device, and thus the system is immune to
> the guest putting crap into the BARs. It can't be remapped to overlap a
> neighbouring device.
>
> Note that the bridge itself isn't visible to the guest, so yes, config
> space is -somewhat- virtualized, typically pHyp make every pass-through
> PE look like a separate PCI host bridge with the devices below it.

I think I see, yes.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 13:39           ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-02 13:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:
> >  >
> >  >  What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> >  >  right ?).
> >
> >  (16 or 25)
>
> 25 levels ? You mean 25 loads to get to a translation ? And you get any
> kind of performance out of that ? :-)
>

Aggressive partial translation caching.  Even then, performance does 
suffer on memory intensive workloads.  The fix was transparent 
hugepages; that makes the page table walks much faster since they're 
fully cached, the partial translation caches become more effective, and 
the tlb itself becomes more effective.  On some workloads, THP on both 
guest and host was faster than no-THP on bare metal.

> >  >
> >  >  Not sure what you mean... the guest calls h-calls for every iommu page
> >  >  mapping/unmapping, yes. So the performance of these is critical. So yes,
> >  >  we'll eventually do it in kernel. We just haven't yet.
> >
> >  I see.  x86 traditionally doesn't do it for every request.  We had some
> >  proposals to do a pviommu that does map every request, but none reached
> >  maturity.
>
> It's quite performance critical, you don't want to go anywhere near a
> full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
> straight off the interrupt handlers, with the CPU still basically
> operating in guest context with HV permission. That is basically do the
> permission check, translation and whack the HW iommu immediately. If for
> some reason one step fails (!present PTE or something like that), we'd
> then fallback to an exit to Linux to handle it in a more "common"
> environment where we can handle page faults etc...

I guess we can hack some kind of private interface, though I'd hoped to 
avoid it (and so far we succeeded - we can even get vfio to inject 
interrupts into kvm from the kernel without either knowing anything 
about the other).

> >  >  >   Does the BAR value contain the segment base address?  Or is that added
> >  >  >   later?
> >  >
> >  >  It's a shared address space. With a basic configuration on p7ioc for
> >  >  example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> >  >  contain the normal PCI address there. But that 1G is divided in 128
> >  >  segments of equal size which can separately be assigned to PE#'s.
> >  >
> >  >  So BARs are allocated by firmware or the kernel PCI code so that devices
> >  >  in different PEs don't share segments.
> >
> >  Okay, and config space virtualization ensures that the guest can't remap?
>
> Well, so it depends :-)
>
> With KVM we currently use whatever config space virtualization you do
> and so we somewhat rely on this but it's not very fool proof.
>
> I believe pHyp doesn't even bother filtering config space. As I said in
> another note, you can't trust adapters anyway. Plenty of them (video
> cards come to mind) have ways to get to their own config space via MMIO
> registers for example.

Yes, we've seen that.

> So what pHyp does is that it always create PE's (aka groups) that are
> below a bridge. With PCIe, everything mostly is below a bridge so that's
> easy, but that does mean that you always have all functions of a device
> in the same PE (and thus in the same partition). SR-IOV is an exception
> to this rule since in that case the HW is designed to be trusted.
>
> That way, being behind a bridge, the bridge windows are going to define
> what can be forwarded to the device, and thus the system is immune to
> the guest putting crap into the BARs. It can't be remapped to overlap a
> neighbouring device.
>
> Note that the bridge itself isn't visible to the guest, so yes, config
> space is -somewhat- virtualized, typically pHyp make every pass-through
> PE look like a separate PCI host bridge with the devices below it.

I think I see, yes.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  1:27     ` Benjamin Herrenschmidt
@ 2011-08-02 14:39       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 14:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote:
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
> 
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
> 
> Of course there's always the risk that a device can be hacked via a
> sideband access to BARs to move out of it's allocated segment. That
> means that the guest owning that device won't be able to access it
> anymore and can potentially disturb a guest or host owning whatever is
> in that other segment.

Wait, what?  I thought the MMIO segments were specifically so that if
the device BARs moved out of the segment the guest only hurts itself and
not the new segments overlapped.

> The only way to enforce isolation here is to ensure that PE# are
> entirely behind P2P bridges, since those would then ensure that even if
> you put crap into your BARs you won't be able to walk over a neighbour.

Ok, so the MMIO segments are really just a configuration nuance of the
platform and being behind a P2P bridge is what allows you to hand off
BARs to a guest (which needs to know the bridge window to do anything
useful with them).  Is that right?

> I believe pHyp enforces that, for example, if you have a slot, all
> devices & functions behind that slot pertain to the same PE# under pHyp.
> 
> That means you cannot put individual functions of a device into
> different PE# with pHyp.
> 
> We plan to be a bit less restrictive here for KVM, assuming that if you
> use a device that allows such a back-channel to the BARs, then it's your
> problem to not trust such a device for virtualization. And most of the
> time, you -will- have a P2P to protect you anyways.
> 
> The problem doesn't exist (or is assumed as non-existing) for SR-IOV
> since in that case, the VFs are meant to be virtualized, so pHyp assumes
> there is no such back-channel and it can trust them to be in different
> PE#.

But you still need the P2P bridge to protect MMIO segments?  Or do
SR-IOV BARs need to be virtualized?  I'm having trouble with the mental
model of how you can do both.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 14:39       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 14:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Avi Kivity, Anthony Liguori, linuxppc-dev

On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote:
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
> 
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
> 
> Of course there's always the risk that a device can be hacked via a
> sideband access to BARs to move out of it's allocated segment. That
> means that the guest owning that device won't be able to access it
> anymore and can potentially disturb a guest or host owning whatever is
> in that other segment.

Wait, what?  I thought the MMIO segments were specifically so that if
the device BARs moved out of the segment the guest only hurts itself and
not the new segments overlapped.

> The only way to enforce isolation here is to ensure that PE# are
> entirely behind P2P bridges, since those would then ensure that even if
> you put crap into your BARs you won't be able to walk over a neighbour.

Ok, so the MMIO segments are really just a configuration nuance of the
platform and being behind a P2P bridge is what allows you to hand off
BARs to a guest (which needs to know the bridge window to do anything
useful with them).  Is that right?

> I believe pHyp enforces that, for example, if you have a slot, all
> devices & functions behind that slot pertain to the same PE# under pHyp.
> 
> That means you cannot put individual functions of a device into
> different PE# with pHyp.
> 
> We plan to be a bit less restrictive here for KVM, assuming that if you
> use a device that allows such a back-channel to the BARs, then it's your
> problem to not trust such a device for virtualization. And most of the
> time, you -will- have a P2P to protect you anyways.
> 
> The problem doesn't exist (or is assumed as non-existing) for SR-IOV
> since in that case, the VFs are meant to be virtualized, so pHyp assumes
> there is no such back-channel and it can trust them to be in different
> PE#.

But you still need the P2P bridge to protect MMIO segments?  Or do
SR-IOV BARs need to be virtualized?  I'm having trouble with the mental
model of how you can do both.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 12:58         ` Benjamin Herrenschmidt
@ 2011-08-02 15:34           ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 15:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> 
> Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> VFs will generally not have limitations like that no, but on the other
> hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> take a bunch of VFs and put them in the same 'domain'.
> 
> I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> tries to put all devices for a given guest into a "domain".

Actually, that's only a recent optimization, before that each device got
it's own iommu domain.  It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.

> On POWER, we have a different view of things were domains/groups are
> defined to be the smallest granularity we can (down to a single VF) and
> we give several groups to a guest (ie we avoid sharing the iommu in most
> cases)
> 
> This is driven by the HW design but that design is itself driven by the
> idea that the domains/group are also error isolation groups and we don't
> want to take all of the IOs of a guest down if one adapter in that guest
> is having an error.
> 
> The x86 domains are conceptually different as they are about sharing the
> iommu page tables with the clear long term intent of then sharing those
> page tables with the guest CPU own. We aren't going in that direction
> (at this point at least) on POWER..

Yes and no.  The x86 domains are pretty flexible and used a few
different ways.  On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges.  In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest.  As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway).  If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 15:34           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 15:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Avi Kivity, Anthony Liguori, linuxppc-dev

On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> 
> Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> VFs will generally not have limitations like that no, but on the other
> hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> take a bunch of VFs and put them in the same 'domain'.
> 
> I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> tries to put all devices for a given guest into a "domain".

Actually, that's only a recent optimization, before that each device got
it's own iommu domain.  It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.

> On POWER, we have a different view of things were domains/groups are
> defined to be the smallest granularity we can (down to a single VF) and
> we give several groups to a guest (ie we avoid sharing the iommu in most
> cases)
> 
> This is driven by the HW design but that design is itself driven by the
> idea that the domains/group are also error isolation groups and we don't
> want to take all of the IOs of a guest down if one adapter in that guest
> is having an error.
> 
> The x86 domains are conceptually different as they are about sharing the
> iommu page tables with the clear long term intent of then sharing those
> page tables with the guest CPU own. We aren't going in that direction
> (at this point at least) on POWER..

Yes and no.  The x86 domains are pretty flexible and used a few
different ways.  On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges.  In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest.  As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway).  If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02  8:28     ` David Gibson
  (?)
@ 2011-08-02 18:14       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, aafabbri, iommu,
	Anthony Liguori, linuxppc-dev, benve

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
> 
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here.  The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host).  Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning".  There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> No-one's suggesting that this isn't a valid mode of operation.  It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

> [snip]
> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> 
> Hrm.  I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group.  Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.

Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices.  Where do we manage enforcement of hardware policy
vs userspace policy?

> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.  The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.

Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series?  Shoot me for using ioapic in the name, but it's
exactly what you ask for.  It just needs to be made a common service and
implemented for power.

> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic.  In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.

One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:14       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
> 
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here.  The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host).  Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning".  There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> No-one's suggesting that this isn't a valid mode of operation.  It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

> [snip]
> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> 
> Hrm.  I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group.  Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.

Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices.  Where do we manage enforcement of hardware policy
vs userspace policy?

> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.  The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.

Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series?  Shoot me for using ioapic in the name, but it's
exactly what you ask for.  It just needs to be made a common service and
implemented for power.

> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic.  In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.

One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:14       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
> 
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here.  The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host).  Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> > 
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu.  Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself?  More on this at the end.
> 
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning".  There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> > 
> > This is a result of wanting to support *unmodified* x86 guests.  We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to.  The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware.  I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
> 
> No-one's suggesting that this isn't a valid mode of operation.  It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

> [snip]
> > >  - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> > 
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment.  To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device...  We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file.  We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier.  More below...
> 
> Hrm.  I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group.  Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.

Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices.  Where do we manage enforcement of hardware policy
vs userspace policy?

> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists.  Please prove me wrong.  The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically).  If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host.  The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> > 
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI.  And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI.  I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures.  Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs.  This is the only reason that I make QEMU VFIO only build
> > for x86.
> 
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.

Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series?  Shoot me for using ioapic in the name, but it's
exactly what you ask for.  It just needs to be made a common service and
implemented for power.

> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it.  For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them.  Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it.  Once we have that, we could probably make uiommu attach to
> > each of those nodes.
> 
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic.  In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.

One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt).  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 18:14       ` Alex Williamson
  (?)
@ 2011-08-02 18:35         ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt.  But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges.  However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> > 
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here.  The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host).  Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
> 
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy.  And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition".  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:35         ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt.  But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges.  However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> > 
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here.  The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host).  Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
> 
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy.  And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition".  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-02 18:35         ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt.  But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges.  However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> > 
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here.  The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host).  Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
> 
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy.  And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition".  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 15:34           ` Alex Williamson
@ 2011-08-02 21:29             ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 322+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-08-02 21:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, Avi Kivity, kvm, Anthony Liguori,
	David Gibson, Paul Mackerras, Alexey Kardashevskiy, linux-pci,
	linuxppc-dev

On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > 
> > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > VFs will generally not have limitations like that no, but on the other
> > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > take a bunch of VFs and put them in the same 'domain'.
> > 
> > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > tries to put all devices for a given guest into a "domain".
> 
> Actually, that's only a recent optimization, before that each device got
> it's own iommu domain.  It's actually completely configurable on the
> qemu command line which devices get their own iommu and which share.
> The default optimizes the number of domains (one) and thus the number of
> mapping callbacks since we pin the entire guest.
> 
> > On POWER, we have a different view of things were domains/groups are
> > defined to be the smallest granularity we can (down to a single VF) and
> > we give several groups to a guest (ie we avoid sharing the iommu in most
> > cases)
> > 
> > This is driven by the HW design but that design is itself driven by the
> > idea that the domains/group are also error isolation groups and we don't
> > want to take all of the IOs of a guest down if one adapter in that guest
> > is having an error.
> > 
> > The x86 domains are conceptually different as they are about sharing the
> > iommu page tables with the clear long term intent of then sharing those
> > page tables with the guest CPU own. We aren't going in that direction
> > (at this point at least) on POWER..
> 
> Yes and no.  The x86 domains are pretty flexible and used a few
> different ways.  On the host we do dynamic DMA with a domain per device,
> mapping only the inflight DMA ranges.  In order to achieve the
> transparent device assignment model, we have to flip that around and map
> the entire guest.  As noted, we can continue to use separate domains for
> this, but since each maps the entire guest, it doesn't add a lot of
> value and uses more resources and requires more mapping callbacks (and
> x86 doesn't have the best error containment anyway).  If we had a well
> supported IOMMU model that we could adapt for pvDMA, then it would make
> sense to keep each device in it's own domain again.  Thanks,

Could you have an PV IOMMU (in the guest) that would set up those
maps?

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-02 21:29             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 322+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-08-02 21:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev

On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > 
> > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > VFs will generally not have limitations like that no, but on the other
> > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > take a bunch of VFs and put them in the same 'domain'.
> > 
> > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > tries to put all devices for a given guest into a "domain".
> 
> Actually, that's only a recent optimization, before that each device got
> it's own iommu domain.  It's actually completely configurable on the
> qemu command line which devices get their own iommu and which share.
> The default optimizes the number of domains (one) and thus the number of
> mapping callbacks since we pin the entire guest.
> 
> > On POWER, we have a different view of things were domains/groups are
> > defined to be the smallest granularity we can (down to a single VF) and
> > we give several groups to a guest (ie we avoid sharing the iommu in most
> > cases)
> > 
> > This is driven by the HW design but that design is itself driven by the
> > idea that the domains/group are also error isolation groups and we don't
> > want to take all of the IOs of a guest down if one adapter in that guest
> > is having an error.
> > 
> > The x86 domains are conceptually different as they are about sharing the
> > iommu page tables with the clear long term intent of then sharing those
> > page tables with the guest CPU own. We aren't going in that direction
> > (at this point at least) on POWER..
> 
> Yes and no.  The x86 domains are pretty flexible and used a few
> different ways.  On the host we do dynamic DMA with a domain per device,
> mapping only the inflight DMA ranges.  In order to achieve the
> transparent device assignment model, we have to flip that around and map
> the entire guest.  As noted, we can continue to use separate domains for
> this, but since each maps the entire guest, it doesn't add a lot of
> value and uses more resources and requires more mapping callbacks (and
> x86 doesn't have the best error containment anyway).  If we had a well
> supported IOMMU model that we could adapt for pvDMA, then it would make
> sense to keep each device in it's own domain again.  Thanks,

Could you have an PV IOMMU (in the guest) that would set up those
maps?

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 21:29             ` Konrad Rzeszutek Wilk
@ 2011-08-03  1:02               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  1:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Benjamin Herrenschmidt, Avi Kivity, kvm, Anthony Liguori,
	David Gibson, Paul Mackerras, Alexey Kardashevskiy, linux-pci,
	linuxppc-dev

On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > > VFs will generally not have limitations like that no, but on the other
> > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > > take a bunch of VFs and put them in the same 'domain'.
> > > 
> > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > > tries to put all devices for a given guest into a "domain".
> > 
> > Actually, that's only a recent optimization, before that each device got
> > it's own iommu domain.  It's actually completely configurable on the
> > qemu command line which devices get their own iommu and which share.
> > The default optimizes the number of domains (one) and thus the number of
> > mapping callbacks since we pin the entire guest.
> > 
> > > On POWER, we have a different view of things were domains/groups are
> > > defined to be the smallest granularity we can (down to a single VF) and
> > > we give several groups to a guest (ie we avoid sharing the iommu in most
> > > cases)
> > > 
> > > This is driven by the HW design but that design is itself driven by the
> > > idea that the domains/group are also error isolation groups and we don't
> > > want to take all of the IOs of a guest down if one adapter in that guest
> > > is having an error.
> > > 
> > > The x86 domains are conceptually different as they are about sharing the
> > > iommu page tables with the clear long term intent of then sharing those
> > > page tables with the guest CPU own. We aren't going in that direction
> > > (at this point at least) on POWER..
> > 
> > Yes and no.  The x86 domains are pretty flexible and used a few
> > different ways.  On the host we do dynamic DMA with a domain per device,
> > mapping only the inflight DMA ranges.  In order to achieve the
> > transparent device assignment model, we have to flip that around and map
> > the entire guest.  As noted, we can continue to use separate domains for
> > this, but since each maps the entire guest, it doesn't add a lot of
> > value and uses more resources and requires more mapping callbacks (and
> > x86 doesn't have the best error containment anyway).  If we had a well
> > supported IOMMU model that we could adapt for pvDMA, then it would make
> > sense to keep each device in it's own domain again.  Thanks,
> 
> Could you have an PV IOMMU (in the guest) that would set up those
> maps?

Yep, definitely.  That's effectively what power wants to do.  We could
do it on x86, but as others have noted, the map/unmap interface isn't
tuned to do this at that granularity and our target guest OS audience is
effectively reduced to Linux.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-03  1:02               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  1:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev

On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > > VFs will generally not have limitations like that no, but on the other
> > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > > take a bunch of VFs and put them in the same 'domain'.
> > > 
> > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > > tries to put all devices for a given guest into a "domain".
> > 
> > Actually, that's only a recent optimization, before that each device got
> > it's own iommu domain.  It's actually completely configurable on the
> > qemu command line which devices get their own iommu and which share.
> > The default optimizes the number of domains (one) and thus the number of
> > mapping callbacks since we pin the entire guest.
> > 
> > > On POWER, we have a different view of things were domains/groups are
> > > defined to be the smallest granularity we can (down to a single VF) and
> > > we give several groups to a guest (ie we avoid sharing the iommu in most
> > > cases)
> > > 
> > > This is driven by the HW design but that design is itself driven by the
> > > idea that the domains/group are also error isolation groups and we don't
> > > want to take all of the IOs of a guest down if one adapter in that guest
> > > is having an error.
> > > 
> > > The x86 domains are conceptually different as they are about sharing the
> > > iommu page tables with the clear long term intent of then sharing those
> > > page tables with the guest CPU own. We aren't going in that direction
> > > (at this point at least) on POWER..
> > 
> > Yes and no.  The x86 domains are pretty flexible and used a few
> > different ways.  On the host we do dynamic DMA with a domain per device,
> > mapping only the inflight DMA ranges.  In order to achieve the
> > transparent device assignment model, we have to flip that around and map
> > the entire guest.  As noted, we can continue to use separate domains for
> > this, but since each maps the entire guest, it doesn't add a lot of
> > value and uses more resources and requires more mapping callbacks (and
> > x86 doesn't have the best error containment anyway).  If we had a well
> > supported IOMMU model that we could adapt for pvDMA, then it would make
> > sense to keep each device in it's own domain again.  Thanks,
> 
> Could you have an PV IOMMU (in the guest) that would set up those
> maps?

Yep, definitely.  That's effectively what power wants to do.  We could
do it on x86, but as others have noted, the map/unmap interface isn't
tuned to do this at that granularity and our target guest OS audience is
effectively reduced to Linux.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-02 18:35         ` Alex Williamson
  (?)
@ 2011-08-03  2:04           ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-03  2:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt.  But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > > 
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here.  The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host).  Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> > 
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy.  And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
> 
> Sorry, I didn't intend to have such circular logic.  "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition".  Thanks,

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-03  2:04           ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-03  2:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Anthony Liguori, linuxppc-dev,
	benve

On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt.  But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > > 
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here.  The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host).  Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> > 
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy.  And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
> 
> Sorry, I didn't intend to have such circular logic.  "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition".  Thanks,

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-03  2:04           ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-03  2:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt.  But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > > 
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here.  The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host).  Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> > 
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy.  And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
> 
> Sorry, I didn't intend to have such circular logic.  "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition".  Thanks,

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-03  2:04           ` David Gibson
  (?)
@ 2011-08-03  3:44             ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > [snip]
> > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > There's a difference between removing the device from the host and
> > > > > exposing the device to the guest.
> > > > 
> > > > I think you're arguing only over details of what words to use for
> > > > what, rather than anything of substance here.  The point is that an
> > > > entire partitionable group must be assigned to "host" (in which case
> > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > at least to a single UID on the host).  Which of the assigned devices
> > > > the partition actually uses is another matter of course, as is at
> > > > exactly which level they become "de-exposed" if you don't want to use
> > > > all of then.
> > > 
> > > Well first we need to define what a partitionable group is, whether it's
> > > based on hardware requirements or user policy.  And while I agree that
> > > we need unique ownership of a partition, I disagree that qemu is
> > > necessarily the owner of the entire partition vs individual devices.
> > 
> > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > that qemu is necessarily the owner of the entire partition vs granted
> > access to devices within the partition".  Thanks,
> 
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
> 
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.

How does a qemu instance that uses none of the devices in a group still
own that group?  Aren't we at that point free to move the group to a
different qemu instance or return ownership to the host?  Who does that?
In my mental model, there's an intermediary that "owns" the group and
just as kernel drivers bind to devices when the host owns the group,
qemu is a userspace device driver that binds to sets of devices when the
intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
have to be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-03  3:44             ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Anthony Liguori, linuxppc-dev,
	benve

On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > [snip]
> > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > There's a difference between removing the device from the host and
> > > > > exposing the device to the guest.
> > > > 
> > > > I think you're arguing only over details of what words to use for
> > > > what, rather than anything of substance here.  The point is that an
> > > > entire partitionable group must be assigned to "host" (in which case
> > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > at least to a single UID on the host).  Which of the assigned devices
> > > > the partition actually uses is another matter of course, as is at
> > > > exactly which level they become "de-exposed" if you don't want to use
> > > > all of then.
> > > 
> > > Well first we need to define what a partitionable group is, whether it's
> > > based on hardware requirements or user policy.  And while I agree that
> > > we need unique ownership of a partition, I disagree that qemu is
> > > necessarily the owner of the entire partition vs individual devices.
> > 
> > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > that qemu is necessarily the owner of the entire partition vs granted
> > access to devices within the partition".  Thanks,
> 
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
> 
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.

How does a qemu instance that uses none of the devices in a group still
own that group?  Aren't we at that point free to move the group to a
different qemu instance or return ownership to the host?  Who does that?
In my mental model, there's an intermediary that "owns" the group and
just as kernel drivers bind to devices when the host owns the group,
qemu is a userspace device driver that binds to sets of devices when the
intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
have to be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-03  3:44             ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-03  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > [snip]
> > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > There's a difference between removing the device from the host and
> > > > > exposing the device to the guest.
> > > > 
> > > > I think you're arguing only over details of what words to use for
> > > > what, rather than anything of substance here.  The point is that an
> > > > entire partitionable group must be assigned to "host" (in which case
> > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > at least to a single UID on the host).  Which of the assigned devices
> > > > the partition actually uses is another matter of course, as is at
> > > > exactly which level they become "de-exposed" if you don't want to use
> > > > all of then.
> > > 
> > > Well first we need to define what a partitionable group is, whether it's
> > > based on hardware requirements or user policy.  And while I agree that
> > > we need unique ownership of a partition, I disagree that qemu is
> > > necessarily the owner of the entire partition vs individual devices.
> > 
> > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > that qemu is necessarily the owner of the entire partition vs granted
> > access to devices within the partition".  Thanks,
> 
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
> 
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.

How does a qemu instance that uses none of the devices in a group still
own that group?  Aren't we at that point free to move the group to a
different qemu instance or return ownership to the host?  Who does that?
In my mental model, there's an intermediary that "owns" the group and
just as kernel drivers bind to devices when the host owns the group,
qemu is a userspace device driver that binds to sets of devices when the
intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
have to be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-03  3:44             ` Alex Williamson
@ 2011-08-04  0:39               ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-04  0:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 09:44:49PM -0600, Alex Williamson wrote:
> On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> > On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > > [snip]
> > > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > > There's a difference between removing the device from the host and
> > > > > > exposing the device to the guest.
> > > > > 
> > > > > I think you're arguing only over details of what words to use for
> > > > > what, rather than anything of substance here.  The point is that an
> > > > > entire partitionable group must be assigned to "host" (in which case
> > > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > > at least to a single UID on the host).  Which of the assigned devices
> > > > > the partition actually uses is another matter of course, as is at
> > > > > exactly which level they become "de-exposed" if you don't want to use
> > > > > all of then.
> > > > 
> > > > Well first we need to define what a partitionable group is, whether it's
> > > > based on hardware requirements or user policy.  And while I agree that
> > > > we need unique ownership of a partition, I disagree that qemu is
> > > > necessarily the owner of the entire partition vs individual devices.
> > > 
> > > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > > that qemu is necessarily the owner of the entire partition vs granted
> > > access to devices within the partition".  Thanks,
> > 
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> > 
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> 
> How does a qemu instance that uses none of the devices in a group still
> own that group?

?? In the same way that you still own a file you don't have open..?

>  Aren't we at that point free to move the group to a
> different qemu instance or return ownership to the host?

Of course.  But until you actually do that, the group is still
notionally owned by the guest.

>  Who does that?

The admin.  Possily by poking sysfs, or possibly by frobbing some
character device, or maybe something else.  Naturally libvirt or
whatever could also do this.

> In my mental model, there's an intermediary that "owns" the group and
> just as kernel drivers bind to devices when the host owns the group,
> qemu is a userspace device driver that binds to sets of devices when the
> intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
> have to be.  Thanks,

Well sure, but I really don't see how such an intermediary fits into
the kernel's model of ownership.

So, first, take a step back and look at what sort of entities can
"own" a group (or device or whatever).  I notice that when I've said
"owned by the guest" you seem to have read this as "owned by qemu"
which is not necessarily the same thing.

What I had in mind is that each group is either owned by "host", in
which case host kernel drivers can bind to it, or it's in "guest mode"
in which case it has a user, group and mode and can be bound by user
drivers (and therefore guests) with the right permission.  From the
kernel's perspective there is therefore no distinction between "owned
by qemu" and "owned by libvirt".


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-04  0:39               ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-04  0:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, linuxppc-dev, benve

On Tue, Aug 02, 2011 at 09:44:49PM -0600, Alex Williamson wrote:
> On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
> > On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> > > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > > > [snip]
> > > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > > > share an interrupt.  But even then, we can count on most modern devices
> > > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > > > share interrupts.  In any case, yes, it's more rare but we need to know
> > > > > > how to handle devices behind PCI bridges.  However I disagree that we
> > > > > > need to assign all the devices behind such a bridge to the guest.
> > > > > > There's a difference between removing the device from the host and
> > > > > > exposing the device to the guest.
> > > > > 
> > > > > I think you're arguing only over details of what words to use for
> > > > > what, rather than anything of substance here.  The point is that an
> > > > > entire partitionable group must be assigned to "host" (in which case
> > > > > kernel drivers may bind to it) or to a particular guest partition (or
> > > > > at least to a single UID on the host).  Which of the assigned devices
> > > > > the partition actually uses is another matter of course, as is at
> > > > > exactly which level they become "de-exposed" if you don't want to use
> > > > > all of then.
> > > > 
> > > > Well first we need to define what a partitionable group is, whether it's
> > > > based on hardware requirements or user policy.  And while I agree that
> > > > we need unique ownership of a partition, I disagree that qemu is
> > > > necessarily the owner of the entire partition vs individual devices.
> > > 
> > > Sorry, I didn't intend to have such circular logic.  "... I disagree
> > > that qemu is necessarily the owner of the entire partition vs granted
> > > access to devices within the partition".  Thanks,
> > 
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> > 
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> 
> How does a qemu instance that uses none of the devices in a group still
> own that group?

?? In the same way that you still own a file you don't have open..?

>  Aren't we at that point free to move the group to a
> different qemu instance or return ownership to the host?

Of course.  But until you actually do that, the group is still
notionally owned by the guest.

>  Who does that?

The admin.  Possily by poking sysfs, or possibly by frobbing some
character device, or maybe something else.  Naturally libvirt or
whatever could also do this.

> In my mental model, there's an intermediary that "owns" the group and
> just as kernel drivers bind to devices when the host owns the group,
> qemu is a userspace device driver that binds to sets of devices when the
> intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
> have to be.  Thanks,

Well sure, but I really don't see how such an intermediary fits into
the kernel's model of ownership.

So, first, take a step back and look at what sort of entities can
"own" a group (or device or whatever).  I notice that when I've said
"owned by the guest" you seem to have read this as "owned by qemu"
which is not necessarily the same thing.

What I had in mind is that each group is either owned by "host", in
which case host kernel drivers can bind to it, or it's in "guest mode"
in which case it has a user, group and mode and can be bound by user
drivers (and therefore guests) with the right permission.  From the
kernel's perspective there is therefore no distinction between "owned
by qemu" and "owned by libvirt".


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-29 23:58 ` Benjamin Herrenschmidt
@ 2011-08-04 10:27   ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

Hi Ben,

thanks for your detailed introduction to the requirements for POWER. Its
good to know that the granularity problem is not x86-only.

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on.

On x86 this is mostly an issue of the IOMMU and which set of devices use
the same request-id. I used to call that an alias-group because the
devices have a request-id alias to the pci-bridge.

> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

Correct.
 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

I agree. Managing the ownership of a group should be done in the kernel.
Doing this in userspace is just too dangerous.

The problem to be solved here is how to present these PEs inside the
kernel and to userspace. I thought a bit about making this visbible
through the iommu-api for in-kernel users. That is probably the most
logical place.

For userspace I would like to propose a new device attribute in sysfs.
This attribute contains the group number. All devices with the same
group number belong to the same PE. Libvirt needs to scan the whole
device tree to build the groups but that is probalbly not a big deal.


	Joerg

> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Nothing great was ever achieved without enthusiasm.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:27   ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

Hi Ben,

thanks for your detailed introduction to the requirements for POWER. Its
good to know that the granularity problem is not x86-only.

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on.

On x86 this is mostly an issue of the IOMMU and which set of devices use
the same request-id. I used to call that an alias-group because the
devices have a request-id alias to the pci-bridge.

> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

Correct.
 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

I agree. Managing the ownership of a group should be done in the kernel.
Doing this in userspace is just too dangerous.

The problem to be solved here is how to present these PEs inside the
kernel and to userspace. I thought a bit about making this visbible
through the iommu-api for in-kernel users. That is probably the most
logical place.

For userspace I would like to propose a new device attribute in sysfs.
This attribute contains the group number. All devices with the same
group number belong to the same PE. Libvirt needs to scan the whole
device tree to build the groups but that is probalbly not a big deal.


	Joerg

> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Nothing great was ever achieved without enthusiasm.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-07-30 18:20   ` Alex Williamson
  (?)
@ 2011-08-04 10:35     ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev,
	iommu, benve, aafabbri, chrisw, qemu-devel

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:35     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	David Gibson, aafabbri, iommu, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:35     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	David Gibson, aafabbri, iommu, linux-pci, linuxppc-dev, benve

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-01 20:27     ` Alex Williamson
@ 2011-08-04 10:41       ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avi Kivity, Benjamin Herrenschmidt, kvm, Anthony Liguori,
	David Gibson, Paul Mackerras, Alexey Kardashevskiy, linux-pci,
	linuxppc-dev

On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> makes this easier?

AMD IOMMU provides remapping tables per-device, and not a global one.
But that does not make direct guest-access to the MSI-X table safe. The
table contains the table contains the interrupt-type and the vector
which is used as an index into the remapping table by the IOMMU. So when
the guest writes into its MSI-X table the remapping-table in the host
needs to be updated too.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-04 10:41       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-04 10:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
	Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev

On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> makes this easier?

AMD IOMMU provides remapping tables per-device, and not a global one.
But that does not make direct guest-access to the MSI-X table safe. The
table contains the table contains the interrupt-type and the vector
which is used as an index into the remapping table by the IOMMU. So when
the guest writes into its MSI-X table the remapping-table in the host
needs to be updated too.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-04 10:41       ` Joerg Roedel
@ 2011-08-05 10:26         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, Avi Kivity, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > implement an emulated interrupt remapper and hope that the guest picks
> > unused indexes in the host interrupt remapping table before it could do
> > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > makes this easier?
> 
> AMD IOMMU provides remapping tables per-device, and not a global one.
> But that does not make direct guest-access to the MSI-X table safe. The
> table contains the table contains the interrupt-type and the vector
> which is used as an index into the remapping table by the IOMMU. So when
> the guest writes into its MSI-X table the remapping-table in the host
> needs to be updated too.

Right, you need paravirt to avoid filtering :-)

IE the problem is two fold:

 - Getting the right value in the table / remapper so things work
(paravirt)

 - Protecting against the guest somewhat managing to change the value in
the table (either directly or via a backdoor access to its own config
space).

The later for us comes from the HW PE filtering of the MSI transactions.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 10:26         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev

On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > implement an emulated interrupt remapper and hope that the guest picks
> > unused indexes in the host interrupt remapping table before it could do
> > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > makes this easier?
> 
> AMD IOMMU provides remapping tables per-device, and not a global one.
> But that does not make direct guest-access to the MSI-X table safe. The
> table contains the table contains the interrupt-type and the vector
> which is used as an index into the remapping table by the IOMMU. So when
> the guest writes into its MSI-X table the remapping-table in the host
> needs to be updated too.

Right, you need paravirt to avoid filtering :-)

IE the problem is two fold:

 - Getting the right value in the table / remapper so things work
(paravirt)

 - Protecting against the guest somewhat managing to change the value in
the table (either directly or via a backdoor access to its own config
space).

The later for us comes from the HW PE filtering of the MSI transactions.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-04 10:27   ` Joerg Roedel
@ 2011-08-05 10:42     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:42 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Thu, 2011-08-04 at 12:27 +0200, Joerg Roedel wrote:
> Hi Ben,
> 
> thanks for your detailed introduction to the requirements for POWER. Its
> good to know that the granularity problem is not x86-only.

I'm happy to see your reply :-) I had the feeling I was a bit alone
here...

> On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> > In IBM POWER land, we call this a "partitionable endpoint" (the term
> > "endpoint" here is historic, such a PE can be made of several PCIe
> > "endpoints"). I think "partitionable" is a pretty good name tho to
> > represent the constraints, so I'll call this a "partitionable group"
> > from now on.
> 
> On x86 this is mostly an issue of the IOMMU and which set of devices use
> the same request-id. I used to call that an alias-group because the
> devices have a request-id alias to the pci-bridge.

Right. In fact to try to clarify the problem for everybody, I think we
can distinguish two different classes of "constraints" that can
influence the grouping of devices:

 1- Hard constraints. These are typically devices using the same RID or
where the RID cannot be reliably guaranteed (the later is the case with
some PCIe-PCIX bridges which will take ownership of "some" transactions
such as split but not all). Devices like that must be in the same
domain. This is where PowerPC adds to what x86 does today the concept
that the domains are pre-existing, since we use the RID for error
isolation & MMIO segmenting as well. so we need to create those domains
at boot time.

 2- Softer constraints. Those constraints derive from the fact that not
applying them risks enabling the guest to create side effects outside of
its "sandbox". To some extent, there can be "degrees" of badness between
the various things that can cause such constraints. Examples are shared
LSIs (since trusting DisINTx can be chancy, see earlier discussions),
potentially any set of functions in the same device can be problematic
due to the possibility to get backdoor access to the BARs etc...

Now, what I derive from the discussion we've had so far, is that we need
to find a proper fix for #1, but Alex and Avi seem to prefer that #2
remains a matter of libvirt/user doing the right thing (basically
keeping a loaded gun aimed at the user's foot with a very very very
sweet trigger but heh, let's not start a flamewar here :-)

So let's try to find a proper solution for #1 now, and leave #2 alone
for the time being.

Maybe the right option is for x86 to move toward pre-existing domains
like powerpc does, or maybe we can just expose some kind of ID.

Because #1 is a mix of generic constraints (nasty bridges) and very
platform specific ones (whatever capacity limits in our MMIO segmenting
forced us to put two devices in the same hard domain on power), I
believe it's really something the kernel must solve, not libvirt nor
qemu user or anything else.

I am open to suggestions here. I can easily expose my PE# (it's just a
number) somewhere in sysfs, in fact I'm considering doing it in the PCI
devices sysfs directory, simply because it can/will be useful for other
things such as error reporting, so we could maybe build on that.

The crux for me is really the need for pre-existence of the iommu
domains as my PE's imply a shared iommu space.

> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> Correct.
>  
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> I agree. Managing the ownership of a group should be done in the kernel.
> Doing this in userspace is just too dangerous.
> 
> The problem to be solved here is how to present these PEs inside the
> kernel and to userspace. I thought a bit about making this visbible
> through the iommu-api for in-kernel users. That is probably the most
> logical place.

Ah you started answering to my above questions :-)

We could do what you propose. It depends what we want to do with
domains. Practically speaking, we could make domains pre-existing (with
the ability to group several PEs into larger domains) or we could keep
the concepts different, possibly with the limitation that on powerpc, a
domain == a PE.

I suppose we -could- make arbitrary domains on ppc as well by making the
various PE's iommu's in HW point to the same in-memory table, but that's
a bit nasty in practice due to the way we manage those, and it would to
some extent increase the risk of a failing device/driver stomping on
another one and thus taking it down with itself. IE. isolation of errors
is an important feature for us.

So I'd rather avoid the whole domain thing for now and keep the
constraint, for powerpc at least, that a domain == a PE, and thus find a
proper way to expose that to qemu/libvirt.

> For userspace I would like to propose a new device attribute in sysfs.
> This attribute contains the group number. All devices with the same
> group number belong to the same PE. Libvirt needs to scan the whole
> device tree to build the groups but that is probalbly not a big deal.

That's trivial for me to map that to my existing PE number. Should we
define the number space to be within a PCI domain (ie a host bridge). Or
should it be a global space ? In the later case I can construct them
using domain << 16 | PE# or something like that.

Cheers,
Ben.

>	Joerg
> 
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> > 
> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> > 
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> > 
> > I'll talk a little bit more about recent POWER iommu's here to
> > illustrate where I'm coming from with my idea of groups:
> > 
> > On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> > of domain and a per-RID filtering. However it differs from VTd in a few
> > ways:
> > 
> > The "domains" (aka PEs) encompass more than just an iommu filtering
> > scheme. The MMIO space and PIO space are also segmented, and those
> > segments assigned to domains. Interrupts (well, MSI ports at least) are
> > assigned to domains. Inbound PCIe error messages are targeted to
> > domains, etc...
> > 
> > Basically, the PEs provide a very strong isolation feature which
> > includes errors, and has the ability to immediately "isolate" a PE on
> > the first occurence of an error. For example, if an inbound PCIe error
> > is signaled by a device on a PE or such a device does a DMA to a
> > non-authorized address, the whole PE gets into error state. All
> > subsequent stores (both DMA and MMIO) are swallowed and reads return all
> > 1's, interrupts are blocked. This is designed to prevent any propagation
> > of bad data, which is a very important feature in large high reliability
> > systems.
> > 
> > Software then has the ability to selectively turn back on MMIO and/or
> > DMA, perform diagnostics, reset devices etc...
> > 
> > Because the domains encompass more than just DMA, but also segment the
> > MMIO space, it is not practical at all to dynamically reconfigure them
> > at runtime to "move" devices into domains. The firmware or early kernel
> > code (it depends) will assign devices BARs using an algorithm that keeps
> > them within PE segment boundaries, etc....
> > 
> > Additionally (and this is indeed a "restriction" compared to VTd, though
> > I expect our future IO chips to lift it to some extent), PE don't get
> > separate DMA address spaces. There is one 64-bit DMA address space per
> > PCI host bridge, and it is 'segmented' with each segment being assigned
> > to a PE. Due to the way PE assignment works in hardware, it is not
> > practical to make several devices share a segment unless they are on the
> > same bus. Also the resulting limit in the amount of 32-bit DMA space a
> > device can access means that it's impractical to put too many devices in
> > a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> > more about that later).
> > 
> > The above essentially extends the granularity requirement (or rather is
> > another factor defining what the granularity of partitionable entities
> > is). You can think of it as "pre-existing" domains.
> > 
> > I believe the way to solve that is to introduce a kernel interface to
> > expose those "partitionable entities" to userspace. In addition, it
> > occurs to me that the ability to manipulate VTd domains essentially
> > boils down to manipulating those groups (creating larger ones with
> > individual components).
> > 
> > I like the idea of defining / playing with those groups statically
> > (using a command line tool or sysfs, possibly having a config file
> > defining them in a persistent way) rather than having their lifetime
> > tied to a uiommu file descriptor.
> > 
> > It also makes it a LOT easier to have a channel to manipulate
> > platform/arch specific attributes of those domains if any.
> > 
> > So we could define an API or representation in sysfs that exposes what
> > the partitionable entities are, and we may add to it an API to
> > manipulate them. But we don't have to and I'm happy to keep the
> > additional SW grouping you can do on VTd as a sepparate "add-on" API
> > (tho I don't like at all the way it works with uiommu). However, qemu
> > needs to know what the grouping is regardless of the domains, and it's
> > not nice if it has to manipulate two different concepts here so
> > eventually those "partitionable entities" from a qemu standpoint must
> > look like domains.
> > 
> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> > 
> > This can be done in a way that isn't PCI specific as well (the
> > definition of the groups and what is grouped would would obviously be
> > somewhat bus specific and handled by platform code in the kernel).
> > 
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> > 
> > * IOMMU
> > 
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> > 
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> > 
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> > 
> > This means:
> > 
> >   - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> > 
> >   - It requires the guest to be pinned. Pass-through -> no more swap
> > 
> >   - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb & bounce buffering.
> > 
> >   - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> > 
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> > 
> > Basically, what we do today is:
> > 
> > - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> > What is the DMA address and size of the DMA "window" usable for a given
> > device. This is a tweak, that should really be handled at the "domain"
> > level.
> > 
> > That current hack won't work well if two devices share an iommu. Note
> > that we have an additional constraint here due to our paravirt
> > interfaces (specificed in PAPR) which is that PE domains must have a
> > common parent. Basically, pHyp makes them look like a PCIe host bridge
> > per domain in the guest. I think that's a pretty good idea and qemu
> > might want to do the same.
> > 
> > - We hack out the currently unconditional mapping of the entire guest
> > space in the iommu. Something will have to be done to "decide" whether
> > to do that or not ... qemu argument -> ioctl ?
> > 
> > - We hook up the paravirt call to insert/remove a translation from the
> > iommu to the VFIO map/unmap ioctl's.
> > 
> > This limps along but it's not great. Some of the problems are:
> > 
> > - I've already mentioned, the domain problem again :-) 
> > 
> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> > 
> >   - ... which isn't trivial to get back to our underlying arch specific
> > iommu object from there. We'll probably need a set of arch specific
> > "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> > link them to the real thing kernel-side.
> > 
> > - PAPR (the specification of our paravirt interface and the expectation
> > of current OSes) wants iommu pages to be 4k by default, regardless of
> > the kernel host page size, which makes things a bit tricky since our
> > enterprise host kernels have a 64k base page size. Additionally, we have
> > new PAPR interfaces that we want to exploit, to allow the guest to
> > create secondary iommu segments (in 64-bit space), which can be used
> > (under guest control) to do things like map the entire guest (here it
> > is :-) or use larger iommu page sizes (if permitted by the host kernel,
> > in our case we could allow 64k iommu page size with a 64k host kernel).
> > 
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> > 
> > * IO space
> > 
> > On most (if not all) non-x86 archs, each PCI host bridge provide a
> > completely separate PCI address space. Qemu doesn't deal with that very
> > well. For MMIO it can be handled since those PCI address spaces are
> > "remapped" holes in the main CPU address space so devices can be
> > registered by using BAR + offset of that window in qemu MMIO mapping.
> > 
> > For PIO things get nasty. We have totally separate PIO spaces and qemu
> > doesn't seem to like that. We can try to play the offset trick as well,
> > we haven't tried yet, but basically that's another one to fix. Not a
> > huge deal I suppose but heh ...
> > 
> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> > 
> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> > 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> > 
> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> > 
> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> > 
> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> > 
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> > 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> > 
> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> > 
> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> > 
> >   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> > majority here is platform devices. There's quite a bit of vfio that
> > isn't intrinsically PCI specific. We could have an in-kernel platform
> > driver like we have an in-kernel PCI driver to attach to. The mapping of
> > resources to userspace is rather generic, as goes for interrupts. I
> > don't know whether that idea can be pushed much further, I don't have
> > the bandwidth to look into it much at this point, but maybe it would be
> > possible to refactor vfio a bit to better separate what is PCI specific
> > to what is not. The idea would be to move the PCI specific bits to
> > inside the "placeholder" PCI driver, and same goes for platform bits.
> > "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> > passes them to the driver which allows the PCI one to handle things
> > differently than the platform one, maybe an amba one while at it,
> > etc.... just a thought, I haven't gone into the details at all.
> > 
> > I think that's all I had on my plate today, it's a long enough email
> > anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> > wiki-disabled myself so he proposed to pickup my email and do that. We
> > should probably discuss the various items in here separately as
> > different threads to avoid too much confusion.
> > 
> > One other thing we should do on our side is publish somewhere our
> > current hacks to get you an idea of where we are going and what we had
> > to do (code speaks more than words). We'll try to do that asap, possibly
> > next week.
> > 
> > Note that I'll be on/off the next few weeks, travelling and doing
> > bringup. So expect latency in my replies.
> > 
> > Cheers,
> > Ben.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 10:42     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 10:42 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Thu, 2011-08-04 at 12:27 +0200, Joerg Roedel wrote:
> Hi Ben,
> 
> thanks for your detailed introduction to the requirements for POWER. Its
> good to know that the granularity problem is not x86-only.

I'm happy to see your reply :-) I had the feeling I was a bit alone
here...

> On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
> > In IBM POWER land, we call this a "partitionable endpoint" (the term
> > "endpoint" here is historic, such a PE can be made of several PCIe
> > "endpoints"). I think "partitionable" is a pretty good name tho to
> > represent the constraints, so I'll call this a "partitionable group"
> > from now on.
> 
> On x86 this is mostly an issue of the IOMMU and which set of devices use
> the same request-id. I used to call that an alias-group because the
> devices have a request-id alias to the pci-bridge.

Right. In fact to try to clarify the problem for everybody, I think we
can distinguish two different classes of "constraints" that can
influence the grouping of devices:

 1- Hard constraints. These are typically devices using the same RID or
where the RID cannot be reliably guaranteed (the later is the case with
some PCIe-PCIX bridges which will take ownership of "some" transactions
such as split but not all). Devices like that must be in the same
domain. This is where PowerPC adds to what x86 does today the concept
that the domains are pre-existing, since we use the RID for error
isolation & MMIO segmenting as well. so we need to create those domains
at boot time.

 2- Softer constraints. Those constraints derive from the fact that not
applying them risks enabling the guest to create side effects outside of
its "sandbox". To some extent, there can be "degrees" of badness between
the various things that can cause such constraints. Examples are shared
LSIs (since trusting DisINTx can be chancy, see earlier discussions),
potentially any set of functions in the same device can be problematic
due to the possibility to get backdoor access to the BARs etc...

Now, what I derive from the discussion we've had so far, is that we need
to find a proper fix for #1, but Alex and Avi seem to prefer that #2
remains a matter of libvirt/user doing the right thing (basically
keeping a loaded gun aimed at the user's foot with a very very very
sweet trigger but heh, let's not start a flamewar here :-)

So let's try to find a proper solution for #1 now, and leave #2 alone
for the time being.

Maybe the right option is for x86 to move toward pre-existing domains
like powerpc does, or maybe we can just expose some kind of ID.

Because #1 is a mix of generic constraints (nasty bridges) and very
platform specific ones (whatever capacity limits in our MMIO segmenting
forced us to put two devices in the same hard domain on power), I
believe it's really something the kernel must solve, not libvirt nor
qemu user or anything else.

I am open to suggestions here. I can easily expose my PE# (it's just a
number) somewhere in sysfs, in fact I'm considering doing it in the PCI
devices sysfs directory, simply because it can/will be useful for other
things such as error reporting, so we could maybe build on that.

The crux for me is really the need for pre-existence of the iommu
domains as my PE's imply a shared iommu space.

> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> Correct.
>  
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> I agree. Managing the ownership of a group should be done in the kernel.
> Doing this in userspace is just too dangerous.
> 
> The problem to be solved here is how to present these PEs inside the
> kernel and to userspace. I thought a bit about making this visbible
> through the iommu-api for in-kernel users. That is probably the most
> logical place.

Ah you started answering to my above questions :-)

We could do what you propose. It depends what we want to do with
domains. Practically speaking, we could make domains pre-existing (with
the ability to group several PEs into larger domains) or we could keep
the concepts different, possibly with the limitation that on powerpc, a
domain == a PE.

I suppose we -could- make arbitrary domains on ppc as well by making the
various PE's iommu's in HW point to the same in-memory table, but that's
a bit nasty in practice due to the way we manage those, and it would to
some extent increase the risk of a failing device/driver stomping on
another one and thus taking it down with itself. IE. isolation of errors
is an important feature for us.

So I'd rather avoid the whole domain thing for now and keep the
constraint, for powerpc at least, that a domain == a PE, and thus find a
proper way to expose that to qemu/libvirt.

> For userspace I would like to propose a new device attribute in sysfs.
> This attribute contains the group number. All devices with the same
> group number belong to the same PE. Libvirt needs to scan the whole
> device tree to build the groups but that is probalbly not a big deal.

That's trivial for me to map that to my existing PE number. Should we
define the number space to be within a PCI domain (ie a host bridge). Or
should it be a global space ? In the later case I can construct them
using domain << 16 | PE# or something like that.

Cheers,
Ben.

>	Joerg
> 
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> > 
> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> > 
> > Now, I'm not saying these programmable iommu domains aren't a nice
> > feature and that we shouldn't exploit them when available, but as it is,
> > it is too much a central part of the API.
> > 
> > I'll talk a little bit more about recent POWER iommu's here to
> > illustrate where I'm coming from with my idea of groups:
> > 
> > On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> > of domain and a per-RID filtering. However it differs from VTd in a few
> > ways:
> > 
> > The "domains" (aka PEs) encompass more than just an iommu filtering
> > scheme. The MMIO space and PIO space are also segmented, and those
> > segments assigned to domains. Interrupts (well, MSI ports at least) are
> > assigned to domains. Inbound PCIe error messages are targeted to
> > domains, etc...
> > 
> > Basically, the PEs provide a very strong isolation feature which
> > includes errors, and has the ability to immediately "isolate" a PE on
> > the first occurence of an error. For example, if an inbound PCIe error
> > is signaled by a device on a PE or such a device does a DMA to a
> > non-authorized address, the whole PE gets into error state. All
> > subsequent stores (both DMA and MMIO) are swallowed and reads return all
> > 1's, interrupts are blocked. This is designed to prevent any propagation
> > of bad data, which is a very important feature in large high reliability
> > systems.
> > 
> > Software then has the ability to selectively turn back on MMIO and/or
> > DMA, perform diagnostics, reset devices etc...
> > 
> > Because the domains encompass more than just DMA, but also segment the
> > MMIO space, it is not practical at all to dynamically reconfigure them
> > at runtime to "move" devices into domains. The firmware or early kernel
> > code (it depends) will assign devices BARs using an algorithm that keeps
> > them within PE segment boundaries, etc....
> > 
> > Additionally (and this is indeed a "restriction" compared to VTd, though
> > I expect our future IO chips to lift it to some extent), PE don't get
> > separate DMA address spaces. There is one 64-bit DMA address space per
> > PCI host bridge, and it is 'segmented' with each segment being assigned
> > to a PE. Due to the way PE assignment works in hardware, it is not
> > practical to make several devices share a segment unless they are on the
> > same bus. Also the resulting limit in the amount of 32-bit DMA space a
> > device can access means that it's impractical to put too many devices in
> > a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> > more about that later).
> > 
> > The above essentially extends the granularity requirement (or rather is
> > another factor defining what the granularity of partitionable entities
> > is). You can think of it as "pre-existing" domains.
> > 
> > I believe the way to solve that is to introduce a kernel interface to
> > expose those "partitionable entities" to userspace. In addition, it
> > occurs to me that the ability to manipulate VTd domains essentially
> > boils down to manipulating those groups (creating larger ones with
> > individual components).
> > 
> > I like the idea of defining / playing with those groups statically
> > (using a command line tool or sysfs, possibly having a config file
> > defining them in a persistent way) rather than having their lifetime
> > tied to a uiommu file descriptor.
> > 
> > It also makes it a LOT easier to have a channel to manipulate
> > platform/arch specific attributes of those domains if any.
> > 
> > So we could define an API or representation in sysfs that exposes what
> > the partitionable entities are, and we may add to it an API to
> > manipulate them. But we don't have to and I'm happy to keep the
> > additional SW grouping you can do on VTd as a sepparate "add-on" API
> > (tho I don't like at all the way it works with uiommu). However, qemu
> > needs to know what the grouping is regardless of the domains, and it's
> > not nice if it has to manipulate two different concepts here so
> > eventually those "partitionable entities" from a qemu standpoint must
> > look like domains.
> > 
> > My main point is that I don't want the "knowledge" here to be in libvirt
> > or qemu. In fact, I want to be able to do something as simple as passing
> > a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> > the devices in there and expose them to the guest.
> > 
> > This can be done in a way that isn't PCI specific as well (the
> > definition of the groups and what is grouped would would obviously be
> > somewhat bus specific and handled by platform code in the kernel).
> > 
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> > 
> > * IOMMU
> > 
> > Now more on iommu. I've described I think in enough details how ours
> > work, there are others, I don't know what freescale or ARM are doing,
> > sparc doesn't quite work like VTd either, etc...
> > 
> > The main problem isn't that much the mechanics of the iommu but really
> > how it's exposed (or not) to guests.
> > 
> > VFIO here is basically designed for one and only one thing: expose the
> > entire guest physical address space to the device more/less 1:1.
> > 
> > This means:
> > 
> >   - It only works with iommu's that provide complete DMA address spaces
> > to devices. Won't work with a single 'segmented' address space like we
> > have on POWER.
> > 
> >   - It requires the guest to be pinned. Pass-through -> no more swap
> > 
> >   - The guest cannot make use of the iommu to deal with 32-bit DMA
> > devices, thus a guest with more than a few G of RAM (I don't know the
> > exact limit on x86, depends on your IO hole I suppose), and you end up
> > back to swiotlb & bounce buffering.
> > 
> >   - It doesn't work for POWER server anyways because of our need to
> > provide a paravirt iommu interface to the guest since that's how pHyp
> > works today and how existing OSes expect to operate.
> > 
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> > 
> > Basically, what we do today is:
> > 
> > - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> > What is the DMA address and size of the DMA "window" usable for a given
> > device. This is a tweak, that should really be handled at the "domain"
> > level.
> > 
> > That current hack won't work well if two devices share an iommu. Note
> > that we have an additional constraint here due to our paravirt
> > interfaces (specificed in PAPR) which is that PE domains must have a
> > common parent. Basically, pHyp makes them look like a PCIe host bridge
> > per domain in the guest. I think that's a pretty good idea and qemu
> > might want to do the same.
> > 
> > - We hack out the currently unconditional mapping of the entire guest
> > space in the iommu. Something will have to be done to "decide" whether
> > to do that or not ... qemu argument -> ioctl ?
> > 
> > - We hook up the paravirt call to insert/remove a translation from the
> > iommu to the VFIO map/unmap ioctl's.
> > 
> > This limps along but it's not great. Some of the problems are:
> > 
> > - I've already mentioned, the domain problem again :-) 
> > 
> > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > and has quite a bit of overhead. However we'll want to do the paravirt
> > call directly in the kernel eventually ...
> > 
> >   - ... which isn't trivial to get back to our underlying arch specific
> > iommu object from there. We'll probably need a set of arch specific
> > "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> > link them to the real thing kernel-side.
> > 
> > - PAPR (the specification of our paravirt interface and the expectation
> > of current OSes) wants iommu pages to be 4k by default, regardless of
> > the kernel host page size, which makes things a bit tricky since our
> > enterprise host kernels have a 64k base page size. Additionally, we have
> > new PAPR interfaces that we want to exploit, to allow the guest to
> > create secondary iommu segments (in 64-bit space), which can be used
> > (under guest control) to do things like map the entire guest (here it
> > is :-) or use larger iommu page sizes (if permitted by the host kernel,
> > in our case we could allow 64k iommu page size with a 64k host kernel).
> > 
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
> > 
> > * IO space
> > 
> > On most (if not all) non-x86 archs, each PCI host bridge provide a
> > completely separate PCI address space. Qemu doesn't deal with that very
> > well. For MMIO it can be handled since those PCI address spaces are
> > "remapped" holes in the main CPU address space so devices can be
> > registered by using BAR + offset of that window in qemu MMIO mapping.
> > 
> > For PIO things get nasty. We have totally separate PIO spaces and qemu
> > doesn't seem to like that. We can try to play the offset trick as well,
> > we haven't tried yet, but basically that's another one to fix. Not a
> > huge deal I suppose but heh ...
> > 
> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> > 
> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> > 
> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> > 
> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> > 
> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> > 
> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> > 
> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> > 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> > 
> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> > 
> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> > 
> >   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> > majority here is platform devices. There's quite a bit of vfio that
> > isn't intrinsically PCI specific. We could have an in-kernel platform
> > driver like we have an in-kernel PCI driver to attach to. The mapping of
> > resources to userspace is rather generic, as goes for interrupts. I
> > don't know whether that idea can be pushed much further, I don't have
> > the bandwidth to look into it much at this point, but maybe it would be
> > possible to refactor vfio a bit to better separate what is PCI specific
> > to what is not. The idea would be to move the PCI specific bits to
> > inside the "placeholder" PCI driver, and same goes for platform bits.
> > "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> > passes them to the driver which allows the PCI one to handle things
> > differently than the platform one, maybe an amba one while at it,
> > etc.... just a thought, I haven't gone into the details at all.
> > 
> > I think that's all I had on my plate today, it's a long enough email
> > anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> > wiki-disabled myself so he proposed to pickup my email and do that. We
> > should probably discuss the various items in here separately as
> > different threads to avoid too much confusion.
> > 
> > One other thing we should do on our side is publish somewhere our
> > current hacks to get you an idea of where we are going and what we had
> > to do (code speaks more than words). We'll try to do that asap, possibly
> > next week.
> > 
> > Note that I'll be on/off the next few weeks, travelling and doing
> > bringup. So expect latency in my replies.
> > 
> > Cheers,
> > Ben.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 10:26         ` Benjamin Herrenschmidt
@ 2011-08-05 12:57           ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 12:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Avi Kivity, kvm, Anthony Liguori, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, Aug 05, 2011 at 08:26:11PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> > On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > > implement an emulated interrupt remapper and hope that the guest picks
> > > unused indexes in the host interrupt remapping table before it could do
> > > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > > makes this easier?
> > 
> > AMD IOMMU provides remapping tables per-device, and not a global one.
> > But that does not make direct guest-access to the MSI-X table safe. The
> > table contains the table contains the interrupt-type and the vector
> > which is used as an index into the remapping table by the IOMMU. So when
> > the guest writes into its MSI-X table the remapping-table in the host
> > needs to be updated too.
> 
> Right, you need paravirt to avoid filtering :-)

Or a shadow MSI-X table like done on x86. How to handle this seems to be
platform specific. As you indicate there is a standardized paravirt
interface for that on Power.

> IE the problem is two fold:
> 
>  - Getting the right value in the table / remapper so things work
> (paravirt)
> 
>  - Protecting against the guest somewhat managing to change the value in
> the table (either directly or via a backdoor access to its own config
> space).
> 
> The later for us comes from the HW PE filtering of the MSI transactions.

Right. The second part of the problem can be avoided with
interrupt-remapping/filtering hardware in the IOMMUs.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 12:57           ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 12:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev

On Fri, Aug 05, 2011 at 08:26:11PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
> > On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
> > > It's not clear to me how we could skip it.  With VT-d, we'd have to
> > > implement an emulated interrupt remapper and hope that the guest picks
> > > unused indexes in the host interrupt remapping table before it could do
> > > anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
> > > makes this easier?
> > 
> > AMD IOMMU provides remapping tables per-device, and not a global one.
> > But that does not make direct guest-access to the MSI-X table safe. The
> > table contains the table contains the interrupt-type and the vector
> > which is used as an index into the remapping table by the IOMMU. So when
> > the guest writes into its MSI-X table the remapping-table in the host
> > needs to be updated too.
> 
> Right, you need paravirt to avoid filtering :-)

Or a shadow MSI-X table like done on x86. How to handle this seems to be
platform specific. As you indicate there is a standardized paravirt
interface for that on Power.

> IE the problem is two fold:
> 
>  - Getting the right value in the table / remapper so things work
> (paravirt)
> 
>  - Protecting against the guest somewhat managing to change the value in
> the table (either directly or via a backdoor access to its own config
> space).
> 
> The later for us comes from the HW PE filtering of the MSI transactions.

Right. The second part of the problem can be avoided with
interrupt-remapping/filtering hardware in the IOMMUs.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 10:42     ` Benjamin Herrenschmidt
@ 2011-08-05 13:44       ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 13:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:

> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.

Domains (in the iommu-sense) are created at boot time on x86 today.
Every device needs at least a domain to provide dma-mapping
functionality to the drivers. So all the grouping is done too at
boot-time. This is specific to the iommu-drivers today but can be
generalized I think.

>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

Hmm, there is no sane way to handle such constraints in a safe way,
right? We can either blacklist devices which are know to have such
backdoors or we just ignore the problem.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)
> 
> So let's try to find a proper solution for #1 now, and leave #2 alone
> for the time being.

Yes, and the solution for #1 should be entirely in the kernel. The
question is how to do that. Probably the most sane way is to introduce a
concept of device ownership. The ownership can either be a kernel driver
or a userspace process. Giving ownership of a device to userspace is
only possible if all devices in the same group are unbound from its
respective drivers. This is a very intrusive concept, no idea if it
has a chance of acceptance :-)
But the advantage is clearly that this allows better semantics in the
IOMMU drivers and a more stable handover of devices from host drivers to
kvm guests.

> Maybe the right option is for x86 to move toward pre-existing domains
> like powerpc does, or maybe we can just expose some kind of ID.

As I said, the domains are created a iommu driver initialization time
(usually boot time). But the groups are internal to the iommu drivers
and not visible somewhere else.

> Ah you started answering to my above questions :-)
> 
> We could do what you propose. It depends what we want to do with
> domains. Practically speaking, we could make domains pre-existing (with
> the ability to group several PEs into larger domains) or we could keep
> the concepts different, possibly with the limitation that on powerpc, a
> domain == a PE.
> 
> I suppose we -could- make arbitrary domains on ppc as well by making the
> various PE's iommu's in HW point to the same in-memory table, but that's
> a bit nasty in practice due to the way we manage those, and it would to
> some extent increase the risk of a failing device/driver stomping on
> another one and thus taking it down with itself. IE. isolation of errors
> is an important feature for us.

These arbitrary domains exist in the iommu-api. It would be good to
emulate them on Power too. Can't you put a PE into an isolated
error-domain when something goes wrong with it? This should provide the
same isolation as before.
What you derive the group number from is your business :-) On x86 it is
certainly the best to use the RID these devices share together with the
PCI segment number.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 13:44       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-05 13:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:

> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.

Domains (in the iommu-sense) are created at boot time on x86 today.
Every device needs at least a domain to provide dma-mapping
functionality to the drivers. So all the grouping is done too at
boot-time. This is specific to the iommu-drivers today but can be
generalized I think.

>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

Hmm, there is no sane way to handle such constraints in a safe way,
right? We can either blacklist devices which are know to have such
backdoors or we just ignore the problem.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)
> 
> So let's try to find a proper solution for #1 now, and leave #2 alone
> for the time being.

Yes, and the solution for #1 should be entirely in the kernel. The
question is how to do that. Probably the most sane way is to introduce a
concept of device ownership. The ownership can either be a kernel driver
or a userspace process. Giving ownership of a device to userspace is
only possible if all devices in the same group are unbound from its
respective drivers. This is a very intrusive concept, no idea if it
has a chance of acceptance :-)
But the advantage is clearly that this allows better semantics in the
IOMMU drivers and a more stable handover of devices from host drivers to
kvm guests.

> Maybe the right option is for x86 to move toward pre-existing domains
> like powerpc does, or maybe we can just expose some kind of ID.

As I said, the domains are created a iommu driver initialization time
(usually boot time). But the groups are internal to the iommu drivers
and not visible somewhere else.

> Ah you started answering to my above questions :-)
> 
> We could do what you propose. It depends what we want to do with
> domains. Practically speaking, we could make domains pre-existing (with
> the ability to group several PEs into larger domains) or we could keep
> the concepts different, possibly with the limitation that on powerpc, a
> domain == a PE.
> 
> I suppose we -could- make arbitrary domains on ppc as well by making the
> various PE's iommu's in HW point to the same in-memory table, but that's
> a bit nasty in practice due to the way we manage those, and it would to
> some extent increase the risk of a failing device/driver stomping on
> another one and thus taking it down with itself. IE. isolation of errors
> is an important feature for us.

These arbitrary domains exist in the iommu-api. It would be good to
emulate them on Power too. Can't you put a PE into an isolated
error-domain when something goes wrong with it? This should provide the
same isolation as before.
What you derive the group number from is your business :-) On x86 it is
certainly the best to use the RID these devices share together with the
PCI segment number.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 10:42     ` Benjamin Herrenschmidt
@ 2011-08-05 15:10       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-05 15:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Joerg Roedel, kvm, Anthony Liguori, David Gibson, Paul Mackerras,
	Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.
> 
>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

This is what I've been trying to get to, hardware constraints vs system
policy constraints.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)

Doesn't your own uncertainty of whether or not to allow this lead to the
same conclusion, that it belongs in userspace policy?  I don't think we
want to make white lists of which devices we trust to do DisINTx
correctly part of the kernel interface, do we?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 15:10       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-05 15:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Joerg Roedel, David Gibson, Anthony Liguori, linuxppc-dev

On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.
> 
>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

This is what I've been trying to get to, hardware constraints vs system
policy constraints.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)

Doesn't your own uncertainty of whether or not to allow this lead to the
same conclusion, that it belongs in userspace policy?  I don't think we
want to make white lists of which devices we trust to do DisINTx
correctly part of the kernel interface, do we?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 13:44       ` Joerg Roedel
@ 2011-08-05 22:49         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 22:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Anthony Liguori, Alex Williamson, David Gibson,
	Paul Mackerras, Alexey Kardashevskiy, linux-pci, linuxppc-dev

On Fri, 2011-08-05 at 15:44 +0200, Joerg Roedel wrote:
> On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Right. In fact to try to clarify the problem for everybody, I think we
> > can distinguish two different classes of "constraints" that can
> > influence the grouping of devices:
> > 
> >  1- Hard constraints. These are typically devices using the same RID or
> > where the RID cannot be reliably guaranteed (the later is the case with
> > some PCIe-PCIX bridges which will take ownership of "some" transactions
> > such as split but not all). Devices like that must be in the same
> > domain. This is where PowerPC adds to what x86 does today the concept
> > that the domains are pre-existing, since we use the RID for error
> > isolation & MMIO segmenting as well. so we need to create those domains
> > at boot time.
> 
> Domains (in the iommu-sense) are created at boot time on x86 today.
> Every device needs at least a domain to provide dma-mapping
> functionality to the drivers. So all the grouping is done too at
> boot-time. This is specific to the iommu-drivers today but can be
> generalized I think.

Ok, let's go there then.

> >  2- Softer constraints. Those constraints derive from the fact that not
> > applying them risks enabling the guest to create side effects outside of
> > its "sandbox". To some extent, there can be "degrees" of badness between
> > the various things that can cause such constraints. Examples are shared
> > LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> > potentially any set of functions in the same device can be problematic
> > due to the possibility to get backdoor access to the BARs etc...
> 
> Hmm, there is no sane way to handle such constraints in a safe way,
> right? We can either blacklist devices which are know to have such
> backdoors or we just ignore the problem.

Arguably they probably all do have such backdoors. A debug register,
JTAG register, ... My point is you don't really know unless you get
manufacturer guarantee that there is no undocumented register somewhere
or way to change the microcode so that it does it etc.... The more
complex the devices, the less likely to have a guarantee.

The "safe" way is what pHyp does and basically boils down to only
allowing pass-through of entire 'slots', ie, things that are behind a
P2P bridge (virtual one typically, ie, a PCIe switch) and disallowing
pass-through with shared interrupts.

That way, even if the guest can move the BARs around, it cannot make
them overlap somebody else device because the parent bridge restricts
the portion of MMIO space that is forwarded down to that device anyway.

> > Now, what I derive from the discussion we've had so far, is that we need
> > to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> > remains a matter of libvirt/user doing the right thing (basically
> > keeping a loaded gun aimed at the user's foot with a very very very
> > sweet trigger but heh, let's not start a flamewar here :-)
> > 
> > So let's try to find a proper solution for #1 now, and leave #2 alone
> > for the time being.
> 
> Yes, and the solution for #1 should be entirely in the kernel. The
> question is how to do that. Probably the most sane way is to introduce a
> concept of device ownership. The ownership can either be a kernel driver
> or a userspace process. Giving ownership of a device to userspace is
> only possible if all devices in the same group are unbound from its
> respective drivers. This is a very intrusive concept, no idea if it
> has a chance of acceptance :-)
> But the advantage is clearly that this allows better semantics in the
> IOMMU drivers and a more stable handover of devices from host drivers to
> kvm guests.

I tend to think around those lines too, but the ownership concept
doesn't necessarily have to be core-kernel enforced itself, it can be in
VFIO.

If we have a common API to expose the "domain number", it can perfectly
be a matter of VFIO itself not allowing to do pass-through until it has 
attached its stub driver to all the devices with that domain number, and
it can handle exclusion of iommu domains from there.

> > Maybe the right option is for x86 to move toward pre-existing domains
> > like powerpc does, or maybe we can just expose some kind of ID.
> 
> As I said, the domains are created a iommu driver initialization time
> (usually boot time). But the groups are internal to the iommu drivers
> and not visible somewhere else.

That's what we need to fix :-)

> > Ah you started answering to my above questions :-)
> > 
> > We could do what you propose. It depends what we want to do with
> > domains. Practically speaking, we could make domains pre-existing (with
> > the ability to group several PEs into larger domains) or we could keep
> > the concepts different, possibly with the limitation that on powerpc, a
> > domain == a PE.
> > 
> > I suppose we -could- make arbitrary domains on ppc as well by making the
> > various PE's iommu's in HW point to the same in-memory table, but that's
> > a bit nasty in practice due to the way we manage those, and it would to
> > some extent increase the risk of a failing device/driver stomping on
> > another one and thus taking it down with itself. IE. isolation of errors
> > is an important feature for us.
> 
> These arbitrary domains exist in the iommu-api. It would be good to
> emulate them on Power too. Can't you put a PE into an isolated
> error-domain when something goes wrong with it? This should provide the
> same isolation as before.

Well, my problem is that it's quite hard for me to arbitrarily make PEs
shared the same iommu table. The iommu tables are assigned at boot time
along with the creation of the PEs, and because sadly, I don't (yet)
support tree structures for them, they are large physically contiguous
things, so I need to allocate them early and keep them around.

I -could- make a hack to share tables when creating such arbitrary
domains, but I would definitely have to keep track of the "original"
table of the PE so that can be reverted, I can't afford to free the
memory or I risk not being able to re-allocate it.

We'll have tree iommu's in future HW but not yet.

> What you derive the group number from is your business :-) On x86 it is
> certainly the best to use the RID these devices share together with the
> PCI segment number.

Ok. The question is more in term of API whether this number is to be
unique at the system scope or only at the PCI host bridge scope. 

Cheers,
Ben.

> Regards,
> 
> 	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-05 22:49         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-05 22:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

On Fri, 2011-08-05 at 15:44 +0200, Joerg Roedel wrote:
> On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Right. In fact to try to clarify the problem for everybody, I think we
> > can distinguish two different classes of "constraints" that can
> > influence the grouping of devices:
> > 
> >  1- Hard constraints. These are typically devices using the same RID or
> > where the RID cannot be reliably guaranteed (the later is the case with
> > some PCIe-PCIX bridges which will take ownership of "some" transactions
> > such as split but not all). Devices like that must be in the same
> > domain. This is where PowerPC adds to what x86 does today the concept
> > that the domains are pre-existing, since we use the RID for error
> > isolation & MMIO segmenting as well. so we need to create those domains
> > at boot time.
> 
> Domains (in the iommu-sense) are created at boot time on x86 today.
> Every device needs at least a domain to provide dma-mapping
> functionality to the drivers. So all the grouping is done too at
> boot-time. This is specific to the iommu-drivers today but can be
> generalized I think.

Ok, let's go there then.

> >  2- Softer constraints. Those constraints derive from the fact that not
> > applying them risks enabling the guest to create side effects outside of
> > its "sandbox". To some extent, there can be "degrees" of badness between
> > the various things that can cause such constraints. Examples are shared
> > LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> > potentially any set of functions in the same device can be problematic
> > due to the possibility to get backdoor access to the BARs etc...
> 
> Hmm, there is no sane way to handle such constraints in a safe way,
> right? We can either blacklist devices which are know to have such
> backdoors or we just ignore the problem.

Arguably they probably all do have such backdoors. A debug register,
JTAG register, ... My point is you don't really know unless you get
manufacturer guarantee that there is no undocumented register somewhere
or way to change the microcode so that it does it etc.... The more
complex the devices, the less likely to have a guarantee.

The "safe" way is what pHyp does and basically boils down to only
allowing pass-through of entire 'slots', ie, things that are behind a
P2P bridge (virtual one typically, ie, a PCIe switch) and disallowing
pass-through with shared interrupts.

That way, even if the guest can move the BARs around, it cannot make
them overlap somebody else device because the parent bridge restricts
the portion of MMIO space that is forwarded down to that device anyway.

> > Now, what I derive from the discussion we've had so far, is that we need
> > to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> > remains a matter of libvirt/user doing the right thing (basically
> > keeping a loaded gun aimed at the user's foot with a very very very
> > sweet trigger but heh, let's not start a flamewar here :-)
> > 
> > So let's try to find a proper solution for #1 now, and leave #2 alone
> > for the time being.
> 
> Yes, and the solution for #1 should be entirely in the kernel. The
> question is how to do that. Probably the most sane way is to introduce a
> concept of device ownership. The ownership can either be a kernel driver
> or a userspace process. Giving ownership of a device to userspace is
> only possible if all devices in the same group are unbound from its
> respective drivers. This is a very intrusive concept, no idea if it
> has a chance of acceptance :-)
> But the advantage is clearly that this allows better semantics in the
> IOMMU drivers and a more stable handover of devices from host drivers to
> kvm guests.

I tend to think around those lines too, but the ownership concept
doesn't necessarily have to be core-kernel enforced itself, it can be in
VFIO.

If we have a common API to expose the "domain number", it can perfectly
be a matter of VFIO itself not allowing to do pass-through until it has 
attached its stub driver to all the devices with that domain number, and
it can handle exclusion of iommu domains from there.

> > Maybe the right option is for x86 to move toward pre-existing domains
> > like powerpc does, or maybe we can just expose some kind of ID.
> 
> As I said, the domains are created a iommu driver initialization time
> (usually boot time). But the groups are internal to the iommu drivers
> and not visible somewhere else.

That's what we need to fix :-)

> > Ah you started answering to my above questions :-)
> > 
> > We could do what you propose. It depends what we want to do with
> > domains. Practically speaking, we could make domains pre-existing (with
> > the ability to group several PEs into larger domains) or we could keep
> > the concepts different, possibly with the limitation that on powerpc, a
> > domain == a PE.
> > 
> > I suppose we -could- make arbitrary domains on ppc as well by making the
> > various PE's iommu's in HW point to the same in-memory table, but that's
> > a bit nasty in practice due to the way we manage those, and it would to
> > some extent increase the risk of a failing device/driver stomping on
> > another one and thus taking it down with itself. IE. isolation of errors
> > is an important feature for us.
> 
> These arbitrary domains exist in the iommu-api. It would be good to
> emulate them on Power too. Can't you put a PE into an isolated
> error-domain when something goes wrong with it? This should provide the
> same isolation as before.

Well, my problem is that it's quite hard for me to arbitrarily make PEs
shared the same iommu table. The iommu tables are assigned at boot time
along with the creation of the PEs, and because sadly, I don't (yet)
support tree structures for them, they are large physically contiguous
things, so I need to allocate them early and keep them around.

I -could- make a hack to share tables when creating such arbitrary
domains, but I would definitely have to keep track of the "original"
table of the PE so that can be reverted, I can't afford to free the
memory or I risk not being able to re-allocate it.

We'll have tree iommu's in future HW but not yet.

> What you derive the group number from is your business :-) On x86 it is
> certainly the best to use the RID these devices share together with the
> PCI segment number.

Ok. The question is more in term of API whether this number is to be
unique at the system scope or only at the PCI host bridge scope. 

Cheers,
Ben.

> Regards,
> 
> 	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-05 15:10       ` Alex Williamson
  (?)
@ 2011-08-08  6:07       ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-08  6:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, Joerg Roedel,
	Anthony Liguori, linux-pci, linuxppc-dev

On Fri, Aug 05, 2011 at 09:10:09AM -0600, Alex Williamson wrote:
> On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
> > Right. In fact to try to clarify the problem for everybody, I think we
> > can distinguish two different classes of "constraints" that can
> > influence the grouping of devices:
> > 
> >  1- Hard constraints. These are typically devices using the same RID or
> > where the RID cannot be reliably guaranteed (the later is the case with
> > some PCIe-PCIX bridges which will take ownership of "some" transactions
> > such as split but not all). Devices like that must be in the same
> > domain. This is where PowerPC adds to what x86 does today the concept
> > that the domains are pre-existing, since we use the RID for error
> > isolation & MMIO segmenting as well. so we need to create those domains
> > at boot time.
> > 
> >  2- Softer constraints. Those constraints derive from the fact that not
> > applying them risks enabling the guest to create side effects outside of
> > its "sandbox". To some extent, there can be "degrees" of badness between
> > the various things that can cause such constraints. Examples are shared
> > LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> > potentially any set of functions in the same device can be problematic
> > due to the possibility to get backdoor access to the BARs etc...
> 
> This is what I've been trying to get to, hardware constraints vs system
> policy constraints.
> 
> > Now, what I derive from the discussion we've had so far, is that we need
> > to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> > remains a matter of libvirt/user doing the right thing (basically
> > keeping a loaded gun aimed at the user's foot with a very very very
> > sweet trigger but heh, let's not start a flamewar here :-)
> 
> Doesn't your own uncertainty of whether or not to allow this lead to the
> same conclusion, that it belongs in userspace policy?  I don't think we
> want to make white lists of which devices we trust to do DisINTx
> correctly part of the kernel interface, do we?  Thanks,

Yes, but the overall point is that both the hard and soft constraints
are much easier to handle if a group or iommu domain or whatever is a
persistent entity that can be set up once-per-boot by the admin with
whatever degree of safety they want, rather than a transient entity
tied to an fd's lifetime, which must be set up correctly, every time,
by the thing establishing it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-03  2:04           ` David Gibson
  (?)
@ 2011-08-08  8:28             ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-08  8:28 UTC (permalink / raw)
  To: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-deve

On 08/03/2011 05:04 AM, David Gibson wrote:
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
>
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.
>

Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.

Is this an accurate way of putting it?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-08  8:28             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-08  8:28 UTC (permalink / raw)
  To: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On 08/03/2011 05:04 AM, David Gibson wrote:
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
>
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.
>

Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.

Is this an accurate way of putting it?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-08  8:28             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-08  8:28 UTC (permalink / raw)
  To: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On 08/03/2011 05:04 AM, David Gibson wrote:
> I still don't understand the distinction you're making.  We're saying
> the group is "owned" by a given user or guest in the sense that no-one
> else may use anything in the group (including host drivers).  At that
> point none, some or all of the devices in the group may actually be
> used by the guest.
>
> You seem to be making a distinction between "owned by" and "assigned
> to" and "used by" and I really don't see what it is.
>

Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.

Is this an accurate way of putting it?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-08  8:28             ` Avi Kivity
  (?)
@ 2011-08-09 23:24               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-09 23:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
> On 08/03/2011 05:04 AM, David Gibson wrote:
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> >
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> >
> 
> Alex (and I) think that we should work with device/function granularity, 
> as is common with other archs, and that the group thing is just a 
> constraint on which functions may be assigned where, while you think 
> that we should work at group granularity, with 1-function groups for 
> archs which don't have constraints.
> 
> Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-09 23:24               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-09 23:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Anthony Liguori, linuxppc-dev,
	benve

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
> On 08/03/2011 05:04 AM, David Gibson wrote:
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> >
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> >
> 
> Alex (and I) think that we should work with device/function granularity, 
> as is common with other archs, and that the group thing is just a 
> constraint on which functions may be assigned where, while you think 
> that we should work at group granularity, with 1-function groups for 
> archs which don't have constraints.
> 
> Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-09 23:24               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-09 23:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, linuxppc-dev, benve

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
> On 08/03/2011 05:04 AM, David Gibson wrote:
> > I still don't understand the distinction you're making.  We're saying
> > the group is "owned" by a given user or guest in the sense that no-one
> > else may use anything in the group (including host drivers).  At that
> > point none, some or all of the devices in the group may actually be
> > used by the guest.
> >
> > You seem to be making a distinction between "owned by" and "assigned
> > to" and "used by" and I really don't see what it is.
> >
> 
> Alex (and I) think that we should work with device/function granularity, 
> as is common with other archs, and that the group thing is just a 
> constraint on which functions may be assigned where, while you think 
> that we should work at group granularity, with 1-function groups for 
> archs which don't have constraints.
> 
> Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-09 23:24               ` Alex Williamson
  (?)
@ 2011-08-10  2:48                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-10  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve


> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-10  2:48                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-10  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve


> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-10  2:48                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-10  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve


> Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
> for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
> We lose resolution of devices behind the bridge.  As you state though, I
> think of this as only a constraint on what we're able to do with those
> devices.
> 
> Perhaps part of the differences is that on x86 the constraints don't
> really effect how we expose devices to the guest.  We need to hold
> unused devices in the group hostage and use the same iommu domain for
> any devices assigned, but that's not visible to the guest.  AIUI, POWER
> probably needs to expose the bridge (or at least an emulated bridge) to
> the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

> some kind of pvDMA needs to be associated with that group, there might
> be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

> Effectively you want to
> transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

>From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice "optimziation" to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-10  2:48                 ` Benjamin Herrenschmidt
  (?)
@ 2011-08-20 16:51                   ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-20 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

>From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This "group" fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a "device" fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-20 16:51                   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-20 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

>From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This "group" fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a "device" fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-20 16:51                   ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-20 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

>From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This "group" fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a "device" fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-20 16:51                   ` Alex Williamson
  (?)
@ 2011-08-22  5:55                     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-22  5:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, aafabbri, iommu,
	Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

> >From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.  As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.

A 1:1 group<->process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

> (do non-PCI
> devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22  5:55                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-22  5:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

> >From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.  As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.

A 1:1 group<->process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

> (do non-PCI
> devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22  5:55                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-22  5:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

> >From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.  As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.

A 1:1 group<->process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

> (do non-PCI
> devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-20 16:51                   ` Alex Williamson
  (?)
@ 2011-08-22  6:30                     ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22  6:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	aafabbri, iommu, linux-pci, linuxppc-dev, benve

On 08/20/2011 07:51 PM, Alex Williamson wrote:
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
>
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
>

$ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22  6:30                     ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22  6:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	aafabbri, iommu, Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/20/2011 07:51 PM, Alex Williamson wrote:
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
>
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
>

$ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22  6:30                     ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22  6:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	aafabbri, iommu, linux-pci, linuxppc-dev, benve

On 08/20/2011 07:51 PM, Alex Williamson wrote:
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
>
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
>

$ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22  6:30                     ` Avi Kivity
  (?)
@ 2011-08-22 10:46                       ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:46                       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:46                       ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 10:46                       ` Joerg Roedel
  (?)
@ 2011-08-22 10:51                         ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 10:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> >  ../../../path/to/device/which/represents/the/resource/constraint
> >
> >  (the pci-to-pci bridge on x86, or whatever node represents partitionable
> >  endpoints on power)
>
> That does not work. The bridge in question may not even be visible as a
> PCI device, so you can't link to it. This is the case on a few PCIe
> cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> the PCIe interface (yes, I have seen those cards).
>

How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:51                         ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 10:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> >  ../../../path/to/device/which/represents/the/resource/constraint
> >
> >  (the pci-to-pci bridge on x86, or whatever node represents partitionable
> >  endpoints on power)
>
> That does not work. The bridge in question may not even be visible as a
> PCI device, so you can't link to it. This is the case on a few PCIe
> cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> the PCIe interface (yes, I have seen those cards).
>

How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 10:51                         ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 10:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> >  ../../../path/to/device/which/represents/the/resource/constraint
> >
> >  (the pci-to-pci bridge on x86, or whatever node represents partitionable
> >  endpoints on power)
>
> That does not work. The bridge in question may not even be visible as a
> PCI device, so you can't link to it. This is the case on a few PCIe
> cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> the PCIe interface (yes, I have seen those cards).
>

How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 10:51                         ` Avi Kivity
  (?)
@ 2011-08-22 12:36                           ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:36                           ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:36                           ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> > That does not work. The bridge in question may not even be visible as a
> > PCI device, so you can't link to it. This is the case on a few PCIe
> > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> > the PCIe interface (yes, I have seen those cards).
> 
> How does the kernel detect that devices behind the invisible bridge must 
> be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 12:36                           ` Roedel, Joerg
  (?)
@ 2011-08-22 12:42                             ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 12:42 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  >  That does not work. The bridge in question may not even be visible as a
> >  >  PCI device, so you can't link to it. This is the case on a few PCIe
> >  >  cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> >  >  the PCIe interface (yes, I have seen those cards).
> >
> >  How does the kernel detect that devices behind the invisible bridge must
> >  be assigned as a unit?
>
> On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> Not sure about the VT-d side, though.
>

I see.  There is no sysfs node representing it?

I'd rather not add another meaningless identifier.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:42                             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 12:42 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  >  That does not work. The bridge in question may not even be visible as a
> >  >  PCI device, so you can't link to it. This is the case on a few PCIe
> >  >  cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> >  >  the PCIe interface (yes, I have seen those cards).
> >
> >  How does the kernel detect that devices behind the invisible bridge must
> >  be assigned as a unit?
>
> On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> Not sure about the VT-d side, though.
>

I see.  There is no sysfs node representing it?

I'd rather not add another meaningless identifier.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:42                             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 12:42 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 01:46 PM, Joerg Roedel wrote:
> >  >  That does not work. The bridge in question may not even be visible as a
> >  >  PCI device, so you can't link to it. This is the case on a few PCIe
> >  >  cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
> >  >  the PCIe interface (yes, I have seen those cards).
> >
> >  How does the kernel detect that devices behind the invisible bridge must
> >  be assigned as a unit?
>
> On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> Not sure about the VT-d side, though.
>

I see.  There is no sysfs node representing it?

I'd rather not add another meaningless identifier.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 12:42                             ` Avi Kivity
  (?)
@ 2011-08-22 12:55                               ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> > On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> > Not sure about the VT-d side, though.
> 
> I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)

> I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:55                               ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> > On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> > Not sure about the VT-d side, though.
> 
> I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)

> I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 12:55                               ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 12:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> > On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> > Not sure about the VT-d side, though.
> 
> I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)

> I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 12:55                               ` Roedel, Joerg
  (?)
@ 2011-08-22 13:06                                 ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 13:06 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> >  >  On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> >  >  Not sure about the VT-d side, though.
> >
> >  I see.  There is no sysfs node representing it?
>
> No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
> the AMD IOMMU driver in the past and I needed to fix that. There I know
> that from :)

Well, too bad.

>
> >  I'd rather not add another meaningless identifier.
>
> Well, I don't think its really meaningless, but we need some way to
> communicate the information about device groups to userspace.
>

I mean the contents of the group descriptor.  There are enough 42s in 
the kernel, it's better if we can replace a synthetic number with 
something meaningful.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 13:06                                 ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 13:06 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> >  >  On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> >  >  Not sure about the VT-d side, though.
> >
> >  I see.  There is no sysfs node representing it?
>
> No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
> the AMD IOMMU driver in the past and I needed to fix that. There I know
> that from :)

Well, too bad.

>
> >  I'd rather not add another meaningless identifier.
>
> Well, I don't think its really meaningless, but we need some way to
> communicate the information about device groups to userspace.
>

I mean the contents of the group descriptor.  There are enough 42s in 
the kernel, it's better if we can replace a synthetic number with 
something meaningful.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 13:06                                 ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 13:06 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
> >  >  On the AMD IOMMU side this information is stored in the IVRS ACPI table.
> >  >  Not sure about the VT-d side, though.
> >
> >  I see.  There is no sysfs node representing it?
>
> No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
> the AMD IOMMU driver in the past and I needed to fix that. There I know
> that from :)

Well, too bad.

>
> >  I'd rather not add another meaningless identifier.
>
> Well, I don't think its really meaningless, but we need some way to
> communicate the information about device groups to userspace.
>

I mean the contents of the group descriptor.  There are enough 42s in 
the kernel, it's better if we can replace a synthetic number with 
something meaningful.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 13:06                                 ` Avi Kivity
  (?)
@ 2011-08-22 13:15                                   ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 13:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

> > Well, I don't think its really meaningless, but we need some way to
> > communicate the information about device groups to userspace.
> 
> I mean the contents of the group descriptor.  There are enough 42s in 
> the kernel, it's better if we can replace a synthetic number with 
> something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 13:15                                   ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 13:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

> > Well, I don't think its really meaningless, but we need some way to
> > communicate the information about device groups to userspace.
> 
> I mean the contents of the group descriptor.  There are enough 42s in 
> the kernel, it's better if we can replace a synthetic number with 
> something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 13:15                                   ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 13:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

> > Well, I don't think its really meaningless, but we need some way to
> > communicate the information about device groups to userspace.
> 
> I mean the contents of the group descriptor.  There are enough 42s in 
> the kernel, it's better if we can replace a synthetic number with 
> something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 13:15                                   ` Roedel, Joerg
  (?)
@ 2011-08-22 13:17                                     ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 13:17 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
>
> >  >  Well, I don't think its really meaningless, but we need some way to
> >  >  communicate the information about device groups to userspace.
> >
> >  I mean the contents of the group descriptor.  There are enough 42s in
> >  the kernel, it's better if we can replace a synthetic number with
> >  something meaningful.
>
> If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> sufficient, of course. But the idea was to make it generic enough so
> that it works with !PCI too.
>

We could make it an arch defined string instead of a symlink.  So it 
doesn't return 42, rather something that can be used by the admin to 
figure out what the problem was.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 13:17                                     ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 13:17 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
>
> >  >  Well, I don't think its really meaningless, but we need some way to
> >  >  communicate the information about device groups to userspace.
> >
> >  I mean the contents of the group descriptor.  There are enough 42s in
> >  the kernel, it's better if we can replace a synthetic number with
> >  something meaningful.
>
> If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> sufficient, of course. But the idea was to make it generic enough so
> that it works with !PCI too.
>

We could make it an arch defined string instead of a symlink.  So it 
doesn't return 42, rather something that can be used by the admin to 
figure out what the problem was.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 13:17                                     ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-22 13:17 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
>
> >  >  Well, I don't think its really meaningless, but we need some way to
> >  >  communicate the information about device groups to userspace.
> >
> >  I mean the contents of the group descriptor.  There are enough 42s in
> >  the kernel, it's better if we can replace a synthetic number with
> >  something meaningful.
>
> If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> sufficient, of course. But the idea was to make it generic enough so
> that it works with !PCI too.
>

We could make it an arch defined string instead of a symlink.  So it 
doesn't return 42, rather something that can be used by the admin to 
figure out what the problem was.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 13:17                                     ` Avi Kivity
  (?)
@ 2011-08-22 14:37                                       ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 14:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, Benjamin Herrenschmidt, qemu-devel, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote:
> On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> > On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> > >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> >
> > >  >  Well, I don't think its really meaningless, but we need some way to
> > >  >  communicate the information about device groups to userspace.
> > >
> > >  I mean the contents of the group descriptor.  There are enough 42s in
> > >  the kernel, it's better if we can replace a synthetic number with
> > >  something meaningful.
> >
> > If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> > sufficient, of course. But the idea was to make it generic enough so
> > that it works with !PCI too.
> >
> 
> We could make it an arch defined string instead of a symlink.  So it 
> doesn't return 42, rather something that can be used by the admin to 
> figure out what the problem was.

Well, ok, it would certainly differ from the in-kernel representation
then and introduce new architecture dependencies into libvirt. But if
the 'group-string' is more meaningful to users then its certainly good.
Suggestions?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 14:37                                       ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 14:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote:
> On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> > On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> > >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> >
> > >  >  Well, I don't think its really meaningless, but we need some way to
> > >  >  communicate the information about device groups to userspace.
> > >
> > >  I mean the contents of the group descriptor.  There are enough 42s in
> > >  the kernel, it's better if we can replace a synthetic number with
> > >  something meaningful.
> >
> > If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> > sufficient, of course. But the idea was to make it generic enough so
> > that it works with !PCI too.
> >
> 
> We could make it an arch defined string instead of a symlink.  So it 
> doesn't return 42, rather something that can be used by the admin to 
> figure out what the problem was.

Well, ok, it would certainly differ from the in-kernel representation
then and introduce new architecture dependencies into libvirt. But if
the 'group-string' is more meaningful to users then its certainly good.
Suggestions?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 14:37                                       ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-22 14:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote:
> On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
> > On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
> > >  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
> >
> > >  >  Well, I don't think its really meaningless, but we need some way to
> > >  >  communicate the information about device groups to userspace.
> > >
> > >  I mean the contents of the group descriptor.  There are enough 42s in
> > >  the kernel, it's better if we can replace a synthetic number with
> > >  something meaningful.
> >
> > If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
> > sufficient, of course. But the idea was to make it generic enough so
> > that it works with !PCI too.
> >
> 
> We could make it an arch defined string instead of a symlink.  So it 
> doesn't return 42, rather something that can be used by the admin to 
> figure out what the problem was.

Well, ok, it would certainly differ from the in-kernel representation
then and introduce new architecture dependencies into libvirt. But if
the 'group-string' is more meaningful to users then its certainly good.
Suggestions?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22  5:55                     ` David Gibson
@ 2011-08-22 15:45                       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-22 15:45 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> 
> Ok.  Am I correct in thinking these group IDs are representing the
> minimum granularity, and are therefore always static, defined only by
> the connected hardware, not by configuration?

Yes, that's the idea.  An open question I have towards the configuration
side is whether we might add iommu driver specific options to the
groups.  For instance on x86 where we typically have B:D.F granularity,
should we have an option not to trust multi-function devices and use a
B:D granularity for grouping?

> > >From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> 
> It seems a slightly strange distinction that the group device appears
> when any device in the group is bound to vfio, but only becomes usable
> when all devices are bound.
> 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> Which is why I marginally prefer this model, although it's not a big
> deal.

Right, we can also combine models.  Binding a device to vfio
creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
device access until all the group devices are also bound.  I think
the /dev/vfio/$GROUP might help provide an enumeration interface as well
though, which could be useful.

> > In either case, the uiommu interface is removed entirely since dma
> > mapping is done via the group fd.  As necessary in the future, we can
> > define a more high performance dma mapping interface for streaming dma
> > via the group fd.  I expect we'll also include architecture specific
> > group ioctls to describe features and capabilities of the iommu.  The
> > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > to userspace process ownership model.
> 
> A 1:1 group<->process correspondance seems wrong to me. But there are
> many ways you could legitimately write the userspace side of the code,
> many of them involving some sort of concurrency.  Implementing that
> concurrency as multiple processes (using explicit shared memory and/or
> other IPC mechanisms to co-ordinate) seems a valid choice that we
> shouldn't arbitrarily prohibit.
> 
> Obviously, only one UID may be permitted to have the group open at a
> time, and I think that's enough to prevent them doing any worse than
> shooting themselves in the foot.

1:1 group<->process is probably too strong.  Not allowing concurrent
open()s on the group file enforces a single userspace entity is
responsible for that group.  Device fds can be passed to other
processes, but only retrieved via the group fd.  I suppose we could even
branch off the dma interface into a different fd, but it seems like we
would logically want to serialize dma mappings at each iommu group
anyway.  I'm open to alternatives, this just seemed an easy way to do
it.  Restricting on UID implies that we require isolated qemu instances
to run as different UIDs.  I know that's a goal, but I don't know if we
want to make it an assumption in the group security model.

> > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > We could keep the same model of segmenting the device fd address space,
> > perhaps adding ioctls to define the segment offset bit position or we
> > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > event fd(s), per resource fd, etc).  For interrupts we can overload
> > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> 
> Sounds reasonable.
> 
> > (do non-PCI
> > devices support MSI?).
> 
> They can.  Obviously they might not have exactly the same semantics as
> PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> whose interrupts are treated by the (also on-die) root interrupt
> controller in the same way as PCI MSIs.

Ok, I suppose we can define ioctls to enable these as we go.  We also
need to figure out how non-PCI resources, interrupts, and iommu mapping
restrictions are described via vfio.

> > For qemu, these changes imply we'd only support a model where we have a
> > 1:1 group to iommu domain.  The current vfio driver could probably
> > become vfio-pci as we might end up with more target specific vfio
> > drivers for non-pci.  PCI should be able to maintain a simple -device
> > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > need to come up with extra options when we need to expose groups to
> > guest for pvdma.
> 
> Are you saying that you'd no longer support the current x86 usage of
> putting all of one guest's devices into a single domain?

Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
to assume >1 device per guest is a typical model and that the iotlb is
large enough that we might improve thrashing to see both a resource and
performance benefit from it.  I'm open to suggestions for how we could
include it though.

> If that's
> not what you're saying, how would the domains - now made up of a
> user's selection of groups, rather than individual devices - be
> configured?
> 
> > Hope that captures it, feel free to jump in with corrections and
> > suggestions.  Thanks,
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 15:45                       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-22 15:45 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> 
> Ok.  Am I correct in thinking these group IDs are representing the
> minimum granularity, and are therefore always static, defined only by
> the connected hardware, not by configuration?

Yes, that's the idea.  An open question I have towards the configuration
side is whether we might add iommu driver specific options to the
groups.  For instance on x86 where we typically have B:D.F granularity,
should we have an option not to trust multi-function devices and use a
B:D granularity for grouping?

> > >From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> 
> It seems a slightly strange distinction that the group device appears
> when any device in the group is bound to vfio, but only becomes usable
> when all devices are bound.
> 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> Which is why I marginally prefer this model, although it's not a big
> deal.

Right, we can also combine models.  Binding a device to vfio
creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
device access until all the group devices are also bound.  I think
the /dev/vfio/$GROUP might help provide an enumeration interface as well
though, which could be useful.

> > In either case, the uiommu interface is removed entirely since dma
> > mapping is done via the group fd.  As necessary in the future, we can
> > define a more high performance dma mapping interface for streaming dma
> > via the group fd.  I expect we'll also include architecture specific
> > group ioctls to describe features and capabilities of the iommu.  The
> > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > to userspace process ownership model.
> 
> A 1:1 group<->process correspondance seems wrong to me. But there are
> many ways you could legitimately write the userspace side of the code,
> many of them involving some sort of concurrency.  Implementing that
> concurrency as multiple processes (using explicit shared memory and/or
> other IPC mechanisms to co-ordinate) seems a valid choice that we
> shouldn't arbitrarily prohibit.
> 
> Obviously, only one UID may be permitted to have the group open at a
> time, and I think that's enough to prevent them doing any worse than
> shooting themselves in the foot.

1:1 group<->process is probably too strong.  Not allowing concurrent
open()s on the group file enforces a single userspace entity is
responsible for that group.  Device fds can be passed to other
processes, but only retrieved via the group fd.  I suppose we could even
branch off the dma interface into a different fd, but it seems like we
would logically want to serialize dma mappings at each iommu group
anyway.  I'm open to alternatives, this just seemed an easy way to do
it.  Restricting on UID implies that we require isolated qemu instances
to run as different UIDs.  I know that's a goal, but I don't know if we
want to make it an assumption in the group security model.

> > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > We could keep the same model of segmenting the device fd address space,
> > perhaps adding ioctls to define the segment offset bit position or we
> > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > event fd(s), per resource fd, etc).  For interrupts we can overload
> > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> 
> Sounds reasonable.
> 
> > (do non-PCI
> > devices support MSI?).
> 
> They can.  Obviously they might not have exactly the same semantics as
> PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> whose interrupts are treated by the (also on-die) root interrupt
> controller in the same way as PCI MSIs.

Ok, I suppose we can define ioctls to enable these as we go.  We also
need to figure out how non-PCI resources, interrupts, and iommu mapping
restrictions are described via vfio.

> > For qemu, these changes imply we'd only support a model where we have a
> > 1:1 group to iommu domain.  The current vfio driver could probably
> > become vfio-pci as we might end up with more target specific vfio
> > drivers for non-pci.  PCI should be able to maintain a simple -device
> > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > need to come up with extra options when we need to expose groups to
> > guest for pvdma.
> 
> Are you saying that you'd no longer support the current x86 usage of
> putting all of one guest's devices into a single domain?

Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
to assume >1 device per guest is a typical model and that the iotlb is
large enough that we might improve thrashing to see both a resource and
performance benefit from it.  I'm open to suggestions for how we could
include it though.

> If that's
> not what you're saying, how would the domains - now made up of a
> user's selection of groups, rather than individual devices - be
> configured?
> 
> > Hope that captures it, feel free to jump in with corrections and
> > suggestions.  Thanks,
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-20 16:51                   ` Alex Williamson
  (?)
@ 2011-08-22 17:25                     ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 17:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42

Right, that is mainly for libvirt to provide that information to the
user in a meaningful way. So userspace is aware that other devices might
not work anymore when it assigns one to a guest.

> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

I am in favour of /dev/vfio/$GROUP. If multiple devices should be
assigned to a guest, there can also be an ioctl to bind a group to an
address-space of another group (certainly needs some care to not allow
that both groups belong to different processes).

Btw, a problem we havn't talked about yet entirely is
driver-deassignment. User space can decide to de-assign the device from
vfio while a fd is open on it. With PCI there is no way to let this fail
(the .release function returns void last time i checked). Is this a
problem, and yes, how we handle that?


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 17:25                     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 17:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42

Right, that is mainly for libvirt to provide that information to the
user in a meaningful way. So userspace is aware that other devices might
not work anymore when it assigns one to a guest.

> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

I am in favour of /dev/vfio/$GROUP. If multiple devices should be
assigned to a guest, there can also be an ioctl to bind a group to an
address-space of another group (certainly needs some care to not allow
that both groups belong to different processes).

Btw, a problem we havn't talked about yet entirely is
driver-deassignment. User space can decide to de-assign the device from
vfio while a fd is open on it. With PCI there is no way to let this fail
(the .release function returns void last time i checked). Is this a
problem, and yes, how we handle that?


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 17:25                     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-22 17:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42

Right, that is mainly for libvirt to provide that information to the
user in a meaningful way. So userspace is aware that other devices might
not work anymore when it assigns one to a guest.

> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.  Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.

I am in favour of /dev/vfio/$GROUP. If multiple devices should be
assigned to a guest, there can also be an ioctl to bind a group to an
address-space of another group (certainly needs some care to not allow
that both groups belong to different processes).

Btw, a problem we havn't talked about yet entirely is
driver-deassignment. User space can decide to de-assign the device from
vfio while a fd is open on it. With PCI there is no way to let this fail
(the .release function returns void last time i checked). Is this a
problem, and yes, how we handle that?


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 17:25                     ` Joerg Roedel
  (?)
@ 2011-08-22 19:17                       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-22 19:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> 
> Right, that is mainly for libvirt to provide that information to the
> user in a meaningful way. So userspace is aware that other devices might
> not work anymore when it assigns one to a guest.
> 
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).

That's an interesting idea.  Maybe an interface similar to the current
uiommu interface, where you open() the 2nd group fd and pass the fd via
ioctl to the primary group.  IOMMUs that don't support this would fail
the attach device callback, which would fail the ioctl to bind them.  It
will need to be designed so any group can be removed from the super-set
and the remaining group(s) still works.  This feels like something that
can be added after we get an initial implementation.

> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

The current vfio has the same problem, we can't unbind a device from
vfio while it's attached to a guest.  I think we'd use the same solution
too; send out a netlink packet for a device removal and have the .remove
call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
and SIGBUS the PIDs holding the device if they don't return it
willingly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 19:17                       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-22 19:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> 
> Right, that is mainly for libvirt to provide that information to the
> user in a meaningful way. So userspace is aware that other devices might
> not work anymore when it assigns one to a guest.
> 
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).

That's an interesting idea.  Maybe an interface similar to the current
uiommu interface, where you open() the 2nd group fd and pass the fd via
ioctl to the primary group.  IOMMUs that don't support this would fail
the attach device callback, which would fail the ioctl to bind them.  It
will need to be designed so any group can be removed from the super-set
and the remaining group(s) still works.  This feels like something that
can be added after we get an initial implementation.

> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

The current vfio has the same problem, we can't unbind a device from
vfio while it's attached to a guest.  I think we'd use the same solution
too; send out a netlink packet for a device removal and have the .remove
call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
and SIGBUS the PIDs holding the device if they don't return it
willingly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 19:17                       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-22 19:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
> > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > capture the plan that I think we agreed to:
> > 
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > 
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> 
> Right, that is mainly for libvirt to provide that information to the
> user in a meaningful way. So userspace is aware that other devices might
> not work anymore when it assigns one to a guest.
> 
> > 
> > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > From there we have a few options.  In the BoF we discussed a model where
> > binding a device to vfio creates a /dev/vfio$GROUP character device
> > file.  This "group" fd provides provides dma mapping ioctls as well as
> > ioctls to enumerate and return a "device" fd for each attached member of
> > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > returning an error on open() of the group fd if there are members of the
> > group not bound to the vfio driver.  Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> 
> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).

That's an interesting idea.  Maybe an interface similar to the current
uiommu interface, where you open() the 2nd group fd and pass the fd via
ioctl to the primary group.  IOMMUs that don't support this would fail
the attach device callback, which would fail the ioctl to bind them.  It
will need to be designed so any group can be removed from the super-set
and the remaining group(s) still works.  This feels like something that
can be added after we get an initial implementation.

> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

The current vfio has the same problem, we can't unbind a device from
vfio while it's attached to a guest.  I think we'd use the same solution
too; send out a netlink packet for a device removal and have the .remove
call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
and SIGBUS the PIDs holding the device if they don't return it
willingly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-20 16:51                   ` Alex Williamson
  (?)
@ 2011-08-22 20:29                     ` aafabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-22 20:29 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve




On 8/20/11 9:51 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.

Sounds reasonable.

> Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.
> 
> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.

The loss in generality is unfortunate. I'd like to be able to support
arbitrary iommu domain <-> device assignment.  One way to do this would be
to keep uiommu, but to return an error if someone tries to assign more than
one uiommu context to devices in the same group.


-Aaron

> As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.
> 
> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> devices support MSI?).
> 
> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.
> 
> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 20:29                     ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-22 20:29 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve




On 8/20/11 9:51 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.

Sounds reasonable.

> Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.
> 
> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.

The loss in generality is unfortunate. I'd like to be able to support
arbitrary iommu domain <-> device assignment.  One way to do this would be
to keep uiommu, but to return an error if someone tries to assign more than
one uiommu context to devices in the same group.


-Aaron

> As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.
> 
> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> devices support MSI?).
> 
> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.
> 
> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 20:29                     ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-22 20:29 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve




On 8/20/11 9:51 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.

Sounds reasonable.

> Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.
> 
> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.

The loss in generality is unfortunate. I'd like to be able to support
arbitrary iommu domain <-> device assignment.  One way to do this would be
to keep uiommu, but to return an error if someone tries to assign more than
one uiommu context to devices in the same group.


-Aaron

> As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.
> 
> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> devices support MSI?).
> 
> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.
> 
> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 20:29                     ` aafabbri
@ 2011-08-22 20:49                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 20:49 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:

> > Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> > 
> > In either case, the uiommu interface is removed entirely since dma
> > mapping is done via the group fd.
> 
> The loss in generality is unfortunate. I'd like to be able to support
> arbitrary iommu domain <-> device assignment.  One way to do this would be
> to keep uiommu, but to return an error if someone tries to assign more than
> one uiommu context to devices in the same group.

I wouldn't use uiommu for that. If the HW or underlying kernel drivers
support it, what I'd suggest is that you have an (optional) ioctl to
bind two groups (you have to have both opened already) or for one group
to "capture" another one.

The binding means under the hood the iommus get shared, with the
lifetime being that of the "owning" group.

Another option is to make that static configuration APIs via special
ioctls (or even netlink if you really like it), to change the grouping
on architectures that allow it.

Cheers.
Ben.

> 
> -Aaron
> 
> > As necessary in the future, we can
> > define a more high performance dma mapping interface for streaming dma
> > via the group fd.  I expect we'll also include architecture specific
> > group ioctls to describe features and capabilities of the iommu.  The
> > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > to userspace process ownership model.
> > 
> > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > We could keep the same model of segmenting the device fd address space,
> > perhaps adding ioctls to define the segment offset bit position or we
> > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > event fd(s), per resource fd, etc).  For interrupts we can overload
> > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> > devices support MSI?).
> > 
> > For qemu, these changes imply we'd only support a model where we have a
> > 1:1 group to iommu domain.  The current vfio driver could probably
> > become vfio-pci as we might end up with more target specific vfio
> > drivers for non-pci.  PCI should be able to maintain a simple -device
> > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > need to come up with extra options when we need to expose groups to
> > guest for pvdma.
> > 
> > Hope that captures it, feel free to jump in with corrections and
> > suggestions.  Thanks,
> > 
> > Alex
> > 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 20:49                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 20:49 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:

> > Each device fd would then support a
> > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > vfio, except for the obvious domain and dma ioctls superseded by the
> > group fd.
> > 
> > Another valid model might be that /dev/vfio/$GROUP is created for all
> > groups when the vfio module is loaded.  The group fd would allow open()
> > and some set of iommu querying and device enumeration ioctls, but would
> > error on dma mapping and retrieving device fds until all of the group
> > devices are bound to the vfio driver.
> > 
> > In either case, the uiommu interface is removed entirely since dma
> > mapping is done via the group fd.
> 
> The loss in generality is unfortunate. I'd like to be able to support
> arbitrary iommu domain <-> device assignment.  One way to do this would be
> to keep uiommu, but to return an error if someone tries to assign more than
> one uiommu context to devices in the same group.

I wouldn't use uiommu for that. If the HW or underlying kernel drivers
support it, what I'd suggest is that you have an (optional) ioctl to
bind two groups (you have to have both opened already) or for one group
to "capture" another one.

The binding means under the hood the iommus get shared, with the
lifetime being that of the "owning" group.

Another option is to make that static configuration APIs via special
ioctls (or even netlink if you really like it), to change the grouping
on architectures that allow it.

Cheers.
Ben.

> 
> -Aaron
> 
> > As necessary in the future, we can
> > define a more high performance dma mapping interface for streaming dma
> > via the group fd.  I expect we'll also include architecture specific
> > group ioctls to describe features and capabilities of the iommu.  The
> > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > to userspace process ownership model.
> > 
> > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > We could keep the same model of segmenting the device fd address space,
> > perhaps adding ioctls to define the segment offset bit position or we
> > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > event fd(s), per resource fd, etc).  For interrupts we can overload
> > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> > devices support MSI?).
> > 
> > For qemu, these changes imply we'd only support a model where we have a
> > 1:1 group to iommu domain.  The current vfio driver could probably
> > become vfio-pci as we might end up with more target specific vfio
> > drivers for non-pci.  PCI should be able to maintain a simple -device
> > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > need to come up with extra options when we need to expose groups to
> > guest for pvdma.
> > 
> > Hope that captures it, feel free to jump in with corrections and
> > suggestions.  Thanks,
> > 
> > Alex
> > 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22  6:30                     ` Avi Kivity
  (?)
@ 2011-08-22 20:53                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 20:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On Mon, 2011-08-22 at 09:30 +0300, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

The constraint might not necessarily be a device.

The PCI bridge is just an example. There are other possible constraints.
On POWER for example, it could be a limit in how far I can segment the
DMA address space, forcing me to arbitrarily put devices together, or it
could be a similar constraint related to how the MMIO space is broken
up.

So either that remains a path in which case we do have a separate set of
sysfs nodes representing the groups themselves which may or may not
itself contain a pointer to the "constraining" device, or we just make
that an arbitrary number (in my case the PE#)

Cheers,
Ben

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 20:53                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 20:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, iommu, aafabbri, Alex Williamson, Anthony Liguori,
	linuxppc-dev, benve

On Mon, 2011-08-22 at 09:30 +0300, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

The constraint might not necessarily be a device.

The PCI bridge is just an example. There are other possible constraints.
On POWER for example, it could be a limit in how far I can segment the
DMA address space, forcing me to arbitrarily put devices together, or it
could be a similar constraint related to how the MMIO space is broken
up.

So either that remains a path in which case we do have a separate set of
sysfs nodes representing the groups themselves which may or may not
itself contain a pointer to the "constraining" device, or we just make
that an arbitrary number (in my case the PE#)

Cheers,
Ben

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 20:53                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 20:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, iommu, aafabbri, Alex Williamson, linuxppc-dev,
	benve

On Mon, 2011-08-22 at 09:30 +0300, Avi Kivity wrote:
> On 08/20/2011 07:51 PM, Alex Williamson wrote:
> > We need to address both the description and enforcement of device
> > groups.  Groups are formed any time the iommu does not have resolution
> > between a set of devices.  On x86, this typically happens when a
> > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > Power, partitionable endpoints define a group.  Grouping information
> > needs to be exposed for both userspace and kernel internal usage.  This
> > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> >
> > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > 42
> >
> 
> $ readlink /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> ../../../path/to/device/which/represents/the/resource/constraint
> 
> (the pci-to-pci bridge on x86, or whatever node represents partitionable 
> endpoints on power)

The constraint might not necessarily be a device.

The PCI bridge is just an example. There are other possible constraints.
On POWER for example, it could be a limit in how far I can segment the
DMA address space, forcing me to arbitrarily put devices together, or it
could be a similar constraint related to how the MMIO space is broken
up.

So either that remains a path in which case we do have a separate set of
sysfs nodes representing the groups themselves which may or may not
itself contain a pointer to the "constraining" device, or we just make
that an arbitrary number (in my case the PE#)

Cheers,
Ben

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 15:45                       ` [Qemu-devel] " Alex Williamson
  (?)
@ 2011-08-22 21:01                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	linuxppc-dev, benve

On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:

> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Or even B or range of busses... if you want to enforce strict isolation
you really can't trust anything below a bus level :-)

> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

Could be tho in what form ? returning sysfs pathes ?

> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.

1:1 process has the advantage of linking to an -mm which makes the whole
mmu notifier business doable. How do you want to track down mappings and
do the second level translation in the case of explicit map/unmap (like
on power) if you are not tied to an mm_struct ?

> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Sharing may or may not be possible depending on setups so yes, it's a
bit tricky.

My preference is to have a static interface (and that's actually where
your pet netlink might make some sense :-) to create "synthetic" groups
made of other groups if the arch allows it. But that might not be the
best approach. In another email I also proposed an option for a group to
"capture" another one...

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,
> > 

Another aspect I don't see discussed is how we represent these things to
the guest.

On Power for example, I have a requirement that a given iommu domain is
represented by a single dma window property in the device-tree. What
that means is that that property needs to be either in the node of the
device itself if there's only one device in the group or in a parent
node (ie a bridge or host bridge) if there are multiple devices.

Now I do -not- want to go down the path of simulating P2P bridges,
besides we'll quickly run out of bus numbers if we go there.

For us the most simple and logical approach (which is also what pHyp
uses and what Linux handles well) is really to expose a given PCI host
bridge per group to the guest. Believe it or not, it makes things
easier :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:01                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:

> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Or even B or range of busses... if you want to enforce strict isolation
you really can't trust anything below a bus level :-)

> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

Could be tho in what form ? returning sysfs pathes ?

> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.

1:1 process has the advantage of linking to an -mm which makes the whole
mmu notifier business doable. How do you want to track down mappings and
do the second level translation in the case of explicit map/unmap (like
on power) if you are not tied to an mm_struct ?

> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Sharing may or may not be possible depending on setups so yes, it's a
bit tricky.

My preference is to have a static interface (and that's actually where
your pet netlink might make some sense :-) to create "synthetic" groups
made of other groups if the arch allows it. But that might not be the
best approach. In another email I also proposed an option for a group to
"capture" another one...

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,
> > 

Another aspect I don't see discussed is how we represent these things to
the guest.

On Power for example, I have a requirement that a given iommu domain is
represented by a single dma window property in the device-tree. What
that means is that that property needs to be either in the node of the
device itself if there's only one device in the group or in a parent
node (ie a bridge or host bridge) if there are multiple devices.

Now I do -not- want to go down the path of simulating P2P bridges,
besides we'll quickly run out of bus numbers if we go there.

For us the most simple and logical approach (which is also what pHyp
uses and what Linux handles well) is really to expose a given PCI host
bridge per group to the guest. Believe it or not, it makes things
easier :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:01                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	linuxppc-dev, benve

On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:

> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Or even B or range of busses... if you want to enforce strict isolation
you really can't trust anything below a bus level :-)

> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

Could be tho in what form ? returning sysfs pathes ?

> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.

1:1 process has the advantage of linking to an -mm which makes the whole
mmu notifier business doable. How do you want to track down mappings and
do the second level translation in the case of explicit map/unmap (like
on power) if you are not tied to an mm_struct ?

> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Sharing may or may not be possible depending on setups so yes, it's a
bit tricky.

My preference is to have a static interface (and that's actually where
your pet netlink might make some sense :-) to create "synthetic" groups
made of other groups if the arch allows it. But that might not be the
best approach. In another email I also proposed an option for a group to
"capture" another one...

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,
> > 

Another aspect I don't see discussed is how we represent these things to
the guest.

On Power for example, I have a requirement that a given iommu domain is
represented by a single dma window property in the device-tree. What
that means is that that property needs to be either in the node of the
device itself if there's only one device in the group or in a parent
node (ie a bridge or host bridge) if there are multiple devices.

Now I do -not- want to go down the path of simulating P2P bridges,
besides we'll quickly run out of bus numbers if we go there.

For us the most simple and logical approach (which is also what pHyp
uses and what Linux handles well) is really to expose a given PCI host
bridge per group to the guest. Believe it or not, it makes things
easier :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 17:25                     ` Joerg Roedel
  (?)
@ 2011-08-22 21:03                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve


> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).
> 
> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

We can treat it as a hard unplug (like a cardbus gone away).

IE. Dispose of the direct mappings (switch to MMIO emulation) and return
all ff's from reads (& ignore writes).

Then send an unplug event via whatever mechanism the platform provides
(ACPI hotplug controller on x86 for example, we haven't quite sorted out
what to do on power for hotplug yet).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:03                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve


> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).
> 
> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

We can treat it as a hard unplug (like a cardbus gone away).

IE. Dispose of the direct mappings (switch to MMIO emulation) and return
all ff's from reads (& ignore writes).

Then send an unplug event via whatever mechanism the platform provides
(ACPI hotplug controller on x86 for example, we haven't quite sorted out
what to do on power for hotplug yet).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:03                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve


> I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> assigned to a guest, there can also be an ioctl to bind a group to an
> address-space of another group (certainly needs some care to not allow
> that both groups belong to different processes).
> 
> Btw, a problem we havn't talked about yet entirely is
> driver-deassignment. User space can decide to de-assign the device from
> vfio while a fd is open on it. With PCI there is no way to let this fail
> (the .release function returns void last time i checked). Is this a
> problem, and yes, how we handle that?

We can treat it as a hard unplug (like a cardbus gone away).

IE. Dispose of the direct mappings (switch to MMIO emulation) and return
all ff's from reads (& ignore writes).

Then send an unplug event via whatever mechanism the platform provides
(ACPI hotplug controller on x86 for example, we haven't quite sorted out
what to do on power for hotplug yet).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 20:49                       ` [Qemu-devel] " Benjamin Herrenschmidt
  (?)
@ 2011-08-22 21:38                         ` aafabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-22 21:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Avi Kivity, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve




On 8/22/11 1:49 PM, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>
wrote:

> On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:
> 
>>> Each device fd would then support a
>>> similar set of ioctls and mapping (mmio/pio/config) interface as current
>>> vfio, except for the obvious domain and dma ioctls superseded by the
>>> group fd.
>>> 
>>> Another valid model might be that /dev/vfio/$GROUP is created for all
>>> groups when the vfio module is loaded.  The group fd would allow open()
>>> and some set of iommu querying and device enumeration ioctls, but would
>>> error on dma mapping and retrieving device fds until all of the group
>>> devices are bound to the vfio driver.
>>> 
>>> In either case, the uiommu interface is removed entirely since dma
>>> mapping is done via the group fd.
>> 
>> The loss in generality is unfortunate. I'd like to be able to support
>> arbitrary iommu domain <-> device assignment.  One way to do this would be
>> to keep uiommu, but to return an error if someone tries to assign more than
>> one uiommu context to devices in the same group.
> 
> I wouldn't use uiommu for that.

Any particular reason besides saving a file descriptor?

We use it today, and it seems like a cleaner API than what you propose
changing it to.

> If the HW or underlying kernel drivers
> support it, what I'd suggest is that you have an (optional) ioctl to
> bind two groups (you have to have both opened already) or for one group
> to "capture" another one.

You'll need other rules there too.. "both opened already, but zero mappings
performed yet as they would have instantiated a default IOMMU domain".

Keep in mind the only case I'm using is singleton groups, a.k.a. devices.

Since what I want is to specify which devices can do things like share
network buffers (in a way that conserves IOMMU hw resources), it seems
cleanest to expose this explicitly, versus some "inherit iommu domain from
another device" ioctl.  What happens if I do something like this:

dev1_fd = open ("/dev/vfio0")
dev2_fd = open ("/dev/vfio1")
dev2_fd.inherit_iommu(dev1_fd)

error = close(dev1_fd)

There are other gross cases as well.

> 
> The binding means under the hood the iommus get shared, with the
> lifetime being that of the "owning" group.

So what happens in the close() above?  EINUSE?  Reset all children?  Still
seems less clean than having an explicit iommu fd.  Without some benefit I'm
not sure why we'd want to change this API.

If we in singleton-group land were building our own "groups" which were sets
of devices sharing the IOMMU domains we wanted, I suppose we could do away
with uiommu fds, but it sounds like the current proposal would create 20
singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
worse than the current explicit uiommu API.

Thanks,
Aaron

> 
> Another option is to make that static configuration APIs via special
> ioctls (or even netlink if you really like it), to change the grouping
> on architectures that allow it.
> 
> Cheers.
> Ben.
> 
>> 
>> -Aaron
>> 
>>> As necessary in the future, we can
>>> define a more high performance dma mapping interface for streaming dma
>>> via the group fd.  I expect we'll also include architecture specific
>>> group ioctls to describe features and capabilities of the iommu.  The
>>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
>>> to userspace process ownership model.
>>> 
>>> Also on the table is supporting non-PCI devices with vfio.  To do this,
>>> we need to generalize the read/write/mmap and irq eventfd interfaces.
>>> We could keep the same model of segmenting the device fd address space,
>>> perhaps adding ioctls to define the segment offset bit position or we
>>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
>>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
>>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
>>> event fd(s), per resource fd, etc).  For interrupts we can overload
>>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
>>> devices support MSI?).
>>> 
>>> For qemu, these changes imply we'd only support a model where we have a
>>> 1:1 group to iommu domain.  The current vfio driver could probably
>>> become vfio-pci as we might end up with more target specific vfio
>>> drivers for non-pci.  PCI should be able to maintain a simple -device
>>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
>>> need to come up with extra options when we need to expose groups to
>>> guest for pvdma.
>>> 
>>> Hope that captures it, feel free to jump in with corrections and
>>> suggestions.  Thanks,
>>> 
>>> Alex
>>> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:38                         ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-22 21:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve




On 8/22/11 1:49 PM, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>
wrote:

> On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:
> 
>>> Each device fd would then support a
>>> similar set of ioctls and mapping (mmio/pio/config) interface as current
>>> vfio, except for the obvious domain and dma ioctls superseded by the
>>> group fd.
>>> 
>>> Another valid model might be that /dev/vfio/$GROUP is created for all
>>> groups when the vfio module is loaded.  The group fd would allow open()
>>> and some set of iommu querying and device enumeration ioctls, but would
>>> error on dma mapping and retrieving device fds until all of the group
>>> devices are bound to the vfio driver.
>>> 
>>> In either case, the uiommu interface is removed entirely since dma
>>> mapping is done via the group fd.
>> 
>> The loss in generality is unfortunate. I'd like to be able to support
>> arbitrary iommu domain <-> device assignment.  One way to do this would be
>> to keep uiommu, but to return an error if someone tries to assign more than
>> one uiommu context to devices in the same group.
> 
> I wouldn't use uiommu for that.

Any particular reason besides saving a file descriptor?

We use it today, and it seems like a cleaner API than what you propose
changing it to.

> If the HW or underlying kernel drivers
> support it, what I'd suggest is that you have an (optional) ioctl to
> bind two groups (you have to have both opened already) or for one group
> to "capture" another one.

You'll need other rules there too.. "both opened already, but zero mappings
performed yet as they would have instantiated a default IOMMU domain".

Keep in mind the only case I'm using is singleton groups, a.k.a. devices.

Since what I want is to specify which devices can do things like share
network buffers (in a way that conserves IOMMU hw resources), it seems
cleanest to expose this explicitly, versus some "inherit iommu domain from
another device" ioctl.  What happens if I do something like this:

dev1_fd = open ("/dev/vfio0")
dev2_fd = open ("/dev/vfio1")
dev2_fd.inherit_iommu(dev1_fd)

error = close(dev1_fd)

There are other gross cases as well.

> 
> The binding means under the hood the iommus get shared, with the
> lifetime being that of the "owning" group.

So what happens in the close() above?  EINUSE?  Reset all children?  Still
seems less clean than having an explicit iommu fd.  Without some benefit I'm
not sure why we'd want to change this API.

If we in singleton-group land were building our own "groups" which were sets
of devices sharing the IOMMU domains we wanted, I suppose we could do away
with uiommu fds, but it sounds like the current proposal would create 20
singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
worse than the current explicit uiommu API.

Thanks,
Aaron

> 
> Another option is to make that static configuration APIs via special
> ioctls (or even netlink if you really like it), to change the grouping
> on architectures that allow it.
> 
> Cheers.
> Ben.
> 
>> 
>> -Aaron
>> 
>>> As necessary in the future, we can
>>> define a more high performance dma mapping interface for streaming dma
>>> via the group fd.  I expect we'll also include architecture specific
>>> group ioctls to describe features and capabilities of the iommu.  The
>>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
>>> to userspace process ownership model.
>>> 
>>> Also on the table is supporting non-PCI devices with vfio.  To do this,
>>> we need to generalize the read/write/mmap and irq eventfd interfaces.
>>> We could keep the same model of segmenting the device fd address space,
>>> perhaps adding ioctls to define the segment offset bit position or we
>>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
>>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
>>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
>>> event fd(s), per resource fd, etc).  For interrupts we can overload
>>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
>>> devices support MSI?).
>>> 
>>> For qemu, these changes imply we'd only support a model where we have a
>>> 1:1 group to iommu domain.  The current vfio driver could probably
>>> become vfio-pci as we might end up with more target specific vfio
>>> drivers for non-pci.  PCI should be able to maintain a simple -device
>>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
>>> need to come up with extra options when we need to expose groups to
>>> guest for pvdma.
>>> 
>>> Hope that captures it, feel free to jump in with corrections and
>>> suggestions.  Thanks,
>>> 
>>> Alex
>>> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:38                         ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-22 21:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve




On 8/22/11 1:49 PM, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>
wrote:

> On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:
> 
>>> Each device fd would then support a
>>> similar set of ioctls and mapping (mmio/pio/config) interface as current
>>> vfio, except for the obvious domain and dma ioctls superseded by the
>>> group fd.
>>> 
>>> Another valid model might be that /dev/vfio/$GROUP is created for all
>>> groups when the vfio module is loaded.  The group fd would allow open()
>>> and some set of iommu querying and device enumeration ioctls, but would
>>> error on dma mapping and retrieving device fds until all of the group
>>> devices are bound to the vfio driver.
>>> 
>>> In either case, the uiommu interface is removed entirely since dma
>>> mapping is done via the group fd.
>> 
>> The loss in generality is unfortunate. I'd like to be able to support
>> arbitrary iommu domain <-> device assignment.  One way to do this would be
>> to keep uiommu, but to return an error if someone tries to assign more than
>> one uiommu context to devices in the same group.
> 
> I wouldn't use uiommu for that.

Any particular reason besides saving a file descriptor?

We use it today, and it seems like a cleaner API than what you propose
changing it to.

> If the HW or underlying kernel drivers
> support it, what I'd suggest is that you have an (optional) ioctl to
> bind two groups (you have to have both opened already) or for one group
> to "capture" another one.

You'll need other rules there too.. "both opened already, but zero mappings
performed yet as they would have instantiated a default IOMMU domain".

Keep in mind the only case I'm using is singleton groups, a.k.a. devices.

Since what I want is to specify which devices can do things like share
network buffers (in a way that conserves IOMMU hw resources), it seems
cleanest to expose this explicitly, versus some "inherit iommu domain from
another device" ioctl.  What happens if I do something like this:

dev1_fd = open ("/dev/vfio0")
dev2_fd = open ("/dev/vfio1")
dev2_fd.inherit_iommu(dev1_fd)

error = close(dev1_fd)

There are other gross cases as well.

> 
> The binding means under the hood the iommus get shared, with the
> lifetime being that of the "owning" group.

So what happens in the close() above?  EINUSE?  Reset all children?  Still
seems less clean than having an explicit iommu fd.  Without some benefit I'm
not sure why we'd want to change this API.

If we in singleton-group land were building our own "groups" which were sets
of devices sharing the IOMMU domains we wanted, I suppose we could do away
with uiommu fds, but it sounds like the current proposal would create 20
singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
worse than the current explicit uiommu API.

Thanks,
Aaron

> 
> Another option is to make that static configuration APIs via special
> ioctls (or even netlink if you really like it), to change the grouping
> on architectures that allow it.
> 
> Cheers.
> Ben.
> 
>> 
>> -Aaron
>> 
>>> As necessary in the future, we can
>>> define a more high performance dma mapping interface for streaming dma
>>> via the group fd.  I expect we'll also include architecture specific
>>> group ioctls to describe features and capabilities of the iommu.  The
>>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
>>> to userspace process ownership model.
>>> 
>>> Also on the table is supporting non-PCI devices with vfio.  To do this,
>>> we need to generalize the read/write/mmap and irq eventfd interfaces.
>>> We could keep the same model of segmenting the device fd address space,
>>> perhaps adding ioctls to define the segment offset bit position or we
>>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
>>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
>>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
>>> event fd(s), per resource fd, etc).  For interrupts we can overload
>>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
>>> devices support MSI?).
>>> 
>>> For qemu, these changes imply we'd only support a model where we have a
>>> 1:1 group to iommu domain.  The current vfio driver could probably
>>> become vfio-pci as we might end up with more target specific vfio
>>> drivers for non-pci.  PCI should be able to maintain a simple -device
>>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
>>> need to come up with extra options when we need to expose groups to
>>> guest for pvdma.
>>> 
>>> Hope that captures it, feel free to jump in with corrections and
>>> suggestions.  Thanks,
>>> 
>>> Alex
>>> 
> 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 21:38                         ` aafabbri
  (?)
@ 2011-08-22 21:49                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:49 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve


> > I wouldn't use uiommu for that.
> 
> Any particular reason besides saving a file descriptor?
> 
> We use it today, and it seems like a cleaner API than what you propose
> changing it to.

Well for one, we are back to square one vs. grouping constraints.

 .../...

> If we in singleton-group land were building our own "groups" which were sets
> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> with uiommu fds, but it sounds like the current proposal would create 20
> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> worse than the current explicit uiommu API.

I'd rather have an API to create super-groups (groups of groups)
statically and then you can use such groups as normal groups using the
same interface. That create/management process could be done via a
simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

> Thanks,
> Aaron
> 
> > 
> > Another option is to make that static configuration APIs via special
> > ioctls (or even netlink if you really like it), to change the grouping
> > on architectures that allow it.
> > 
> > Cheers.
> > Ben.
> > 
> >> 
> >> -Aaron
> >> 
> >>> As necessary in the future, we can
> >>> define a more high performance dma mapping interface for streaming dma
> >>> via the group fd.  I expect we'll also include architecture specific
> >>> group ioctls to describe features and capabilities of the iommu.  The
> >>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> >>> to userspace process ownership model.
> >>> 
> >>> Also on the table is supporting non-PCI devices with vfio.  To do this,
> >>> we need to generalize the read/write/mmap and irq eventfd interfaces.
> >>> We could keep the same model of segmenting the device fd address space,
> >>> perhaps adding ioctls to define the segment offset bit position or we
> >>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> >>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> >>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> >>> event fd(s), per resource fd, etc).  For interrupts we can overload
> >>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> >>> devices support MSI?).
> >>> 
> >>> For qemu, these changes imply we'd only support a model where we have a
> >>> 1:1 group to iommu domain.  The current vfio driver could probably
> >>> become vfio-pci as we might end up with more target specific vfio
> >>> drivers for non-pci.  PCI should be able to maintain a simple -device
> >>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> >>> need to come up with extra options when we need to expose groups to
> >>> guest for pvdma.
> >>> 
> >>> Hope that captures it, feel free to jump in with corrections and
> >>> suggestions.  Thanks,
> >>> 
> >>> Alex
> >>> 
> > 
> > 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:49                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:49 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve


> > I wouldn't use uiommu for that.
> 
> Any particular reason besides saving a file descriptor?
> 
> We use it today, and it seems like a cleaner API than what you propose
> changing it to.

Well for one, we are back to square one vs. grouping constraints.

 .../...

> If we in singleton-group land were building our own "groups" which were sets
> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> with uiommu fds, but it sounds like the current proposal would create 20
> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> worse than the current explicit uiommu API.

I'd rather have an API to create super-groups (groups of groups)
statically and then you can use such groups as normal groups using the
same interface. That create/management process could be done via a
simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

> Thanks,
> Aaron
> 
> > 
> > Another option is to make that static configuration APIs via special
> > ioctls (or even netlink if you really like it), to change the grouping
> > on architectures that allow it.
> > 
> > Cheers.
> > Ben.
> > 
> >> 
> >> -Aaron
> >> 
> >>> As necessary in the future, we can
> >>> define a more high performance dma mapping interface for streaming dma
> >>> via the group fd.  I expect we'll also include architecture specific
> >>> group ioctls to describe features and capabilities of the iommu.  The
> >>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> >>> to userspace process ownership model.
> >>> 
> >>> Also on the table is supporting non-PCI devices with vfio.  To do this,
> >>> we need to generalize the read/write/mmap and irq eventfd interfaces.
> >>> We could keep the same model of segmenting the device fd address space,
> >>> perhaps adding ioctls to define the segment offset bit position or we
> >>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> >>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> >>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> >>> event fd(s), per resource fd, etc).  For interrupts we can overload
> >>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> >>> devices support MSI?).
> >>> 
> >>> For qemu, these changes imply we'd only support a model where we have a
> >>> 1:1 group to iommu domain.  The current vfio driver could probably
> >>> become vfio-pci as we might end up with more target specific vfio
> >>> drivers for non-pci.  PCI should be able to maintain a simple -device
> >>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> >>> need to come up with extra options when we need to expose groups to
> >>> guest for pvdma.
> >>> 
> >>> Hope that captures it, feel free to jump in with corrections and
> >>> suggestions.  Thanks,
> >>> 
> >>> Alex
> >>> 
> > 
> > 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-22 21:49                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-22 21:49 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve


> > I wouldn't use uiommu for that.
> 
> Any particular reason besides saving a file descriptor?
> 
> We use it today, and it seems like a cleaner API than what you propose
> changing it to.

Well for one, we are back to square one vs. grouping constraints.

 .../...

> If we in singleton-group land were building our own "groups" which were sets
> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> with uiommu fds, but it sounds like the current proposal would create 20
> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> worse than the current explicit uiommu API.

I'd rather have an API to create super-groups (groups of groups)
statically and then you can use such groups as normal groups using the
same interface. That create/management process could be done via a
simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

> Thanks,
> Aaron
> 
> > 
> > Another option is to make that static configuration APIs via special
> > ioctls (or even netlink if you really like it), to change the grouping
> > on architectures that allow it.
> > 
> > Cheers.
> > Ben.
> > 
> >> 
> >> -Aaron
> >> 
> >>> As necessary in the future, we can
> >>> define a more high performance dma mapping interface for streaming dma
> >>> via the group fd.  I expect we'll also include architecture specific
> >>> group ioctls to describe features and capabilities of the iommu.  The
> >>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> >>> to userspace process ownership model.
> >>> 
> >>> Also on the table is supporting non-PCI devices with vfio.  To do this,
> >>> we need to generalize the read/write/mmap and irq eventfd interfaces.
> >>> We could keep the same model of segmenting the device fd address space,
> >>> perhaps adding ioctls to define the segment offset bit position or we
> >>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> >>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> >>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> >>> event fd(s), per resource fd, etc).  For interrupts we can overload
> >>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> >>> devices support MSI?).
> >>> 
> >>> For qemu, these changes imply we'd only support a model where we have a
> >>> 1:1 group to iommu domain.  The current vfio driver could probably
> >>> become vfio-pci as we might end up with more target specific vfio
> >>> drivers for non-pci.  PCI should be able to maintain a simple -device
> >>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> >>> need to come up with extra options when we need to expose groups to
> >>> guest for pvdma.
> >>> 
> >>> Hope that captures it, feel free to jump in with corrections and
> >>> suggestions.  Thanks,
> >>> 
> >>> Alex
> >>> 
> > 
> > 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 21:49                           ` Benjamin Herrenschmidt
  (?)
@ 2011-08-23  0:52                             ` aafabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-23  0:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Avi Kivity, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve




On 8/22/11 2:49 PM, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>
wrote:

> 
>>> I wouldn't use uiommu for that.
>> 
>> Any particular reason besides saving a file descriptor?
>> 
>> We use it today, and it seems like a cleaner API than what you propose
>> changing it to.
> 
> Well for one, we are back to square one vs. grouping constraints.

I'm not following you.

You have to enforce group/iommu domain assignment whether you have the
existing uiommu API, or if you change it to your proposed
ioctl(inherit_iommu) API.

The only change needed to VFIO here should be to make uiommu fd assignment
happen on the groups instead of on device fds.  That operation fails or
succeeds according to the group semantics (all-or-none assignment/same
uiommu).

I think the question is: do we force 1:1 iommu/group mapping, or do we allow
arbitrary mapping (satisfying group constraints) as we do today.

I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
ability and definitely think the uiommu approach is cleaner than the
ioctl(inherit_iommu) approach.  We considered that approach before but it
seemed less clean so we went with the explicit uiommu context.

>  .../...
> 
>> If we in singleton-group land were building our own "groups" which were sets
>> of devices sharing the IOMMU domains we wanted, I suppose we could do away
>> with uiommu fds, but it sounds like the current proposal would create 20
>> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
>> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
>> worse than the current explicit uiommu API.
> 
> I'd rather have an API to create super-groups (groups of groups)
> statically and then you can use such groups as normal groups using the
> same interface. That create/management process could be done via a
> simple command line utility or via sysfs banging, whatever...

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23  0:52                             ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-23  0:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve




On 8/22/11 2:49 PM, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>
wrote:

> 
>>> I wouldn't use uiommu for that.
>> 
>> Any particular reason besides saving a file descriptor?
>> 
>> We use it today, and it seems like a cleaner API than what you propose
>> changing it to.
> 
> Well for one, we are back to square one vs. grouping constraints.

I'm not following you.

You have to enforce group/iommu domain assignment whether you have the
existing uiommu API, or if you change it to your proposed
ioctl(inherit_iommu) API.

The only change needed to VFIO here should be to make uiommu fd assignment
happen on the groups instead of on device fds.  That operation fails or
succeeds according to the group semantics (all-or-none assignment/same
uiommu).

I think the question is: do we force 1:1 iommu/group mapping, or do we allow
arbitrary mapping (satisfying group constraints) as we do today.

I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
ability and definitely think the uiommu approach is cleaner than the
ioctl(inherit_iommu) approach.  We considered that approach before but it
seemed less clean so we went with the explicit uiommu context.

>  .../...
> 
>> If we in singleton-group land were building our own "groups" which were sets
>> of devices sharing the IOMMU domains we wanted, I suppose we could do away
>> with uiommu fds, but it sounds like the current proposal would create 20
>> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
>> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
>> worse than the current explicit uiommu API.
> 
> I'd rather have an API to create super-groups (groups of groups)
> statically and then you can use such groups as normal groups using the
> same interface. That create/management process could be done via a
> simple command line utility or via sysfs banging, whatever...

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23  0:52                             ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-23  0:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve




On 8/22/11 2:49 PM, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>
wrote:

> 
>>> I wouldn't use uiommu for that.
>> 
>> Any particular reason besides saving a file descriptor?
>> 
>> We use it today, and it seems like a cleaner API than what you propose
>> changing it to.
> 
> Well for one, we are back to square one vs. grouping constraints.

I'm not following you.

You have to enforce group/iommu domain assignment whether you have the
existing uiommu API, or if you change it to your proposed
ioctl(inherit_iommu) API.

The only change needed to VFIO here should be to make uiommu fd assignment
happen on the groups instead of on device fds.  That operation fails or
succeeds according to the group semantics (all-or-none assignment/same
uiommu).

I think the question is: do we force 1:1 iommu/group mapping, or do we allow
arbitrary mapping (satisfying group constraints) as we do today.

I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
ability and definitely think the uiommu approach is cleaner than the
ioctl(inherit_iommu) approach.  We considered that approach before but it
seemed less clean so we went with the explicit uiommu context.

>  .../...
> 
>> If we in singleton-group land were building our own "groups" which were sets
>> of devices sharing the IOMMU domains we wanted, I suppose we could do away
>> with uiommu fds, but it sounds like the current proposal would create 20
>> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
>> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
>> worse than the current explicit uiommu API.
> 
> I'd rather have an API to create super-groups (groups of groups)
> statically and then you can use such groups as normal groups using the
> same interface. That create/management process could be done via a
> simple command line utility or via sysfs banging, whatever...

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 15:45                       ` [Qemu-devel] " Alex Williamson
  (?)
@ 2011-08-23  2:38                         ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-23  2:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > capture the plan that I think we agreed to:
> > > 
> > > We need to address both the description and enforcement of device
> > > groups.  Groups are formed any time the iommu does not have resolution
> > > between a set of devices.  On x86, this typically happens when a
> > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > Power, partitionable endpoints define a group.  Grouping information
> > > needs to be exposed for both userspace and kernel internal usage.  This
> > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > 
> > > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > > 42
> > > 
> > > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > Ok.  Am I correct in thinking these group IDs are representing the
> > minimum granularity, and are therefore always static, defined only by
> > the connected hardware, not by configuration?
> 
> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Right.  And likewise I can see a place for configuration parameters
like the present 'allow_unsafe_irqs'.  But these would be more-or-less
global options which affected the overall granularity, rather than
detailed configuration such as explicitly binding some devices into a
group, yes?

> > > >From there we have a few options.  In the BoF we discussed a model where
> > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > ioctls to enumerate and return a "device" fd for each attached member of
> > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > returning an error on open() of the group fd if there are members of the
> > > group not bound to the vfio driver.  Each device fd would then support a
> > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > group fd.
> > 
> > It seems a slightly strange distinction that the group device appears
> > when any device in the group is bound to vfio, but only becomes usable
> > when all devices are bound.
> > 
> > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > groups when the vfio module is loaded.  The group fd would allow open()
> > > and some set of iommu querying and device enumeration ioctls, but would
> > > error on dma mapping and retrieving device fds until all of the group
> > > devices are bound to the vfio driver.
> > 
> > Which is why I marginally prefer this model, although it's not a big
> > deal.
> 
> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

I'm not entirely sure what you mean here.  But, that's now several
weak votes in favour of the always-present group devices, and none in
favour of the created-when-first-device-bound model, so I suggest we
take the /dev/vfio/$GROUP as our tentative approach.

> > > In either case, the uiommu interface is removed entirely since dma
> > > mapping is done via the group fd.  As necessary in the future, we can
> > > define a more high performance dma mapping interface for streaming dma
> > > via the group fd.  I expect we'll also include architecture specific
> > > group ioctls to describe features and capabilities of the iommu.  The
> > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > to userspace process ownership model.
> > 
> > A 1:1 group<->process correspondance seems wrong to me. But there are
> > many ways you could legitimately write the userspace side of the code,
> > many of them involving some sort of concurrency.  Implementing that
> > concurrency as multiple processes (using explicit shared memory and/or
> > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > shouldn't arbitrarily prohibit.
> > 
> > Obviously, only one UID may be permitted to have the group open at a
> > time, and I think that's enough to prevent them doing any worse than
> > shooting themselves in the foot.
> 
> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.

Well.. yes and know.  It means guests which need to be isolated from
malicious interference with each other need different UIDs, but given
that if they have the same UID one qemu can kill() or ptrace() the
other, they're not isolated in that sense anyway.

It seems to me that running as the same UIDs with different device
groups assigned, the guests are still pretty well isolated from
accidental interference with each other.

>  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.
> 
> > > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > > We could keep the same model of segmenting the device fd address space,
> > > perhaps adding ioctls to define the segment offset bit position or we
> > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > > event fd(s), per resource fd, etc).  For interrupts we can overload
> > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> > 
> > Sounds reasonable.
> > 
> > > (do non-PCI
> > > devices support MSI?).
> > 
> > They can.  Obviously they might not have exactly the same semantics as
> > PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> > whose interrupts are treated by the (also on-die) root interrupt
> > controller in the same way as PCI MSIs.
> 
> Ok, I suppose we can define ioctls to enable these as we go.  We also
> need to figure out how non-PCI resources, interrupts, and iommu mapping
> restrictions are described via vfio.

Yeah.  On device tree platforms we'd want it to be bound to the device
tree representation in some way.

For platform devices, at least, could we have the index into the array
of resources take the place of BAR number for PCI?
> 
> > > For qemu, these changes imply we'd only support a model where we have a
> > > 1:1 group to iommu domain.  The current vfio driver could probably
> > > become vfio-pci as we might end up with more target specific vfio
> > > drivers for non-pci.  PCI should be able to maintain a simple -device
> > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > > need to come up with extra options when we need to expose groups to
> > > guest for pvdma.
> > 
> > Are you saying that you'd no longer support the current x86 usage of
> > putting all of one guest's devices into a single domain?
> 
> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Creating supergroups of some sort seems to be what we need, but I'm
not sure what's the best interface for doing that.

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23  2:38                         ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-23  2:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > capture the plan that I think we agreed to:
> > > 
> > > We need to address both the description and enforcement of device
> > > groups.  Groups are formed any time the iommu does not have resolution
> > > between a set of devices.  On x86, this typically happens when a
> > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > Power, partitionable endpoints define a group.  Grouping information
> > > needs to be exposed for both userspace and kernel internal usage.  This
> > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > 
> > > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > > 42
> > > 
> > > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > Ok.  Am I correct in thinking these group IDs are representing the
> > minimum granularity, and are therefore always static, defined only by
> > the connected hardware, not by configuration?
> 
> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Right.  And likewise I can see a place for configuration parameters
like the present 'allow_unsafe_irqs'.  But these would be more-or-less
global options which affected the overall granularity, rather than
detailed configuration such as explicitly binding some devices into a
group, yes?

> > > >From there we have a few options.  In the BoF we discussed a model where
> > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > ioctls to enumerate and return a "device" fd for each attached member of
> > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > returning an error on open() of the group fd if there are members of the
> > > group not bound to the vfio driver.  Each device fd would then support a
> > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > group fd.
> > 
> > It seems a slightly strange distinction that the group device appears
> > when any device in the group is bound to vfio, but only becomes usable
> > when all devices are bound.
> > 
> > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > groups when the vfio module is loaded.  The group fd would allow open()
> > > and some set of iommu querying and device enumeration ioctls, but would
> > > error on dma mapping and retrieving device fds until all of the group
> > > devices are bound to the vfio driver.
> > 
> > Which is why I marginally prefer this model, although it's not a big
> > deal.
> 
> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

I'm not entirely sure what you mean here.  But, that's now several
weak votes in favour of the always-present group devices, and none in
favour of the created-when-first-device-bound model, so I suggest we
take the /dev/vfio/$GROUP as our tentative approach.

> > > In either case, the uiommu interface is removed entirely since dma
> > > mapping is done via the group fd.  As necessary in the future, we can
> > > define a more high performance dma mapping interface for streaming dma
> > > via the group fd.  I expect we'll also include architecture specific
> > > group ioctls to describe features and capabilities of the iommu.  The
> > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > to userspace process ownership model.
> > 
> > A 1:1 group<->process correspondance seems wrong to me. But there are
> > many ways you could legitimately write the userspace side of the code,
> > many of them involving some sort of concurrency.  Implementing that
> > concurrency as multiple processes (using explicit shared memory and/or
> > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > shouldn't arbitrarily prohibit.
> > 
> > Obviously, only one UID may be permitted to have the group open at a
> > time, and I think that's enough to prevent them doing any worse than
> > shooting themselves in the foot.
> 
> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.

Well.. yes and know.  It means guests which need to be isolated from
malicious interference with each other need different UIDs, but given
that if they have the same UID one qemu can kill() or ptrace() the
other, they're not isolated in that sense anyway.

It seems to me that running as the same UIDs with different device
groups assigned, the guests are still pretty well isolated from
accidental interference with each other.

>  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.
> 
> > > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > > We could keep the same model of segmenting the device fd address space,
> > > perhaps adding ioctls to define the segment offset bit position or we
> > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > > event fd(s), per resource fd, etc).  For interrupts we can overload
> > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> > 
> > Sounds reasonable.
> > 
> > > (do non-PCI
> > > devices support MSI?).
> > 
> > They can.  Obviously they might not have exactly the same semantics as
> > PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> > whose interrupts are treated by the (also on-die) root interrupt
> > controller in the same way as PCI MSIs.
> 
> Ok, I suppose we can define ioctls to enable these as we go.  We also
> need to figure out how non-PCI resources, interrupts, and iommu mapping
> restrictions are described via vfio.

Yeah.  On device tree platforms we'd want it to be bound to the device
tree representation in some way.

For platform devices, at least, could we have the index into the array
of resources take the place of BAR number for PCI?
> 
> > > For qemu, these changes imply we'd only support a model where we have a
> > > 1:1 group to iommu domain.  The current vfio driver could probably
> > > become vfio-pci as we might end up with more target specific vfio
> > > drivers for non-pci.  PCI should be able to maintain a simple -device
> > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > > need to come up with extra options when we need to expose groups to
> > > guest for pvdma.
> > 
> > Are you saying that you'd no longer support the current x86 usage of
> > putting all of one guest's devices into a single domain?
> 
> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Creating supergroups of some sort seems to be what we need, but I'm
not sure what's the best interface for doing that.

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23  2:38                         ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-23  2:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > capture the plan that I think we agreed to:
> > > 
> > > We need to address both the description and enforcement of device
> > > groups.  Groups are formed any time the iommu does not have resolution
> > > between a set of devices.  On x86, this typically happens when a
> > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > Power, partitionable endpoints define a group.  Grouping information
> > > needs to be exposed for both userspace and kernel internal usage.  This
> > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > 
> > > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > > 42
> > > 
> > > (I use a PCI example here, but attribute should not be PCI specific)
> > 
> > Ok.  Am I correct in thinking these group IDs are representing the
> > minimum granularity, and are therefore always static, defined only by
> > the connected hardware, not by configuration?
> 
> Yes, that's the idea.  An open question I have towards the configuration
> side is whether we might add iommu driver specific options to the
> groups.  For instance on x86 where we typically have B:D.F granularity,
> should we have an option not to trust multi-function devices and use a
> B:D granularity for grouping?

Right.  And likewise I can see a place for configuration parameters
like the present 'allow_unsafe_irqs'.  But these would be more-or-less
global options which affected the overall granularity, rather than
detailed configuration such as explicitly binding some devices into a
group, yes?

> > > >From there we have a few options.  In the BoF we discussed a model where
> > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > ioctls to enumerate and return a "device" fd for each attached member of
> > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > returning an error on open() of the group fd if there are members of the
> > > group not bound to the vfio driver.  Each device fd would then support a
> > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > group fd.
> > 
> > It seems a slightly strange distinction that the group device appears
> > when any device in the group is bound to vfio, but only becomes usable
> > when all devices are bound.
> > 
> > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > groups when the vfio module is loaded.  The group fd would allow open()
> > > and some set of iommu querying and device enumeration ioctls, but would
> > > error on dma mapping and retrieving device fds until all of the group
> > > devices are bound to the vfio driver.
> > 
> > Which is why I marginally prefer this model, although it's not a big
> > deal.
> 
> Right, we can also combine models.  Binding a device to vfio
> creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> device access until all the group devices are also bound.  I think
> the /dev/vfio/$GROUP might help provide an enumeration interface as well
> though, which could be useful.

I'm not entirely sure what you mean here.  But, that's now several
weak votes in favour of the always-present group devices, and none in
favour of the created-when-first-device-bound model, so I suggest we
take the /dev/vfio/$GROUP as our tentative approach.

> > > In either case, the uiommu interface is removed entirely since dma
> > > mapping is done via the group fd.  As necessary in the future, we can
> > > define a more high performance dma mapping interface for streaming dma
> > > via the group fd.  I expect we'll also include architecture specific
> > > group ioctls to describe features and capabilities of the iommu.  The
> > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > to userspace process ownership model.
> > 
> > A 1:1 group<->process correspondance seems wrong to me. But there are
> > many ways you could legitimately write the userspace side of the code,
> > many of them involving some sort of concurrency.  Implementing that
> > concurrency as multiple processes (using explicit shared memory and/or
> > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > shouldn't arbitrarily prohibit.
> > 
> > Obviously, only one UID may be permitted to have the group open at a
> > time, and I think that's enough to prevent them doing any worse than
> > shooting themselves in the foot.
> 
> 1:1 group<->process is probably too strong.  Not allowing concurrent
> open()s on the group file enforces a single userspace entity is
> responsible for that group.  Device fds can be passed to other
> processes, but only retrieved via the group fd.  I suppose we could even
> branch off the dma interface into a different fd, but it seems like we
> would logically want to serialize dma mappings at each iommu group
> anyway.  I'm open to alternatives, this just seemed an easy way to do
> it.  Restricting on UID implies that we require isolated qemu instances
> to run as different UIDs.

Well.. yes and know.  It means guests which need to be isolated from
malicious interference with each other need different UIDs, but given
that if they have the same UID one qemu can kill() or ptrace() the
other, they're not isolated in that sense anyway.

It seems to me that running as the same UIDs with different device
groups assigned, the guests are still pretty well isolated from
accidental interference with each other.

>  I know that's a goal, but I don't know if we
> want to make it an assumption in the group security model.
> 
> > > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > > We could keep the same model of segmenting the device fd address space,
> > > perhaps adding ioctls to define the segment offset bit position or we
> > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > > event fd(s), per resource fd, etc).  For interrupts we can overload
> > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> > 
> > Sounds reasonable.
> > 
> > > (do non-PCI
> > > devices support MSI?).
> > 
> > They can.  Obviously they might not have exactly the same semantics as
> > PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> > whose interrupts are treated by the (also on-die) root interrupt
> > controller in the same way as PCI MSIs.
> 
> Ok, I suppose we can define ioctls to enable these as we go.  We also
> need to figure out how non-PCI resources, interrupts, and iommu mapping
> restrictions are described via vfio.

Yeah.  On device tree platforms we'd want it to be bound to the device
tree representation in some way.

For platform devices, at least, could we have the index into the array
of resources take the place of BAR number for PCI?
> 
> > > For qemu, these changes imply we'd only support a model where we have a
> > > 1:1 group to iommu domain.  The current vfio driver could probably
> > > become vfio-pci as we might end up with more target specific vfio
> > > drivers for non-pci.  PCI should be able to maintain a simple -device
> > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > > need to come up with extra options when we need to expose groups to
> > > guest for pvdma.
> > 
> > Are you saying that you'd no longer support the current x86 usage of
> > putting all of one guest's devices into a single domain?
> 
> Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> to assume >1 device per guest is a typical model and that the iotlb is
> large enough that we might improve thrashing to see both a resource and
> performance benefit from it.  I'm open to suggestions for how we could
> include it though.

Creating supergroups of some sort seems to be what we need, but I'm
not sure what's the best interface for doing that.

> > If that's
> > not what you're saying, how would the domains - now made up of a
> > user's selection of groups, rather than individual devices - be
> > configured?
> > 
> > > Hope that captures it, feel free to jump in with corrections and
> > > suggestions.  Thanks,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23  0:52                             ` aafabbri
  (?)
@ 2011-08-23  6:54                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23  6:54 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:

> I'm not following you.
> 
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

Ok, so I missed that part where you change uiommu to operate on group
fd's rather than device fd's, my apologies if you actually wrote that
down :-) It might be obvious ... bare with me I just flew back from the
US and I am badly jet lagged ...

So I see what you mean, however...

> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> arbitrary mapping (satisfying group constraints) as we do today.
> 
> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> ability and definitely think the uiommu approach is cleaner than the
> ioctl(inherit_iommu) approach.  We considered that approach before but it
> seemed less clean so we went with the explicit uiommu context.

Possibly, the question that interest me the most is what interface will
KVM end up using. I'm also not terribly fan with the (perceived)
discrepancy between using uiommu to create groups but using the group fd
to actually do the mappings, at least if that is still the plan.

If the separate uiommu interface is kept, then anything that wants to be
able to benefit from the ability to put multiple devices (or existing
groups) into such a "meta group" would need to be explicitly modified to
deal with the uiommu APIs.

I tend to prefer such "meta groups" as being something you create
statically using a configuration interface, either via sysfs, netlink or
ioctl's to a "control" vfio device driven by a simple command line tool
(which can have the configuration stored in /etc and re-apply it at
boot).

That way, any program capable of exploiting VFIO "groups" will
automatically be able to exploit those "meta groups" (or groups of
groups) as well as long as they are supported on the system.

If we ever have system specific constraints as to how such groups can be
created, then it can all be handled at the level of that configuration
tool without impact on whatever programs know how to exploit them via
the VFIO interfaces.

> >  .../...
> > 
> >> If we in singleton-group land were building our own "groups" which were sets
> >> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> >> with uiommu fds, but it sounds like the current proposal would create 20
> >> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> >> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> >> worse than the current explicit uiommu API.
> > 
> > I'd rather have an API to create super-groups (groups of groups)
> > statically and then you can use such groups as normal groups using the
> > same interface. That create/management process could be done via a
> > simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23  6:54                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23  6:54 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:

> I'm not following you.
> 
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

Ok, so I missed that part where you change uiommu to operate on group
fd's rather than device fd's, my apologies if you actually wrote that
down :-) It might be obvious ... bare with me I just flew back from the
US and I am badly jet lagged ...

So I see what you mean, however...

> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> arbitrary mapping (satisfying group constraints) as we do today.
> 
> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> ability and definitely think the uiommu approach is cleaner than the
> ioctl(inherit_iommu) approach.  We considered that approach before but it
> seemed less clean so we went with the explicit uiommu context.

Possibly, the question that interest me the most is what interface will
KVM end up using. I'm also not terribly fan with the (perceived)
discrepancy between using uiommu to create groups but using the group fd
to actually do the mappings, at least if that is still the plan.

If the separate uiommu interface is kept, then anything that wants to be
able to benefit from the ability to put multiple devices (or existing
groups) into such a "meta group" would need to be explicitly modified to
deal with the uiommu APIs.

I tend to prefer such "meta groups" as being something you create
statically using a configuration interface, either via sysfs, netlink or
ioctl's to a "control" vfio device driven by a simple command line tool
(which can have the configuration stored in /etc and re-apply it at
boot).

That way, any program capable of exploiting VFIO "groups" will
automatically be able to exploit those "meta groups" (or groups of
groups) as well as long as they are supported on the system.

If we ever have system specific constraints as to how such groups can be
created, then it can all be handled at the level of that configuration
tool without impact on whatever programs know how to exploit them via
the VFIO interfaces.

> >  .../...
> > 
> >> If we in singleton-group land were building our own "groups" which were sets
> >> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> >> with uiommu fds, but it sounds like the current proposal would create 20
> >> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> >> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> >> worse than the current explicit uiommu API.
> > 
> > I'd rather have an API to create super-groups (groups of groups)
> > statically and then you can use such groups as normal groups using the
> > same interface. That create/management process could be done via a
> > simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23  6:54                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23  6:54 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:

> I'm not following you.
> 
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

Ok, so I missed that part where you change uiommu to operate on group
fd's rather than device fd's, my apologies if you actually wrote that
down :-) It might be obvious ... bare with me I just flew back from the
US and I am badly jet lagged ...

So I see what you mean, however...

> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> arbitrary mapping (satisfying group constraints) as we do today.
> 
> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> ability and definitely think the uiommu approach is cleaner than the
> ioctl(inherit_iommu) approach.  We considered that approach before but it
> seemed less clean so we went with the explicit uiommu context.

Possibly, the question that interest me the most is what interface will
KVM end up using. I'm also not terribly fan with the (perceived)
discrepancy between using uiommu to create groups but using the group fd
to actually do the mappings, at least if that is still the plan.

If the separate uiommu interface is kept, then anything that wants to be
able to benefit from the ability to put multiple devices (or existing
groups) into such a "meta group" would need to be explicitly modified to
deal with the uiommu APIs.

I tend to prefer such "meta groups" as being something you create
statically using a configuration interface, either via sysfs, netlink or
ioctl's to a "control" vfio device driven by a simple command line tool
(which can have the configuration stored in /etc and re-apply it at
boot).

That way, any program capable of exploiting VFIO "groups" will
automatically be able to exploit those "meta groups" (or groups of
groups) as well as long as they are supported on the system.

If we ever have system specific constraints as to how such groups can be
created, then it can all be handled at the level of that configuration
tool without impact on whatever programs know how to exploit them via
the VFIO interfaces.

> >  .../...
> > 
> >> If we in singleton-group land were building our own "groups" which were sets
> >> of devices sharing the IOMMU domains we wanted, I suppose we could do away
> >> with uiommu fds, but it sounds like the current proposal would create 20
> >> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
> >> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
> >> worse than the current explicit uiommu API.
> > 
> > I'd rather have an API to create super-groups (groups of groups)
> > statically and then you can use such groups as normal groups using the
> > same interface. That create/management process could be done via a
> > simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23  0:52                             ` aafabbri
  (?)
@ 2011-08-23 11:04                               ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-23 11:04 UTC (permalink / raw)
  To: aafabbri
  Cc: Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, chrisw, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

That is makes uiommu basically the same as the meta-groups, right?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 11:04                               ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-23 11:04 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

That is makes uiommu basically the same as the meta-groups, right?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 11:04                               ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-23 11:04 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
> You have to enforce group/iommu domain assignment whether you have the
> existing uiommu API, or if you change it to your proposed
> ioctl(inherit_iommu) API.
> 
> The only change needed to VFIO here should be to make uiommu fd assignment
> happen on the groups instead of on device fds.  That operation fails or
> succeeds according to the group semantics (all-or-none assignment/same
> uiommu).

That is makes uiommu basically the same as the meta-groups, right?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23  6:54                               ` Benjamin Herrenschmidt
  (?)
@ 2011-08-23 11:09                                 ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-23 11:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, iommu, chrisw, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, Aug 23, 2011 at 02:54:43AM -0400, Benjamin Herrenschmidt wrote:
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.
> 
> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

Hmm, I don't think that these groups are static for the systems
run-time. They only exist for the lifetime of a guest per default, at
least on x86. Thats why I prefer to do this grouping using VFIO and not
some sysfs interface (which would be the third interface beside the
ioctls and netlink a VFIO user needs to be aware of). Doing this in the
ioctl interface just makes things easier.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 11:09                                 ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-23 11:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, Aug 23, 2011 at 02:54:43AM -0400, Benjamin Herrenschmidt wrote:
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.
> 
> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

Hmm, I don't think that these groups are static for the systems
run-time. They only exist for the lifetime of a guest per default, at
least on x86. Thats why I prefer to do this grouping using VFIO and not
some sysfs interface (which would be the third interface beside the
ioctls and netlink a VFIO user needs to be aware of). Doing this in the
ioctl interface just makes things easier.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 11:09                                 ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-23 11:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 02:54:43AM -0400, Benjamin Herrenschmidt wrote:
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.
> 
> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

Hmm, I don't think that these groups are static for the systems
run-time. They only exist for the lifetime of a guest per default, at
least on x86. Thats why I prefer to do this grouping using VFIO and not
some sysfs interface (which would be the third interface beside the
ioctls and netlink a VFIO user needs to be aware of). Doing this in the
ioctl interface just makes things easier.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 19:17                       ` Alex Williamson
  (?)
@ 2011-08-23 13:14                         ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-23 13:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:

> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> 
> That's an interesting idea.  Maybe an interface similar to the current
> uiommu interface, where you open() the 2nd group fd and pass the fd via
> ioctl to the primary group.  IOMMUs that don't support this would fail
> the attach device callback, which would fail the ioctl to bind them.  It
> will need to be designed so any group can be removed from the super-set
> and the remaining group(s) still works.  This feels like something that
> can be added after we get an initial implementation.

Handling it through fds is a good idea. This makes sure that everything
belongs to one process. I am not really sure yet if we go the way to
just bind plain groups together or if we create meta-groups. The
meta-groups thing seems somewhat cleaner, though.

> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> The current vfio has the same problem, we can't unbind a device from
> vfio while it's attached to a guest.  I think we'd use the same solution
> too; send out a netlink packet for a device removal and have the .remove
> call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> and SIGBUS the PIDs holding the device if they don't return it
> willingly.  Thanks,

Putting the process to sleep (which would be uninterruptible) seems bad.
The process would sleep until the guest releases the device-group, which
can take days or months.
The best thing (and the most intrusive :-) ) is to change PCI core to
allow unbindings to fail, I think. But this probably further complicates
the way to upstream VFIO...

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 13:14                         ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-23 13:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:

> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> 
> That's an interesting idea.  Maybe an interface similar to the current
> uiommu interface, where you open() the 2nd group fd and pass the fd via
> ioctl to the primary group.  IOMMUs that don't support this would fail
> the attach device callback, which would fail the ioctl to bind them.  It
> will need to be designed so any group can be removed from the super-set
> and the remaining group(s) still works.  This feels like something that
> can be added after we get an initial implementation.

Handling it through fds is a good idea. This makes sure that everything
belongs to one process. I am not really sure yet if we go the way to
just bind plain groups together or if we create meta-groups. The
meta-groups thing seems somewhat cleaner, though.

> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> The current vfio has the same problem, we can't unbind a device from
> vfio while it's attached to a guest.  I think we'd use the same solution
> too; send out a netlink packet for a device removal and have the .remove
> call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> and SIGBUS the PIDs holding the device if they don't return it
> willingly.  Thanks,

Putting the process to sleep (which would be uninterruptible) seems bad.
The process would sleep until the guest releases the device-group, which
can take days or months.
The best thing (and the most intrusive :-) ) is to change PCI core to
allow unbindings to fail, I think. But this probably further complicates
the way to upstream VFIO...

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 13:14                         ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-23 13:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:

> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> 
> That's an interesting idea.  Maybe an interface similar to the current
> uiommu interface, where you open() the 2nd group fd and pass the fd via
> ioctl to the primary group.  IOMMUs that don't support this would fail
> the attach device callback, which would fail the ioctl to bind them.  It
> will need to be designed so any group can be removed from the super-set
> and the remaining group(s) still works.  This feels like something that
> can be added after we get an initial implementation.

Handling it through fds is a good idea. This makes sure that everything
belongs to one process. I am not really sure yet if we go the way to
just bind plain groups together or if we create meta-groups. The
meta-groups thing seems somewhat cleaner, though.

> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> The current vfio has the same problem, we can't unbind a device from
> vfio while it's attached to a guest.  I think we'd use the same solution
> too; send out a netlink packet for a device removal and have the .remove
> call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> and SIGBUS the PIDs holding the device if they don't return it
> willingly.  Thanks,

Putting the process to sleep (which would be uninterruptible) seems bad.
The process would sleep until the guest releases the device-group, which
can take days or months.
The best thing (and the most intrusive :-) ) is to change PCI core to
allow unbindings to fail, I think. But this probably further complicates
the way to upstream VFIO...

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 21:03                       ` Benjamin Herrenschmidt
  (?)
@ 2011-08-23 13:18                         ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-23 13:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> 
> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> > 
> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> We can treat it as a hard unplug (like a cardbus gone away).
> 
> IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> all ff's from reads (& ignore writes).
> 
> Then send an unplug event via whatever mechanism the platform provides
> (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> what to do on power for hotplug yet).

Hmm, good idea. But as far as I know the hotplug-event needs to be in
the guest _before_ the device is actually unplugged (so that the guest
can unbind its driver first). That somehow brings back the sleep-idea
and the timeout in the .release function.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 13:18                         ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-23 13:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> 
> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> > 
> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> We can treat it as a hard unplug (like a cardbus gone away).
> 
> IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> all ff's from reads (& ignore writes).
> 
> Then send an unplug event via whatever mechanism the platform provides
> (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> what to do on power for hotplug yet).

Hmm, good idea. But as far as I know the hotplug-event needs to be in
the guest _before_ the device is actually unplugged (so that the guest
can unbind its driver first). That somehow brings back the sleep-idea
and the timeout in the .release function.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 13:18                         ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-23 13:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> 
> > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > assigned to a guest, there can also be an ioctl to bind a group to an
> > address-space of another group (certainly needs some care to not allow
> > that both groups belong to different processes).
> > 
> > Btw, a problem we havn't talked about yet entirely is
> > driver-deassignment. User space can decide to de-assign the device from
> > vfio while a fd is open on it. With PCI there is no way to let this fail
> > (the .release function returns void last time i checked). Is this a
> > problem, and yes, how we handle that?
> 
> We can treat it as a hard unplug (like a cardbus gone away).
> 
> IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> all ff's from reads (& ignore writes).
> 
> Then send an unplug event via whatever mechanism the platform provides
> (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> what to do on power for hotplug yet).

Hmm, good idea. But as far as I know the hotplug-event needs to be in
the guest _before_ the device is actually unplugged (so that the guest
can unbind its driver first). That somehow brings back the sleep-idea
and the timeout in the .release function.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23  2:38                         ` David Gibson
  (?)
@ 2011-08-23 16:23                           ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 16:23 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 12:38 +1000, David Gibson wrote:
> On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > > capture the plan that I think we agreed to:
> > > > 
> > > > We need to address both the description and enforcement of device
> > > > groups.  Groups are formed any time the iommu does not have resolution
> > > > between a set of devices.  On x86, this typically happens when a
> > > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > > Power, partitionable endpoints define a group.  Grouping information
> > > > needs to be exposed for both userspace and kernel internal usage.  This
> > > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > > 
> > > > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > > > 42
> > > > 
> > > > (I use a PCI example here, but attribute should not be PCI specific)
> > > 
> > > Ok.  Am I correct in thinking these group IDs are representing the
> > > minimum granularity, and are therefore always static, defined only by
> > > the connected hardware, not by configuration?
> > 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Right.  And likewise I can see a place for configuration parameters
> like the present 'allow_unsafe_irqs'.  But these would be more-or-less
> global options which affected the overall granularity, rather than
> detailed configuration such as explicitly binding some devices into a
> group, yes?

Yes, currently the interrupt remapping support is a global iommu
capability.  I suppose it's possible that this could be an iommu option,
where the iommu driver would not advertise a group if the interrupt
remapping constraint isn't met.

> > > > >From there we have a few options.  In the BoF we discussed a model where
> > > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > > ioctls to enumerate and return a "device" fd for each attached member of
> > > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > > returning an error on open() of the group fd if there are members of the
> > > > group not bound to the vfio driver.  Each device fd would then support a
> > > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > > group fd.
> > > 
> > > It seems a slightly strange distinction that the group device appears
> > > when any device in the group is bound to vfio, but only becomes usable
> > > when all devices are bound.
> > > 
> > > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > > groups when the vfio module is loaded.  The group fd would allow open()
> > > > and some set of iommu querying and device enumeration ioctls, but would
> > > > error on dma mapping and retrieving device fds until all of the group
> > > > devices are bound to the vfio driver.
> > > 
> > > Which is why I marginally prefer this model, although it's not a big
> > > deal.
> > 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> I'm not entirely sure what you mean here.  But, that's now several
> weak votes in favour of the always-present group devices, and none in
> favour of the created-when-first-device-bound model, so I suggest we
> take the /dev/vfio/$GROUP as our tentative approach.

Yep

> > > > In either case, the uiommu interface is removed entirely since dma
> > > > mapping is done via the group fd.  As necessary in the future, we can
> > > > define a more high performance dma mapping interface for streaming dma
> > > > via the group fd.  I expect we'll also include architecture specific
> > > > group ioctls to describe features and capabilities of the iommu.  The
> > > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > > to userspace process ownership model.
> > > 
> > > A 1:1 group<->process correspondance seems wrong to me. But there are
> > > many ways you could legitimately write the userspace side of the code,
> > > many of them involving some sort of concurrency.  Implementing that
> > > concurrency as multiple processes (using explicit shared memory and/or
> > > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > > shouldn't arbitrarily prohibit.
> > > 
> > > Obviously, only one UID may be permitted to have the group open at a
> > > time, and I think that's enough to prevent them doing any worse than
> > > shooting themselves in the foot.
> > 
> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.
> 
> Well.. yes and know.  It means guests which need to be isolated from
> malicious interference with each other need different UIDs, but given
> that if they have the same UID one qemu can kill() or ptrace() the
> other, they're not isolated in that sense anyway.
> 
> It seems to me that running as the same UIDs with different device
> groups assigned, the guests are still pretty well isolated from
> accidental interference with each other.

If our only restriction is UID, what prevents a non-clueful user from
trying to create separate qemu instances making use of different devices
within the same group (or even the same device)?  If we restrict
concurrent opens, it's just the subsequent instances get a -EBUSY.

> >  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> > 
> > > > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > > > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > > > We could keep the same model of segmenting the device fd address space,
> > > > perhaps adding ioctls to define the segment offset bit position or we
> > > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > > > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > > > event fd(s), per resource fd, etc).  For interrupts we can overload
> > > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> > > 
> > > Sounds reasonable.
> > > 
> > > > (do non-PCI
> > > > devices support MSI?).
> > > 
> > > They can.  Obviously they might not have exactly the same semantics as
> > > PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> > > whose interrupts are treated by the (also on-die) root interrupt
> > > controller in the same way as PCI MSIs.
> > 
> > Ok, I suppose we can define ioctls to enable these as we go.  We also
> > need to figure out how non-PCI resources, interrupts, and iommu mapping
> > restrictions are described via vfio.
> 
> Yeah.  On device tree platforms we'd want it to be bound to the device
> tree representation in some way.
> 
> For platform devices, at least, could we have the index into the array
> of resources take the place of BAR number for PCI?

That's what I was thinking, but we need some way to describe the set of
valid indexes and type and size for each as well.  We already have the
BAR_LEN helper ioctl, we could make that generic (RANGE_LEN?) and add
NUM_RANGES and RANGE_TYPE.  For PCI there would always be 7 ranges (6
BARs + ROM).

> > > > For qemu, these changes imply we'd only support a model where we have a
> > > > 1:1 group to iommu domain.  The current vfio driver could probably
> > > > become vfio-pci as we might end up with more target specific vfio
> > > > drivers for non-pci.  PCI should be able to maintain a simple -device
> > > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > > > need to come up with extra options when we need to expose groups to
> > > > guest for pvdma.
> > > 
> > > Are you saying that you'd no longer support the current x86 usage of
> > > putting all of one guest's devices into a single domain?
> > 
> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Creating supergroups of some sort seems to be what we need, but I'm
> not sure what's the best interface for doing that.

Yeah.  Joerg's idea of binding groups internally (pass the fd of one
group to another via ioctl) is one option.  The tricky part will be
implementing it to support hot unplug of any group from the supergroup.
I believe Ben had a suggestion that supergroups could be created in
sysfs, but I don't know what the mechanism to do that looks like.  It
would also be an extra management step to dynamically bind and unbind
groups to the supergroup around hotplug.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 16:23                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 16:23 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 12:38 +1000, David Gibson wrote:
> On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > > capture the plan that I think we agreed to:
> > > > 
> > > > We need to address both the description and enforcement of device
> > > > groups.  Groups are formed any time the iommu does not have resolution
> > > > between a set of devices.  On x86, this typically happens when a
> > > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > > Power, partitionable endpoints define a group.  Grouping information
> > > > needs to be exposed for both userspace and kernel internal usage.  This
> > > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > > 
> > > > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > > > 42
> > > > 
> > > > (I use a PCI example here, but attribute should not be PCI specific)
> > > 
> > > Ok.  Am I correct in thinking these group IDs are representing the
> > > minimum granularity, and are therefore always static, defined only by
> > > the connected hardware, not by configuration?
> > 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Right.  And likewise I can see a place for configuration parameters
> like the present 'allow_unsafe_irqs'.  But these would be more-or-less
> global options which affected the overall granularity, rather than
> detailed configuration such as explicitly binding some devices into a
> group, yes?

Yes, currently the interrupt remapping support is a global iommu
capability.  I suppose it's possible that this could be an iommu option,
where the iommu driver would not advertise a group if the interrupt
remapping constraint isn't met.

> > > > >From there we have a few options.  In the BoF we discussed a model where
> > > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > > ioctls to enumerate and return a "device" fd for each attached member of
> > > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > > returning an error on open() of the group fd if there are members of the
> > > > group not bound to the vfio driver.  Each device fd would then support a
> > > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > > group fd.
> > > 
> > > It seems a slightly strange distinction that the group device appears
> > > when any device in the group is bound to vfio, but only becomes usable
> > > when all devices are bound.
> > > 
> > > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > > groups when the vfio module is loaded.  The group fd would allow open()
> > > > and some set of iommu querying and device enumeration ioctls, but would
> > > > error on dma mapping and retrieving device fds until all of the group
> > > > devices are bound to the vfio driver.
> > > 
> > > Which is why I marginally prefer this model, although it's not a big
> > > deal.
> > 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> I'm not entirely sure what you mean here.  But, that's now several
> weak votes in favour of the always-present group devices, and none in
> favour of the created-when-first-device-bound model, so I suggest we
> take the /dev/vfio/$GROUP as our tentative approach.

Yep

> > > > In either case, the uiommu interface is removed entirely since dma
> > > > mapping is done via the group fd.  As necessary in the future, we can
> > > > define a more high performance dma mapping interface for streaming dma
> > > > via the group fd.  I expect we'll also include architecture specific
> > > > group ioctls to describe features and capabilities of the iommu.  The
> > > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > > to userspace process ownership model.
> > > 
> > > A 1:1 group<->process correspondance seems wrong to me. But there are
> > > many ways you could legitimately write the userspace side of the code,
> > > many of them involving some sort of concurrency.  Implementing that
> > > concurrency as multiple processes (using explicit shared memory and/or
> > > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > > shouldn't arbitrarily prohibit.
> > > 
> > > Obviously, only one UID may be permitted to have the group open at a
> > > time, and I think that's enough to prevent them doing any worse than
> > > shooting themselves in the foot.
> > 
> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.
> 
> Well.. yes and know.  It means guests which need to be isolated from
> malicious interference with each other need different UIDs, but given
> that if they have the same UID one qemu can kill() or ptrace() the
> other, they're not isolated in that sense anyway.
> 
> It seems to me that running as the same UIDs with different device
> groups assigned, the guests are still pretty well isolated from
> accidental interference with each other.

If our only restriction is UID, what prevents a non-clueful user from
trying to create separate qemu instances making use of different devices
within the same group (or even the same device)?  If we restrict
concurrent opens, it's just the subsequent instances get a -EBUSY.

> >  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> > 
> > > > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > > > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > > > We could keep the same model of segmenting the device fd address space,
> > > > perhaps adding ioctls to define the segment offset bit position or we
> > > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > > > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > > > event fd(s), per resource fd, etc).  For interrupts we can overload
> > > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> > > 
> > > Sounds reasonable.
> > > 
> > > > (do non-PCI
> > > > devices support MSI?).
> > > 
> > > They can.  Obviously they might not have exactly the same semantics as
> > > PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> > > whose interrupts are treated by the (also on-die) root interrupt
> > > controller in the same way as PCI MSIs.
> > 
> > Ok, I suppose we can define ioctls to enable these as we go.  We also
> > need to figure out how non-PCI resources, interrupts, and iommu mapping
> > restrictions are described via vfio.
> 
> Yeah.  On device tree platforms we'd want it to be bound to the device
> tree representation in some way.
> 
> For platform devices, at least, could we have the index into the array
> of resources take the place of BAR number for PCI?

That's what I was thinking, but we need some way to describe the set of
valid indexes and type and size for each as well.  We already have the
BAR_LEN helper ioctl, we could make that generic (RANGE_LEN?) and add
NUM_RANGES and RANGE_TYPE.  For PCI there would always be 7 ranges (6
BARs + ROM).

> > > > For qemu, these changes imply we'd only support a model where we have a
> > > > 1:1 group to iommu domain.  The current vfio driver could probably
> > > > become vfio-pci as we might end up with more target specific vfio
> > > > drivers for non-pci.  PCI should be able to maintain a simple -device
> > > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > > > need to come up with extra options when we need to expose groups to
> > > > guest for pvdma.
> > > 
> > > Are you saying that you'd no longer support the current x86 usage of
> > > putting all of one guest's devices into a single domain?
> > 
> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Creating supergroups of some sort seems to be what we need, but I'm
> not sure what's the best interface for doing that.

Yeah.  Joerg's idea of binding groups internally (pass the fd of one
group to another via ioctl) is one option.  The tricky part will be
implementing it to support hot unplug of any group from the supergroup.
I believe Ben had a suggestion that supergroups could be created in
sysfs, but I don't know what the mechanism to do that looks like.  It
would also be an extra management step to dynamically bind and unbind
groups to the supergroup around hotplug.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 16:23                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 16:23 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

On Tue, 2011-08-23 at 12:38 +1000, David Gibson wrote:
> On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
> > > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
> > > > We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> > > > capture the plan that I think we agreed to:
> > > > 
> > > > We need to address both the description and enforcement of device
> > > > groups.  Groups are formed any time the iommu does not have resolution
> > > > between a set of devices.  On x86, this typically happens when a
> > > > PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> > > > Power, partitionable endpoints define a group.  Grouping information
> > > > needs to be exposed for both userspace and kernel internal usage.  This
> > > > will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> > > > 
> > > > # cat /sys/devices/pci0000:00/0000:00:19.0/iommu_group
> > > > 42
> > > > 
> > > > (I use a PCI example here, but attribute should not be PCI specific)
> > > 
> > > Ok.  Am I correct in thinking these group IDs are representing the
> > > minimum granularity, and are therefore always static, defined only by
> > > the connected hardware, not by configuration?
> > 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Right.  And likewise I can see a place for configuration parameters
> like the present 'allow_unsafe_irqs'.  But these would be more-or-less
> global options which affected the overall granularity, rather than
> detailed configuration such as explicitly binding some devices into a
> group, yes?

Yes, currently the interrupt remapping support is a global iommu
capability.  I suppose it's possible that this could be an iommu option,
where the iommu driver would not advertise a group if the interrupt
remapping constraint isn't met.

> > > > >From there we have a few options.  In the BoF we discussed a model where
> > > > binding a device to vfio creates a /dev/vfio$GROUP character device
> > > > file.  This "group" fd provides provides dma mapping ioctls as well as
> > > > ioctls to enumerate and return a "device" fd for each attached member of
> > > > the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> > > > returning an error on open() of the group fd if there are members of the
> > > > group not bound to the vfio driver.  Each device fd would then support a
> > > > similar set of ioctls and mapping (mmio/pio/config) interface as current
> > > > vfio, except for the obvious domain and dma ioctls superseded by the
> > > > group fd.
> > > 
> > > It seems a slightly strange distinction that the group device appears
> > > when any device in the group is bound to vfio, but only becomes usable
> > > when all devices are bound.
> > > 
> > > > Another valid model might be that /dev/vfio/$GROUP is created for all
> > > > groups when the vfio module is loaded.  The group fd would allow open()
> > > > and some set of iommu querying and device enumeration ioctls, but would
> > > > error on dma mapping and retrieving device fds until all of the group
> > > > devices are bound to the vfio driver.
> > > 
> > > Which is why I marginally prefer this model, although it's not a big
> > > deal.
> > 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> I'm not entirely sure what you mean here.  But, that's now several
> weak votes in favour of the always-present group devices, and none in
> favour of the created-when-first-device-bound model, so I suggest we
> take the /dev/vfio/$GROUP as our tentative approach.

Yep

> > > > In either case, the uiommu interface is removed entirely since dma
> > > > mapping is done via the group fd.  As necessary in the future, we can
> > > > define a more high performance dma mapping interface for streaming dma
> > > > via the group fd.  I expect we'll also include architecture specific
> > > > group ioctls to describe features and capabilities of the iommu.  The
> > > > group fd will need to prevent concurrent open()s to maintain a 1:1 group
> > > > to userspace process ownership model.
> > > 
> > > A 1:1 group<->process correspondance seems wrong to me. But there are
> > > many ways you could legitimately write the userspace side of the code,
> > > many of them involving some sort of concurrency.  Implementing that
> > > concurrency as multiple processes (using explicit shared memory and/or
> > > other IPC mechanisms to co-ordinate) seems a valid choice that we
> > > shouldn't arbitrarily prohibit.
> > > 
> > > Obviously, only one UID may be permitted to have the group open at a
> > > time, and I think that's enough to prevent them doing any worse than
> > > shooting themselves in the foot.
> > 
> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.
> 
> Well.. yes and know.  It means guests which need to be isolated from
> malicious interference with each other need different UIDs, but given
> that if they have the same UID one qemu can kill() or ptrace() the
> other, they're not isolated in that sense anyway.
> 
> It seems to me that running as the same UIDs with different device
> groups assigned, the guests are still pretty well isolated from
> accidental interference with each other.

If our only restriction is UID, what prevents a non-clueful user from
trying to create separate qemu instances making use of different devices
within the same group (or even the same device)?  If we restrict
concurrent opens, it's just the subsequent instances get a -EBUSY.

> >  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> > 
> > > > Also on the table is supporting non-PCI devices with vfio.  To do this,
> > > > we need to generalize the read/write/mmap and irq eventfd interfaces.
> > > > We could keep the same model of segmenting the device fd address space,
> > > > perhaps adding ioctls to define the segment offset bit position or we
> > > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> > > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> > > > suffering some degree of fd bloat (group fd, device fd(s), interrupt
> > > > event fd(s), per resource fd, etc).  For interrupts we can overload
> > > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 
> > > 
> > > Sounds reasonable.
> > > 
> > > > (do non-PCI
> > > > devices support MSI?).
> > > 
> > > They can.  Obviously they might not have exactly the same semantics as
> > > PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
> > > whose interrupts are treated by the (also on-die) root interrupt
> > > controller in the same way as PCI MSIs.
> > 
> > Ok, I suppose we can define ioctls to enable these as we go.  We also
> > need to figure out how non-PCI resources, interrupts, and iommu mapping
> > restrictions are described via vfio.
> 
> Yeah.  On device tree platforms we'd want it to be bound to the device
> tree representation in some way.
> 
> For platform devices, at least, could we have the index into the array
> of resources take the place of BAR number for PCI?

That's what I was thinking, but we need some way to describe the set of
valid indexes and type and size for each as well.  We already have the
BAR_LEN helper ioctl, we could make that generic (RANGE_LEN?) and add
NUM_RANGES and RANGE_TYPE.  For PCI there would always be 7 ranges (6
BARs + ROM).

> > > > For qemu, these changes imply we'd only support a model where we have a
> > > > 1:1 group to iommu domain.  The current vfio driver could probably
> > > > become vfio-pci as we might end up with more target specific vfio
> > > > drivers for non-pci.  PCI should be able to maintain a simple -device
> > > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> > > > need to come up with extra options when we need to expose groups to
> > > > guest for pvdma.
> > > 
> > > Are you saying that you'd no longer support the current x86 usage of
> > > putting all of one guest's devices into a single domain?
> > 
> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Creating supergroups of some sort seems to be what we need, but I'm
> not sure what's the best interface for doing that.

Yeah.  Joerg's idea of binding groups internally (pass the fd of one
group to another via ioctl) is one option.  The tricky part will be
implementing it to support hot unplug of any group from the supergroup.
I believe Ben had a suggestion that supergroups could be created in
sysfs, but I don't know what the mechanism to do that looks like.  It
would also be an extra management step to dynamically bind and unbind
groups to the supergroup around hotplug.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 11:04                               ` Joerg Roedel
  (?)
@ 2011-08-23 16:54                                 ` aafabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-23 16:54 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, chrisw, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve




On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:

> On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
>> You have to enforce group/iommu domain assignment whether you have the
>> existing uiommu API, or if you change it to your proposed
>> ioctl(inherit_iommu) API.
>> 
>> The only change needed to VFIO here should be to make uiommu fd assignment
>> happen on the groups instead of on device fds.  That operation fails or
>> succeeds according to the group semantics (all-or-none assignment/same
>> uiommu).
> 
> That is makes uiommu basically the same as the meta-groups, right?

Yes, functionality seems the same, thus my suggestion to keep uiommu
explicit.  Is there some need for group-groups besides defining sets of
groups which share IOMMU resources?

I do all this stuff (bringing up sets of devices which may share IOMMU
domain) dynamically from C applications.  I don't really want some static
(boot-time or sysfs fiddling) supergroup config unless there is a good
reason KVM/power needs it.

As you say in your next email, doing it all from ioctls is very easy,
programmatically.

-Aaron Fabbri

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 16:54                                 ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-23 16:54 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve




On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:

> On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
>> You have to enforce group/iommu domain assignment whether you have the
>> existing uiommu API, or if you change it to your proposed
>> ioctl(inherit_iommu) API.
>> 
>> The only change needed to VFIO here should be to make uiommu fd assignment
>> happen on the groups instead of on device fds.  That operation fails or
>> succeeds according to the group semantics (all-or-none assignment/same
>> uiommu).
> 
> That is makes uiommu basically the same as the meta-groups, right?

Yes, functionality seems the same, thus my suggestion to keep uiommu
explicit.  Is there some need for group-groups besides defining sets of
groups which share IOMMU resources?

I do all this stuff (bringing up sets of devices which may share IOMMU
domain) dynamically from C applications.  I don't really want some static
(boot-time or sysfs fiddling) supergroup config unless there is a good
reason KVM/power needs it.

As you say in your next email, doing it all from ioctls is very easy,
programmatically.

-Aaron Fabbri

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 16:54                                 ` aafabbri
  0 siblings, 0 replies; 322+ messages in thread
From: aafabbri @ 2011-08-23 16:54 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve




On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:

> On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
>> You have to enforce group/iommu domain assignment whether you have the
>> existing uiommu API, or if you change it to your proposed
>> ioctl(inherit_iommu) API.
>> 
>> The only change needed to VFIO here should be to make uiommu fd assignment
>> happen on the groups instead of on device fds.  That operation fails or
>> succeeds according to the group semantics (all-or-none assignment/same
>> uiommu).
> 
> That is makes uiommu basically the same as the meta-groups, right?

Yes, functionality seems the same, thus my suggestion to keep uiommu
explicit.  Is there some need for group-groups besides defining sets of
groups which share IOMMU resources?

I do all this stuff (bringing up sets of devices which may share IOMMU
domain) dynamically from C applications.  I don't really want some static
(boot-time or sysfs fiddling) supergroup config unless there is a good
reason KVM/power needs it.

As you say in your next email, doing it all from ioctls is very easy,
programmatically.

-Aaron Fabbri

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23  6:54                               ` Benjamin Herrenschmidt
  (?)
@ 2011-08-23 17:01                                 ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 17:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Avi Kivity, Alexey Kardashevskiy, kvm, Paul Mackerras,
	qemu-devel, chrisw, iommu, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> 
> > I'm not following you.
> > 
> > You have to enforce group/iommu domain assignment whether you have the
> > existing uiommu API, or if you change it to your proposed
> > ioctl(inherit_iommu) API.
> > 
> > The only change needed to VFIO here should be to make uiommu fd assignment
> > happen on the groups instead of on device fds.  That operation fails or
> > succeeds according to the group semantics (all-or-none assignment/same
> > uiommu).
> 
> Ok, so I missed that part where you change uiommu to operate on group
> fd's rather than device fd's, my apologies if you actually wrote that
> down :-) It might be obvious ... bare with me I just flew back from the
> US and I am badly jet lagged ...

I missed it too, the model I'm proposing entirely removes the uiommu
concept.

> So I see what you mean, however...
> 
> > I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> > arbitrary mapping (satisfying group constraints) as we do today.
> > 
> > I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> > ability and definitely think the uiommu approach is cleaner than the
> > ioctl(inherit_iommu) approach.  We considered that approach before but it
> > seemed less clean so we went with the explicit uiommu context.
> 
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.

Current code: uiommu creates the domain, we bind a vfio device to that
domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
mappings via MAP_DMA on the vfio device (affecting all the vfio devices
bound to the domain)

My current proposal: "groups" are predefined.  groups ~= iommu domain.
The iommu domain would probably be allocated when the first device is
bound to vfio.  As each device is bound, it gets attached to the group.
DMAs are done via an ioctl on the group.

I think group + uiommu leads to effectively reliving most of the
problems with the current code.  The only benefit is the group
assignment to enforce hardware restrictions.  We still have the problem
that uiommu open() = iommu_domain_alloc(), whose properties are
meaningless without attached devices (groups).  Which I think leads to
the same awkward model of attaching groups to define the domain, then we
end up doing mappings via the group to enforce ordering.

> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

I cringe anytime there's a mention of "static".  IMHO, we have to
support hotplug.  That means "meta groups" change dynamically.  Maybe
this supports the idea that we should be able to retrieve a new fd from
the group to do mappings.  Any groups bound together will return the
same fd and the fd will persist so long as any member of the group is
open.

> That way, any program capable of exploiting VFIO "groups" will
> automatically be able to exploit those "meta groups" (or groups of
> groups) as well as long as they are supported on the system.
> 
> If we ever have system specific constraints as to how such groups can be
> created, then it can all be handled at the level of that configuration
> tool without impact on whatever programs know how to exploit them via
> the VFIO interfaces.

I'd prefer to have the constraints be represented in the ioctl to bind
groups.  It works or not and the platform gets to define what it
considers compatible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 17:01                                 ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 17:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> 
> > I'm not following you.
> > 
> > You have to enforce group/iommu domain assignment whether you have the
> > existing uiommu API, or if you change it to your proposed
> > ioctl(inherit_iommu) API.
> > 
> > The only change needed to VFIO here should be to make uiommu fd assignment
> > happen on the groups instead of on device fds.  That operation fails or
> > succeeds according to the group semantics (all-or-none assignment/same
> > uiommu).
> 
> Ok, so I missed that part where you change uiommu to operate on group
> fd's rather than device fd's, my apologies if you actually wrote that
> down :-) It might be obvious ... bare with me I just flew back from the
> US and I am badly jet lagged ...

I missed it too, the model I'm proposing entirely removes the uiommu
concept.

> So I see what you mean, however...
> 
> > I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> > arbitrary mapping (satisfying group constraints) as we do today.
> > 
> > I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> > ability and definitely think the uiommu approach is cleaner than the
> > ioctl(inherit_iommu) approach.  We considered that approach before but it
> > seemed less clean so we went with the explicit uiommu context.
> 
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.

Current code: uiommu creates the domain, we bind a vfio device to that
domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
mappings via MAP_DMA on the vfio device (affecting all the vfio devices
bound to the domain)

My current proposal: "groups" are predefined.  groups ~= iommu domain.
The iommu domain would probably be allocated when the first device is
bound to vfio.  As each device is bound, it gets attached to the group.
DMAs are done via an ioctl on the group.

I think group + uiommu leads to effectively reliving most of the
problems with the current code.  The only benefit is the group
assignment to enforce hardware restrictions.  We still have the problem
that uiommu open() = iommu_domain_alloc(), whose properties are
meaningless without attached devices (groups).  Which I think leads to
the same awkward model of attaching groups to define the domain, then we
end up doing mappings via the group to enforce ordering.

> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

I cringe anytime there's a mention of "static".  IMHO, we have to
support hotplug.  That means "meta groups" change dynamically.  Maybe
this supports the idea that we should be able to retrieve a new fd from
the group to do mappings.  Any groups bound together will return the
same fd and the fd will persist so long as any member of the group is
open.

> That way, any program capable of exploiting VFIO "groups" will
> automatically be able to exploit those "meta groups" (or groups of
> groups) as well as long as they are supported on the system.
> 
> If we ever have system specific constraints as to how such groups can be
> created, then it can all be handled at the level of that configuration
> tool without impact on whatever programs know how to exploit them via
> the VFIO interfaces.

I'd prefer to have the constraints be represented in the ioctl to bind
groups.  It works or not and the platform gets to define what it
considers compatible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 17:01                                 ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 17:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> 
> > I'm not following you.
> > 
> > You have to enforce group/iommu domain assignment whether you have the
> > existing uiommu API, or if you change it to your proposed
> > ioctl(inherit_iommu) API.
> > 
> > The only change needed to VFIO here should be to make uiommu fd assignment
> > happen on the groups instead of on device fds.  That operation fails or
> > succeeds according to the group semantics (all-or-none assignment/same
> > uiommu).
> 
> Ok, so I missed that part where you change uiommu to operate on group
> fd's rather than device fd's, my apologies if you actually wrote that
> down :-) It might be obvious ... bare with me I just flew back from the
> US and I am badly jet lagged ...

I missed it too, the model I'm proposing entirely removes the uiommu
concept.

> So I see what you mean, however...
> 
> > I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> > arbitrary mapping (satisfying group constraints) as we do today.
> > 
> > I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> > ability and definitely think the uiommu approach is cleaner than the
> > ioctl(inherit_iommu) approach.  We considered that approach before but it
> > seemed less clean so we went with the explicit uiommu context.
> 
> Possibly, the question that interest me the most is what interface will
> KVM end up using. I'm also not terribly fan with the (perceived)
> discrepancy between using uiommu to create groups but using the group fd
> to actually do the mappings, at least if that is still the plan.

Current code: uiommu creates the domain, we bind a vfio device to that
domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
mappings via MAP_DMA on the vfio device (affecting all the vfio devices
bound to the domain)

My current proposal: "groups" are predefined.  groups ~= iommu domain.
The iommu domain would probably be allocated when the first device is
bound to vfio.  As each device is bound, it gets attached to the group.
DMAs are done via an ioctl on the group.

I think group + uiommu leads to effectively reliving most of the
problems with the current code.  The only benefit is the group
assignment to enforce hardware restrictions.  We still have the problem
that uiommu open() = iommu_domain_alloc(), whose properties are
meaningless without attached devices (groups).  Which I think leads to
the same awkward model of attaching groups to define the domain, then we
end up doing mappings via the group to enforce ordering.

> If the separate uiommu interface is kept, then anything that wants to be
> able to benefit from the ability to put multiple devices (or existing
> groups) into such a "meta group" would need to be explicitly modified to
> deal with the uiommu APIs.
> 
> I tend to prefer such "meta groups" as being something you create
> statically using a configuration interface, either via sysfs, netlink or
> ioctl's to a "control" vfio device driven by a simple command line tool
> (which can have the configuration stored in /etc and re-apply it at
> boot).

I cringe anytime there's a mention of "static".  IMHO, we have to
support hotplug.  That means "meta groups" change dynamically.  Maybe
this supports the idea that we should be able to retrieve a new fd from
the group to do mappings.  Any groups bound together will return the
same fd and the fd will persist so long as any member of the group is
open.

> That way, any program capable of exploiting VFIO "groups" will
> automatically be able to exploit those "meta groups" (or groups of
> groups) as well as long as they are supported on the system.
> 
> If we ever have system specific constraints as to how such groups can be
> created, then it can all be handled at the level of that configuration
> tool without impact on whatever programs know how to exploit them via
> the VFIO interfaces.

I'd prefer to have the constraints be represented in the ioctl to bind
groups.  It works or not and the platform gets to define what it
considers compatible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 13:14                         ` Roedel, Joerg
  (?)
@ 2011-08-23 17:08                           ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 17:08 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > 
> > That's an interesting idea.  Maybe an interface similar to the current
> > uiommu interface, where you open() the 2nd group fd and pass the fd via
> > ioctl to the primary group.  IOMMUs that don't support this would fail
> > the attach device callback, which would fail the ioctl to bind them.  It
> > will need to be designed so any group can be removed from the super-set
> > and the remaining group(s) still works.  This feels like something that
> > can be added after we get an initial implementation.
> 
> Handling it through fds is a good idea. This makes sure that everything
> belongs to one process. I am not really sure yet if we go the way to
> just bind plain groups together or if we create meta-groups. The
> meta-groups thing seems somewhat cleaner, though.

I'm leaning towards binding because we need to make it dynamic, but I
don't really have a good picture of the lifecycle of a meta-group.

> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > The current vfio has the same problem, we can't unbind a device from
> > vfio while it's attached to a guest.  I think we'd use the same solution
> > too; send out a netlink packet for a device removal and have the .remove
> > call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> > and SIGBUS the PIDs holding the device if they don't return it
> > willingly.  Thanks,
> 
> Putting the process to sleep (which would be uninterruptible) seems bad.
> The process would sleep until the guest releases the device-group, which
> can take days or months.
> The best thing (and the most intrusive :-) ) is to change PCI core to
> allow unbindings to fail, I think. But this probably further complicates
> the way to upstream VFIO...

Yes, it's not ideal but I think it's sufficient for now and if we later
get support for returning an error from release, we can set a timeout
after notifying the user to make use of that.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 17:08                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 17:08 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > 
> > That's an interesting idea.  Maybe an interface similar to the current
> > uiommu interface, where you open() the 2nd group fd and pass the fd via
> > ioctl to the primary group.  IOMMUs that don't support this would fail
> > the attach device callback, which would fail the ioctl to bind them.  It
> > will need to be designed so any group can be removed from the super-set
> > and the remaining group(s) still works.  This feels like something that
> > can be added after we get an initial implementation.
> 
> Handling it through fds is a good idea. This makes sure that everything
> belongs to one process. I am not really sure yet if we go the way to
> just bind plain groups together or if we create meta-groups. The
> meta-groups thing seems somewhat cleaner, though.

I'm leaning towards binding because we need to make it dynamic, but I
don't really have a good picture of the lifecycle of a meta-group.

> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > The current vfio has the same problem, we can't unbind a device from
> > vfio while it's attached to a guest.  I think we'd use the same solution
> > too; send out a netlink packet for a device removal and have the .remove
> > call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> > and SIGBUS the PIDs holding the device if they don't return it
> > willingly.  Thanks,
> 
> Putting the process to sleep (which would be uninterruptible) seems bad.
> The process would sleep until the guest releases the device-group, which
> can take days or months.
> The best thing (and the most intrusive :-) ) is to change PCI core to
> allow unbindings to fail, I think. But this probably further complicates
> the way to upstream VFIO...

Yes, it's not ideal but I think it's sufficient for now and if we later
get support for returning an error from release, we can set a timeout
after notifying the user to make use of that.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 17:08                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 17:08 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
> > On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
> 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > 
> > That's an interesting idea.  Maybe an interface similar to the current
> > uiommu interface, where you open() the 2nd group fd and pass the fd via
> > ioctl to the primary group.  IOMMUs that don't support this would fail
> > the attach device callback, which would fail the ioctl to bind them.  It
> > will need to be designed so any group can be removed from the super-set
> > and the remaining group(s) still works.  This feels like something that
> > can be added after we get an initial implementation.
> 
> Handling it through fds is a good idea. This makes sure that everything
> belongs to one process. I am not really sure yet if we go the way to
> just bind plain groups together or if we create meta-groups. The
> meta-groups thing seems somewhat cleaner, though.

I'm leaning towards binding because we need to make it dynamic, but I
don't really have a good picture of the lifecycle of a meta-group.

> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > The current vfio has the same problem, we can't unbind a device from
> > vfio while it's attached to a guest.  I think we'd use the same solution
> > too; send out a netlink packet for a device removal and have the .remove
> > call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
> > and SIGBUS the PIDs holding the device if they don't return it
> > willingly.  Thanks,
> 
> Putting the process to sleep (which would be uninterruptible) seems bad.
> The process would sleep until the guest releases the device-group, which
> can take days or months.
> The best thing (and the most intrusive :-) ) is to change PCI core to
> allow unbindings to fail, I think. But this probably further complicates
> the way to upstream VFIO...

Yes, it's not ideal but I think it's sufficient for now and if we later
get support for returning an error from release, we can set a timeout
after notifying the user to make use of that.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 17:01                                 ` Alex Williamson
  (?)
@ 2011-08-23 17:33                                   ` Aaron Fabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-23 17:33 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Avi Kivity, Alexey Kardashevskiy, kvm, Paul Mackerras,
	qemu-devel, chrisw, iommu, Anthony Liguori, linux-pci,
	linuxppc-dev, benve




On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
>> 
>>> I'm not following you.
>>> 
>>> You have to enforce group/iommu domain assignment whether you have the
>>> existing uiommu API, or if you change it to your proposed
>>> ioctl(inherit_iommu) API.
>>> 
>>> The only change needed to VFIO here should be to make uiommu fd assignment
>>> happen on the groups instead of on device fds.  That operation fails or
>>> succeeds according to the group semantics (all-or-none assignment/same
>>> uiommu).
>> 
>> Ok, so I missed that part where you change uiommu to operate on group
>> fd's rather than device fd's, my apologies if you actually wrote that
>> down :-) It might be obvious ... bare with me I just flew back from the
>> US and I am badly jet lagged ...
> 
> I missed it too, the model I'm proposing entirely removes the uiommu
> concept.
> 
>> So I see what you mean, however...
>> 
>>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
>>> arbitrary mapping (satisfying group constraints) as we do today.
>>> 
>>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
>>> ability and definitely think the uiommu approach is cleaner than the
>>> ioctl(inherit_iommu) approach.  We considered that approach before but it
>>> seemed less clean so we went with the explicit uiommu context.
>> 
>> Possibly, the question that interest me the most is what interface will
>> KVM end up using. I'm also not terribly fan with the (perceived)
>> discrepancy between using uiommu to create groups but using the group fd
>> to actually do the mappings, at least if that is still the plan.
> 
> Current code: uiommu creates the domain, we bind a vfio device to that
> domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> bound to the domain)
> 
> My current proposal: "groups" are predefined.  groups ~= iommu domain.

This is my main objection.  I'd rather not lose the ability to have multiple
devices (which are all predefined as singleton groups on x86 w/o PCI
bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

> The iommu domain would probably be allocated when the first device is
> bound to vfio.  As each device is bound, it gets attached to the group.
> DMAs are done via an ioctl on the group.
> 
> I think group + uiommu leads to effectively reliving most of the
> problems with the current code.  The only benefit is the group
> assignment to enforce hardware restrictions.  We still have the problem
> that uiommu open() = iommu_domain_alloc(), whose properties are
> meaningless without attached devices (groups).  Which I think leads to
> the same awkward model of attaching groups to define the domain, then we
> end up doing mappings via the group to enforce ordering.

Is there a better way to allow groups to share an IOMMU domain?

Maybe, instead of having an ioctl to allow a group A to inherit the same
iommu domain as group B, we could have an ioctl to fully merge two groups
(could be what Ben was thinking):

A.ioctl(MERGE_TO_GROUP, B)

The group A now goes away and its devices join group B.  If A ever had an
iommu domain assigned (and buffers mapped?) we fail.

Groups cannot get smaller (they are defined as minimum granularity of an
IOMMU, initially).  They can get bigger if you want to share IOMMU
resources, though.

Any downsides to this approach?

-AF

> 
>> If the separate uiommu interface is kept, then anything that wants to be
>> able to benefit from the ability to put multiple devices (or existing
>> groups) into such a "meta group" would need to be explicitly modified to
>> deal with the uiommu APIs.
>> 
>> I tend to prefer such "meta groups" as being something you create
>> statically using a configuration interface, either via sysfs, netlink or
>> ioctl's to a "control" vfio device driven by a simple command line tool
>> (which can have the configuration stored in /etc and re-apply it at
>> boot).
> 
> I cringe anytime there's a mention of "static".  IMHO, we have to
> support hotplug.  That means "meta groups" change dynamically.  Maybe
> this supports the idea that we should be able to retrieve a new fd from
> the group to do mappings.  Any groups bound together will return the
> same fd and the fd will persist so long as any member of the group is
> open.
> 
>> That way, any program capable of exploiting VFIO "groups" will
>> automatically be able to exploit those "meta groups" (or groups of
>> groups) as well as long as they are supported on the system.
>> 
>> If we ever have system specific constraints as to how such groups can be
>> created, then it can all be handled at the level of that configuration
>> tool without impact on whatever programs know how to exploit them via
>> the VFIO interfaces.
> 
> I'd prefer to have the constraints be represented in the ioctl to bind
> groups.  It works or not and the platform gets to define what it
> considers compatible.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 17:33                                   ` Aaron Fabbri
  0 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-23 17:33 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve




On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
>> 
>>> I'm not following you.
>>> 
>>> You have to enforce group/iommu domain assignment whether you have the
>>> existing uiommu API, or if you change it to your proposed
>>> ioctl(inherit_iommu) API.
>>> 
>>> The only change needed to VFIO here should be to make uiommu fd assignment
>>> happen on the groups instead of on device fds.  That operation fails or
>>> succeeds according to the group semantics (all-or-none assignment/same
>>> uiommu).
>> 
>> Ok, so I missed that part where you change uiommu to operate on group
>> fd's rather than device fd's, my apologies if you actually wrote that
>> down :-) It might be obvious ... bare with me I just flew back from the
>> US and I am badly jet lagged ...
> 
> I missed it too, the model I'm proposing entirely removes the uiommu
> concept.
> 
>> So I see what you mean, however...
>> 
>>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
>>> arbitrary mapping (satisfying group constraints) as we do today.
>>> 
>>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
>>> ability and definitely think the uiommu approach is cleaner than the
>>> ioctl(inherit_iommu) approach.  We considered that approach before but it
>>> seemed less clean so we went with the explicit uiommu context.
>> 
>> Possibly, the question that interest me the most is what interface will
>> KVM end up using. I'm also not terribly fan with the (perceived)
>> discrepancy between using uiommu to create groups but using the group fd
>> to actually do the mappings, at least if that is still the plan.
> 
> Current code: uiommu creates the domain, we bind a vfio device to that
> domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> bound to the domain)
> 
> My current proposal: "groups" are predefined.  groups ~= iommu domain.

This is my main objection.  I'd rather not lose the ability to have multiple
devices (which are all predefined as singleton groups on x86 w/o PCI
bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

> The iommu domain would probably be allocated when the first device is
> bound to vfio.  As each device is bound, it gets attached to the group.
> DMAs are done via an ioctl on the group.
> 
> I think group + uiommu leads to effectively reliving most of the
> problems with the current code.  The only benefit is the group
> assignment to enforce hardware restrictions.  We still have the problem
> that uiommu open() = iommu_domain_alloc(), whose properties are
> meaningless without attached devices (groups).  Which I think leads to
> the same awkward model of attaching groups to define the domain, then we
> end up doing mappings via the group to enforce ordering.

Is there a better way to allow groups to share an IOMMU domain?

Maybe, instead of having an ioctl to allow a group A to inherit the same
iommu domain as group B, we could have an ioctl to fully merge two groups
(could be what Ben was thinking):

A.ioctl(MERGE_TO_GROUP, B)

The group A now goes away and its devices join group B.  If A ever had an
iommu domain assigned (and buffers mapped?) we fail.

Groups cannot get smaller (they are defined as minimum granularity of an
IOMMU, initially).  They can get bigger if you want to share IOMMU
resources, though.

Any downsides to this approach?

-AF

> 
>> If the separate uiommu interface is kept, then anything that wants to be
>> able to benefit from the ability to put multiple devices (or existing
>> groups) into such a "meta group" would need to be explicitly modified to
>> deal with the uiommu APIs.
>> 
>> I tend to prefer such "meta groups" as being something you create
>> statically using a configuration interface, either via sysfs, netlink or
>> ioctl's to a "control" vfio device driven by a simple command line tool
>> (which can have the configuration stored in /etc and re-apply it at
>> boot).
> 
> I cringe anytime there's a mention of "static".  IMHO, we have to
> support hotplug.  That means "meta groups" change dynamically.  Maybe
> this supports the idea that we should be able to retrieve a new fd from
> the group to do mappings.  Any groups bound together will return the
> same fd and the fd will persist so long as any member of the group is
> open.
> 
>> That way, any program capable of exploiting VFIO "groups" will
>> automatically be able to exploit those "meta groups" (or groups of
>> groups) as well as long as they are supported on the system.
>> 
>> If we ever have system specific constraints as to how such groups can be
>> created, then it can all be handled at the level of that configuration
>> tool without impact on whatever programs know how to exploit them via
>> the VFIO interfaces.
> 
> I'd prefer to have the constraints be represented in the ioctl to bind
> groups.  It works or not and the platform gets to define what it
> considers compatible.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 17:33                                   ` Aaron Fabbri
  0 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-23 17:33 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve




On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
>> 
>>> I'm not following you.
>>> 
>>> You have to enforce group/iommu domain assignment whether you have the
>>> existing uiommu API, or if you change it to your proposed
>>> ioctl(inherit_iommu) API.
>>> 
>>> The only change needed to VFIO here should be to make uiommu fd assignment
>>> happen on the groups instead of on device fds.  That operation fails or
>>> succeeds according to the group semantics (all-or-none assignment/same
>>> uiommu).
>> 
>> Ok, so I missed that part where you change uiommu to operate on group
>> fd's rather than device fd's, my apologies if you actually wrote that
>> down :-) It might be obvious ... bare with me I just flew back from the
>> US and I am badly jet lagged ...
> 
> I missed it too, the model I'm proposing entirely removes the uiommu
> concept.
> 
>> So I see what you mean, however...
>> 
>>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
>>> arbitrary mapping (satisfying group constraints) as we do today.
>>> 
>>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
>>> ability and definitely think the uiommu approach is cleaner than the
>>> ioctl(inherit_iommu) approach.  We considered that approach before but it
>>> seemed less clean so we went with the explicit uiommu context.
>> 
>> Possibly, the question that interest me the most is what interface will
>> KVM end up using. I'm also not terribly fan with the (perceived)
>> discrepancy between using uiommu to create groups but using the group fd
>> to actually do the mappings, at least if that is still the plan.
> 
> Current code: uiommu creates the domain, we bind a vfio device to that
> domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> bound to the domain)
> 
> My current proposal: "groups" are predefined.  groups ~= iommu domain.

This is my main objection.  I'd rather not lose the ability to have multiple
devices (which are all predefined as singleton groups on x86 w/o PCI
bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

> The iommu domain would probably be allocated when the first device is
> bound to vfio.  As each device is bound, it gets attached to the group.
> DMAs are done via an ioctl on the group.
> 
> I think group + uiommu leads to effectively reliving most of the
> problems with the current code.  The only benefit is the group
> assignment to enforce hardware restrictions.  We still have the problem
> that uiommu open() = iommu_domain_alloc(), whose properties are
> meaningless without attached devices (groups).  Which I think leads to
> the same awkward model of attaching groups to define the domain, then we
> end up doing mappings via the group to enforce ordering.

Is there a better way to allow groups to share an IOMMU domain?

Maybe, instead of having an ioctl to allow a group A to inherit the same
iommu domain as group B, we could have an ioctl to fully merge two groups
(could be what Ben was thinking):

A.ioctl(MERGE_TO_GROUP, B)

The group A now goes away and its devices join group B.  If A ever had an
iommu domain assigned (and buffers mapped?) we fail.

Groups cannot get smaller (they are defined as minimum granularity of an
IOMMU, initially).  They can get bigger if you want to share IOMMU
resources, though.

Any downsides to this approach?

-AF

> 
>> If the separate uiommu interface is kept, then anything that wants to be
>> able to benefit from the ability to put multiple devices (or existing
>> groups) into such a "meta group" would need to be explicitly modified to
>> deal with the uiommu APIs.
>> 
>> I tend to prefer such "meta groups" as being something you create
>> statically using a configuration interface, either via sysfs, netlink or
>> ioctl's to a "control" vfio device driven by a simple command line tool
>> (which can have the configuration stored in /etc and re-apply it at
>> boot).
> 
> I cringe anytime there's a mention of "static".  IMHO, we have to
> support hotplug.  That means "meta groups" change dynamically.  Maybe
> this supports the idea that we should be able to retrieve a new fd from
> the group to do mappings.  Any groups bound together will return the
> same fd and the fd will persist so long as any member of the group is
> open.
> 
>> That way, any program capable of exploiting VFIO "groups" will
>> automatically be able to exploit those "meta groups" (or groups of
>> groups) as well as long as they are supported on the system.
>> 
>> If we ever have system specific constraints as to how such groups can be
>> created, then it can all be handled at the level of that configuration
>> tool without impact on whatever programs know how to exploit them via
>> the VFIO interfaces.
> 
> I'd prefer to have the constraints be represented in the ioctl to bind
> groups.  It works or not and the platform gets to define what it
> considers compatible.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 17:33                                   ` Aaron Fabbri
  (?)
@ 2011-08-23 18:01                                     ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 18:01 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Benjamin Herrenschmidt, Avi Kivity, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, chrisw, iommu, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On Tue, 2011-08-23 at 10:33 -0700, Aaron Fabbri wrote:
> 
> 
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> >> 
> >>> I'm not following you.
> >>> 
> >>> You have to enforce group/iommu domain assignment whether you have the
> >>> existing uiommu API, or if you change it to your proposed
> >>> ioctl(inherit_iommu) API.
> >>> 
> >>> The only change needed to VFIO here should be to make uiommu fd assignment
> >>> happen on the groups instead of on device fds.  That operation fails or
> >>> succeeds according to the group semantics (all-or-none assignment/same
> >>> uiommu).
> >> 
> >> Ok, so I missed that part where you change uiommu to operate on group
> >> fd's rather than device fd's, my apologies if you actually wrote that
> >> down :-) It might be obvious ... bare with me I just flew back from the
> >> US and I am badly jet lagged ...
> > 
> > I missed it too, the model I'm proposing entirely removes the uiommu
> > concept.
> > 
> >> So I see what you mean, however...
> >> 
> >>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> >>> arbitrary mapping (satisfying group constraints) as we do today.
> >>> 
> >>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> >>> ability and definitely think the uiommu approach is cleaner than the
> >>> ioctl(inherit_iommu) approach.  We considered that approach before but it
> >>> seemed less clean so we went with the explicit uiommu context.
> >> 
> >> Possibly, the question that interest me the most is what interface will
> >> KVM end up using. I'm also not terribly fan with the (perceived)
> >> discrepancy between using uiommu to create groups but using the group fd
> >> to actually do the mappings, at least if that is still the plan.
> > 
> > Current code: uiommu creates the domain, we bind a vfio device to that
> > domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> > mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> > bound to the domain)
> > 
> > My current proposal: "groups" are predefined.  groups ~= iommu domain.
> 
> This is my main objection.  I'd rather not lose the ability to have multiple
> devices (which are all predefined as singleton groups on x86 w/o PCI
> bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
> require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

We do care, I just wasn't prioritizing it as heavily since I think the
typical model is probably closer to 1 device per guest.

> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

That's sort of the way I'm picturing it.  When groups are bound
together, they effectively form a pool, where all the groups are peers.
When the MERGE/BIND ioctl is called on group A and passed the group B
fd, A can check compatibility of the domain associated with B, unbind
devices from the B domain and attach them to the A domain.  The B domain
would then be freed and it would bump the refcnt on the A domain.  If we
need to remove A from the pool, we call UNMERGE/UNBIND on B with the A
fd, it will remove the A devices from the shared object, disassociate A
with the shared object, re-alloc a domain for A and rebind A devices to
that domain. 

This is where it seems like it might be helpful to make a GET_IOMMU_FD
ioctl so that an iommu object is ubiquitous and persistent across the
pool.  Operations on any group fd work on the pool as a whole.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 18:01                                     ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 18:01 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Tue, 2011-08-23 at 10:33 -0700, Aaron Fabbri wrote:
> 
> 
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> >> 
> >>> I'm not following you.
> >>> 
> >>> You have to enforce group/iommu domain assignment whether you have the
> >>> existing uiommu API, or if you change it to your proposed
> >>> ioctl(inherit_iommu) API.
> >>> 
> >>> The only change needed to VFIO here should be to make uiommu fd assignment
> >>> happen on the groups instead of on device fds.  That operation fails or
> >>> succeeds according to the group semantics (all-or-none assignment/same
> >>> uiommu).
> >> 
> >> Ok, so I missed that part where you change uiommu to operate on group
> >> fd's rather than device fd's, my apologies if you actually wrote that
> >> down :-) It might be obvious ... bare with me I just flew back from the
> >> US and I am badly jet lagged ...
> > 
> > I missed it too, the model I'm proposing entirely removes the uiommu
> > concept.
> > 
> >> So I see what you mean, however...
> >> 
> >>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> >>> arbitrary mapping (satisfying group constraints) as we do today.
> >>> 
> >>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> >>> ability and definitely think the uiommu approach is cleaner than the
> >>> ioctl(inherit_iommu) approach.  We considered that approach before but it
> >>> seemed less clean so we went with the explicit uiommu context.
> >> 
> >> Possibly, the question that interest me the most is what interface will
> >> KVM end up using. I'm also not terribly fan with the (perceived)
> >> discrepancy between using uiommu to create groups but using the group fd
> >> to actually do the mappings, at least if that is still the plan.
> > 
> > Current code: uiommu creates the domain, we bind a vfio device to that
> > domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> > mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> > bound to the domain)
> > 
> > My current proposal: "groups" are predefined.  groups ~= iommu domain.
> 
> This is my main objection.  I'd rather not lose the ability to have multiple
> devices (which are all predefined as singleton groups on x86 w/o PCI
> bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
> require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

We do care, I just wasn't prioritizing it as heavily since I think the
typical model is probably closer to 1 device per guest.

> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

That's sort of the way I'm picturing it.  When groups are bound
together, they effectively form a pool, where all the groups are peers.
When the MERGE/BIND ioctl is called on group A and passed the group B
fd, A can check compatibility of the domain associated with B, unbind
devices from the B domain and attach them to the A domain.  The B domain
would then be freed and it would bump the refcnt on the A domain.  If we
need to remove A from the pool, we call UNMERGE/UNBIND on B with the A
fd, it will remove the A devices from the shared object, disassociate A
with the shared object, re-alloc a domain for A and rebind A devices to
that domain. 

This is where it seems like it might be helpful to make a GET_IOMMU_FD
ioctl so that an iommu object is ubiquitous and persistent across the
pool.  Operations on any group fd work on the pool as a whole.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 18:01                                     ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 18:01 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Tue, 2011-08-23 at 10:33 -0700, Aaron Fabbri wrote:
> 
> 
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
> >> 
> >>> I'm not following you.
> >>> 
> >>> You have to enforce group/iommu domain assignment whether you have the
> >>> existing uiommu API, or if you change it to your proposed
> >>> ioctl(inherit_iommu) API.
> >>> 
> >>> The only change needed to VFIO here should be to make uiommu fd assignment
> >>> happen on the groups instead of on device fds.  That operation fails or
> >>> succeeds according to the group semantics (all-or-none assignment/same
> >>> uiommu).
> >> 
> >> Ok, so I missed that part where you change uiommu to operate on group
> >> fd's rather than device fd's, my apologies if you actually wrote that
> >> down :-) It might be obvious ... bare with me I just flew back from the
> >> US and I am badly jet lagged ...
> > 
> > I missed it too, the model I'm proposing entirely removes the uiommu
> > concept.
> > 
> >> So I see what you mean, however...
> >> 
> >>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
> >>> arbitrary mapping (satisfying group constraints) as we do today.
> >>> 
> >>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
> >>> ability and definitely think the uiommu approach is cleaner than the
> >>> ioctl(inherit_iommu) approach.  We considered that approach before but it
> >>> seemed less clean so we went with the explicit uiommu context.
> >> 
> >> Possibly, the question that interest me the most is what interface will
> >> KVM end up using. I'm also not terribly fan with the (perceived)
> >> discrepancy between using uiommu to create groups but using the group fd
> >> to actually do the mappings, at least if that is still the plan.
> > 
> > Current code: uiommu creates the domain, we bind a vfio device to that
> > domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> > mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> > bound to the domain)
> > 
> > My current proposal: "groups" are predefined.  groups ~= iommu domain.
> 
> This is my main objection.  I'd rather not lose the ability to have multiple
> devices (which are all predefined as singleton groups on x86 w/o PCI
> bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
> require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

We do care, I just wasn't prioritizing it as heavily since I think the
typical model is probably closer to 1 device per guest.

> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

That's sort of the way I'm picturing it.  When groups are bound
together, they effectively form a pool, where all the groups are peers.
When the MERGE/BIND ioctl is called on group A and passed the group B
fd, A can check compatibility of the domain associated with B, unbind
devices from the B domain and attach them to the A domain.  The B domain
would then be freed and it would bump the refcnt on the A domain.  If we
need to remove A from the pool, we call UNMERGE/UNBIND on B with the A
fd, it will remove the A devices from the shared object, disassociate A
with the shared object, re-alloc a domain for A and rebind A devices to
that domain. 

This is where it seems like it might be helpful to make a GET_IOMMU_FD
ioctl so that an iommu object is ubiquitous and persistent across the
pool.  Operations on any group fd work on the pool as a whole.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-22 21:01                         ` Benjamin Herrenschmidt
  (?)
@ 2011-08-23 19:30                           ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 19:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci, qemu-devel, aafabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:
> 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Or even B or range of busses... if you want to enforce strict isolation
> you really can't trust anything below a bus level :-)
> 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> Could be tho in what form ? returning sysfs pathes ?

I'm at a loss there, please suggest.  I think we need an ioctl that
returns some kind of array of devices within the group and another that
maybe takes an index from that array and returns an fd for that device.
A sysfs path string might be a reasonable array element, but it sounds
like a pain to work with.

> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> 
> 1:1 process has the advantage of linking to an -mm which makes the whole
> mmu notifier business doable. How do you want to track down mappings and
> do the second level translation in the case of explicit map/unmap (like
> on power) if you are not tied to an mm_struct ?

Right, I threw away the mmu notifier code that was originally part of
vfio because we can't do anything useful with it yet on x86.  I
definitely don't want to prevent it where it makes sense though.  Maybe
we just record current->mm on open and restrict subsequent opens to the
same.

> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Sharing may or may not be possible depending on setups so yes, it's a
> bit tricky.
> 
> My preference is to have a static interface (and that's actually where
> your pet netlink might make some sense :-) to create "synthetic" groups
> made of other groups if the arch allows it. But that might not be the
> best approach. In another email I also proposed an option for a group to
> "capture" another one...

I already made some comments on this in a different thread, so I won't
repeat here.

> > > If that's
> > > not what you're saying, how would the domains - now made up of a
> > > user's selection of groups, rather than individual devices - be
> > > configured?
> > > 
> > > > Hope that captures it, feel free to jump in with corrections and
> > > > suggestions.  Thanks,
> > > 
> 
> Another aspect I don't see discussed is how we represent these things to
> the guest.
> 
> On Power for example, I have a requirement that a given iommu domain is
> represented by a single dma window property in the device-tree. What
> that means is that that property needs to be either in the node of the
> device itself if there's only one device in the group or in a parent
> node (ie a bridge or host bridge) if there are multiple devices.
> 
> Now I do -not- want to go down the path of simulating P2P bridges,
> besides we'll quickly run out of bus numbers if we go there.
> 
> For us the most simple and logical approach (which is also what pHyp
> uses and what Linux handles well) is really to expose a given PCI host
> bridge per group to the guest. Believe it or not, it makes things
> easier :-)

I'm all for easier.  Why does exposing the bridge use less bus numbers
than emulating a bridge?

On x86, I want to maintain that our default assignment is at the device
level.  A user should be able to pick single or multiple devices from
across several groups and have them all show up as individual,
hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
also seen cases where users try to attach a bridge to the guest,
assuming they'll get all the devices below the bridge, so I'd be in
favor of making this "just work" if possible too, though we may have to
prevent hotplug of those.

Given the device requirement on x86 and since everything is a PCI device
on x86, I'd like to keep a qemu command line something like -device
vfio,host=00:19.0.  I assume that some of the iommu properties, such as
dma window size/address, will be query-able through an architecture
specific (or general if possible) ioctl on the vfio group fd.  I hope
that will help the specification, but I don't fully understand what all
remains.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 19:30                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 19:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:
> 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Or even B or range of busses... if you want to enforce strict isolation
> you really can't trust anything below a bus level :-)
> 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> Could be tho in what form ? returning sysfs pathes ?

I'm at a loss there, please suggest.  I think we need an ioctl that
returns some kind of array of devices within the group and another that
maybe takes an index from that array and returns an fd for that device.
A sysfs path string might be a reasonable array element, but it sounds
like a pain to work with.

> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> 
> 1:1 process has the advantage of linking to an -mm which makes the whole
> mmu notifier business doable. How do you want to track down mappings and
> do the second level translation in the case of explicit map/unmap (like
> on power) if you are not tied to an mm_struct ?

Right, I threw away the mmu notifier code that was originally part of
vfio because we can't do anything useful with it yet on x86.  I
definitely don't want to prevent it where it makes sense though.  Maybe
we just record current->mm on open and restrict subsequent opens to the
same.

> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Sharing may or may not be possible depending on setups so yes, it's a
> bit tricky.
> 
> My preference is to have a static interface (and that's actually where
> your pet netlink might make some sense :-) to create "synthetic" groups
> made of other groups if the arch allows it. But that might not be the
> best approach. In another email I also proposed an option for a group to
> "capture" another one...

I already made some comments on this in a different thread, so I won't
repeat here.

> > > If that's
> > > not what you're saying, how would the domains - now made up of a
> > > user's selection of groups, rather than individual devices - be
> > > configured?
> > > 
> > > > Hope that captures it, feel free to jump in with corrections and
> > > > suggestions.  Thanks,
> > > 
> 
> Another aspect I don't see discussed is how we represent these things to
> the guest.
> 
> On Power for example, I have a requirement that a given iommu domain is
> represented by a single dma window property in the device-tree. What
> that means is that that property needs to be either in the node of the
> device itself if there's only one device in the group or in a parent
> node (ie a bridge or host bridge) if there are multiple devices.
> 
> Now I do -not- want to go down the path of simulating P2P bridges,
> besides we'll quickly run out of bus numbers if we go there.
> 
> For us the most simple and logical approach (which is also what pHyp
> uses and what Linux handles well) is really to expose a given PCI host
> bridge per group to the guest. Believe it or not, it makes things
> easier :-)

I'm all for easier.  Why does exposing the bridge use less bus numbers
than emulating a bridge?

On x86, I want to maintain that our default assignment is at the device
level.  A user should be able to pick single or multiple devices from
across several groups and have them all show up as individual,
hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
also seen cases where users try to attach a bridge to the guest,
assuming they'll get all the devices below the bridge, so I'd be in
favor of making this "just work" if possible too, though we may have to
prevent hotplug of those.

Given the device requirement on x86 and since everything is a PCI device
on x86, I'd like to keep a qemu command line something like -device
vfio,host=00:19.0.  I assume that some of the iommu properties, such as
dma window size/address, will be query-able through an architecture
specific (or general if possible) ioctl on the vfio group fd.  I hope
that will help the specification, but I don't fully understand what all
remains.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 19:30                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-23 19:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:
> 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Or even B or range of busses... if you want to enforce strict isolation
> you really can't trust anything below a bus level :-)
> 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> Could be tho in what form ? returning sysfs pathes ?

I'm at a loss there, please suggest.  I think we need an ioctl that
returns some kind of array of devices within the group and another that
maybe takes an index from that array and returns an fd for that device.
A sysfs path string might be a reasonable array element, but it sounds
like a pain to work with.

> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> 
> 1:1 process has the advantage of linking to an -mm which makes the whole
> mmu notifier business doable. How do you want to track down mappings and
> do the second level translation in the case of explicit map/unmap (like
> on power) if you are not tied to an mm_struct ?

Right, I threw away the mmu notifier code that was originally part of
vfio because we can't do anything useful with it yet on x86.  I
definitely don't want to prevent it where it makes sense though.  Maybe
we just record current->mm on open and restrict subsequent opens to the
same.

> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Sharing may or may not be possible depending on setups so yes, it's a
> bit tricky.
> 
> My preference is to have a static interface (and that's actually where
> your pet netlink might make some sense :-) to create "synthetic" groups
> made of other groups if the arch allows it. But that might not be the
> best approach. In another email I also proposed an option for a group to
> "capture" another one...

I already made some comments on this in a different thread, so I won't
repeat here.

> > > If that's
> > > not what you're saying, how would the domains - now made up of a
> > > user's selection of groups, rather than individual devices - be
> > > configured?
> > > 
> > > > Hope that captures it, feel free to jump in with corrections and
> > > > suggestions.  Thanks,
> > > 
> 
> Another aspect I don't see discussed is how we represent these things to
> the guest.
> 
> On Power for example, I have a requirement that a given iommu domain is
> represented by a single dma window property in the device-tree. What
> that means is that that property needs to be either in the node of the
> device itself if there's only one device in the group or in a parent
> node (ie a bridge or host bridge) if there are multiple devices.
> 
> Now I do -not- want to go down the path of simulating P2P bridges,
> besides we'll quickly run out of bus numbers if we go there.
> 
> For us the most simple and logical approach (which is also what pHyp
> uses and what Linux handles well) is really to expose a given PCI host
> bridge per group to the guest. Believe it or not, it makes things
> easier :-)

I'm all for easier.  Why does exposing the bridge use less bus numbers
than emulating a bridge?

On x86, I want to maintain that our default assignment is at the device
level.  A user should be able to pick single or multiple devices from
across several groups and have them all show up as individual,
hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
also seen cases where users try to attach a bridge to the guest,
assuming they'll get all the devices below the bridge, so I'd be in
favor of making this "just work" if possible too, though we may have to
prevent hotplug of those.

Given the device requirement on x86 and since everything is a PCI device
on x86, I'd like to keep a qemu command line something like -device
vfio,host=00:19.0.  I assume that some of the iommu properties, such as
dma window size/address, will be query-able through an architecture
specific (or general if possible) ioctl on the vfio group fd.  I hope
that will help the specification, but I don't fully understand what all
remains.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 13:18                         ` Roedel, Joerg
  (?)
@ 2011-08-23 23:35                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:35 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> > 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > > 
> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > We can treat it as a hard unplug (like a cardbus gone away).
> > 
> > IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> > all ff's from reads (& ignore writes).
> > 
> > Then send an unplug event via whatever mechanism the platform provides
> > (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> > what to do on power for hotplug yet).
> 
> Hmm, good idea. But as far as I know the hotplug-event needs to be in
> the guest _before_ the device is actually unplugged (so that the guest
> can unbind its driver first). That somehow brings back the sleep-idea
> and the timeout in the .release function.

That's for normal assisted hotplug, but don't we support hard hotplug ?
I mean, things like cardbus, thunderbolt (if we ever support that)
etc... will need it and some platforms do support hard hotplug of PCIe
devices.

(That's why drivers should never spin on MMIO waiting for a 1 bit to
clear without a timeout :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 23:35                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:35 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> > 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > > 
> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > We can treat it as a hard unplug (like a cardbus gone away).
> > 
> > IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> > all ff's from reads (& ignore writes).
> > 
> > Then send an unplug event via whatever mechanism the platform provides
> > (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> > what to do on power for hotplug yet).
> 
> Hmm, good idea. But as far as I know the hotplug-event needs to be in
> the guest _before_ the device is actually unplugged (so that the guest
> can unbind its driver first). That somehow brings back the sleep-idea
> and the timeout in the .release function.

That's for normal assisted hotplug, but don't we support hard hotplug ?
I mean, things like cardbus, thunderbolt (if we ever support that)
etc... will need it and some platforms do support hard hotplug of PCIe
devices.

(That's why drivers should never spin on MMIO waiting for a 1 bit to
clear without a timeout :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 23:35                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:35 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> > 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > > 
> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > We can treat it as a hard unplug (like a cardbus gone away).
> > 
> > IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> > all ff's from reads (& ignore writes).
> > 
> > Then send an unplug event via whatever mechanism the platform provides
> > (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> > what to do on power for hotplug yet).
> 
> Hmm, good idea. But as far as I know the hotplug-event needs to be in
> the guest _before_ the device is actually unplugged (so that the guest
> can unbind its driver first). That somehow brings back the sleep-idea
> and the timeout in the .release function.

That's for normal assisted hotplug, but don't we support hard hotplug ?
I mean, things like cardbus, thunderbolt (if we ever support that)
etc... will need it and some platforms do support hard hotplug of PCIe
devices.

(That's why drivers should never spin on MMIO waiting for a 1 bit to
clear without a timeout :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 16:23                           ` Alex Williamson
  (?)
@ 2011-08-23 23:41                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
> 
> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
> group to another via ioctl) is one option.  The tricky part will be
> implementing it to support hot unplug of any group from the
> supergroup.
> I believe Ben had a suggestion that supergroups could be created in
> sysfs, but I don't know what the mechanism to do that looks like.  It
> would also be an extra management step to dynamically bind and unbind
> groups to the supergroup around hotplug.  Thanks, 

I don't really care that much what the method for creating them is, to
be honest, I just prefer this concept of "meta groups" or "super groups"
or "synthetic groups" (whatever you want to name them) to having a
separate uiommu file descriptor.

The one reason I have a slight preference for creating them "statically"
using some kind of separate interface (again, I don't care whether it's
sysfs, netlink, etc...) is that it means things like qemu don't have to
care about them.

In general, apps that want to use vfio can just get passed the path to
such a group or the /dev/ path or the group number (whatever we chose as
the way to identify a group), and don't need to know anything about
"super groups", how to manipulate them, create them, possible
constraints etc...

Now, libvirt might want to know about that other API in order to provide
control on the creation of these things, but that's a different issue.

By "static" I mean they persist, they aren't tied to the lifetime of an
fd.

Now that's purely a preference on my side because I believe it will make
life easier for actual programs wanting to use vfio to not have to care
about those super-groups, but as I said earlier, I don't actually care
that much :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 23:41                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
> 
> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
> group to another via ioctl) is one option.  The tricky part will be
> implementing it to support hot unplug of any group from the
> supergroup.
> I believe Ben had a suggestion that supergroups could be created in
> sysfs, but I don't know what the mechanism to do that looks like.  It
> would also be an extra management step to dynamically bind and unbind
> groups to the supergroup around hotplug.  Thanks, 

I don't really care that much what the method for creating them is, to
be honest, I just prefer this concept of "meta groups" or "super groups"
or "synthetic groups" (whatever you want to name them) to having a
separate uiommu file descriptor.

The one reason I have a slight preference for creating them "statically"
using some kind of separate interface (again, I don't care whether it's
sysfs, netlink, etc...) is that it means things like qemu don't have to
care about them.

In general, apps that want to use vfio can just get passed the path to
such a group or the /dev/ path or the group number (whatever we chose as
the way to identify a group), and don't need to know anything about
"super groups", how to manipulate them, create them, possible
constraints etc...

Now, libvirt might want to know about that other API in order to provide
control on the creation of these things, but that's a different issue.

By "static" I mean they persist, they aren't tied to the lifetime of an
fd.

Now that's purely a preference on my side because I believe it will make
life easier for actual programs wanting to use vfio to not have to care
about those super-groups, but as I said earlier, I don't actually care
that much :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 23:41                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, aafabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
> 
> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
> group to another via ioctl) is one option.  The tricky part will be
> implementing it to support hot unplug of any group from the
> supergroup.
> I believe Ben had a suggestion that supergroups could be created in
> sysfs, but I don't know what the mechanism to do that looks like.  It
> would also be an extra management step to dynamically bind and unbind
> groups to the supergroup around hotplug.  Thanks, 

I don't really care that much what the method for creating them is, to
be honest, I just prefer this concept of "meta groups" or "super groups"
or "synthetic groups" (whatever you want to name them) to having a
separate uiommu file descriptor.

The one reason I have a slight preference for creating them "statically"
using some kind of separate interface (again, I don't care whether it's
sysfs, netlink, etc...) is that it means things like qemu don't have to
care about them.

In general, apps that want to use vfio can just get passed the path to
such a group or the /dev/ path or the group number (whatever we chose as
the way to identify a group), and don't need to know anything about
"super groups", how to manipulate them, create them, possible
constraints etc...

Now, libvirt might want to know about that other API in order to provide
control on the creation of these things, but that's a different issue.

By "static" I mean they persist, they aren't tied to the lifetime of an
fd.

Now that's purely a preference on my side because I believe it will make
life easier for actual programs wanting to use vfio to not have to care
about those super-groups, but as I said earlier, I don't actually care
that much :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 19:30                           ` Alex Williamson
  (?)
@ 2011-08-23 23:51                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	linuxppc-dev, benve


> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?

Because a host bridge doesn't look like a PCI to PCI bridge at all for
us. It's an entire separate domain with it's own bus number space
(unlike most x86 setups).

In fact we have some problems afaik in qemu today with the concept of
PCI domains, for example, I think qemu has assumptions about a single
shared IO space domain which isn't true for us (each PCI host bridge
provides a distinct IO space domain starting at 0). We'll have to fix
that, but it's not a huge deal.

So for each "group" we'd expose in the guest an entire separate PCI
domain space with its own IO, MMIO etc... spaces, handed off from a
single device-tree "host bridge" which doesn't itself appear in the
config space, doesn't need any emulation of any config space etc...

> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.
>
> Given the device requirement on x86 and since everything is a PCI device
> on x86, I'd like to keep a qemu command line something like -device
> vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> dma window size/address, will be query-able through an architecture
> specific (or general if possible) ioctl on the vfio group fd.  I hope
> that will help the specification, but I don't fully understand what all
> remains.  Thanks,

Well, for iommu there's a couple of different issues here but yes,
basically on one side we'll have some kind of ioctl to know what segment
of the device(s) DMA address space is assigned to the group and we'll
need to represent that to the guest via a device-tree property in some
kind of "parent" node of all the devices in that group.

We -might- be able to implement some kind of hotplug of individual
devices of a group under such a PHB (PCI Host Bridge), I don't know for
sure yet, some of that PAPR stuff is pretty arcane, but basically, for
all intend and purpose, we really want a group to be represented as a
PHB in the guest.

We cannot arbitrary have individual devices of separate groups be
represented in the guest as siblings on a single simulated PCI bus.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-23 23:51                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve


> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?

Because a host bridge doesn't look like a PCI to PCI bridge at all for
us. It's an entire separate domain with it's own bus number space
(unlike most x86 setups).

In fact we have some problems afaik in qemu today with the concept of
PCI domains, for example, I think qemu has assumptions about a single
shared IO space domain which isn't true for us (each PCI host bridge
provides a distinct IO space domain starting at 0). We'll have to fix
that, but it's not a huge deal.

So for each "group" we'd expose in the guest an entire separate PCI
domain space with its own IO, MMIO etc... spaces, handed off from a
single device-tree "host bridge" which doesn't itself appear in the
config space, doesn't need any emulation of any config space etc...

> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.
>
> Given the device requirement on x86 and since everything is a PCI device
> on x86, I'd like to keep a qemu command line something like -device
> vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> dma window size/address, will be query-able through an architecture
> specific (or general if possible) ioctl on the vfio group fd.  I hope
> that will help the specification, but I don't fully understand what all
> remains.  Thanks,

Well, for iommu there's a couple of different issues here but yes,
basically on one side we'll have some kind of ioctl to know what segment
of the device(s) DMA address space is assigned to the group and we'll
need to represent that to the guest via a device-tree property in some
kind of "parent" node of all the devices in that group.

We -might- be able to implement some kind of hotplug of individual
devices of a group under such a PHB (PCI Host Bridge), I don't know for
sure yet, some of that PAPR stuff is pretty arcane, but basically, for
all intend and purpose, we really want a group to be represented as a
PHB in the guest.

We cannot arbitrary have individual devices of separate groups be
represented in the guest as siblings on a single simulated PCI bus.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-23 23:51                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 322+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-23 23:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	linuxppc-dev, benve


> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?

Because a host bridge doesn't look like a PCI to PCI bridge at all for
us. It's an entire separate domain with it's own bus number space
(unlike most x86 setups).

In fact we have some problems afaik in qemu today with the concept of
PCI domains, for example, I think qemu has assumptions about a single
shared IO space domain which isn't true for us (each PCI host bridge
provides a distinct IO space domain starting at 0). We'll have to fix
that, but it's not a huge deal.

So for each "group" we'd expose in the guest an entire separate PCI
domain space with its own IO, MMIO etc... spaces, handed off from a
single device-tree "host bridge" which doesn't itself appear in the
config space, doesn't need any emulation of any config space etc...

> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.
>
> Given the device requirement on x86 and since everything is a PCI device
> on x86, I'd like to keep a qemu command line something like -device
> vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> dma window size/address, will be query-able through an architecture
> specific (or general if possible) ioctl on the vfio group fd.  I hope
> that will help the specification, but I don't fully understand what all
> remains.  Thanks,

Well, for iommu there's a couple of different issues here but yes,
basically on one side we'll have some kind of ioctl to know what segment
of the device(s) DMA address space is assigned to the group and we'll
need to represent that to the guest via a device-tree property in some
kind of "parent" node of all the devices in that group.

We -might- be able to implement some kind of hotplug of individual
devices of a group under such a PHB (PCI Host Bridge), I don't know for
sure yet, some of that PAPR stuff is pretty arcane, but basically, for
all intend and purpose, we really want a group to be represented as a
PHB in the guest.

We cannot arbitrary have individual devices of separate groups be
represented in the guest as siblings on a single simulated PCI bus.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 23:41                             ` Benjamin Herrenschmidt
  (?)
@ 2011-08-24  3:36                               ` Alexander Graf
  -1 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-24  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, iommu, David Gibson, aafabbri, Alex Williamson,
	Avi Kivity, linuxppc-dev, benve


On 23.08.2011, at 18:41, Benjamin Herrenschmidt wrote:

> On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
>> 
>> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
>> group to another via ioctl) is one option.  The tricky part will be
>> implementing it to support hot unplug of any group from the
>> supergroup.
>> I believe Ben had a suggestion that supergroups could be created in
>> sysfs, but I don't know what the mechanism to do that looks like.  It
>> would also be an extra management step to dynamically bind and unbind
>> groups to the supergroup around hotplug.  Thanks, 
> 
> I don't really care that much what the method for creating them is, to
> be honest, I just prefer this concept of "meta groups" or "super groups"
> or "synthetic groups" (whatever you want to name them) to having a
> separate uiommu file descriptor.
> 
> The one reason I have a slight preference for creating them "statically"
> using some kind of separate interface (again, I don't care whether it's
> sysfs, netlink, etc...) is that it means things like qemu don't have to
> care about them.
> 
> In general, apps that want to use vfio can just get passed the path to
> such a group or the /dev/ path or the group number (whatever we chose as
> the way to identify a group), and don't need to know anything about
> "super groups", how to manipulate them, create them, possible
> constraints etc...
> 
> Now, libvirt might want to know about that other API in order to provide
> control on the creation of these things, but that's a different issue.
> 
> By "static" I mean they persist, they aren't tied to the lifetime of an
> fd.
> 
> Now that's purely a preference on my side because I believe it will make
> life easier for actual programs wanting to use vfio to not have to care
> about those super-groups, but as I said earlier, I don't actually care
> that much :-)

Oh I think it's one of the building blocks we need for a sane user space device exposure API. If I want to pass user X a few devices that are all behind a single IOMMU, I just chown that device node to user X and be done with it.

The user space tool actually using the VFIO interface wouldn't be in configuration business then - and it really shouldn't. That's what system configuration is there for :).

But I'm fairly sure we managed to persuade Alex that this is the right path on the BOF :)


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  3:36                               ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-24  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, iommu, David Gibson, aafabbri, Alex Williamson,
	Avi Kivity, Anthony Liguori, linuxppc-dev, benve


On 23.08.2011, at 18:41, Benjamin Herrenschmidt wrote:

> On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
>>=20
>> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
>> group to another via ioctl) is one option.  The tricky part will be
>> implementing it to support hot unplug of any group from the
>> supergroup.
>> I believe Ben had a suggestion that supergroups could be created in
>> sysfs, but I don't know what the mechanism to do that looks like.  It
>> would also be an extra management step to dynamically bind and unbind
>> groups to the supergroup around hotplug.  Thanks,=20
>=20
> I don't really care that much what the method for creating them is, to
> be honest, I just prefer this concept of "meta groups" or "super =
groups"
> or "synthetic groups" (whatever you want to name them) to having a
> separate uiommu file descriptor.
>=20
> The one reason I have a slight preference for creating them =
"statically"
> using some kind of separate interface (again, I don't care whether =
it's
> sysfs, netlink, etc...) is that it means things like qemu don't have =
to
> care about them.
>=20
> In general, apps that want to use vfio can just get passed the path to
> such a group or the /dev/ path or the group number (whatever we chose =
as
> the way to identify a group), and don't need to know anything about
> "super groups", how to manipulate them, create them, possible
> constraints etc...
>=20
> Now, libvirt might want to know about that other API in order to =
provide
> control on the creation of these things, but that's a different issue.
>=20
> By "static" I mean they persist, they aren't tied to the lifetime of =
an
> fd.
>=20
> Now that's purely a preference on my side because I believe it will =
make
> life easier for actual programs wanting to use vfio to not have to =
care
> about those super-groups, but as I said earlier, I don't actually care
> that much :-)

Oh I think it's one of the building blocks we need for a sane user space =
device exposure API. If I want to pass user X a few devices that are all =
behind a single IOMMU, I just chown that device node to user X and be =
done with it.

The user space tool actually using the VFIO interface wouldn't be in =
configuration business then - and it really shouldn't. That's what =
system configuration is there for :).

But I'm fairly sure we managed to persuade Alex that this is the right =
path on the BOF :)


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  3:36                               ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-24  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, iommu, David Gibson, aafabbri, Alex Williamson,
	Avi Kivity, linuxppc-dev, benve


On 23.08.2011, at 18:41, Benjamin Herrenschmidt wrote:

> On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
>> 
>> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
>> group to another via ioctl) is one option.  The tricky part will be
>> implementing it to support hot unplug of any group from the
>> supergroup.
>> I believe Ben had a suggestion that supergroups could be created in
>> sysfs, but I don't know what the mechanism to do that looks like.  It
>> would also be an extra management step to dynamically bind and unbind
>> groups to the supergroup around hotplug.  Thanks, 
> 
> I don't really care that much what the method for creating them is, to
> be honest, I just prefer this concept of "meta groups" or "super groups"
> or "synthetic groups" (whatever you want to name them) to having a
> separate uiommu file descriptor.
> 
> The one reason I have a slight preference for creating them "statically"
> using some kind of separate interface (again, I don't care whether it's
> sysfs, netlink, etc...) is that it means things like qemu don't have to
> care about them.
> 
> In general, apps that want to use vfio can just get passed the path to
> such a group or the /dev/ path or the group number (whatever we chose as
> the way to identify a group), and don't need to know anything about
> "super groups", how to manipulate them, create them, possible
> constraints etc...
> 
> Now, libvirt might want to know about that other API in order to provide
> control on the creation of these things, but that's a different issue.
> 
> By "static" I mean they persist, they aren't tied to the lifetime of an
> fd.
> 
> Now that's purely a preference on my side because I believe it will make
> life easier for actual programs wanting to use vfio to not have to care
> about those super-groups, but as I said earlier, I don't actually care
> that much :-)

Oh I think it's one of the building blocks we need for a sane user space device exposure API. If I want to pass user X a few devices that are all behind a single IOMMU, I just chown that device node to user X and be done with it.

The user space tool actually using the VFIO interface wouldn't be in configuration business then - and it really shouldn't. That's what system configuration is there for :).

But I'm fairly sure we managed to persuade Alex that this is the right path on the BOF :)


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 23:51                             ` Benjamin Herrenschmidt
  (?)
@ 2011-08-24  3:40                               ` Alexander Graf
  -1 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-24  3:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, David Gibson, chrisw Wright,
	Alexey Kardashevskiy, kvm@vger.kernel.org list, Paul Mackerras,
	linux-pci, qemu-devel Developers, aafabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve, Yoder Stuart-B08248


On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote:

> 
>>> For us the most simple and logical approach (which is also what pHyp
>>> uses and what Linux handles well) is really to expose a given PCI host
>>> bridge per group to the guest. Believe it or not, it makes things
>>> easier :-)
>> 
>> I'm all for easier.  Why does exposing the bridge use less bus numbers
>> than emulating a bridge?
> 
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).
> 
> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.
> 
> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
> 
>> On x86, I want to maintain that our default assignment is at the device
>> level.  A user should be able to pick single or multiple devices from
>> across several groups and have them all show up as individual,
>> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
>> also seen cases where users try to attach a bridge to the guest,
>> assuming they'll get all the devices below the bridge, so I'd be in
>> favor of making this "just work" if possible too, though we may have to
>> prevent hotplug of those.
>> 
>> Given the device requirement on x86 and since everything is a PCI device
>> on x86, I'd like to keep a qemu command line something like -device
>> vfio,host=00:19.0.  I assume that some of the iommu properties, such as
>> dma window size/address, will be query-able through an architecture
>> specific (or general if possible) ioctl on the vfio group fd.  I hope
>> that will help the specification, but I don't fully understand what all
>> remains.  Thanks,
> 
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
> 
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
> 
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

So would it make sense for you to go the same route that we need to go on embedded power, with a separate VFIO style interface that simply exports memory ranges and irq bindings, but doesn't know anything about PCI? For e500, we'll be using something like that to pass through a full PCI bus into the system.


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  3:40                               ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-24  3:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm@vger.kernel.org list,
	Paul Mackerras, linux-pci, qemu-devel Developers, iommu,
	David Gibson, chrisw Wright, Alex Williamson, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve


On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote:

>=20
>>> For us the most simple and logical approach (which is also what pHyp
>>> uses and what Linux handles well) is really to expose a given PCI =
host
>>> bridge per group to the guest. Believe it or not, it makes things
>>> easier :-)
>>=20
>> I'm all for easier.  Why does exposing the bridge use less bus =
numbers
>> than emulating a bridge?
>=20
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).
>=20
> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.
>=20
> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
>=20
>> On x86, I want to maintain that our default assignment is at the =
device
>> level.  A user should be able to pick single or multiple devices from
>> across several groups and have them all show up as individual,
>> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
>> also seen cases where users try to attach a bridge to the guest,
>> assuming they'll get all the devices below the bridge, so I'd be in
>> favor of making this "just work" if possible too, though we may have =
to
>> prevent hotplug of those.
>>=20
>> Given the device requirement on x86 and since everything is a PCI =
device
>> on x86, I'd like to keep a qemu command line something like -device
>> vfio,host=3D00:19.0.  I assume that some of the iommu properties, =
such as
>> dma window size/address, will be query-able through an architecture
>> specific (or general if possible) ioctl on the vfio group fd.  I hope
>> that will help the specification, but I don't fully understand what =
all
>> remains.  Thanks,
>=20
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what =
segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
>=20
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know =
for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
>=20
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

So would it make sense for you to go the same route that we need to go =
on embedded power, with a separate VFIO style interface that simply =
exports memory ranges and irq bindings, but doesn't know anything about =
PCI? For e500, we'll be using something like that to pass through a full =
PCI bus into the system.


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  3:40                               ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-24  3:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm@vger.kernel.org list,
	Paul Mackerras, linux-pci, qemu-devel Developers, iommu,
	David Gibson, chrisw Wright, Yoder Stuart-B08248,
	Alex Williamson, Avi Kivity, linuxppc-dev, benve


On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote:

> 
>>> For us the most simple and logical approach (which is also what pHyp
>>> uses and what Linux handles well) is really to expose a given PCI host
>>> bridge per group to the guest. Believe it or not, it makes things
>>> easier :-)
>> 
>> I'm all for easier.  Why does exposing the bridge use less bus numbers
>> than emulating a bridge?
> 
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).
> 
> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.
> 
> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
> 
>> On x86, I want to maintain that our default assignment is at the device
>> level.  A user should be able to pick single or multiple devices from
>> across several groups and have them all show up as individual,
>> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
>> also seen cases where users try to attach a bridge to the guest,
>> assuming they'll get all the devices below the bridge, so I'd be in
>> favor of making this "just work" if possible too, though we may have to
>> prevent hotplug of those.
>> 
>> Given the device requirement on x86 and since everything is a PCI device
>> on x86, I'd like to keep a qemu command line something like -device
>> vfio,host=00:19.0.  I assume that some of the iommu properties, such as
>> dma window size/address, will be query-able through an architecture
>> specific (or general if possible) ioctl on the vfio group fd.  I hope
>> that will help the specification, but I don't fully understand what all
>> remains.  Thanks,
> 
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
> 
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
> 
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

So would it make sense for you to go the same route that we need to go on embedded power, with a separate VFIO style interface that simply exports memory ranges and irq bindings, but doesn't know anything about PCI? For e500, we'll be using something like that to pass through a full PCI bus into the system.


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 19:30                           ` Alex Williamson
  (?)
@ 2011-08-24  8:43                             ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-24  8:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:

> > Could be tho in what form ? returning sysfs pathes ?
> 
> I'm at a loss there, please suggest.  I think we need an ioctl that
> returns some kind of array of devices within the group and another that
> maybe takes an index from that array and returns an fd for that device.
> A sysfs path string might be a reasonable array element, but it sounds
> like a pain to work with.

Limiting to PCI we can just pass the BDF as the argument to optain the
device-fd. For a more generic solution we need a unique identifier in
some way which is unique across all 'struct device' instances in the
system. As far as I know we don't have that yet (besides the sysfs-path)
so we either add that or stick with bus-specific solutions.

> > 1:1 process has the advantage of linking to an -mm which makes the whole
> > mmu notifier business doable. How do you want to track down mappings and
> > do the second level translation in the case of explicit map/unmap (like
> > on power) if you are not tied to an mm_struct ?
> 
> Right, I threw away the mmu notifier code that was originally part of
> vfio because we can't do anything useful with it yet on x86.  I
> definitely don't want to prevent it where it makes sense though.  Maybe
> we just record current->mm on open and restrict subsequent opens to the
> same.

Hmm, I think we need io-page-fault support in the iommu-api then.

> > Another aspect I don't see discussed is how we represent these things to
> > the guest.
> > 
> > On Power for example, I have a requirement that a given iommu domain is
> > represented by a single dma window property in the device-tree. What
> > that means is that that property needs to be either in the node of the
> > device itself if there's only one device in the group or in a parent
> > node (ie a bridge or host bridge) if there are multiple devices.
> > 
> > Now I do -not- want to go down the path of simulating P2P bridges,
> > besides we'll quickly run out of bus numbers if we go there.
> > 
> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?
> 
> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.

A side-note: Might it be better to expose assigned devices in a guest on
a seperate bus? This will make it easier to emulate an IOMMU for the
guest inside qemu.


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  8:43                             ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-24  8:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	David Gibson, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:

> > Could be tho in what form ? returning sysfs pathes ?
> 
> I'm at a loss there, please suggest.  I think we need an ioctl that
> returns some kind of array of devices within the group and another that
> maybe takes an index from that array and returns an fd for that device.
> A sysfs path string might be a reasonable array element, but it sounds
> like a pain to work with.

Limiting to PCI we can just pass the BDF as the argument to optain the
device-fd. For a more generic solution we need a unique identifier in
some way which is unique across all 'struct device' instances in the
system. As far as I know we don't have that yet (besides the sysfs-path)
so we either add that or stick with bus-specific solutions.

> > 1:1 process has the advantage of linking to an -mm which makes the whole
> > mmu notifier business doable. How do you want to track down mappings and
> > do the second level translation in the case of explicit map/unmap (like
> > on power) if you are not tied to an mm_struct ?
> 
> Right, I threw away the mmu notifier code that was originally part of
> vfio because we can't do anything useful with it yet on x86.  I
> definitely don't want to prevent it where it makes sense though.  Maybe
> we just record current->mm on open and restrict subsequent opens to the
> same.

Hmm, I think we need io-page-fault support in the iommu-api then.

> > Another aspect I don't see discussed is how we represent these things to
> > the guest.
> > 
> > On Power for example, I have a requirement that a given iommu domain is
> > represented by a single dma window property in the device-tree. What
> > that means is that that property needs to be either in the node of the
> > device itself if there's only one device in the group or in a parent
> > node (ie a bridge or host bridge) if there are multiple devices.
> > 
> > Now I do -not- want to go down the path of simulating P2P bridges,
> > besides we'll quickly run out of bus numbers if we go there.
> > 
> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?
> 
> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.

A side-note: Might it be better to expose assigned devices in a guest on
a seperate bus? This will make it easier to emulate an IOMMU for the
guest inside qemu.


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  8:43                             ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-24  8:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	David Gibson, chrisw, iommu, Avi Kivity, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:

> > Could be tho in what form ? returning sysfs pathes ?
> 
> I'm at a loss there, please suggest.  I think we need an ioctl that
> returns some kind of array of devices within the group and another that
> maybe takes an index from that array and returns an fd for that device.
> A sysfs path string might be a reasonable array element, but it sounds
> like a pain to work with.

Limiting to PCI we can just pass the BDF as the argument to optain the
device-fd. For a more generic solution we need a unique identifier in
some way which is unique across all 'struct device' instances in the
system. As far as I know we don't have that yet (besides the sysfs-path)
so we either add that or stick with bus-specific solutions.

> > 1:1 process has the advantage of linking to an -mm which makes the whole
> > mmu notifier business doable. How do you want to track down mappings and
> > do the second level translation in the case of explicit map/unmap (like
> > on power) if you are not tied to an mm_struct ?
> 
> Right, I threw away the mmu notifier code that was originally part of
> vfio because we can't do anything useful with it yet on x86.  I
> definitely don't want to prevent it where it makes sense though.  Maybe
> we just record current->mm on open and restrict subsequent opens to the
> same.

Hmm, I think we need io-page-fault support in the iommu-api then.

> > Another aspect I don't see discussed is how we represent these things to
> > the guest.
> > 
> > On Power for example, I have a requirement that a given iommu domain is
> > represented by a single dma window property in the device-tree. What
> > that means is that that property needs to be either in the node of the
> > device itself if there's only one device in the group or in a parent
> > node (ie a bridge or host bridge) if there are multiple devices.
> > 
> > Now I do -not- want to go down the path of simulating P2P bridges,
> > besides we'll quickly run out of bus numbers if we go there.
> > 
> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?
> 
> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.

A side-note: Might it be better to expose assigned devices in a guest on
a seperate bus? This will make it easier to emulate an IOMMU for the
guest inside qemu.


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 17:08                           ` Alex Williamson
  (?)
@ 2011-08-24  8:52                             ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  8:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:

> > Handling it through fds is a good idea. This makes sure that everything
> > belongs to one process. I am not really sure yet if we go the way to
> > just bind plain groups together or if we create meta-groups. The
> > meta-groups thing seems somewhat cleaner, though.
> 
> I'm leaning towards binding because we need to make it dynamic, but I
> don't really have a good picture of the lifecycle of a meta-group.

In my view the life-cycle of the meta-group is a subrange of the
qemu-instance's life-cycle.

> > Putting the process to sleep (which would be uninterruptible) seems bad.
> > The process would sleep until the guest releases the device-group, which
> > can take days or months.
> > The best thing (and the most intrusive :-) ) is to change PCI core to
> > allow unbindings to fail, I think. But this probably further complicates
> > the way to upstream VFIO...
> 
> Yes, it's not ideal but I think it's sufficient for now and if we later
> get support for returning an error from release, we can set a timeout
> after notifying the user to make use of that.  Thanks,

Ben had the idea of just forcing to hard-unplug this device from the
guest. Thats probably the best way to deal with that, I think. VFIO
sends a notification to qemu that the device is gone and qemu informs
the guest in some way about it.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  8:52                             ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  8:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:

> > Handling it through fds is a good idea. This makes sure that everything
> > belongs to one process. I am not really sure yet if we go the way to
> > just bind plain groups together or if we create meta-groups. The
> > meta-groups thing seems somewhat cleaner, though.
> 
> I'm leaning towards binding because we need to make it dynamic, but I
> don't really have a good picture of the lifecycle of a meta-group.

In my view the life-cycle of the meta-group is a subrange of the
qemu-instance's life-cycle.

> > Putting the process to sleep (which would be uninterruptible) seems bad.
> > The process would sleep until the guest releases the device-group, which
> > can take days or months.
> > The best thing (and the most intrusive :-) ) is to change PCI core to
> > allow unbindings to fail, I think. But this probably further complicates
> > the way to upstream VFIO...
> 
> Yes, it's not ideal but I think it's sufficient for now and if we later
> get support for returning an error from release, we can set a timeout
> after notifying the user to make use of that.  Thanks,

Ben had the idea of just forcing to hard-unplug this device from the
guest. Thats probably the best way to deal with that, I think. VFIO
sends a notification to qemu that the device is gone and qemu informs
the guest in some way about it.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  8:52                             ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  8:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:

> > Handling it through fds is a good idea. This makes sure that everything
> > belongs to one process. I am not really sure yet if we go the way to
> > just bind plain groups together or if we create meta-groups. The
> > meta-groups thing seems somewhat cleaner, though.
> 
> I'm leaning towards binding because we need to make it dynamic, but I
> don't really have a good picture of the lifecycle of a meta-group.

In my view the life-cycle of the meta-group is a subrange of the
qemu-instance's life-cycle.

> > Putting the process to sleep (which would be uninterruptible) seems bad.
> > The process would sleep until the guest releases the device-group, which
> > can take days or months.
> > The best thing (and the most intrusive :-) ) is to change PCI core to
> > allow unbindings to fail, I think. But this probably further complicates
> > the way to upstream VFIO...
> 
> Yes, it's not ideal but I think it's sufficient for now and if we later
> get support for returning an error from release, we can set a timeout
> after notifying the user to make use of that.  Thanks,

Ben had the idea of just forcing to hard-unplug this device from the
guest. Thats probably the best way to deal with that, I think. VFIO
sends a notification to qemu that the device is gone and qemu informs
the guest in some way about it.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 23:35                           ` Benjamin Herrenschmidt
  (?)
@ 2011-08-24  8:53                             ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  8:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote:
> On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:

> > Hmm, good idea. But as far as I know the hotplug-event needs to be in
> > the guest _before_ the device is actually unplugged (so that the guest
> > can unbind its driver first). That somehow brings back the sleep-idea
> > and the timeout in the .release function.
> 
> That's for normal assisted hotplug, but don't we support hard hotplug ?
> I mean, things like cardbus, thunderbolt (if we ever support that)
> etc... will need it and some platforms do support hard hotplug of PCIe
> devices.
> 
> (That's why drivers should never spin on MMIO waiting for a 1 bit to
> clear without a timeout :-)

Right, thats probably the best semantic for this issue then. The worst
thing that happens is that the admin crashed the guest.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  8:53                             ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  8:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote:
> On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:

> > Hmm, good idea. But as far as I know the hotplug-event needs to be in
> > the guest _before_ the device is actually unplugged (so that the guest
> > can unbind its driver first). That somehow brings back the sleep-idea
> > and the timeout in the .release function.
> 
> That's for normal assisted hotplug, but don't we support hard hotplug ?
> I mean, things like cardbus, thunderbolt (if we ever support that)
> etc... will need it and some platforms do support hard hotplug of PCIe
> devices.
> 
> (That's why drivers should never spin on MMIO waiting for a 1 bit to
> clear without a timeout :-)

Right, thats probably the best semantic for this issue then. The worst
thing that happens is that the admin crashed the guest.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  8:53                             ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  8:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote:
> On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:

> > Hmm, good idea. But as far as I know the hotplug-event needs to be in
> > the guest _before_ the device is actually unplugged (so that the guest
> > can unbind its driver first). That somehow brings back the sleep-idea
> > and the timeout in the .release function.
> 
> That's for normal assisted hotplug, but don't we support hard hotplug ?
> I mean, things like cardbus, thunderbolt (if we ever support that)
> etc... will need it and some platforms do support hard hotplug of PCIe
> devices.
> 
> (That's why drivers should never spin on MMIO waiting for a 1 bit to
> clear without a timeout :-)

Right, thats probably the best semantic for this issue then. The worst
thing that happens is that the admin crashed the guest.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 17:33                                   ` Aaron Fabbri
  (?)
@ 2011-08-24  9:10                                     ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-24  9:10 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote:
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

As long as this is a 2-way road its fine. There must be a way to split
the groups again after the guest exits. But then we are again at the
super-groups (aka meta-groups, aka uiommu) point.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  9:10                                     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-24  9:10 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote:
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

As long as this is a 2-way road its fine. There must be a way to split
the groups again after the guest exits. But then we are again at the
super-groups (aka meta-groups, aka uiommu) point.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  9:10                                     ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-24  9:10 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	iommu, chrisw, Alex Williamson, Avi Kivity, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote:
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

As long as this is a 2-way road its fine. There must be a way to split
the groups again after the guest exits. But then we are again at the
super-groups (aka meta-groups, aka uiommu) point.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 16:54                                 ` aafabbri
  (?)
@ 2011-08-24  9:14                                   ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  9:14 UTC (permalink / raw)
  To: aafabbri
  Cc: Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, chrisw, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > That is makes uiommu basically the same as the meta-groups, right?
> 
> Yes, functionality seems the same, thus my suggestion to keep uiommu
> explicit.  Is there some need for group-groups besides defining sets of
> groups which share IOMMU resources?
> 
> I do all this stuff (bringing up sets of devices which may share IOMMU
> domain) dynamically from C applications.  I don't really want some static
> (boot-time or sysfs fiddling) supergroup config unless there is a good
> reason KVM/power needs it.
> 
> As you say in your next email, doing it all from ioctls is very easy,
> programmatically.

I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  9:14                                   ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  9:14 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > That is makes uiommu basically the same as the meta-groups, right?
> 
> Yes, functionality seems the same, thus my suggestion to keep uiommu
> explicit.  Is there some need for group-groups besides defining sets of
> groups which share IOMMU resources?
> 
> I do all this stuff (bringing up sets of devices which may share IOMMU
> domain) dynamically from C applications.  I don't really want some static
> (boot-time or sysfs fiddling) supergroup config unless there is a good
> reason KVM/power needs it.
> 
> As you say in your next email, doing it all from ioctls is very easy,
> programmatically.

I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  9:14                                   ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24  9:14 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve

On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > That is makes uiommu basically the same as the meta-groups, right?
> 
> Yes, functionality seems the same, thus my suggestion to keep uiommu
> explicit.  Is there some need for group-groups besides defining sets of
> groups which share IOMMU resources?
> 
> I do all this stuff (bringing up sets of devices which may share IOMMU
> domain) dynamically from C applications.  I don't really want some static
> (boot-time or sysfs fiddling) supergroup config unless there is a good
> reason KVM/power needs it.
> 
> As you say in your next email, doing it all from ioctls is very easy,
> programmatically.

I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24  9:14                                   ` Roedel, Joerg
  (?)
@ 2011-08-24  9:33                                     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-24  9:33 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> > On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > > That is makes uiommu basically the same as the meta-groups, right?
> > 
> > Yes, functionality seems the same, thus my suggestion to keep uiommu
> > explicit.  Is there some need for group-groups besides defining sets of
> > groups which share IOMMU resources?
> > 
> > I do all this stuff (bringing up sets of devices which may share IOMMU
> > domain) dynamically from C applications.  I don't really want some static
> > (boot-time or sysfs fiddling) supergroup config unless there is a good
> > reason KVM/power needs it.
> > 
> > As you say in your next email, doing it all from ioctls is very easy,
> > programmatically.
> 
> I don't see a reason to make this meta-grouping static. It would harm
> flexibility on x86. I think it makes things easier on power but there
> are options on that platform to get the dynamic solution too.

I think several people are misreading what Ben means by "static".  I
would prefer to say 'persistent', in that the meta-groups lifetime is
not tied to an fd, but they can be freely created, altered and removed
during runtime.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24  9:33                                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-24  9:33 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> > On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > > That is makes uiommu basically the same as the meta-groups, right?
> > 
> > Yes, functionality seems the same, thus my suggestion to keep uiommu
> > explicit.  Is there some need for group-groups besides defining sets of
> > groups which share IOMMU resources?
> > 
> > I do all this stuff (bringing up sets of devices which may share IOMMU
> > domain) dynamically from C applications.  I don't really want some static
> > (boot-time or sysfs fiddling) supergroup config unless there is a good
> > reason KVM/power needs it.
> > 
> > As you say in your next email, doing it all from ioctls is very easy,
> > programmatically.
> 
> I don't see a reason to make this meta-grouping static. It would harm
> flexibility on x86. I think it makes things easier on power but there
> are options on that platform to get the dynamic solution too.

I think several people are misreading what Ben means by "static".  I
would prefer to say 'persistent', in that the meta-groups lifetime is
not tied to an fd, but they can be freely created, altered and removed
during runtime.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24  9:33                                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-24  9:33 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> > On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > > That is makes uiommu basically the same as the meta-groups, right?
> > 
> > Yes, functionality seems the same, thus my suggestion to keep uiommu
> > explicit.  Is there some need for group-groups besides defining sets of
> > groups which share IOMMU resources?
> > 
> > I do all this stuff (bringing up sets of devices which may share IOMMU
> > domain) dynamically from C applications.  I don't really want some static
> > (boot-time or sysfs fiddling) supergroup config unless there is a good
> > reason KVM/power needs it.
> > 
> > As you say in your next email, doing it all from ioctls is very easy,
> > programmatically.
> 
> I don't see a reason to make this meta-grouping static. It would harm
> flexibility on x86. I think it makes things easier on power but there
> are options on that platform to get the dynamic solution too.

I think several people are misreading what Ben means by "static".  I
would prefer to say 'persistent', in that the meta-groups lifetime is
not tied to an fd, but they can be freely created, altered and removed
during runtime.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24  9:33                                     ` David Gibson
  (?)
@ 2011-08-24 11:03                                       ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24 11:03 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org

On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:

> > I don't see a reason to make this meta-grouping static. It would harm
> > flexibility on x86. I think it makes things easier on power but there
> > are options on that platform to get the dynamic solution too.
> 
> I think several people are misreading what Ben means by "static".  I
> would prefer to say 'persistent', in that the meta-groups lifetime is
> not tied to an fd, but they can be freely created, altered and removed
> during runtime.

Even if it can be altered at runtime, from a usability perspective it is
certainly the best to handle these groups directly in qemu. Or are there
strong reasons to do it somewhere else?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24 11:03                                       ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24 11:03 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:

> > I don't see a reason to make this meta-grouping static. It would harm
> > flexibility on x86. I think it makes things easier on power but there
> > are options on that platform to get the dynamic solution too.
> 
> I think several people are misreading what Ben means by "static".  I
> would prefer to say 'persistent', in that the meta-groups lifetime is
> not tied to an fd, but they can be freely created, altered and removed
> during runtime.

Even if it can be altered at runtime, from a usability perspective it is
certainly the best to handle these groups directly in qemu. Or are there
strong reasons to do it somewhere else?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24 11:03                                       ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-24 11:03 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:

> > I don't see a reason to make this meta-grouping static. It would harm
> > flexibility on x86. I think it makes things easier on power but there
> > are options on that platform to get the dynamic solution too.
> 
> I think several people are misreading what Ben means by "static".  I
> would prefer to say 'persistent', in that the meta-groups lifetime is
> not tied to an fd, but they can be freely created, altered and removed
> during runtime.

Even if it can be altered at runtime, from a usability perspective it is
certainly the best to handle these groups directly in qemu. Or are there
strong reasons to do it somewhere else?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-23 23:51                             ` Benjamin Herrenschmidt
  (?)
@ 2011-08-24 14:47                               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 14:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci, qemu-devel, aafabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Wed, 2011-08-24 at 09:51 +1000, Benjamin Herrenschmidt wrote:
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> 
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).

Ok, I missed the "host" bridge.

> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.

Yep, I've seen similar on ia64 systems.

> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
> 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> >
> > Given the device requirement on x86 and since everything is a PCI device
> > on x86, I'd like to keep a qemu command line something like -device
> > vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> > dma window size/address, will be query-able through an architecture
> > specific (or general if possible) ioctl on the vfio group fd.  I hope
> > that will help the specification, but I don't fully understand what all
> > remains.  Thanks,
> 
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
> 
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
> 
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

I think the vfio kernel layer we're describing easily supports both.
This is just a matter of adding qemu-vfio code to expose different
topologies based on group iommu capabilities and mapping mode.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24 14:47                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 14:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Wed, 2011-08-24 at 09:51 +1000, Benjamin Herrenschmidt wrote:
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> 
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).

Ok, I missed the "host" bridge.

> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.

Yep, I've seen similar on ia64 systems.

> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
> 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> >
> > Given the device requirement on x86 and since everything is a PCI device
> > on x86, I'd like to keep a qemu command line something like -device
> > vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> > dma window size/address, will be query-able through an architecture
> > specific (or general if possible) ioctl on the vfio group fd.  I hope
> > that will help the specification, but I don't fully understand what all
> > remains.  Thanks,
> 
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
> 
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
> 
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

I think the vfio kernel layer we're describing easily supports both.
This is just a matter of adding qemu-vfio code to expose different
topologies based on group iommu capabilities and mapping mode.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24 14:47                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 14:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, David Gibson, chrisw, iommu, Avi Kivity,
	linuxppc-dev, benve

On Wed, 2011-08-24 at 09:51 +1000, Benjamin Herrenschmidt wrote:
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> 
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).

Ok, I missed the "host" bridge.

> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.

Yep, I've seen similar on ia64 systems.

> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
> 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> >
> > Given the device requirement on x86 and since everything is a PCI device
> > on x86, I'd like to keep a qemu command line something like -device
> > vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> > dma window size/address, will be query-able through an architecture
> > specific (or general if possible) ioctl on the vfio group fd.  I hope
> > that will help the specification, but I don't fully understand what all
> > remains.  Thanks,
> 
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
> 
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
> 
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

I think the vfio kernel layer we're describing easily supports both.
This is just a matter of adding qemu-vfio code to expose different
topologies based on group iommu capabilities and mapping mode.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24  8:43                             ` Joerg Roedel
  (?)
@ 2011-08-24 14:56                               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 14:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> 
> > > Could be tho in what form ? returning sysfs pathes ?
> > 
> > I'm at a loss there, please suggest.  I think we need an ioctl that
> > returns some kind of array of devices within the group and another that
> > maybe takes an index from that array and returns an fd for that device.
> > A sysfs path string might be a reasonable array element, but it sounds
> > like a pain to work with.
> 
> Limiting to PCI we can just pass the BDF as the argument to optain the
> device-fd. For a more generic solution we need a unique identifier in
> some way which is unique across all 'struct device' instances in the
> system. As far as I know we don't have that yet (besides the sysfs-path)
> so we either add that or stick with bus-specific solutions.
> 
> > > 1:1 process has the advantage of linking to an -mm which makes the whole
> > > mmu notifier business doable. How do you want to track down mappings and
> > > do the second level translation in the case of explicit map/unmap (like
> > > on power) if you are not tied to an mm_struct ?
> > 
> > Right, I threw away the mmu notifier code that was originally part of
> > vfio because we can't do anything useful with it yet on x86.  I
> > definitely don't want to prevent it where it makes sense though.  Maybe
> > we just record current->mm on open and restrict subsequent opens to the
> > same.
> 
> Hmm, I think we need io-page-fault support in the iommu-api then.

Yeah, when we can handle iommu page faults, this gets more interesting.

> > > Another aspect I don't see discussed is how we represent these things to
> > > the guest.
> > > 
> > > On Power for example, I have a requirement that a given iommu domain is
> > > represented by a single dma window property in the device-tree. What
> > > that means is that that property needs to be either in the node of the
> > > device itself if there's only one device in the group or in a parent
> > > node (ie a bridge or host bridge) if there are multiple devices.
> > > 
> > > Now I do -not- want to go down the path of simulating P2P bridges,
> > > besides we'll quickly run out of bus numbers if we go there.
> > > 
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> > 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> 
> A side-note: Might it be better to expose assigned devices in a guest on
> a seperate bus? This will make it easier to emulate an IOMMU for the
> guest inside qemu.

I think we want that option, sure.  A lot of guests aren't going to
support hotplugging buses though, so I think our default, map the entire
guest model should still be using bus 0.  The ACPI gets a lot more
complicated for that model too; dynamic SSDTs?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24 14:56                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 14:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	David Gibson, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> 
> > > Could be tho in what form ? returning sysfs pathes ?
> > 
> > I'm at a loss there, please suggest.  I think we need an ioctl that
> > returns some kind of array of devices within the group and another that
> > maybe takes an index from that array and returns an fd for that device.
> > A sysfs path string might be a reasonable array element, but it sounds
> > like a pain to work with.
> 
> Limiting to PCI we can just pass the BDF as the argument to optain the
> device-fd. For a more generic solution we need a unique identifier in
> some way which is unique across all 'struct device' instances in the
> system. As far as I know we don't have that yet (besides the sysfs-path)
> so we either add that or stick with bus-specific solutions.
> 
> > > 1:1 process has the advantage of linking to an -mm which makes the whole
> > > mmu notifier business doable. How do you want to track down mappings and
> > > do the second level translation in the case of explicit map/unmap (like
> > > on power) if you are not tied to an mm_struct ?
> > 
> > Right, I threw away the mmu notifier code that was originally part of
> > vfio because we can't do anything useful with it yet on x86.  I
> > definitely don't want to prevent it where it makes sense though.  Maybe
> > we just record current->mm on open and restrict subsequent opens to the
> > same.
> 
> Hmm, I think we need io-page-fault support in the iommu-api then.

Yeah, when we can handle iommu page faults, this gets more interesting.

> > > Another aspect I don't see discussed is how we represent these things to
> > > the guest.
> > > 
> > > On Power for example, I have a requirement that a given iommu domain is
> > > represented by a single dma window property in the device-tree. What
> > > that means is that that property needs to be either in the node of the
> > > device itself if there's only one device in the group or in a parent
> > > node (ie a bridge or host bridge) if there are multiple devices.
> > > 
> > > Now I do -not- want to go down the path of simulating P2P bridges,
> > > besides we'll quickly run out of bus numbers if we go there.
> > > 
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> > 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> 
> A side-note: Might it be better to expose assigned devices in a guest on
> a seperate bus? This will make it easier to emulate an IOMMU for the
> guest inside qemu.

I think we want that option, sure.  A lot of guests aren't going to
support hotplugging buses though, so I think our default, map the entire
guest model should still be using bus 0.  The ACPI gets a lot more
complicated for that model too; dynamic SSDTs?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24 14:56                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 14:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	David Gibson, chrisw, iommu, Avi Kivity, linuxppc-dev, benve

On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> 
> > > Could be tho in what form ? returning sysfs pathes ?
> > 
> > I'm at a loss there, please suggest.  I think we need an ioctl that
> > returns some kind of array of devices within the group and another that
> > maybe takes an index from that array and returns an fd for that device.
> > A sysfs path string might be a reasonable array element, but it sounds
> > like a pain to work with.
> 
> Limiting to PCI we can just pass the BDF as the argument to optain the
> device-fd. For a more generic solution we need a unique identifier in
> some way which is unique across all 'struct device' instances in the
> system. As far as I know we don't have that yet (besides the sysfs-path)
> so we either add that or stick with bus-specific solutions.
> 
> > > 1:1 process has the advantage of linking to an -mm which makes the whole
> > > mmu notifier business doable. How do you want to track down mappings and
> > > do the second level translation in the case of explicit map/unmap (like
> > > on power) if you are not tied to an mm_struct ?
> > 
> > Right, I threw away the mmu notifier code that was originally part of
> > vfio because we can't do anything useful with it yet on x86.  I
> > definitely don't want to prevent it where it makes sense though.  Maybe
> > we just record current->mm on open and restrict subsequent opens to the
> > same.
> 
> Hmm, I think we need io-page-fault support in the iommu-api then.

Yeah, when we can handle iommu page faults, this gets more interesting.

> > > Another aspect I don't see discussed is how we represent these things to
> > > the guest.
> > > 
> > > On Power for example, I have a requirement that a given iommu domain is
> > > represented by a single dma window property in the device-tree. What
> > > that means is that that property needs to be either in the node of the
> > > device itself if there's only one device in the group or in a parent
> > > node (ie a bridge or host bridge) if there are multiple devices.
> > > 
> > > Now I do -not- want to go down the path of simulating P2P bridges,
> > > besides we'll quickly run out of bus numbers if we go there.
> > > 
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> > 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> 
> A side-note: Might it be better to expose assigned devices in a guest on
> a seperate bus? This will make it easier to emulate an IOMMU for the
> guest inside qemu.

I think we want that option, sure.  A lot of guests aren't going to
support hotplugging buses though, so I think our default, map the entire
guest model should still be using bus 0.  The ACPI gets a lot more
complicated for that model too; dynamic SSDTs?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24  8:52                             ` Roedel, Joerg
  (?)
@ 2011-08-24 15:07                               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 15:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> 
> > > Handling it through fds is a good idea. This makes sure that everything
> > > belongs to one process. I am not really sure yet if we go the way to
> > > just bind plain groups together or if we create meta-groups. The
> > > meta-groups thing seems somewhat cleaner, though.
> > 
> > I'm leaning towards binding because we need to make it dynamic, but I
> > don't really have a good picture of the lifecycle of a meta-group.
> 
> In my view the life-cycle of the meta-group is a subrange of the
> qemu-instance's life-cycle.

I guess I mean the lifecycle of a super-group that's actually exposed as
a new group in sysfs.  Who creates it?  How?  How are groups dynamically
added and removed from the super-group?  The group merging makes sense
to me because it's largely just an optimization that qemu will try to
merge groups.  If it works, great.  If not, it manages them separately.
When all the devices from a group are unplugged, unmerge the group if
necessary.

> > > Putting the process to sleep (which would be uninterruptible) seems bad.
> > > The process would sleep until the guest releases the device-group, which
> > > can take days or months.
> > > The best thing (and the most intrusive :-) ) is to change PCI core to
> > > allow unbindings to fail, I think. But this probably further complicates
> > > the way to upstream VFIO...
> > 
> > Yes, it's not ideal but I think it's sufficient for now and if we later
> > get support for returning an error from release, we can set a timeout
> > after notifying the user to make use of that.  Thanks,
> 
> Ben had the idea of just forcing to hard-unplug this device from the
> guest. Thats probably the best way to deal with that, I think. VFIO
> sends a notification to qemu that the device is gone and qemu informs
> the guest in some way about it.

We need to try the polite method of attempting to hot unplug the device
from qemu first, which the current vfio code already implements.  We can
then escalate if it doesn't respond.  The current code calls abort in
qemu if the guest doesn't respond, but I agree we should also be
enforcing this at the kernel interface.  I think the problem with the
hard-unplug is that we don't have a good revoke mechanism for the mmio
mmaps.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24 15:07                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 15:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> 
> > > Handling it through fds is a good idea. This makes sure that everything
> > > belongs to one process. I am not really sure yet if we go the way to
> > > just bind plain groups together or if we create meta-groups. The
> > > meta-groups thing seems somewhat cleaner, though.
> > 
> > I'm leaning towards binding because we need to make it dynamic, but I
> > don't really have a good picture of the lifecycle of a meta-group.
> 
> In my view the life-cycle of the meta-group is a subrange of the
> qemu-instance's life-cycle.

I guess I mean the lifecycle of a super-group that's actually exposed as
a new group in sysfs.  Who creates it?  How?  How are groups dynamically
added and removed from the super-group?  The group merging makes sense
to me because it's largely just an optimization that qemu will try to
merge groups.  If it works, great.  If not, it manages them separately.
When all the devices from a group are unplugged, unmerge the group if
necessary.

> > > Putting the process to sleep (which would be uninterruptible) seems bad.
> > > The process would sleep until the guest releases the device-group, which
> > > can take days or months.
> > > The best thing (and the most intrusive :-) ) is to change PCI core to
> > > allow unbindings to fail, I think. But this probably further complicates
> > > the way to upstream VFIO...
> > 
> > Yes, it's not ideal but I think it's sufficient for now and if we later
> > get support for returning an error from release, we can set a timeout
> > after notifying the user to make use of that.  Thanks,
> 
> Ben had the idea of just forcing to hard-unplug this device from the
> guest. Thats probably the best way to deal with that, I think. VFIO
> sends a notification to qemu that the device is gone and qemu informs
> the guest in some way about it.

We need to try the polite method of attempting to hot unplug the device
from qemu first, which the current vfio code already implements.  We can
then escalate if it doesn't respond.  The current code calls abort in
qemu if the guest doesn't respond, but I agree we should also be
enforcing this at the kernel interface.  I think the problem with the
hard-unplug is that we don't have a good revoke mechanism for the mmio
mmaps.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24 15:07                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 15:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> 
> > > Handling it through fds is a good idea. This makes sure that everything
> > > belongs to one process. I am not really sure yet if we go the way to
> > > just bind plain groups together or if we create meta-groups. The
> > > meta-groups thing seems somewhat cleaner, though.
> > 
> > I'm leaning towards binding because we need to make it dynamic, but I
> > don't really have a good picture of the lifecycle of a meta-group.
> 
> In my view the life-cycle of the meta-group is a subrange of the
> qemu-instance's life-cycle.

I guess I mean the lifecycle of a super-group that's actually exposed as
a new group in sysfs.  Who creates it?  How?  How are groups dynamically
added and removed from the super-group?  The group merging makes sense
to me because it's largely just an optimization that qemu will try to
merge groups.  If it works, great.  If not, it manages them separately.
When all the devices from a group are unplugged, unmerge the group if
necessary.

> > > Putting the process to sleep (which would be uninterruptible) seems bad.
> > > The process would sleep until the guest releases the device-group, which
> > > can take days or months.
> > > The best thing (and the most intrusive :-) ) is to change PCI core to
> > > allow unbindings to fail, I think. But this probably further complicates
> > > the way to upstream VFIO...
> > 
> > Yes, it's not ideal but I think it's sufficient for now and if we later
> > get support for returning an error from release, we can set a timeout
> > after notifying the user to make use of that.  Thanks,
> 
> Ben had the idea of just forcing to hard-unplug this device from the
> guest. Thats probably the best way to deal with that, I think. VFIO
> sends a notification to qemu that the device is gone and qemu informs
> the guest in some way about it.

We need to try the polite method of attempting to hot unplug the device
from qemu first, which the current vfio code already implements.  We can
then escalate if it doesn't respond.  The current code calls abort in
qemu if the guest doesn't respond, but I agree we should also be
enforcing this at the kernel interface.  I think the problem with the
hard-unplug is that we don't have a good revoke mechanism for the mmio
mmaps.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24  9:10                                     ` Joerg Roedel
  (?)
@ 2011-08-24 21:13                                       ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 21:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Aaron Fabbri, Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

Joerg,

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,

Alex

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
  */
 
 #include <linux/bug.h>
+#include <linux/device.h>
 #include <linux/types.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/pci.h>
 
 static struct iommu_ops *iommu_ops;
 
+static ssize_t show_iommu_group(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+	if (iommu_dev_to_group(dev) >= 0)
+		return device_create_file(dev, &dev_attr_iommu_group);
+
+	return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+			   unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	if (action == BUS_NOTIFY_ADD_DEVICE)
+		return add_iommu_group(dev, NULL);
+
+	return 0;
+}
+
+static struct notifier_block device_nb = {
+	.notifier_call = device_notifier,
+};
+
 void register_iommu(struct iommu_ops *ops)
 {
 	if (iommu_ops)
 		BUG();
 
 	iommu_ops = ops;
+
+	/* FIXME - non-PCI, really want for_each_bus() */
+	bus_register_notifier(&pci_bus_type, &device_nb);
+	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
 }
 
 bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
+long iommu_dev_to_group(struct device *dev)
+{
+	if (iommu_ops->dev_to_group)
+		return iommu_ops->dev_to_group(dev);
+	return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);
+
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
 	      phys_addr_t paddr, int gfp_order, int prot)
 {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
 static int dmar_forcedac;
 static int intel_iommu_strict;
 static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;
 
 #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
 			printk(KERN_INFO
 				"Intel-IOMMU: disable supported super page\n");
 			intel_iommu_superpage = 0;
+		} else if (!strncmp(str, "no_mf_groups", 12)) {
+			printk(KERN_INFO
+				"Intel-IOMMU: disable separate groups for multifunction devices\n");
+			intel_iommu_no_mf_groups = 1;
 		}
 
 		str += strcspn(str, ",");
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *bridge;
+	union {
+		struct {
+			u8 devfn;
+			u8 bus;
+			u16 segment;
+		} pci;
+		u32 group;
+	} id;
+
+	if (iommu_no_mapping(dev))
+		return -ENODEV;
+
+	id.pci.segment = pci_domain_nr(pdev->bus);
+	id.pci.bus = pdev->bus->number;
+	id.pci.devfn = pdev->devfn;
+
+	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
+		return -ENODEV;
+
+	bridge = pci_find_upstream_pcie_bridge(pdev);
+	if (bridge) {
+		if (pci_is_pcie(bridge)) {
+			id.pci.bus = bridge->subordinate->number;
+			id.pci.devfn = 0;
+		} else {
+			id.pci.bus = bridge->bus->number;
+			id.pci.devfn = bridge->devfn;
+		}
+	}
+
+	/* Virtual functions always get their own group */
+	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
+		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
+
+	/* FIXME - seg # >= 0x8000 on 32b */
+	return id.group;
+}
+
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
+	.dev_to_group	= intel_iommu_dev_to_group,
 };
 
 static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0a2ba40..90c1a86 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -45,6 +45,7 @@ struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
+	long (*dev_to_group)(struct device *dev);
 };
 
 #ifdef CONFIG_IOMMU_API
@@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 				      unsigned long iova);
 extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
+extern long iommu_dev_to_group(struct device *dev);
 
 #else /* CONFIG_IOMMU_API */
 
@@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+static inline long iommu_dev_to_group(struct device *dev);
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-24 21:13                                       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 21:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

Joerg,

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,

Alex

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
  */
 
 #include <linux/bug.h>
+#include <linux/device.h>
 #include <linux/types.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/pci.h>
 
 static struct iommu_ops *iommu_ops;
 
+static ssize_t show_iommu_group(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+	if (iommu_dev_to_group(dev) >= 0)
+		return device_create_file(dev, &dev_attr_iommu_group);
+
+	return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+			   unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	if (action == BUS_NOTIFY_ADD_DEVICE)
+		return add_iommu_group(dev, NULL);
+
+	return 0;
+}
+
+static struct notifier_block device_nb = {
+	.notifier_call = device_notifier,
+};
+
 void register_iommu(struct iommu_ops *ops)
 {
 	if (iommu_ops)
 		BUG();
 
 	iommu_ops = ops;
+
+	/* FIXME - non-PCI, really want for_each_bus() */
+	bus_register_notifier(&pci_bus_type, &device_nb);
+	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
 }
 
 bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
+long iommu_dev_to_group(struct device *dev)
+{
+	if (iommu_ops->dev_to_group)
+		return iommu_ops->dev_to_group(dev);
+	return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);
+
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
 	      phys_addr_t paddr, int gfp_order, int prot)
 {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
 static int dmar_forcedac;
 static int intel_iommu_strict;
 static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;
 
 #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
 			printk(KERN_INFO
 				"Intel-IOMMU: disable supported super page\n");
 			intel_iommu_superpage = 0;
+		} else if (!strncmp(str, "no_mf_groups", 12)) {
+			printk(KERN_INFO
+				"Intel-IOMMU: disable separate groups for multifunction devices\n");
+			intel_iommu_no_mf_groups = 1;
 		}
 
 		str += strcspn(str, ",");
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *bridge;
+	union {
+		struct {
+			u8 devfn;
+			u8 bus;
+			u16 segment;
+		} pci;
+		u32 group;
+	} id;
+
+	if (iommu_no_mapping(dev))
+		return -ENODEV;
+
+	id.pci.segment = pci_domain_nr(pdev->bus);
+	id.pci.bus = pdev->bus->number;
+	id.pci.devfn = pdev->devfn;
+
+	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
+		return -ENODEV;
+
+	bridge = pci_find_upstream_pcie_bridge(pdev);
+	if (bridge) {
+		if (pci_is_pcie(bridge)) {
+			id.pci.bus = bridge->subordinate->number;
+			id.pci.devfn = 0;
+		} else {
+			id.pci.bus = bridge->bus->number;
+			id.pci.devfn = bridge->devfn;
+		}
+	}
+
+	/* Virtual functions always get their own group */
+	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
+		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
+
+	/* FIXME - seg # >= 0x8000 on 32b */
+	return id.group;
+}
+
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
+	.dev_to_group	= intel_iommu_dev_to_group,
 };
 
 static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0a2ba40..90c1a86 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -45,6 +45,7 @@ struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
+	long (*dev_to_group)(struct device *dev);
 };
 
 #ifdef CONFIG_IOMMU_API
@@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 				      unsigned long iova);
 extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
+extern long iommu_dev_to_group(struct device *dev);
 
 #else /* CONFIG_IOMMU_API */
 
@@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+static inline long iommu_dev_to_group(struct device *dev);
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-24 21:13                                       ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-24 21:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

Joerg,

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,

Alex

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
  */
 
 #include <linux/bug.h>
+#include <linux/device.h>
 #include <linux/types.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/pci.h>
 
 static struct iommu_ops *iommu_ops;
 
+static ssize_t show_iommu_group(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+	if (iommu_dev_to_group(dev) >= 0)
+		return device_create_file(dev, &dev_attr_iommu_group);
+
+	return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+			   unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	if (action == BUS_NOTIFY_ADD_DEVICE)
+		return add_iommu_group(dev, NULL);
+
+	return 0;
+}
+
+static struct notifier_block device_nb = {
+	.notifier_call = device_notifier,
+};
+
 void register_iommu(struct iommu_ops *ops)
 {
 	if (iommu_ops)
 		BUG();
 
 	iommu_ops = ops;
+
+	/* FIXME - non-PCI, really want for_each_bus() */
+	bus_register_notifier(&pci_bus_type, &device_nb);
+	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
 }
 
 bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
+long iommu_dev_to_group(struct device *dev)
+{
+	if (iommu_ops->dev_to_group)
+		return iommu_ops->dev_to_group(dev);
+	return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);
+
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
 	      phys_addr_t paddr, int gfp_order, int prot)
 {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
 static int dmar_forcedac;
 static int intel_iommu_strict;
 static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;
 
 #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
 			printk(KERN_INFO
 				"Intel-IOMMU: disable supported super page\n");
 			intel_iommu_superpage = 0;
+		} else if (!strncmp(str, "no_mf_groups", 12)) {
+			printk(KERN_INFO
+				"Intel-IOMMU: disable separate groups for multifunction devices\n");
+			intel_iommu_no_mf_groups = 1;
 		}
 
 		str += strcspn(str, ",");
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *bridge;
+	union {
+		struct {
+			u8 devfn;
+			u8 bus;
+			u16 segment;
+		} pci;
+		u32 group;
+	} id;
+
+	if (iommu_no_mapping(dev))
+		return -ENODEV;
+
+	id.pci.segment = pci_domain_nr(pdev->bus);
+	id.pci.bus = pdev->bus->number;
+	id.pci.devfn = pdev->devfn;
+
+	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
+		return -ENODEV;
+
+	bridge = pci_find_upstream_pcie_bridge(pdev);
+	if (bridge) {
+		if (pci_is_pcie(bridge)) {
+			id.pci.bus = bridge->subordinate->number;
+			id.pci.devfn = 0;
+		} else {
+			id.pci.bus = bridge->bus->number;
+			id.pci.devfn = bridge->devfn;
+		}
+	}
+
+	/* Virtual functions always get their own group */
+	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
+		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
+
+	/* FIXME - seg # >= 0x8000 on 32b */
+	return id.group;
+}
+
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
+	.dev_to_group	= intel_iommu_dev_to_group,
 };
 
 static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0a2ba40..90c1a86 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -45,6 +45,7 @@ struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
+	long (*dev_to_group)(struct device *dev);
 };
 
 #ifdef CONFIG_IOMMU_API
@@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 				      unsigned long iova);
 extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
+extern long iommu_dev_to_group(struct device *dev);
 
 #else /* CONFIG_IOMMU_API */
 
@@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+static inline long iommu_dev_to_group(struct device *dev);
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */

^ permalink raw reply related	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24 21:13                                       ` Alex Williamson
  (?)
@ 2011-08-25 10:54                                         ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 10:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Aaron Fabbri, Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> Is this roughly what you're thinking of for the iommu_group component?
> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> support in the iommu base.  Would AMD-Vi do something similar (or
> exactly the same) for group #s?  Thanks,

The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.

> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> index 6e6b6a1..6b54c1a 100644
> --- a/drivers/base/iommu.c
> +++ b/drivers/base/iommu.c
> @@ -17,20 +17,56 @@
>   */
>  
>  #include <linux/bug.h>
> +#include <linux/device.h>
>  #include <linux/types.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/pci.h>
>  
>  static struct iommu_ops *iommu_ops;
>  
> +static ssize_t show_iommu_group(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));

Probably add a 0x prefix so userspace knows the format?

> +}
> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> +
> +static int add_iommu_group(struct device *dev, void *unused)
> +{
> +	if (iommu_dev_to_group(dev) >= 0)
> +		return device_create_file(dev, &dev_attr_iommu_group);
> +
> +	return 0;
> +}
> +
> +static int device_notifier(struct notifier_block *nb,
> +			   unsigned long action, void *data)
> +{
> +	struct device *dev = data;
> +
> +	if (action == BUS_NOTIFY_ADD_DEVICE)
> +		return add_iommu_group(dev, NULL);
> +
> +	return 0;
> +}
> +
> +static struct notifier_block device_nb = {
> +	.notifier_call = device_notifier,
> +};
> +
>  void register_iommu(struct iommu_ops *ops)
>  {
>  	if (iommu_ops)
>  		BUG();
>  
>  	iommu_ops = ops;
> +
> +	/* FIXME - non-PCI, really want for_each_bus() */
> +	bus_register_notifier(&pci_bus_type, &device_nb);
> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>  }

We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.

>  bool iommu_found(void)
> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>  
> +long iommu_dev_to_group(struct device *dev)
> +{
> +	if (iommu_ops->dev_to_group)
> +		return iommu_ops->dev_to_group(dev);
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);

Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.

> +
>  int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  	      phys_addr_t paddr, int gfp_order, int prot)
>  {
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index f02c34d..477259c 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>  static int dmar_forcedac;
>  static int intel_iommu_strict;
>  static int intel_iommu_superpage = 1;
> +static int intel_iommu_no_mf_groups;
>  
>  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>  static DEFINE_SPINLOCK(device_domain_lock);
> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>  			printk(KERN_INFO
>  				"Intel-IOMMU: disable supported super page\n");
>  			intel_iommu_superpage = 0;
> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> +			printk(KERN_INFO
> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> +			intel_iommu_no_mf_groups = 1;

This should really be a global iommu option and not be VT-d specific.

>  
>  		str += strcspn(str, ",");
> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +/* Group numbers are arbitrary.  Device with the same group number
> + * indicate the iommu cannot differentiate between them.  To avoid
> + * tracking used groups we just use the seg|bus|devfn of the lowest
> + * level we're able to differentiate devices */
> +static long intel_iommu_dev_to_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct pci_dev *bridge;
> +	union {
> +		struct {
> +			u8 devfn;
> +			u8 bus;
> +			u16 segment;
> +		} pci;
> +		u32 group;
> +	} id;
> +
> +	if (iommu_no_mapping(dev))
> +		return -ENODEV;
> +
> +	id.pci.segment = pci_domain_nr(pdev->bus);
> +	id.pci.bus = pdev->bus->number;
> +	id.pci.devfn = pdev->devfn;
> +
> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> +		return -ENODEV;
> +
> +	bridge = pci_find_upstream_pcie_bridge(pdev);
> +	if (bridge) {
> +		if (pci_is_pcie(bridge)) {
> +			id.pci.bus = bridge->subordinate->number;
> +			id.pci.devfn = 0;
> +		} else {
> +			id.pci.bus = bridge->bus->number;
> +			id.pci.devfn = bridge->devfn;
> +		}
> +	}
> +
> +	/* Virtual functions always get their own group */
> +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> +
> +	/* FIXME - seg # >= 0x8000 on 32b */
> +	return id.group;
> +}

This looks like code duplication in the VT-d driver. It doesn't need to
be generalized now, but we should keep in mind to do a more general
solution later.
Maybe it is beneficial if the IOMMU drivers only setup the number in
dev->arch.iommu.groupid and the iommu-api fetches it from there then.
But as I said, this is some more work and does not need to be done for
this patch(-set).

> +
>  static struct iommu_ops intel_iommu_ops = {
>  	.domain_init	= intel_iommu_domain_init,
>  	.domain_destroy = intel_iommu_domain_destroy,
> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>  	.unmap		= intel_iommu_unmap,
>  	.iova_to_phys	= intel_iommu_iova_to_phys,
>  	.domain_has_cap = intel_iommu_domain_has_cap,
> +	.dev_to_group	= intel_iommu_dev_to_group,
>  };
>  
>  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0a2ba40..90c1a86 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -45,6 +45,7 @@ struct iommu_ops {
>  				    unsigned long iova);
>  	int (*domain_has_cap)(struct iommu_domain *domain,
>  			      unsigned long cap);
> +	long (*dev_to_group)(struct device *dev);
>  };
>  
>  #ifdef CONFIG_IOMMU_API
> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>  				      unsigned long iova);
>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>  				unsigned long cap);
> +extern long iommu_dev_to_group(struct device *dev);
>  
>  #else /* CONFIG_IOMMU_API */
>  
> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +static inline long iommu_dev_to_group(struct device *dev);
> +{
> +	return -ENODEV;
> +}
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> 
> 
> 

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 10:54                                         ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 10:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> Is this roughly what you're thinking of for the iommu_group component?
> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> support in the iommu base.  Would AMD-Vi do something similar (or
> exactly the same) for group #s?  Thanks,

The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.

> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> index 6e6b6a1..6b54c1a 100644
> --- a/drivers/base/iommu.c
> +++ b/drivers/base/iommu.c
> @@ -17,20 +17,56 @@
>   */
>  
>  #include <linux/bug.h>
> +#include <linux/device.h>
>  #include <linux/types.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/pci.h>
>  
>  static struct iommu_ops *iommu_ops;
>  
> +static ssize_t show_iommu_group(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));

Probably add a 0x prefix so userspace knows the format?

> +}
> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> +
> +static int add_iommu_group(struct device *dev, void *unused)
> +{
> +	if (iommu_dev_to_group(dev) >= 0)
> +		return device_create_file(dev, &dev_attr_iommu_group);
> +
> +	return 0;
> +}
> +
> +static int device_notifier(struct notifier_block *nb,
> +			   unsigned long action, void *data)
> +{
> +	struct device *dev = data;
> +
> +	if (action == BUS_NOTIFY_ADD_DEVICE)
> +		return add_iommu_group(dev, NULL);
> +
> +	return 0;
> +}
> +
> +static struct notifier_block device_nb = {
> +	.notifier_call = device_notifier,
> +};
> +
>  void register_iommu(struct iommu_ops *ops)
>  {
>  	if (iommu_ops)
>  		BUG();
>  
>  	iommu_ops = ops;
> +
> +	/* FIXME - non-PCI, really want for_each_bus() */
> +	bus_register_notifier(&pci_bus_type, &device_nb);
> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>  }

We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.

>  bool iommu_found(void)
> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>  
> +long iommu_dev_to_group(struct device *dev)
> +{
> +	if (iommu_ops->dev_to_group)
> +		return iommu_ops->dev_to_group(dev);
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);

Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.

> +
>  int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  	      phys_addr_t paddr, int gfp_order, int prot)
>  {
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index f02c34d..477259c 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>  static int dmar_forcedac;
>  static int intel_iommu_strict;
>  static int intel_iommu_superpage = 1;
> +static int intel_iommu_no_mf_groups;
>  
>  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>  static DEFINE_SPINLOCK(device_domain_lock);
> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>  			printk(KERN_INFO
>  				"Intel-IOMMU: disable supported super page\n");
>  			intel_iommu_superpage = 0;
> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> +			printk(KERN_INFO
> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> +			intel_iommu_no_mf_groups = 1;

This should really be a global iommu option and not be VT-d specific.

>  
>  		str += strcspn(str, ",");
> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +/* Group numbers are arbitrary.  Device with the same group number
> + * indicate the iommu cannot differentiate between them.  To avoid
> + * tracking used groups we just use the seg|bus|devfn of the lowest
> + * level we're able to differentiate devices */
> +static long intel_iommu_dev_to_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct pci_dev *bridge;
> +	union {
> +		struct {
> +			u8 devfn;
> +			u8 bus;
> +			u16 segment;
> +		} pci;
> +		u32 group;
> +	} id;
> +
> +	if (iommu_no_mapping(dev))
> +		return -ENODEV;
> +
> +	id.pci.segment = pci_domain_nr(pdev->bus);
> +	id.pci.bus = pdev->bus->number;
> +	id.pci.devfn = pdev->devfn;
> +
> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> +		return -ENODEV;
> +
> +	bridge = pci_find_upstream_pcie_bridge(pdev);
> +	if (bridge) {
> +		if (pci_is_pcie(bridge)) {
> +			id.pci.bus = bridge->subordinate->number;
> +			id.pci.devfn = 0;
> +		} else {
> +			id.pci.bus = bridge->bus->number;
> +			id.pci.devfn = bridge->devfn;
> +		}
> +	}
> +
> +	/* Virtual functions always get their own group */
> +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> +
> +	/* FIXME - seg # >= 0x8000 on 32b */
> +	return id.group;
> +}

This looks like code duplication in the VT-d driver. It doesn't need to
be generalized now, but we should keep in mind to do a more general
solution later.
Maybe it is beneficial if the IOMMU drivers only setup the number in
dev->arch.iommu.groupid and the iommu-api fetches it from there then.
But as I said, this is some more work and does not need to be done for
this patch(-set).

> +
>  static struct iommu_ops intel_iommu_ops = {
>  	.domain_init	= intel_iommu_domain_init,
>  	.domain_destroy = intel_iommu_domain_destroy,
> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>  	.unmap		= intel_iommu_unmap,
>  	.iova_to_phys	= intel_iommu_iova_to_phys,
>  	.domain_has_cap = intel_iommu_domain_has_cap,
> +	.dev_to_group	= intel_iommu_dev_to_group,
>  };
>  
>  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0a2ba40..90c1a86 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -45,6 +45,7 @@ struct iommu_ops {
>  				    unsigned long iova);
>  	int (*domain_has_cap)(struct iommu_domain *domain,
>  			      unsigned long cap);
> +	long (*dev_to_group)(struct device *dev);
>  };
>  
>  #ifdef CONFIG_IOMMU_API
> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>  				      unsigned long iova);
>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>  				unsigned long cap);
> +extern long iommu_dev_to_group(struct device *dev);
>  
>  #else /* CONFIG_IOMMU_API */
>  
> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +static inline long iommu_dev_to_group(struct device *dev);
> +{
> +	return -ENODEV;
> +}
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> 
> 
> 

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 10:54                                         ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 10:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> Is this roughly what you're thinking of for the iommu_group component?
> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> support in the iommu base.  Would AMD-Vi do something similar (or
> exactly the same) for group #s?  Thanks,

The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.

> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> index 6e6b6a1..6b54c1a 100644
> --- a/drivers/base/iommu.c
> +++ b/drivers/base/iommu.c
> @@ -17,20 +17,56 @@
>   */
>  
>  #include <linux/bug.h>
> +#include <linux/device.h>
>  #include <linux/types.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/pci.h>
>  
>  static struct iommu_ops *iommu_ops;
>  
> +static ssize_t show_iommu_group(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));

Probably add a 0x prefix so userspace knows the format?

> +}
> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> +
> +static int add_iommu_group(struct device *dev, void *unused)
> +{
> +	if (iommu_dev_to_group(dev) >= 0)
> +		return device_create_file(dev, &dev_attr_iommu_group);
> +
> +	return 0;
> +}
> +
> +static int device_notifier(struct notifier_block *nb,
> +			   unsigned long action, void *data)
> +{
> +	struct device *dev = data;
> +
> +	if (action == BUS_NOTIFY_ADD_DEVICE)
> +		return add_iommu_group(dev, NULL);
> +
> +	return 0;
> +}
> +
> +static struct notifier_block device_nb = {
> +	.notifier_call = device_notifier,
> +};
> +
>  void register_iommu(struct iommu_ops *ops)
>  {
>  	if (iommu_ops)
>  		BUG();
>  
>  	iommu_ops = ops;
> +
> +	/* FIXME - non-PCI, really want for_each_bus() */
> +	bus_register_notifier(&pci_bus_type, &device_nb);
> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>  }

We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.

>  bool iommu_found(void)
> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>  
> +long iommu_dev_to_group(struct device *dev)
> +{
> +	if (iommu_ops->dev_to_group)
> +		return iommu_ops->dev_to_group(dev);
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);

Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.

> +
>  int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  	      phys_addr_t paddr, int gfp_order, int prot)
>  {
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index f02c34d..477259c 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>  static int dmar_forcedac;
>  static int intel_iommu_strict;
>  static int intel_iommu_superpage = 1;
> +static int intel_iommu_no_mf_groups;
>  
>  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>  static DEFINE_SPINLOCK(device_domain_lock);
> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>  			printk(KERN_INFO
>  				"Intel-IOMMU: disable supported super page\n");
>  			intel_iommu_superpage = 0;
> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> +			printk(KERN_INFO
> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> +			intel_iommu_no_mf_groups = 1;

This should really be a global iommu option and not be VT-d specific.

>  
>  		str += strcspn(str, ",");
> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +/* Group numbers are arbitrary.  Device with the same group number
> + * indicate the iommu cannot differentiate between them.  To avoid
> + * tracking used groups we just use the seg|bus|devfn of the lowest
> + * level we're able to differentiate devices */
> +static long intel_iommu_dev_to_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct pci_dev *bridge;
> +	union {
> +		struct {
> +			u8 devfn;
> +			u8 bus;
> +			u16 segment;
> +		} pci;
> +		u32 group;
> +	} id;
> +
> +	if (iommu_no_mapping(dev))
> +		return -ENODEV;
> +
> +	id.pci.segment = pci_domain_nr(pdev->bus);
> +	id.pci.bus = pdev->bus->number;
> +	id.pci.devfn = pdev->devfn;
> +
> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> +		return -ENODEV;
> +
> +	bridge = pci_find_upstream_pcie_bridge(pdev);
> +	if (bridge) {
> +		if (pci_is_pcie(bridge)) {
> +			id.pci.bus = bridge->subordinate->number;
> +			id.pci.devfn = 0;
> +		} else {
> +			id.pci.bus = bridge->bus->number;
> +			id.pci.devfn = bridge->devfn;
> +		}
> +	}
> +
> +	/* Virtual functions always get their own group */
> +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> +
> +	/* FIXME - seg # >= 0x8000 on 32b */
> +	return id.group;
> +}

This looks like code duplication in the VT-d driver. It doesn't need to
be generalized now, but we should keep in mind to do a more general
solution later.
Maybe it is beneficial if the IOMMU drivers only setup the number in
dev->arch.iommu.groupid and the iommu-api fetches it from there then.
But as I said, this is some more work and does not need to be done for
this patch(-set).

> +
>  static struct iommu_ops intel_iommu_ops = {
>  	.domain_init	= intel_iommu_domain_init,
>  	.domain_destroy = intel_iommu_domain_destroy,
> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>  	.unmap		= intel_iommu_unmap,
>  	.iova_to_phys	= intel_iommu_iova_to_phys,
>  	.domain_has_cap = intel_iommu_domain_has_cap,
> +	.dev_to_group	= intel_iommu_dev_to_group,
>  };
>  
>  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0a2ba40..90c1a86 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -45,6 +45,7 @@ struct iommu_ops {
>  				    unsigned long iova);
>  	int (*domain_has_cap)(struct iommu_domain *domain,
>  			      unsigned long cap);
> +	long (*dev_to_group)(struct device *dev);
>  };
>  
>  #ifdef CONFIG_IOMMU_API
> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>  				      unsigned long iova);
>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>  				unsigned long cap);
> +extern long iommu_dev_to_group(struct device *dev);
>  
>  #else /* CONFIG_IOMMU_API */
>  
> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +static inline long iommu_dev_to_group(struct device *dev);
> +{
> +	return -ENODEV;
> +}
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> 
> 
> 

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24 14:56                               ` Alex Williamson
  (?)
@ 2011-08-25 11:01                                 ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 11:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> > A side-note: Might it be better to expose assigned devices in a guest on
> > a seperate bus? This will make it easier to emulate an IOMMU for the
> > guest inside qemu.
> 
> I think we want that option, sure.  A lot of guests aren't going to
> support hotplugging buses though, so I think our default, map the entire
> guest model should still be using bus 0.  The ACPI gets a lot more
> complicated for that model too; dynamic SSDTs?  Thanks,

Ok, if only AMD-Vi should be emulated then it is not strictly
necessary. For this IOMMU we can specify that devices on the same bus
belong to different IOMMUs. So we can implement an IOMMU that handles
internal qemu-devices and one that handles pass-through devices.
Not sure if this is possible with VT-d too. Okay VT-d emulation would
also require that the devices emulation of a PCIe bridge, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 11:01                                 ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 11:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	David Gibson, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> > A side-note: Might it be better to expose assigned devices in a guest on
> > a seperate bus? This will make it easier to emulate an IOMMU for the
> > guest inside qemu.
> 
> I think we want that option, sure.  A lot of guests aren't going to
> support hotplugging buses though, so I think our default, map the entire
> guest model should still be using bus 0.  The ACPI gets a lot more
> complicated for that model too; dynamic SSDTs?  Thanks,

Ok, if only AMD-Vi should be emulated then it is not strictly
necessary. For this IOMMU we can specify that devices on the same bus
belong to different IOMMUs. So we can implement an IOMMU that handles
internal qemu-devices and one that handles pass-through devices.
Not sure if this is possible with VT-d too. Okay VT-d emulation would
also require that the devices emulation of a PCIe bridge, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 11:01                                 ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 11:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	David Gibson, chrisw, iommu, Avi Kivity, linuxppc-dev, benve

On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> > A side-note: Might it be better to expose assigned devices in a guest on
> > a seperate bus? This will make it easier to emulate an IOMMU for the
> > guest inside qemu.
> 
> I think we want that option, sure.  A lot of guests aren't going to
> support hotplugging buses though, so I think our default, map the entire
> guest model should still be using bus 0.  The ACPI gets a lot more
> complicated for that model too; dynamic SSDTs?  Thanks,

Ok, if only AMD-Vi should be emulated then it is not strictly
necessary. For this IOMMU we can specify that devices on the same bus
belong to different IOMMUs. So we can implement an IOMMU that handles
internal qemu-devices and one that handles pass-through devices.
Not sure if this is possible with VT-d too. Okay VT-d emulation would
also require that the devices emulation of a PCIe bridge, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24 15:07                               ` Alex Williamson
  (?)
@ 2011-08-25 12:31                                 ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 12:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, chrisw, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> > 
> > > > Handling it through fds is a good idea. This makes sure that everything
> > > > belongs to one process. I am not really sure yet if we go the way to
> > > > just bind plain groups together or if we create meta-groups. The
> > > > meta-groups thing seems somewhat cleaner, though.
> > > 
> > > I'm leaning towards binding because we need to make it dynamic, but I
> > > don't really have a good picture of the lifecycle of a meta-group.
> > 
> > In my view the life-cycle of the meta-group is a subrange of the
> > qemu-instance's life-cycle.
> 
> I guess I mean the lifecycle of a super-group that's actually exposed as
> a new group in sysfs.  Who creates it?  How?  How are groups dynamically
> added and removed from the super-group?  The group merging makes sense
> to me because it's largely just an optimization that qemu will try to
> merge groups.  If it works, great.  If not, it manages them separately.
> When all the devices from a group are unplugged, unmerge the group if
> necessary.

Right. The super-group thing is an optimization.

> We need to try the polite method of attempting to hot unplug the device
> from qemu first, which the current vfio code already implements.  We can
> then escalate if it doesn't respond.  The current code calls abort in
> qemu if the guest doesn't respond, but I agree we should also be
> enforcing this at the kernel interface.  I think the problem with the
> hard-unplug is that we don't have a good revoke mechanism for the mmio
> mmaps.

For mmio we could stop the guest and replace the mmio region with a
region that is filled with 0xff, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 12:31                                 ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 12:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linux-pci, linuxppc-dev,
	benve

On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> > 
> > > > Handling it through fds is a good idea. This makes sure that everything
> > > > belongs to one process. I am not really sure yet if we go the way to
> > > > just bind plain groups together or if we create meta-groups. The
> > > > meta-groups thing seems somewhat cleaner, though.
> > > 
> > > I'm leaning towards binding because we need to make it dynamic, but I
> > > don't really have a good picture of the lifecycle of a meta-group.
> > 
> > In my view the life-cycle of the meta-group is a subrange of the
> > qemu-instance's life-cycle.
> 
> I guess I mean the lifecycle of a super-group that's actually exposed as
> a new group in sysfs.  Who creates it?  How?  How are groups dynamically
> added and removed from the super-group?  The group merging makes sense
> to me because it's largely just an optimization that qemu will try to
> merge groups.  If it works, great.  If not, it manages them separately.
> When all the devices from a group are unplugged, unmerge the group if
> necessary.

Right. The super-group thing is an optimization.

> We need to try the polite method of attempting to hot unplug the device
> from qemu first, which the current vfio code already implements.  We can
> then escalate if it doesn't respond.  The current code calls abort in
> qemu if the guest doesn't respond, but I agree we should also be
> enforcing this at the kernel interface.  I think the problem with the
> hard-unplug is that we don't have a good revoke mechanism for the mmio
> mmaps.

For mmio we could stop the guest and replace the mmio region with a
region that is filled with 0xff, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 12:31                                 ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 12:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, chrisw,
	iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> > 
> > > > Handling it through fds is a good idea. This makes sure that everything
> > > > belongs to one process. I am not really sure yet if we go the way to
> > > > just bind plain groups together or if we create meta-groups. The
> > > > meta-groups thing seems somewhat cleaner, though.
> > > 
> > > I'm leaning towards binding because we need to make it dynamic, but I
> > > don't really have a good picture of the lifecycle of a meta-group.
> > 
> > In my view the life-cycle of the meta-group is a subrange of the
> > qemu-instance's life-cycle.
> 
> I guess I mean the lifecycle of a super-group that's actually exposed as
> a new group in sysfs.  Who creates it?  How?  How are groups dynamically
> added and removed from the super-group?  The group merging makes sense
> to me because it's largely just an optimization that qemu will try to
> merge groups.  If it works, great.  If not, it manages them separately.
> When all the devices from a group are unplugged, unmerge the group if
> necessary.

Right. The super-group thing is an optimization.

> We need to try the polite method of attempting to hot unplug the device
> from qemu first, which the current vfio code already implements.  We can
> then escalate if it doesn't respond.  The current code calls abort in
> qemu if the guest doesn't respond, but I agree we should also be
> enforcing this at the kernel interface.  I think the problem with the
> hard-unplug is that we don't have a good revoke mechanism for the mmio
> mmaps.

For mmio we could stop the guest and replace the mmio region with a
region that is filled with 0xff, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 12:31                                 ` Roedel, Joerg
  (?)
@ 2011-08-25 13:25                                   ` Alexander Graf
  -1 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-25 13:25 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, linux-pci, linuxppc-dev,
	benve

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]


On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
>> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> 

[...]

>> We need to try the polite method of attempting to hot unplug the device
>> from qemu first, which the current vfio code already implements.  We can
>> then escalate if it doesn't respond.  The current code calls abort in
>> qemu if the guest doesn't respond, but I agree we should also be
>> enforcing this at the kernel interface.  I think the problem with the
>> hard-unplug is that we don't have a good revoke mechanism for the mmio
>> mmaps.
> 
> For mmio we could stop the guest and replace the mmio region with a
> region that is filled with 0xff, no?

Sure, but that happens in user space. The question is how does kernel space enforce an MMIO region to not be mapped after the hotplug event occured? Keep in mind that user space is pretty much untrusted here - it doesn't have to be QEMU. It could just as well be a generic user space driver. And that can just ignore hotplug events.


Alex


[-- Attachment #2: Type: text/html, Size: 1843 bytes --]

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 13:25                                   ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-25 13:25 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]


On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
>> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> 

[...]

>> We need to try the polite method of attempting to hot unplug the device
>> from qemu first, which the current vfio code already implements.  We can
>> then escalate if it doesn't respond.  The current code calls abort in
>> qemu if the guest doesn't respond, but I agree we should also be
>> enforcing this at the kernel interface.  I think the problem with the
>> hard-unplug is that we don't have a good revoke mechanism for the mmio
>> mmaps.
> 
> For mmio we could stop the guest and replace the mmio region with a
> region that is filled with 0xff, no?

Sure, but that happens in user space. The question is how does kernel space enforce an MMIO region to not be mapped after the hotplug event occured? Keep in mind that user space is pretty much untrusted here - it doesn't have to be QEMU. It could just as well be a generic user space driver. And that can just ignore hotplug events.


Alex


[-- Attachment #2: Type: text/html, Size: 1843 bytes --]

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 13:25                                   ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-25 13:25 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, linux-pci, linuxppc-dev,
	benve

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]


On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
>> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> 

[...]

>> We need to try the polite method of attempting to hot unplug the device
>> from qemu first, which the current vfio code already implements.  We can
>> then escalate if it doesn't respond.  The current code calls abort in
>> qemu if the guest doesn't respond, but I agree we should also be
>> enforcing this at the kernel interface.  I think the problem with the
>> hard-unplug is that we don't have a good revoke mechanism for the mmio
>> mmaps.
> 
> For mmio we could stop the guest and replace the mmio region with a
> region that is filled with 0xff, no?

Sure, but that happens in user space. The question is how does kernel space enforce an MMIO region to not be mapped after the hotplug event occured? Keep in mind that user space is pretty much untrusted here - it doesn't have to be QEMU. It could just as well be a generic user space driver. And that can just ignore hotplug events.


Alex


[-- Attachment #2: Type: text/html, Size: 1843 bytes --]

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 10:54                                         ` Roedel, Joerg
  (?)
@ 2011-08-25 15:38                                           ` Don Dutile
  -1 siblings, 0 replies; 322+ messages in thread
From: Don Dutile @ 2011-08-25 15:38 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, linux-pci, linuxppc-dev,
	benve

On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> Hi Alex,
>
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
>> Is this roughly what you're thinking of for the iommu_group component?
>> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
>> support in the iommu base.  Would AMD-Vi do something similar (or
>> exactly the same) for group #s?  Thanks,
>
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
>
>> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
>> index 6e6b6a1..6b54c1a 100644
>> --- a/drivers/base/iommu.c
>> +++ b/drivers/base/iommu.c
>> @@ -17,20 +17,56 @@
>>    */
>>
>>   #include<linux/bug.h>
>> +#include<linux/device.h>
>>   #include<linux/types.h>
>>   #include<linux/module.h>
>>   #include<linux/slab.h>
>>   #include<linux/errno.h>
>>   #include<linux/iommu.h>
>> +#include<linux/pci.h>
>>
>>   static struct iommu_ops *iommu_ops;
>>
>> +static ssize_t show_iommu_group(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
>
> Probably add a 0x prefix so userspace knows the format?
>
>> +}
>> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
>> +
>> +static int add_iommu_group(struct device *dev, void *unused)
>> +{
>> +	if (iommu_dev_to_group(dev)>= 0)
>> +		return device_create_file(dev,&dev_attr_iommu_group);
>> +
>> +	return 0;
>> +}
>> +
>> +static int device_notifier(struct notifier_block *nb,
>> +			   unsigned long action, void *data)
>> +{
>> +	struct device *dev = data;
>> +
>> +	if (action == BUS_NOTIFY_ADD_DEVICE)
>> +		return add_iommu_group(dev, NULL);
>> +
>> +	return 0;
>> +}
>> +
>> +static struct notifier_block device_nb = {
>> +	.notifier_call = device_notifier,
>> +};
>> +
>>   void register_iommu(struct iommu_ops *ops)
>>   {
>>   	if (iommu_ops)
>>   		BUG();
>>
>>   	iommu_ops = ops;
>> +
>> +	/* FIXME - non-PCI, really want for_each_bus() */
>> +	bus_register_notifier(&pci_bus_type,&device_nb);
>> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>>   }
>
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.
>
When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.


>>   bool iommu_found(void)
>> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>>
>> +long iommu_dev_to_group(struct device *dev)
>> +{
>> +	if (iommu_ops->dev_to_group)
>> +		return iommu_ops->dev_to_group(dev);
>> +	return -ENODEV;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
>
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.
> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.
>
>> +
>>   int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>   	      phys_addr_t paddr, int gfp_order, int prot)
>>   {
>> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
>> index f02c34d..477259c 100644
>> --- a/drivers/pci/intel-iommu.c
>> +++ b/drivers/pci/intel-iommu.c
>> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>>   static int dmar_forcedac;
>>   static int intel_iommu_strict;
>>   static int intel_iommu_superpage = 1;
>> +static int intel_iommu_no_mf_groups;
>>
>>   #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>>   static DEFINE_SPINLOCK(device_domain_lock);
>> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>>   			printk(KERN_INFO
>>   				"Intel-IOMMU: disable supported super page\n");
>>   			intel_iommu_superpage = 0;
>> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
>> +			printk(KERN_INFO
>> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
>> +			intel_iommu_no_mf_groups = 1;
>
> This should really be a global iommu option and not be VT-d specific.
>
>>
>>   		str += strcspn(str, ",");
>> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +/* Group numbers are arbitrary.  Device with the same group number
>> + * indicate the iommu cannot differentiate between them.  To avoid
>> + * tracking used groups we just use the seg|bus|devfn of the lowest
>> + * level we're able to differentiate devices */
>> +static long intel_iommu_dev_to_group(struct device *dev)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	struct pci_dev *bridge;
>> +	union {
>> +		struct {
>> +			u8 devfn;
>> +			u8 bus;
>> +			u16 segment;
>> +		} pci;
>> +		u32 group;
>> +	} id;
>> +
>> +	if (iommu_no_mapping(dev))
>> +		return -ENODEV;
>> +
>> +	id.pci.segment = pci_domain_nr(pdev->bus);
>> +	id.pci.bus = pdev->bus->number;
>> +	id.pci.devfn = pdev->devfn;
>> +
>> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
>> +		return -ENODEV;
>> +
>> +	bridge = pci_find_upstream_pcie_bridge(pdev);
>> +	if (bridge) {
>> +		if (pci_is_pcie(bridge)) {
>> +			id.pci.bus = bridge->subordinate->number;
>> +			id.pci.devfn = 0;
>> +		} else {
>> +			id.pci.bus = bridge->bus->number;
>> +			id.pci.devfn = bridge->devfn;
>> +		}
>> +	}
>> +
>> +	/* Virtual functions always get their own group */
>> +	if (!pdev->is_virtfn&&  intel_iommu_no_mf_groups)
>> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
>> +
>> +	/* FIXME - seg #>= 0x8000 on 32b */
>> +	return id.group;
>> +}
>
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).
>
>> +
>>   static struct iommu_ops intel_iommu_ops = {
>>   	.domain_init	= intel_iommu_domain_init,
>>   	.domain_destroy = intel_iommu_domain_destroy,
>> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>>   	.unmap		= intel_iommu_unmap,
>>   	.iova_to_phys	= intel_iommu_iova_to_phys,
>>   	.domain_has_cap = intel_iommu_domain_has_cap,
>> +	.dev_to_group	= intel_iommu_dev_to_group,
>>   };
>>
>>   static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 0a2ba40..90c1a86 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -45,6 +45,7 @@ struct iommu_ops {
>>   				    unsigned long iova);
>>   	int (*domain_has_cap)(struct iommu_domain *domain,
>>   			      unsigned long cap);
>> +	long (*dev_to_group)(struct device *dev);
>>   };
>>
>>   #ifdef CONFIG_IOMMU_API
>> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>>   				      unsigned long iova);
>>   extern int iommu_domain_has_cap(struct iommu_domain *domain,
>>   				unsigned long cap);
>> +extern long iommu_dev_to_group(struct device *dev);
>>
>>   #else /* CONFIG_IOMMU_API */
>>
>> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +static inline long iommu_dev_to_group(struct device *dev);
>> +{
>> +	return -ENODEV;
>> +}
>>   #endif /* CONFIG_IOMMU_API */
>>
>>   #endif /* __LINUX_IOMMU_H */
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 15:38                                           ` Don Dutile
  0 siblings, 0 replies; 322+ messages in thread
From: Don Dutile @ 2011-08-25 15:38 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> Hi Alex,
>
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
>> Is this roughly what you're thinking of for the iommu_group component?
>> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
>> support in the iommu base.  Would AMD-Vi do something similar (or
>> exactly the same) for group #s?  Thanks,
>
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
>
>> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
>> index 6e6b6a1..6b54c1a 100644
>> --- a/drivers/base/iommu.c
>> +++ b/drivers/base/iommu.c
>> @@ -17,20 +17,56 @@
>>    */
>>
>>   #include<linux/bug.h>
>> +#include<linux/device.h>
>>   #include<linux/types.h>
>>   #include<linux/module.h>
>>   #include<linux/slab.h>
>>   #include<linux/errno.h>
>>   #include<linux/iommu.h>
>> +#include<linux/pci.h>
>>
>>   static struct iommu_ops *iommu_ops;
>>
>> +static ssize_t show_iommu_group(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
>
> Probably add a 0x prefix so userspace knows the format?
>
>> +}
>> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
>> +
>> +static int add_iommu_group(struct device *dev, void *unused)
>> +{
>> +	if (iommu_dev_to_group(dev)>= 0)
>> +		return device_create_file(dev,&dev_attr_iommu_group);
>> +
>> +	return 0;
>> +}
>> +
>> +static int device_notifier(struct notifier_block *nb,
>> +			   unsigned long action, void *data)
>> +{
>> +	struct device *dev = data;
>> +
>> +	if (action == BUS_NOTIFY_ADD_DEVICE)
>> +		return add_iommu_group(dev, NULL);
>> +
>> +	return 0;
>> +}
>> +
>> +static struct notifier_block device_nb = {
>> +	.notifier_call = device_notifier,
>> +};
>> +
>>   void register_iommu(struct iommu_ops *ops)
>>   {
>>   	if (iommu_ops)
>>   		BUG();
>>
>>   	iommu_ops = ops;
>> +
>> +	/* FIXME - non-PCI, really want for_each_bus() */
>> +	bus_register_notifier(&pci_bus_type,&device_nb);
>> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>>   }
>
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.
>
When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.


>>   bool iommu_found(void)
>> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>>
>> +long iommu_dev_to_group(struct device *dev)
>> +{
>> +	if (iommu_ops->dev_to_group)
>> +		return iommu_ops->dev_to_group(dev);
>> +	return -ENODEV;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
>
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.
> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.
>
>> +
>>   int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>   	      phys_addr_t paddr, int gfp_order, int prot)
>>   {
>> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
>> index f02c34d..477259c 100644
>> --- a/drivers/pci/intel-iommu.c
>> +++ b/drivers/pci/intel-iommu.c
>> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>>   static int dmar_forcedac;
>>   static int intel_iommu_strict;
>>   static int intel_iommu_superpage = 1;
>> +static int intel_iommu_no_mf_groups;
>>
>>   #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>>   static DEFINE_SPINLOCK(device_domain_lock);
>> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>>   			printk(KERN_INFO
>>   				"Intel-IOMMU: disable supported super page\n");
>>   			intel_iommu_superpage = 0;
>> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
>> +			printk(KERN_INFO
>> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
>> +			intel_iommu_no_mf_groups = 1;
>
> This should really be a global iommu option and not be VT-d specific.
>
>>
>>   		str += strcspn(str, ",");
>> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +/* Group numbers are arbitrary.  Device with the same group number
>> + * indicate the iommu cannot differentiate between them.  To avoid
>> + * tracking used groups we just use the seg|bus|devfn of the lowest
>> + * level we're able to differentiate devices */
>> +static long intel_iommu_dev_to_group(struct device *dev)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	struct pci_dev *bridge;
>> +	union {
>> +		struct {
>> +			u8 devfn;
>> +			u8 bus;
>> +			u16 segment;
>> +		} pci;
>> +		u32 group;
>> +	} id;
>> +
>> +	if (iommu_no_mapping(dev))
>> +		return -ENODEV;
>> +
>> +	id.pci.segment = pci_domain_nr(pdev->bus);
>> +	id.pci.bus = pdev->bus->number;
>> +	id.pci.devfn = pdev->devfn;
>> +
>> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
>> +		return -ENODEV;
>> +
>> +	bridge = pci_find_upstream_pcie_bridge(pdev);
>> +	if (bridge) {
>> +		if (pci_is_pcie(bridge)) {
>> +			id.pci.bus = bridge->subordinate->number;
>> +			id.pci.devfn = 0;
>> +		} else {
>> +			id.pci.bus = bridge->bus->number;
>> +			id.pci.devfn = bridge->devfn;
>> +		}
>> +	}
>> +
>> +	/* Virtual functions always get their own group */
>> +	if (!pdev->is_virtfn&&  intel_iommu_no_mf_groups)
>> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
>> +
>> +	/* FIXME - seg #>= 0x8000 on 32b */
>> +	return id.group;
>> +}
>
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).
>
>> +
>>   static struct iommu_ops intel_iommu_ops = {
>>   	.domain_init	= intel_iommu_domain_init,
>>   	.domain_destroy = intel_iommu_domain_destroy,
>> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>>   	.unmap		= intel_iommu_unmap,
>>   	.iova_to_phys	= intel_iommu_iova_to_phys,
>>   	.domain_has_cap = intel_iommu_domain_has_cap,
>> +	.dev_to_group	= intel_iommu_dev_to_group,
>>   };
>>
>>   static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 0a2ba40..90c1a86 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -45,6 +45,7 @@ struct iommu_ops {
>>   				    unsigned long iova);
>>   	int (*domain_has_cap)(struct iommu_domain *domain,
>>   			      unsigned long cap);
>> +	long (*dev_to_group)(struct device *dev);
>>   };
>>
>>   #ifdef CONFIG_IOMMU_API
>> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>>   				      unsigned long iova);
>>   extern int iommu_domain_has_cap(struct iommu_domain *domain,
>>   				unsigned long cap);
>> +extern long iommu_dev_to_group(struct device *dev);
>>
>>   #else /* CONFIG_IOMMU_API */
>>
>> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +static inline long iommu_dev_to_group(struct device *dev);
>> +{
>> +	return -ENODEV;
>> +}
>>   #endif /* CONFIG_IOMMU_API */
>>
>>   #endif /* __LINUX_IOMMU_H */
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 15:38                                           ` Don Dutile
  0 siblings, 0 replies; 322+ messages in thread
From: Don Dutile @ 2011-08-25 15:38 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, linux-pci, linuxppc-dev,
	benve

On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> Hi Alex,
>
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
>> Is this roughly what you're thinking of for the iommu_group component?
>> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
>> support in the iommu base.  Would AMD-Vi do something similar (or
>> exactly the same) for group #s?  Thanks,
>
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
>
>> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
>> index 6e6b6a1..6b54c1a 100644
>> --- a/drivers/base/iommu.c
>> +++ b/drivers/base/iommu.c
>> @@ -17,20 +17,56 @@
>>    */
>>
>>   #include<linux/bug.h>
>> +#include<linux/device.h>
>>   #include<linux/types.h>
>>   #include<linux/module.h>
>>   #include<linux/slab.h>
>>   #include<linux/errno.h>
>>   #include<linux/iommu.h>
>> +#include<linux/pci.h>
>>
>>   static struct iommu_ops *iommu_ops;
>>
>> +static ssize_t show_iommu_group(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
>
> Probably add a 0x prefix so userspace knows the format?
>
>> +}
>> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
>> +
>> +static int add_iommu_group(struct device *dev, void *unused)
>> +{
>> +	if (iommu_dev_to_group(dev)>= 0)
>> +		return device_create_file(dev,&dev_attr_iommu_group);
>> +
>> +	return 0;
>> +}
>> +
>> +static int device_notifier(struct notifier_block *nb,
>> +			   unsigned long action, void *data)
>> +{
>> +	struct device *dev = data;
>> +
>> +	if (action == BUS_NOTIFY_ADD_DEVICE)
>> +		return add_iommu_group(dev, NULL);
>> +
>> +	return 0;
>> +}
>> +
>> +static struct notifier_block device_nb = {
>> +	.notifier_call = device_notifier,
>> +};
>> +
>>   void register_iommu(struct iommu_ops *ops)
>>   {
>>   	if (iommu_ops)
>>   		BUG();
>>
>>   	iommu_ops = ops;
>> +
>> +	/* FIXME - non-PCI, really want for_each_bus() */
>> +	bus_register_notifier(&pci_bus_type,&device_nb);
>> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>>   }
>
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.
>
When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.


>>   bool iommu_found(void)
>> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>>
>> +long iommu_dev_to_group(struct device *dev)
>> +{
>> +	if (iommu_ops->dev_to_group)
>> +		return iommu_ops->dev_to_group(dev);
>> +	return -ENODEV;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
>
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.
> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.
>
>> +
>>   int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>   	      phys_addr_t paddr, int gfp_order, int prot)
>>   {
>> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
>> index f02c34d..477259c 100644
>> --- a/drivers/pci/intel-iommu.c
>> +++ b/drivers/pci/intel-iommu.c
>> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>>   static int dmar_forcedac;
>>   static int intel_iommu_strict;
>>   static int intel_iommu_superpage = 1;
>> +static int intel_iommu_no_mf_groups;
>>
>>   #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>>   static DEFINE_SPINLOCK(device_domain_lock);
>> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>>   			printk(KERN_INFO
>>   				"Intel-IOMMU: disable supported super page\n");
>>   			intel_iommu_superpage = 0;
>> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
>> +			printk(KERN_INFO
>> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
>> +			intel_iommu_no_mf_groups = 1;
>
> This should really be a global iommu option and not be VT-d specific.
>
>>
>>   		str += strcspn(str, ",");
>> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +/* Group numbers are arbitrary.  Device with the same group number
>> + * indicate the iommu cannot differentiate between them.  To avoid
>> + * tracking used groups we just use the seg|bus|devfn of the lowest
>> + * level we're able to differentiate devices */
>> +static long intel_iommu_dev_to_group(struct device *dev)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	struct pci_dev *bridge;
>> +	union {
>> +		struct {
>> +			u8 devfn;
>> +			u8 bus;
>> +			u16 segment;
>> +		} pci;
>> +		u32 group;
>> +	} id;
>> +
>> +	if (iommu_no_mapping(dev))
>> +		return -ENODEV;
>> +
>> +	id.pci.segment = pci_domain_nr(pdev->bus);
>> +	id.pci.bus = pdev->bus->number;
>> +	id.pci.devfn = pdev->devfn;
>> +
>> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
>> +		return -ENODEV;
>> +
>> +	bridge = pci_find_upstream_pcie_bridge(pdev);
>> +	if (bridge) {
>> +		if (pci_is_pcie(bridge)) {
>> +			id.pci.bus = bridge->subordinate->number;
>> +			id.pci.devfn = 0;
>> +		} else {
>> +			id.pci.bus = bridge->bus->number;
>> +			id.pci.devfn = bridge->devfn;
>> +		}
>> +	}
>> +
>> +	/* Virtual functions always get their own group */
>> +	if (!pdev->is_virtfn&&  intel_iommu_no_mf_groups)
>> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
>> +
>> +	/* FIXME - seg #>= 0x8000 on 32b */
>> +	return id.group;
>> +}
>
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).
>
>> +
>>   static struct iommu_ops intel_iommu_ops = {
>>   	.domain_init	= intel_iommu_domain_init,
>>   	.domain_destroy = intel_iommu_domain_destroy,
>> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>>   	.unmap		= intel_iommu_unmap,
>>   	.iova_to_phys	= intel_iommu_iova_to_phys,
>>   	.domain_has_cap = intel_iommu_domain_has_cap,
>> +	.dev_to_group	= intel_iommu_dev_to_group,
>>   };
>>
>>   static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 0a2ba40..90c1a86 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -45,6 +45,7 @@ struct iommu_ops {
>>   				    unsigned long iova);
>>   	int (*domain_has_cap)(struct iommu_domain *domain,
>>   			      unsigned long cap);
>> +	long (*dev_to_group)(struct device *dev);
>>   };
>>
>>   #ifdef CONFIG_IOMMU_API
>> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>>   				      unsigned long iova);
>>   extern int iommu_domain_has_cap(struct iommu_domain *domain,
>>   				unsigned long cap);
>> +extern long iommu_dev_to_group(struct device *dev);
>>
>>   #else /* CONFIG_IOMMU_API */
>>
>> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +static inline long iommu_dev_to_group(struct device *dev);
>> +{
>> +	return -ENODEV;
>> +}
>>   #endif /* CONFIG_IOMMU_API */
>>
>>   #endif /* __LINUX_IOMMU_H */
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 15:38                                           ` Don Dutile
  (?)
@ 2011-08-25 16:46                                             ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 16:46 UTC (permalink / raw)
  To: Don Dutile
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, linux-pci, linuxppc-dev,
	benve

On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote:

> On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> >
> When you think of a system where there isn't just one bus-type
> with iommu support, it makes more sense.
> Additionally, it also allows the long-term architecture to use different types
> of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
> esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
> for direct-attach disk hba's.

Not sure how likely it is to have different types of IOMMUs within a
given bus-type. But if they become reality we can multiplex in the
iommu-api without much hassle :)
For now, something like bus_set_iommu() or bus_register_iommu() would
provide a nice way to do bus-specific setups for a given iommu
implementation.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 16:46                                             ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 16:46 UTC (permalink / raw)
  To: Don Dutile
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote:

> On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> >
> When you think of a system where there isn't just one bus-type
> with iommu support, it makes more sense.
> Additionally, it also allows the long-term architecture to use different types
> of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
> esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
> for direct-attach disk hba's.

Not sure how likely it is to have different types of IOMMUs within a
given bus-type. But if they become reality we can multiplex in the
iommu-api without much hassle :)
For now, something like bus_set_iommu() or bus_register_iommu() would
provide a nice way to do bus-specific setups for a given iommu
implementation.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 16:46                                             ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-25 16:46 UTC (permalink / raw)
  To: Don Dutile
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel, iommu,
	chrisw, Alex Williamson, Avi Kivity, linux-pci, linuxppc-dev,
	benve

On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote:

> On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> >
> When you think of a system where there isn't just one bus-type
> with iommu support, it makes more sense.
> Additionally, it also allows the long-term architecture to use different types
> of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
> esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
> for direct-attach disk hba's.

Not sure how likely it is to have different types of IOMMUs within a
given bus-type. But if they become reality we can multiplex in the
iommu-api without much hassle :)
For now, something like bus_set_iommu() or bus_register_iommu() would
provide a nice way to do bus-specific setups for a given iommu
implementation.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 10:54                                         ` Roedel, Joerg
  (?)
@ 2011-08-25 17:20                                           ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-25 17:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Aaron Fabbri, Benjamin Herrenschmidt, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> Hi Alex,
> 
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> > Is this roughly what you're thinking of for the iommu_group component?
> > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> > support in the iommu base.  Would AMD-Vi do something similar (or
> > exactly the same) for group #s?  Thanks,
> 
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
> 
> > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> > index 6e6b6a1..6b54c1a 100644
> > --- a/drivers/base/iommu.c
> > +++ b/drivers/base/iommu.c
> > @@ -17,20 +17,56 @@
> >   */
> >  
> >  #include <linux/bug.h>
> > +#include <linux/device.h>
> >  #include <linux/types.h>
> >  #include <linux/module.h>
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/pci.h>
> >  
> >  static struct iommu_ops *iommu_ops;
> >  
> > +static ssize_t show_iommu_group(struct device *dev,
> > +				struct device_attribute *attr, char *buf)
> > +{
> > +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
> 
> Probably add a 0x prefix so userspace knows the format?

I think I'll probably change it to %u.  Seems common to have decimal in
sysfs and doesn't get confusing if we cat it with a string.  As a bonus,
it abstracts that vt-d is just stuffing a PCI device address in there,
which nobody should ever rely on.

> > +}
> > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> > +
> > +static int add_iommu_group(struct device *dev, void *unused)
> > +{
> > +	if (iommu_dev_to_group(dev) >= 0)
> > +		return device_create_file(dev, &dev_attr_iommu_group);
> > +
> > +	return 0;
> > +}
> > +
> > +static int device_notifier(struct notifier_block *nb,
> > +			   unsigned long action, void *data)
> > +{
> > +	struct device *dev = data;
> > +
> > +	if (action == BUS_NOTIFY_ADD_DEVICE)
> > +		return add_iommu_group(dev, NULL);
> > +
> > +	return 0;
> > +}
> > +
> > +static struct notifier_block device_nb = {
> > +	.notifier_call = device_notifier,
> > +};
> > +
> >  void register_iommu(struct iommu_ops *ops)
> >  {
> >  	if (iommu_ops)
> >  		BUG();
> >  
> >  	iommu_ops = ops;
> > +
> > +	/* FIXME - non-PCI, really want for_each_bus() */
> > +	bus_register_notifier(&pci_bus_type, &device_nb);
> > +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
> >  }
> 
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.

That sounds good.  Is anyone working on it?  It seems like it doesn't
hurt to use this in the interim, we may just be watching the wrong bus
and never add any sysfs group info.

> >  bool iommu_found(void)
> > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
> >  
> > +long iommu_dev_to_group(struct device *dev)
> > +{
> > +	if (iommu_ops->dev_to_group)
> > +		return iommu_ops->dev_to_group(dev);
> > +	return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
> 
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.

Ok.

> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.

The convenience of using seg|bus|dev|fn was too much to resist, too bad
it requires a full 32bits.  Maybe I'll change it to:
        int iommu_device_group(struct device *dev, unsigned int *group)

> > +
> >  int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >  	      phys_addr_t paddr, int gfp_order, int prot)
> >  {
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index f02c34d..477259c 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
> >  static int dmar_forcedac;
> >  static int intel_iommu_strict;
> >  static int intel_iommu_superpage = 1;
> > +static int intel_iommu_no_mf_groups;
> >  
> >  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
> >  static DEFINE_SPINLOCK(device_domain_lock);
> > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> >  			printk(KERN_INFO
> >  				"Intel-IOMMU: disable supported super page\n");
> >  			intel_iommu_superpage = 0;
> > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > +			printk(KERN_INFO
> > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > +			intel_iommu_no_mf_groups = 1;
> 
> This should really be a global iommu option and not be VT-d specific.

You think?  It's meaningless on benh's power systems.

> >  
> >  		str += strcspn(str, ",");
> > @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +/* Group numbers are arbitrary.  Device with the same group number
> > + * indicate the iommu cannot differentiate between them.  To avoid
> > + * tracking used groups we just use the seg|bus|devfn of the lowest
> > + * level we're able to differentiate devices */
> > +static long intel_iommu_dev_to_group(struct device *dev)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct pci_dev *bridge;
> > +	union {
> > +		struct {
> > +			u8 devfn;
> > +			u8 bus;
> > +			u16 segment;
> > +		} pci;
> > +		u32 group;
> > +	} id;
> > +
> > +	if (iommu_no_mapping(dev))
> > +		return -ENODEV;
> > +
> > +	id.pci.segment = pci_domain_nr(pdev->bus);
> > +	id.pci.bus = pdev->bus->number;
> > +	id.pci.devfn = pdev->devfn;
> > +
> > +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> > +		return -ENODEV;
> > +
> > +	bridge = pci_find_upstream_pcie_bridge(pdev);
> > +	if (bridge) {
> > +		if (pci_is_pcie(bridge)) {
> > +			id.pci.bus = bridge->subordinate->number;
> > +			id.pci.devfn = 0;
> > +		} else {
> > +			id.pci.bus = bridge->bus->number;
> > +			id.pci.devfn = bridge->devfn;
> > +		}
> > +	}
> > +
> > +	/* Virtual functions always get their own group */
> > +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> > +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> > +
> > +	/* FIXME - seg # >= 0x8000 on 32b */
> > +	return id.group;
> > +}
> 
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).

The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
at least start out with a lightweight, optional interface without the
overhead of predefining groupids setup by bus notification callbacks in
each iommu driver.  Thanks,

Alex

> 
> > +
> >  static struct iommu_ops intel_iommu_ops = {
> >  	.domain_init	= intel_iommu_domain_init,
> >  	.domain_destroy = intel_iommu_domain_destroy,
> > @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
> >  	.unmap		= intel_iommu_unmap,
> >  	.iova_to_phys	= intel_iommu_iova_to_phys,
> >  	.domain_has_cap = intel_iommu_domain_has_cap,
> > +	.dev_to_group	= intel_iommu_dev_to_group,
> >  };
> >  
> >  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0a2ba40..90c1a86 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -45,6 +45,7 @@ struct iommu_ops {
> >  				    unsigned long iova);
> >  	int (*domain_has_cap)(struct iommu_domain *domain,
> >  			      unsigned long cap);
> > +	long (*dev_to_group)(struct device *dev);
> >  };
> >  
> >  #ifdef CONFIG_IOMMU_API
> > @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
> >  				      unsigned long iova);
> >  extern int iommu_domain_has_cap(struct iommu_domain *domain,
> >  				unsigned long cap);
> > +extern long iommu_dev_to_group(struct device *dev);
> >  
> >  #else /* CONFIG_IOMMU_API */
> >  
> > @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +static inline long iommu_dev_to_group(struct device *dev);
> > +{
> > +	return -ENODEV;
> > +}
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 17:20                                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-25 17:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, Anthony Liguori, linux-pci,
	linuxppc-dev, benve

On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> Hi Alex,
> 
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> > Is this roughly what you're thinking of for the iommu_group component?
> > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> > support in the iommu base.  Would AMD-Vi do something similar (or
> > exactly the same) for group #s?  Thanks,
> 
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
> 
> > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> > index 6e6b6a1..6b54c1a 100644
> > --- a/drivers/base/iommu.c
> > +++ b/drivers/base/iommu.c
> > @@ -17,20 +17,56 @@
> >   */
> >  
> >  #include <linux/bug.h>
> > +#include <linux/device.h>
> >  #include <linux/types.h>
> >  #include <linux/module.h>
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/pci.h>
> >  
> >  static struct iommu_ops *iommu_ops;
> >  
> > +static ssize_t show_iommu_group(struct device *dev,
> > +				struct device_attribute *attr, char *buf)
> > +{
> > +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
> 
> Probably add a 0x prefix so userspace knows the format?

I think I'll probably change it to %u.  Seems common to have decimal in
sysfs and doesn't get confusing if we cat it with a string.  As a bonus,
it abstracts that vt-d is just stuffing a PCI device address in there,
which nobody should ever rely on.

> > +}
> > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> > +
> > +static int add_iommu_group(struct device *dev, void *unused)
> > +{
> > +	if (iommu_dev_to_group(dev) >= 0)
> > +		return device_create_file(dev, &dev_attr_iommu_group);
> > +
> > +	return 0;
> > +}
> > +
> > +static int device_notifier(struct notifier_block *nb,
> > +			   unsigned long action, void *data)
> > +{
> > +	struct device *dev = data;
> > +
> > +	if (action == BUS_NOTIFY_ADD_DEVICE)
> > +		return add_iommu_group(dev, NULL);
> > +
> > +	return 0;
> > +}
> > +
> > +static struct notifier_block device_nb = {
> > +	.notifier_call = device_notifier,
> > +};
> > +
> >  void register_iommu(struct iommu_ops *ops)
> >  {
> >  	if (iommu_ops)
> >  		BUG();
> >  
> >  	iommu_ops = ops;
> > +
> > +	/* FIXME - non-PCI, really want for_each_bus() */
> > +	bus_register_notifier(&pci_bus_type, &device_nb);
> > +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
> >  }
> 
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.

That sounds good.  Is anyone working on it?  It seems like it doesn't
hurt to use this in the interim, we may just be watching the wrong bus
and never add any sysfs group info.

> >  bool iommu_found(void)
> > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
> >  
> > +long iommu_dev_to_group(struct device *dev)
> > +{
> > +	if (iommu_ops->dev_to_group)
> > +		return iommu_ops->dev_to_group(dev);
> > +	return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
> 
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.

Ok.

> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.

The convenience of using seg|bus|dev|fn was too much to resist, too bad
it requires a full 32bits.  Maybe I'll change it to:
        int iommu_device_group(struct device *dev, unsigned int *group)

> > +
> >  int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >  	      phys_addr_t paddr, int gfp_order, int prot)
> >  {
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index f02c34d..477259c 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
> >  static int dmar_forcedac;
> >  static int intel_iommu_strict;
> >  static int intel_iommu_superpage = 1;
> > +static int intel_iommu_no_mf_groups;
> >  
> >  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
> >  static DEFINE_SPINLOCK(device_domain_lock);
> > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> >  			printk(KERN_INFO
> >  				"Intel-IOMMU: disable supported super page\n");
> >  			intel_iommu_superpage = 0;
> > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > +			printk(KERN_INFO
> > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > +			intel_iommu_no_mf_groups = 1;
> 
> This should really be a global iommu option and not be VT-d specific.

You think?  It's meaningless on benh's power systems.

> >  
> >  		str += strcspn(str, ",");
> > @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +/* Group numbers are arbitrary.  Device with the same group number
> > + * indicate the iommu cannot differentiate between them.  To avoid
> > + * tracking used groups we just use the seg|bus|devfn of the lowest
> > + * level we're able to differentiate devices */
> > +static long intel_iommu_dev_to_group(struct device *dev)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct pci_dev *bridge;
> > +	union {
> > +		struct {
> > +			u8 devfn;
> > +			u8 bus;
> > +			u16 segment;
> > +		} pci;
> > +		u32 group;
> > +	} id;
> > +
> > +	if (iommu_no_mapping(dev))
> > +		return -ENODEV;
> > +
> > +	id.pci.segment = pci_domain_nr(pdev->bus);
> > +	id.pci.bus = pdev->bus->number;
> > +	id.pci.devfn = pdev->devfn;
> > +
> > +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> > +		return -ENODEV;
> > +
> > +	bridge = pci_find_upstream_pcie_bridge(pdev);
> > +	if (bridge) {
> > +		if (pci_is_pcie(bridge)) {
> > +			id.pci.bus = bridge->subordinate->number;
> > +			id.pci.devfn = 0;
> > +		} else {
> > +			id.pci.bus = bridge->bus->number;
> > +			id.pci.devfn = bridge->devfn;
> > +		}
> > +	}
> > +
> > +	/* Virtual functions always get their own group */
> > +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> > +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> > +
> > +	/* FIXME - seg # >= 0x8000 on 32b */
> > +	return id.group;
> > +}
> 
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).

The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
at least start out with a lightweight, optional interface without the
overhead of predefining groupids setup by bus notification callbacks in
each iommu driver.  Thanks,

Alex

> 
> > +
> >  static struct iommu_ops intel_iommu_ops = {
> >  	.domain_init	= intel_iommu_domain_init,
> >  	.domain_destroy = intel_iommu_domain_destroy,
> > @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
> >  	.unmap		= intel_iommu_unmap,
> >  	.iova_to_phys	= intel_iommu_iova_to_phys,
> >  	.domain_has_cap = intel_iommu_domain_has_cap,
> > +	.dev_to_group	= intel_iommu_dev_to_group,
> >  };
> >  
> >  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0a2ba40..90c1a86 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -45,6 +45,7 @@ struct iommu_ops {
> >  				    unsigned long iova);
> >  	int (*domain_has_cap)(struct iommu_domain *domain,
> >  			      unsigned long cap);
> > +	long (*dev_to_group)(struct device *dev);
> >  };
> >  
> >  #ifdef CONFIG_IOMMU_API
> > @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
> >  				      unsigned long iova);
> >  extern int iommu_domain_has_cap(struct iommu_domain *domain,
> >  				unsigned long cap);
> > +extern long iommu_dev_to_group(struct device *dev);
> >  
> >  #else /* CONFIG_IOMMU_API */
> >  
> > @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +static inline long iommu_dev_to_group(struct device *dev);
> > +{
> > +	return -ENODEV;
> > +}
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 17:20                                           ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-25 17:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, linux-pci, linuxppc-dev, benve

On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> Hi Alex,
> 
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> > Is this roughly what you're thinking of for the iommu_group component?
> > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> > support in the iommu base.  Would AMD-Vi do something similar (or
> > exactly the same) for group #s?  Thanks,
> 
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
> 
> > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> > index 6e6b6a1..6b54c1a 100644
> > --- a/drivers/base/iommu.c
> > +++ b/drivers/base/iommu.c
> > @@ -17,20 +17,56 @@
> >   */
> >  
> >  #include <linux/bug.h>
> > +#include <linux/device.h>
> >  #include <linux/types.h>
> >  #include <linux/module.h>
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/pci.h>
> >  
> >  static struct iommu_ops *iommu_ops;
> >  
> > +static ssize_t show_iommu_group(struct device *dev,
> > +				struct device_attribute *attr, char *buf)
> > +{
> > +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
> 
> Probably add a 0x prefix so userspace knows the format?

I think I'll probably change it to %u.  Seems common to have decimal in
sysfs and doesn't get confusing if we cat it with a string.  As a bonus,
it abstracts that vt-d is just stuffing a PCI device address in there,
which nobody should ever rely on.

> > +}
> > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> > +
> > +static int add_iommu_group(struct device *dev, void *unused)
> > +{
> > +	if (iommu_dev_to_group(dev) >= 0)
> > +		return device_create_file(dev, &dev_attr_iommu_group);
> > +
> > +	return 0;
> > +}
> > +
> > +static int device_notifier(struct notifier_block *nb,
> > +			   unsigned long action, void *data)
> > +{
> > +	struct device *dev = data;
> > +
> > +	if (action == BUS_NOTIFY_ADD_DEVICE)
> > +		return add_iommu_group(dev, NULL);
> > +
> > +	return 0;
> > +}
> > +
> > +static struct notifier_block device_nb = {
> > +	.notifier_call = device_notifier,
> > +};
> > +
> >  void register_iommu(struct iommu_ops *ops)
> >  {
> >  	if (iommu_ops)
> >  		BUG();
> >  
> >  	iommu_ops = ops;
> > +
> > +	/* FIXME - non-PCI, really want for_each_bus() */
> > +	bus_register_notifier(&pci_bus_type, &device_nb);
> > +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
> >  }
> 
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.

That sounds good.  Is anyone working on it?  It seems like it doesn't
hurt to use this in the interim, we may just be watching the wrong bus
and never add any sysfs group info.

> >  bool iommu_found(void)
> > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
> >  
> > +long iommu_dev_to_group(struct device *dev)
> > +{
> > +	if (iommu_ops->dev_to_group)
> > +		return iommu_ops->dev_to_group(dev);
> > +	return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
> 
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.

Ok.

> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.

The convenience of using seg|bus|dev|fn was too much to resist, too bad
it requires a full 32bits.  Maybe I'll change it to:
        int iommu_device_group(struct device *dev, unsigned int *group)

> > +
> >  int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >  	      phys_addr_t paddr, int gfp_order, int prot)
> >  {
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index f02c34d..477259c 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
> >  static int dmar_forcedac;
> >  static int intel_iommu_strict;
> >  static int intel_iommu_superpage = 1;
> > +static int intel_iommu_no_mf_groups;
> >  
> >  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
> >  static DEFINE_SPINLOCK(device_domain_lock);
> > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> >  			printk(KERN_INFO
> >  				"Intel-IOMMU: disable supported super page\n");
> >  			intel_iommu_superpage = 0;
> > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > +			printk(KERN_INFO
> > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > +			intel_iommu_no_mf_groups = 1;
> 
> This should really be a global iommu option and not be VT-d specific.

You think?  It's meaningless on benh's power systems.

> >  
> >  		str += strcspn(str, ",");
> > @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +/* Group numbers are arbitrary.  Device with the same group number
> > + * indicate the iommu cannot differentiate between them.  To avoid
> > + * tracking used groups we just use the seg|bus|devfn of the lowest
> > + * level we're able to differentiate devices */
> > +static long intel_iommu_dev_to_group(struct device *dev)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct pci_dev *bridge;
> > +	union {
> > +		struct {
> > +			u8 devfn;
> > +			u8 bus;
> > +			u16 segment;
> > +		} pci;
> > +		u32 group;
> > +	} id;
> > +
> > +	if (iommu_no_mapping(dev))
> > +		return -ENODEV;
> > +
> > +	id.pci.segment = pci_domain_nr(pdev->bus);
> > +	id.pci.bus = pdev->bus->number;
> > +	id.pci.devfn = pdev->devfn;
> > +
> > +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> > +		return -ENODEV;
> > +
> > +	bridge = pci_find_upstream_pcie_bridge(pdev);
> > +	if (bridge) {
> > +		if (pci_is_pcie(bridge)) {
> > +			id.pci.bus = bridge->subordinate->number;
> > +			id.pci.devfn = 0;
> > +		} else {
> > +			id.pci.bus = bridge->bus->number;
> > +			id.pci.devfn = bridge->devfn;
> > +		}
> > +	}
> > +
> > +	/* Virtual functions always get their own group */
> > +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> > +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> > +
> > +	/* FIXME - seg # >= 0x8000 on 32b */
> > +	return id.group;
> > +}
> 
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).

The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
at least start out with a lightweight, optional interface without the
overhead of predefining groupids setup by bus notification callbacks in
each iommu driver.  Thanks,

Alex

> 
> > +
> >  static struct iommu_ops intel_iommu_ops = {
> >  	.domain_init	= intel_iommu_domain_init,
> >  	.domain_destroy = intel_iommu_domain_destroy,
> > @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
> >  	.unmap		= intel_iommu_unmap,
> >  	.iova_to_phys	= intel_iommu_iova_to_phys,
> >  	.domain_has_cap = intel_iommu_domain_has_cap,
> > +	.dev_to_group	= intel_iommu_dev_to_group,
> >  };
> >  
> >  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0a2ba40..90c1a86 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -45,6 +45,7 @@ struct iommu_ops {
> >  				    unsigned long iova);
> >  	int (*domain_has_cap)(struct iommu_domain *domain,
> >  			      unsigned long cap);
> > +	long (*dev_to_group)(struct device *dev);
> >  };
> >  
> >  #ifdef CONFIG_IOMMU_API
> > @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
> >  				      unsigned long iova);
> >  extern int iommu_domain_has_cap(struct iommu_domain *domain,
> >  				unsigned long cap);
> > +extern long iommu_dev_to_group(struct device *dev);
> >  
> >  #else /* CONFIG_IOMMU_API */
> >  
> > @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +static inline long iommu_dev_to_group(struct device *dev);
> > +{
> > +	return -ENODEV;
> > +}
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 17:20                                           ` Alex Williamson
  (?)
@ 2011-08-25 18:05                                             ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-25 18:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Roedel, Joerg, Aaron Fabbri, Benjamin Herrenschmidt,
	Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve

On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:

> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> 
> That sounds good.  Is anyone working on it?  It seems like it doesn't
> hurt to use this in the interim, we may just be watching the wrong bus
> and never add any sysfs group info.

I'll cook something up for RFC over the weekend.

> > Also the return type should not be long but something that fits into
> > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > choice.
> 
> The convenience of using seg|bus|dev|fn was too much to resist, too bad
> it requires a full 32bits.  Maybe I'll change it to:
>         int iommu_device_group(struct device *dev, unsigned int *group)

If we really expect segment numbers that need the full 16 bit then this
would be the way to go. Otherwise I would prefer returning the group-id
directly and partition the group-id space for the error values (s32 with
negative numbers being errors).

> > > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> > >  			printk(KERN_INFO
> > >  				"Intel-IOMMU: disable supported super page\n");
> > >  			intel_iommu_superpage = 0;
> > > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > > +			printk(KERN_INFO
> > > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > > +			intel_iommu_no_mf_groups = 1;
> > 
> > This should really be a global iommu option and not be VT-d specific.
> 
> You think?  It's meaningless on benh's power systems.

But it is not meaningless on AMD-Vi systems :) There should be one
option for both.
On the other hand this requires an iommu= parameter on ia64, but thats
probably not that bad.

> > This looks like code duplication in the VT-d driver. It doesn't need to
> > be generalized now, but we should keep in mind to do a more general
> > solution later.
> > Maybe it is beneficial if the IOMMU drivers only setup the number in
> > dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> > But as I said, this is some more work and does not need to be done for
> > this patch(-set).
> 
> The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
> at least start out with a lightweight, optional interface without the
> overhead of predefining groupids setup by bus notification callbacks in
> each iommu driver.  Thanks,

As I said, this is just an idea for an later optimization. It is fine
for now as it is in this patch.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-25 18:05                                             ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-25 18:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:

> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> 
> That sounds good.  Is anyone working on it?  It seems like it doesn't
> hurt to use this in the interim, we may just be watching the wrong bus
> and never add any sysfs group info.

I'll cook something up for RFC over the weekend.

> > Also the return type should not be long but something that fits into
> > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > choice.
> 
> The convenience of using seg|bus|dev|fn was too much to resist, too bad
> it requires a full 32bits.  Maybe I'll change it to:
>         int iommu_device_group(struct device *dev, unsigned int *group)

If we really expect segment numbers that need the full 16 bit then this
would be the way to go. Otherwise I would prefer returning the group-id
directly and partition the group-id space for the error values (s32 with
negative numbers being errors).

> > > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> > >  			printk(KERN_INFO
> > >  				"Intel-IOMMU: disable supported super page\n");
> > >  			intel_iommu_superpage = 0;
> > > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > > +			printk(KERN_INFO
> > > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > > +			intel_iommu_no_mf_groups = 1;
> > 
> > This should really be a global iommu option and not be VT-d specific.
> 
> You think?  It's meaningless on benh's power systems.

But it is not meaningless on AMD-Vi systems :) There should be one
option for both.
On the other hand this requires an iommu= parameter on ia64, but thats
probably not that bad.

> > This looks like code duplication in the VT-d driver. It doesn't need to
> > be generalized now, but we should keep in mind to do a more general
> > solution later.
> > Maybe it is beneficial if the IOMMU drivers only setup the number in
> > dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> > But as I said, this is some more work and does not need to be done for
> > this patch(-set).
> 
> The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
> at least start out with a lightweight, optional interface without the
> overhead of predefining groupids setup by bus notification callbacks in
> each iommu driver.  Thanks,

As I said, this is just an idea for an later optimization. It is fine
for now as it is in this patch.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-25 18:05                                             ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-25 18:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:

> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> 
> That sounds good.  Is anyone working on it?  It seems like it doesn't
> hurt to use this in the interim, we may just be watching the wrong bus
> and never add any sysfs group info.

I'll cook something up for RFC over the weekend.

> > Also the return type should not be long but something that fits into
> > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > choice.
> 
> The convenience of using seg|bus|dev|fn was too much to resist, too bad
> it requires a full 32bits.  Maybe I'll change it to:
>         int iommu_device_group(struct device *dev, unsigned int *group)

If we really expect segment numbers that need the full 16 bit then this
would be the way to go. Otherwise I would prefer returning the group-id
directly and partition the group-id space for the error values (s32 with
negative numbers being errors).

> > > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> > >  			printk(KERN_INFO
> > >  				"Intel-IOMMU: disable supported super page\n");
> > >  			intel_iommu_superpage = 0;
> > > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > > +			printk(KERN_INFO
> > > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > > +			intel_iommu_no_mf_groups = 1;
> > 
> > This should really be a global iommu option and not be VT-d specific.
> 
> You think?  It's meaningless on benh's power systems.

But it is not meaningless on AMD-Vi systems :) There should be one
option for both.
On the other hand this requires an iommu= parameter on ia64, but thats
probably not that bad.

> > This looks like code duplication in the VT-d driver. It doesn't need to
> > be generalized now, but we should keep in mind to do a more general
> > solution later.
> > Maybe it is beneficial if the IOMMU drivers only setup the number in
> > dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> > But as I said, this is some more work and does not need to be done for
> > this patch(-set).
> 
> The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
> at least start out with a lightweight, optional interface without the
> overhead of predefining groupids setup by bus notification callbacks in
> each iommu driver.  Thanks,

As I said, this is just an idea for an later optimization. It is fine
for now as it is in this patch.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-24 11:03                                       ` Roedel, Joerg
  (?)
@ 2011-08-26  4:20                                         ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-26  4:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> 
> > > I don't see a reason to make this meta-grouping static. It would harm
> > > flexibility on x86. I think it makes things easier on power but there
> > > are options on that platform to get the dynamic solution too.
> > 
> > I think several people are misreading what Ben means by "static".  I
> > would prefer to say 'persistent', in that the meta-groups lifetime is
> > not tied to an fd, but they can be freely created, altered and removed
> > during runtime.
> 
> Even if it can be altered at runtime, from a usability perspective it is
> certainly the best to handle these groups directly in qemu. Or are there
> strong reasons to do it somewhere else?

Funny, Ben and I think usability demands it be the other way around.

If the meta-groups are transient - that is lifetime tied to an fd -
then any program that wants to use meta-groups *must* know the
interfaces for creating one, whatever they are.

But if they're persistent, the admin can use other tools to create the
meta-group then just hand it to a program to use, since the interfaces
for _using_ a meta-group are identical to those for an atomic group.

This doesn't preclude a program from being meta-group aware, and
creating its own if it wants to, of course.  My guess is that qemu
would not want to build its own meta-groups, but libvirt probably
would.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26  4:20                                         ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-26  4:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> 
> > > I don't see a reason to make this meta-grouping static. It would harm
> > > flexibility on x86. I think it makes things easier on power but there
> > > are options on that platform to get the dynamic solution too.
> > 
> > I think several people are misreading what Ben means by "static".  I
> > would prefer to say 'persistent', in that the meta-groups lifetime is
> > not tied to an fd, but they can be freely created, altered and removed
> > during runtime.
> 
> Even if it can be altered at runtime, from a usability perspective it is
> certainly the best to handle these groups directly in qemu. Or are there
> strong reasons to do it somewhere else?

Funny, Ben and I think usability demands it be the other way around.

If the meta-groups are transient - that is lifetime tied to an fd -
then any program that wants to use meta-groups *must* know the
interfaces for creating one, whatever they are.

But if they're persistent, the admin can use other tools to create the
meta-group then just hand it to a program to use, since the interfaces
for _using_ a meta-group are identical to those for an atomic group.

This doesn't preclude a program from being meta-group aware, and
creating its own if it wants to, of course.  My guess is that qemu
would not want to build its own meta-groups, but libvirt probably
would.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26  4:20                                         ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-26  4:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve

On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> 
> > > I don't see a reason to make this meta-grouping static. It would harm
> > > flexibility on x86. I think it makes things easier on power but there
> > > are options on that platform to get the dynamic solution too.
> > 
> > I think several people are misreading what Ben means by "static".  I
> > would prefer to say 'persistent', in that the meta-groups lifetime is
> > not tied to an fd, but they can be freely created, altered and removed
> > during runtime.
> 
> Even if it can be altered at runtime, from a usability perspective it is
> certainly the best to handle these groups directly in qemu. Or are there
> strong reasons to do it somewhere else?

Funny, Ben and I think usability demands it be the other way around.

If the meta-groups are transient - that is lifetime tied to an fd -
then any program that wants to use meta-groups *must* know the
interfaces for creating one, whatever they are.

But if they're persistent, the admin can use other tools to create the
meta-group then just hand it to a program to use, since the interfaces
for _using_ a meta-group are identical to those for an atomic group.

This doesn't preclude a program from being meta-group aware, and
creating its own if it wants to, of course.  My guess is that qemu
would not want to build its own meta-groups, but libvirt probably
would.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 13:25                                   ` Alexander Graf
  (?)
@ 2011-08-26  4:24                                     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-26  4:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Roedel, Joerg, Alexey Kardashevskiy, kvm, Paul Mackerras,
	qemu-devel, iommu, chrisw, Alex Williamson, Avi Kivity,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> 
> On 25.08.2011, at 07:31, Roedel, Joerg wrote:
> 
> > On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> >> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > 
> 
> [...]
> 
> >> We need to try the polite method of attempting to hot unplug the device
> >> from qemu first, which the current vfio code already implements.  We can
> >> then escalate if it doesn't respond.  The current code calls abort in
> >> qemu if the guest doesn't respond, but I agree we should also be
> >> enforcing this at the kernel interface.  I think the problem with the
> >> hard-unplug is that we don't have a good revoke mechanism for the mmio
> >> mmaps.
> > 
> > For mmio we could stop the guest and replace the mmio region with a
> > region that is filled with 0xff, no?
> 
> Sure, but that happens in user space. The question is how does
> kernel space enforce an MMIO region to not be mapped after the
> hotplug event occured? Keep in mind that user space is pretty much
> untrusted here - it doesn't have to be QEMU. It could just as well
> be a generic user space driver. And that can just ignore hotplug
> events.

We're saying you hard yank the mapping from the userspace process.
That is, you invalidate all its PTEs mapping the MMIO space, and don't
let it fault them back in.

As I see it there are two options: (a) make subsequent accesses from
userspace or the guest result in either a SIGBUS that userspace must
either deal with or die, or (b) replace the mapping with a dummy RO
mapping containing 0xff, with any trapped writes emulated as nops.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26  4:24                                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-26  4:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> 
> On 25.08.2011, at 07:31, Roedel, Joerg wrote:
> 
> > On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> >> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > 
> 
> [...]
> 
> >> We need to try the polite method of attempting to hot unplug the device
> >> from qemu first, which the current vfio code already implements.  We can
> >> then escalate if it doesn't respond.  The current code calls abort in
> >> qemu if the guest doesn't respond, but I agree we should also be
> >> enforcing this at the kernel interface.  I think the problem with the
> >> hard-unplug is that we don't have a good revoke mechanism for the mmio
> >> mmaps.
> > 
> > For mmio we could stop the guest and replace the mmio region with a
> > region that is filled with 0xff, no?
> 
> Sure, but that happens in user space. The question is how does
> kernel space enforce an MMIO region to not be mapped after the
> hotplug event occured? Keep in mind that user space is pretty much
> untrusted here - it doesn't have to be QEMU. It could just as well
> be a generic user space driver. And that can just ignore hotplug
> events.

We're saying you hard yank the mapping from the userspace process.
That is, you invalidate all its PTEs mapping the MMIO space, and don't
let it fault them back in.

As I see it there are two options: (a) make subsequent accesses from
userspace or the guest result in either a SIGBUS that userspace must
either deal with or die, or (b) replace the mapping with a dummy RO
mapping containing 0xff, with any trapped writes emulated as nops.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26  4:24                                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-26  4:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, chrisw, iommu, Avi Kivity, linux-pci,
	linuxppc-dev, benve

On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> 
> On 25.08.2011, at 07:31, Roedel, Joerg wrote:
> 
> > On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> >> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > 
> 
> [...]
> 
> >> We need to try the polite method of attempting to hot unplug the device
> >> from qemu first, which the current vfio code already implements.  We can
> >> then escalate if it doesn't respond.  The current code calls abort in
> >> qemu if the guest doesn't respond, but I agree we should also be
> >> enforcing this at the kernel interface.  I think the problem with the
> >> hard-unplug is that we don't have a good revoke mechanism for the mmio
> >> mmaps.
> > 
> > For mmio we could stop the guest and replace the mmio region with a
> > region that is filled with 0xff, no?
> 
> Sure, but that happens in user space. The question is how does
> kernel space enforce an MMIO region to not be mapped after the
> hotplug event occured? Keep in mind that user space is pretty much
> untrusted here - it doesn't have to be QEMU. It could just as well
> be a generic user space driver. And that can just ignore hotplug
> events.

We're saying you hard yank the mapping from the userspace process.
That is, you invalidate all its PTEs mapping the MMIO space, and don't
let it fault them back in.

As I see it there are two options: (a) make subsequent accesses from
userspace or the guest result in either a SIGBUS that userspace must
either deal with or die, or (b) replace the mapping with a dummy RO
mapping containing 0xff, with any trapped writes emulated as nops.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26  4:24                                     ` David Gibson
  (?)
@ 2011-08-26  9:24                                       ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-26  9:24 UTC (permalink / raw)
  To: Alexander Graf, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel

On Fri, Aug 26, 2011 at 12:24:23AM -0400, David Gibson wrote:
> On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> > On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> > > For mmio we could stop the guest and replace the mmio region with a
> > > region that is filled with 0xff, no?
> > 
> > Sure, but that happens in user space. The question is how does
> > kernel space enforce an MMIO region to not be mapped after the
> > hotplug event occured? Keep in mind that user space is pretty much
> > untrusted here - it doesn't have to be QEMU. It could just as well
> > be a generic user space driver. And that can just ignore hotplug
> > events.
> 
> We're saying you hard yank the mapping from the userspace process.
> That is, you invalidate all its PTEs mapping the MMIO space, and don't
> let it fault them back in.
> 
> As I see it there are two options: (a) make subsequent accesses from
> userspace or the guest result in either a SIGBUS that userspace must
> either deal with or die, or (b) replace the mapping with a dummy RO
> mapping containing 0xff, with any trapped writes emulated as nops.

The biggest problem with this approach is that it has to happen in the
context of the given process. Linux can't really modify an mm which
which belong to another context in a safe way.

The more I think about this, I come to the conclusion that it would be
the best to just kill the process accessing the device if it is manually
de-assigned from vfio. It should be a non-standard path anyway so it
doesn't make a lot of sense to implement complicated handling semantics
for it, no?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26  9:24                                       ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-26  9:24 UTC (permalink / raw)
  To: Alexander Graf, Alexey Kardashevskiy, kvm, Paul Mackerras,
	qemu-devel, iommu, chrisw, Alex Williamson, Avi Kivity,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:24:23AM -0400, David Gibson wrote:
> On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> > On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> > > For mmio we could stop the guest and replace the mmio region with a
> > > region that is filled with 0xff, no?
> > 
> > Sure, but that happens in user space. The question is how does
> > kernel space enforce an MMIO region to not be mapped after the
> > hotplug event occured? Keep in mind that user space is pretty much
> > untrusted here - it doesn't have to be QEMU. It could just as well
> > be a generic user space driver. And that can just ignore hotplug
> > events.
> 
> We're saying you hard yank the mapping from the userspace process.
> That is, you invalidate all its PTEs mapping the MMIO space, and don't
> let it fault them back in.
> 
> As I see it there are two options: (a) make subsequent accesses from
> userspace or the guest result in either a SIGBUS that userspace must
> either deal with or die, or (b) replace the mapping with a dummy RO
> mapping containing 0xff, with any trapped writes emulated as nops.

The biggest problem with this approach is that it has to happen in the
context of the given process. Linux can't really modify an mm which
which belong to another context in a safe way.

The more I think about this, I come to the conclusion that it would be
the best to just kill the process accessing the device if it is manually
de-assigned from vfio. It should be a non-standard path anyway so it
doesn't make a lot of sense to implement complicated handling semantics
for it, no?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26  9:24                                       ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-26  9:24 UTC (permalink / raw)
  To: Alexander Graf, Alexey Kardashevskiy, kvm, Paul Mackerras,
	qemu-devel, iommu, chrisw, Alex Williamson, Avi Kivity,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:24:23AM -0400, David Gibson wrote:
> On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> > On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> > > For mmio we could stop the guest and replace the mmio region with a
> > > region that is filled with 0xff, no?
> > 
> > Sure, but that happens in user space. The question is how does
> > kernel space enforce an MMIO region to not be mapped after the
> > hotplug event occured? Keep in mind that user space is pretty much
> > untrusted here - it doesn't have to be QEMU. It could just as well
> > be a generic user space driver. And that can just ignore hotplug
> > events.
> 
> We're saying you hard yank the mapping from the userspace process.
> That is, you invalidate all its PTEs mapping the MMIO space, and don't
> let it fault them back in.
> 
> As I see it there are two options: (a) make subsequent accesses from
> userspace or the guest result in either a SIGBUS that userspace must
> either deal with or die, or (b) replace the mapping with a dummy RO
> mapping containing 0xff, with any trapped writes emulated as nops.

The biggest problem with this approach is that it has to happen in the
context of the given process. Linux can't really modify an mm which
which belong to another context in a safe way.

The more I think about this, I come to the conclusion that it would be
the best to just kill the process accessing the device if it is manually
de-assigned from vfio. It should be a non-standard path anyway so it
doesn't make a lot of sense to implement complicated handling semantics
for it, no?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26  4:20                                         ` David Gibson
  (?)
@ 2011-08-26  9:33                                           ` Roedel, Joerg
  -1 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-26  9:33 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org

On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> > On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> > 
> > > > I don't see a reason to make this meta-grouping static. It would harm
> > > > flexibility on x86. I think it makes things easier on power but there
> > > > are options on that platform to get the dynamic solution too.
> > > 
> > > I think several people are misreading what Ben means by "static".  I
> > > would prefer to say 'persistent', in that the meta-groups lifetime is
> > > not tied to an fd, but they can be freely created, altered and removed
> > > during runtime.
> > 
> > Even if it can be altered at runtime, from a usability perspective it is
> > certainly the best to handle these groups directly in qemu. Or are there
> > strong reasons to do it somewhere else?
> 
> Funny, Ben and I think usability demands it be the other way around.

The reason is that you mean the usability for the programmer and I mean
it for the actual user of qemu :)

> If the meta-groups are transient - that is lifetime tied to an fd -
> then any program that wants to use meta-groups *must* know the
> interfaces for creating one, whatever they are.
> 
> But if they're persistent, the admin can use other tools to create the
> meta-group then just hand it to a program to use, since the interfaces
> for _using_ a meta-group are identical to those for an atomic group.
> 
> This doesn't preclude a program from being meta-group aware, and
> creating its own if it wants to, of course.  My guess is that qemu
> would not want to build its own meta-groups, but libvirt probably
> would.

Doing it in libvirt makes it really hard for a plain user of qemu to
assign more than one device to a guest. What I want it that a user just
types

	qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...

and it just works. Qemu creates the meta-groups and they are
automatically destroyed when qemu exits. That the programs are not aware
of meta-groups is not a big problem because all software using vfio
needs still to be written :)

Btw, with this concept the programmer can still decide to not use
meta-groups and just multiplex the mappings to all open device-fds it
uses.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26  9:33                                           ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-26  9:33 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> > On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> > 
> > > > I don't see a reason to make this meta-grouping static. It would harm
> > > > flexibility on x86. I think it makes things easier on power but there
> > > > are options on that platform to get the dynamic solution too.
> > > 
> > > I think several people are misreading what Ben means by "static".  I
> > > would prefer to say 'persistent', in that the meta-groups lifetime is
> > > not tied to an fd, but they can be freely created, altered and removed
> > > during runtime.
> > 
> > Even if it can be altered at runtime, from a usability perspective it is
> > certainly the best to handle these groups directly in qemu. Or are there
> > strong reasons to do it somewhere else?
> 
> Funny, Ben and I think usability demands it be the other way around.

The reason is that you mean the usability for the programmer and I mean
it for the actual user of qemu :)

> If the meta-groups are transient - that is lifetime tied to an fd -
> then any program that wants to use meta-groups *must* know the
> interfaces for creating one, whatever they are.
> 
> But if they're persistent, the admin can use other tools to create the
> meta-group then just hand it to a program to use, since the interfaces
> for _using_ a meta-group are identical to those for an atomic group.
> 
> This doesn't preclude a program from being meta-group aware, and
> creating its own if it wants to, of course.  My guess is that qemu
> would not want to build its own meta-groups, but libvirt probably
> would.

Doing it in libvirt makes it really hard for a plain user of qemu to
assign more than one device to a guest. What I want it that a user just
types

	qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...

and it just works. Qemu creates the meta-groups and they are
automatically destroyed when qemu exits. That the programs are not aware
of meta-groups is not a big problem because all software using vfio
needs still to be written :)

Btw, with this concept the programmer can still decide to not use
meta-groups and just multiplex the mappings to all open device-fds it
uses.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26  9:33                                           ` Roedel, Joerg
  0 siblings, 0 replies; 322+ messages in thread
From: Roedel, Joerg @ 2011-08-26  9:33 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> > On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> > 
> > > > I don't see a reason to make this meta-grouping static. It would harm
> > > > flexibility on x86. I think it makes things easier on power but there
> > > > are options on that platform to get the dynamic solution too.
> > > 
> > > I think several people are misreading what Ben means by "static".  I
> > > would prefer to say 'persistent', in that the meta-groups lifetime is
> > > not tied to an fd, but they can be freely created, altered and removed
> > > during runtime.
> > 
> > Even if it can be altered at runtime, from a usability perspective it is
> > certainly the best to handle these groups directly in qemu. Or are there
> > strong reasons to do it somewhere else?
> 
> Funny, Ben and I think usability demands it be the other way around.

The reason is that you mean the usability for the programmer and I mean
it for the actual user of qemu :)

> If the meta-groups are transient - that is lifetime tied to an fd -
> then any program that wants to use meta-groups *must* know the
> interfaces for creating one, whatever they are.
> 
> But if they're persistent, the admin can use other tools to create the
> meta-group then just hand it to a program to use, since the interfaces
> for _using_ a meta-group are identical to those for an atomic group.
> 
> This doesn't preclude a program from being meta-group aware, and
> creating its own if it wants to, of course.  My guess is that qemu
> would not want to build its own meta-groups, but libvirt probably
> would.

Doing it in libvirt makes it really hard for a plain user of qemu to
assign more than one device to a guest. What I want it that a user just
types

	qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...

and it just works. Qemu creates the meta-groups and they are
automatically destroyed when qemu exits. That the programs are not aware
of meta-groups is not a big problem because all software using vfio
needs still to be written :)

Btw, with this concept the programmer can still decide to not use
meta-groups and just multiplex the mappings to all open device-fds it
uses.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26  9:33                                           ` Roedel, Joerg
  (?)
@ 2011-08-26 14:07                                             ` Alexander Graf
  -1 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-26 14:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve


On 26.08.2011, at 04:33, Roedel, Joerg wrote:

> On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
>> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
>>> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
>>>> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
>>> 
>>>>> I don't see a reason to make this meta-grouping static. It would harm
>>>>> flexibility on x86. I think it makes things easier on power but there
>>>>> are options on that platform to get the dynamic solution too.
>>>> 
>>>> I think several people are misreading what Ben means by "static".  I
>>>> would prefer to say 'persistent', in that the meta-groups lifetime is
>>>> not tied to an fd, but they can be freely created, altered and removed
>>>> during runtime.
>>> 
>>> Even if it can be altered at runtime, from a usability perspective it is
>>> certainly the best to handle these groups directly in qemu. Or are there
>>> strong reasons to do it somewhere else?
>> 
>> Funny, Ben and I think usability demands it be the other way around.
> 
> The reason is that you mean the usability for the programmer and I mean
> it for the actual user of qemu :)

No, we mean the actual user of qemu. The reason being that making a device available for any user space application is an administrative task.

Forget the KVM case for a moment and think of a user space device driver. I as a user am not root. But I as a user when having access to /dev/vfioX want to be able to access the device and manage it - and only it. The admin of that box needs to set it up properly for me to be able to access it.

So having two steps is really the correct way to go:

  * create VFIO group
  * use VFIO group

because the two are done by completely different users. It's similar to how tun/tap works in Linux too. Of course nothing keeps you from also creating a group on the fly, but it shouldn't be the only interface available. The persistent setup is definitely more useful.

> 
>> If the meta-groups are transient - that is lifetime tied to an fd -
>> then any program that wants to use meta-groups *must* know the
>> interfaces for creating one, whatever they are.
>> 
>> But if they're persistent, the admin can use other tools to create the
>> meta-group then just hand it to a program to use, since the interfaces
>> for _using_ a meta-group are identical to those for an atomic group.
>> 
>> This doesn't preclude a program from being meta-group aware, and
>> creating its own if it wants to, of course.  My guess is that qemu
>> would not want to build its own meta-groups, but libvirt probably
>> would.
> 
> Doing it in libvirt makes it really hard for a plain user of qemu to
> assign more than one device to a guest. What I want it that a user just
> types
> 
> 	qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...
> 
> and it just works. Qemu creates the meta-groups and they are
> automatically destroyed when qemu exits. That the programs are not aware
> of meta-groups is not a big problem because all software using vfio
> needs still to be written :)
> 
> Btw, with this concept the programmer can still decide to not use
> meta-groups and just multiplex the mappings to all open device-fds it
> uses.

What I want to see is:

  # vfio-create 00:01.0
    /dev/vfio0
  # vftio-create -a /dev/vfio0 00:02.0
    /dev/vfio0

  $ qemu -vfio dev=/dev/vfio0,id=vfio0 -device vfio,vfio=vfio0.0 -device vfio,vfio=vfio0.1


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 14:07                                             ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-26 14:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve


On 26.08.2011, at 04:33, Roedel, Joerg wrote:

> On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
>> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
>>> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
>>>> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
>>>=20
>>>>> I don't see a reason to make this meta-grouping static. It would =
harm
>>>>> flexibility on x86. I think it makes things easier on power but =
there
>>>>> are options on that platform to get the dynamic solution too.
>>>>=20
>>>> I think several people are misreading what Ben means by "static".  =
I
>>>> would prefer to say 'persistent', in that the meta-groups lifetime =
is
>>>> not tied to an fd, but they can be freely created, altered and =
removed
>>>> during runtime.
>>>=20
>>> Even if it can be altered at runtime, from a usability perspective =
it is
>>> certainly the best to handle these groups directly in qemu. Or are =
there
>>> strong reasons to do it somewhere else?
>>=20
>> Funny, Ben and I think usability demands it be the other way around.
>=20
> The reason is that you mean the usability for the programmer and I =
mean
> it for the actual user of qemu :)

No, we mean the actual user of qemu. The reason being that making a =
device available for any user space application is an administrative =
task.

Forget the KVM case for a moment and think of a user space device =
driver. I as a user am not root. But I as a user when having access to =
/dev/vfioX want to be able to access the device and manage it - and only =
it. The admin of that box needs to set it up properly for me to be able =
to access it.

So having two steps is really the correct way to go:

  * create VFIO group
  * use VFIO group

because the two are done by completely different users. It's similar to =
how tun/tap works in Linux too. Of course nothing keeps you from also =
creating a group on the fly, but it shouldn't be the only interface =
available. The persistent setup is definitely more useful.

>=20
>> If the meta-groups are transient - that is lifetime tied to an fd -
>> then any program that wants to use meta-groups *must* know the
>> interfaces for creating one, whatever they are.
>>=20
>> But if they're persistent, the admin can use other tools to create =
the
>> meta-group then just hand it to a program to use, since the =
interfaces
>> for _using_ a meta-group are identical to those for an atomic group.
>>=20
>> This doesn't preclude a program from being meta-group aware, and
>> creating its own if it wants to, of course.  My guess is that qemu
>> would not want to build its own meta-groups, but libvirt probably
>> would.
>=20
> Doing it in libvirt makes it really hard for a plain user of qemu to
> assign more than one device to a guest. What I want it that a user =
just
> types
>=20
> 	qemu -device vfio,host=3D00:01.0 -device vfio,host=3D00:02.0 ...
>=20
> and it just works. Qemu creates the meta-groups and they are
> automatically destroyed when qemu exits. That the programs are not =
aware
> of meta-groups is not a big problem because all software using vfio
> needs still to be written :)
>=20
> Btw, with this concept the programmer can still decide to not use
> meta-groups and just multiplex the mappings to all open device-fds it
> uses.

What I want to see is:

  # vfio-create 00:01.0
    /dev/vfio0
  # vftio-create -a /dev/vfio0 00:02.0
    /dev/vfio0

  $ qemu -vfio dev=3D/dev/vfio0,id=3Dvfio0 -device vfio,vfio=3Dvfio0.0 =
-device vfio,vfio=3Dvfio0.1


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 14:07                                             ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-26 14:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	qemu-devel, aafabbri, iommu, Avi Kivity, linuxppc-dev, benve


On 26.08.2011, at 04:33, Roedel, Joerg wrote:

> On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
>> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
>>> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
>>>> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
>>> 
>>>>> I don't see a reason to make this meta-grouping static. It would harm
>>>>> flexibility on x86. I think it makes things easier on power but there
>>>>> are options on that platform to get the dynamic solution too.
>>>> 
>>>> I think several people are misreading what Ben means by "static".  I
>>>> would prefer to say 'persistent', in that the meta-groups lifetime is
>>>> not tied to an fd, but they can be freely created, altered and removed
>>>> during runtime.
>>> 
>>> Even if it can be altered at runtime, from a usability perspective it is
>>> certainly the best to handle these groups directly in qemu. Or are there
>>> strong reasons to do it somewhere else?
>> 
>> Funny, Ben and I think usability demands it be the other way around.
> 
> The reason is that you mean the usability for the programmer and I mean
> it for the actual user of qemu :)

No, we mean the actual user of qemu. The reason being that making a device available for any user space application is an administrative task.

Forget the KVM case for a moment and think of a user space device driver. I as a user am not root. But I as a user when having access to /dev/vfioX want to be able to access the device and manage it - and only it. The admin of that box needs to set it up properly for me to be able to access it.

So having two steps is really the correct way to go:

  * create VFIO group
  * use VFIO group

because the two are done by completely different users. It's similar to how tun/tap works in Linux too. Of course nothing keeps you from also creating a group on the fly, but it shouldn't be the only interface available. The persistent setup is definitely more useful.

> 
>> If the meta-groups are transient - that is lifetime tied to an fd -
>> then any program that wants to use meta-groups *must* know the
>> interfaces for creating one, whatever they are.
>> 
>> But if they're persistent, the admin can use other tools to create the
>> meta-group then just hand it to a program to use, since the interfaces
>> for _using_ a meta-group are identical to those for an atomic group.
>> 
>> This doesn't preclude a program from being meta-group aware, and
>> creating its own if it wants to, of course.  My guess is that qemu
>> would not want to build its own meta-groups, but libvirt probably
>> would.
> 
> Doing it in libvirt makes it really hard for a plain user of qemu to
> assign more than one device to a guest. What I want it that a user just
> types
> 
> 	qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...
> 
> and it just works. Qemu creates the meta-groups and they are
> automatically destroyed when qemu exits. That the programs are not aware
> of meta-groups is not a big problem because all software using vfio
> needs still to be written :)
> 
> Btw, with this concept the programmer can still decide to not use
> meta-groups and just multiplex the mappings to all open device-fds it
> uses.

What I want to see is:

  # vfio-create 00:01.0
    /dev/vfio0
  # vftio-create -a /dev/vfio0 00:02.0
    /dev/vfio0

  $ qemu -vfio dev=/dev/vfio0,id=vfio0 -device vfio,vfio=vfio0.0 -device vfio,vfio=vfio0.1


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 14:07                                             ` Alexander Graf
  (?)
@ 2011-08-26 15:24                                               ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-26 15:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Roedel, Joerg, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
> > 
> > The reason is that you mean the usability for the programmer and I mean
> > it for the actual user of qemu :)
> 
> No, we mean the actual user of qemu. The reason being that making a
> device available for any user space application is an administrative
> task.
>
> Forget the KVM case for a moment and think of a user space device
> driver. I as a user am not root. But I as a user when having access to
> /dev/vfioX want to be able to access the device and manage it - and
> only it. The admin of that box needs to set it up properly for me to
> be able to access it.

Right, and that task is being performed by attaching the device(s) in
question to the vfio driver. The rights-management happens on the
/dev/vfio/$group file.

> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users. It's similar
> to how tun/tap works in Linux too. Of course nothing keeps you from
> also creating a group on the fly, but it shouldn't be the only
> interface available. The persistent setup is definitely more useful.

I see the use-case. But to make it as easy as possible for the end-user
we can do both.

So the user of (qemu again) does this:

# vfio-ctl attach 00:01.0
vfio-ctl: attached to group 8
# vfio-ctl attach 00:02.0
vfio-ctl: attached to group 16
$ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...

which should cover the usecase you prefer. Qemu still creates the
meta-group that allow the devices to share the same page-table. But what
should also be possible is:

# qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0

In that case qemu detects that the devices are not yet bound to vfio and
will do so and also unbinds them afterwards (essentially the developer
use-case).

Your interface which requires pre-binding of devices into one group by
the administrator only makes sense if you want to force userspace to
use certain devices (which do not belong to the same hw-group) only
together. But I don't see a usecase for defining such constraints (yet).

	Joerg


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 15:24                                               ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-26 15:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linux-pci, linuxppc-dev, benve

On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
> > 
> > The reason is that you mean the usability for the programmer and I mean
> > it for the actual user of qemu :)
> 
> No, we mean the actual user of qemu. The reason being that making a
> device available for any user space application is an administrative
> task.
>
> Forget the KVM case for a moment and think of a user space device
> driver. I as a user am not root. But I as a user when having access to
> /dev/vfioX want to be able to access the device and manage it - and
> only it. The admin of that box needs to set it up properly for me to
> be able to access it.

Right, and that task is being performed by attaching the device(s) in
question to the vfio driver. The rights-management happens on the
/dev/vfio/$group file.

> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users. It's similar
> to how tun/tap works in Linux too. Of course nothing keeps you from
> also creating a group on the fly, but it shouldn't be the only
> interface available. The persistent setup is definitely more useful.

I see the use-case. But to make it as easy as possible for the end-user
we can do both.

So the user of (qemu again) does this:

# vfio-ctl attach 00:01.0
vfio-ctl: attached to group 8
# vfio-ctl attach 00:02.0
vfio-ctl: attached to group 16
$ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...

which should cover the usecase you prefer. Qemu still creates the
meta-group that allow the devices to share the same page-table. But what
should also be possible is:

# qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0

In that case qemu detects that the devices are not yet bound to vfio and
will do so and also unbinds them afterwards (essentially the developer
use-case).

Your interface which requires pre-binding of devices into one group by
the administrator only makes sense if you want to force userspace to
use certain devices (which do not belong to the same hw-group) only
together. But I don't see a usecase for defining such constraints (yet).

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 15:24                                               ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-26 15:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	qemu-devel, aafabbri, iommu, Avi Kivity, linux-pci, linuxppc-dev,
	benve

On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
> > 
> > The reason is that you mean the usability for the programmer and I mean
> > it for the actual user of qemu :)
> 
> No, we mean the actual user of qemu. The reason being that making a
> device available for any user space application is an administrative
> task.
>
> Forget the KVM case for a moment and think of a user space device
> driver. I as a user am not root. But I as a user when having access to
> /dev/vfioX want to be able to access the device and manage it - and
> only it. The admin of that box needs to set it up properly for me to
> be able to access it.

Right, and that task is being performed by attaching the device(s) in
question to the vfio driver. The rights-management happens on the
/dev/vfio/$group file.

> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users. It's similar
> to how tun/tap works in Linux too. Of course nothing keeps you from
> also creating a group on the fly, but it shouldn't be the only
> interface available. The persistent setup is definitely more useful.

I see the use-case. But to make it as easy as possible for the end-user
we can do both.

So the user of (qemu again) does this:

# vfio-ctl attach 00:01.0
vfio-ctl: attached to group 8
# vfio-ctl attach 00:02.0
vfio-ctl: attached to group 16
$ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...

which should cover the usecase you prefer. Qemu still creates the
meta-group that allow the devices to share the same page-table. But what
should also be possible is:

# qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0

In that case qemu detects that the devices are not yet bound to vfio and
will do so and also unbinds them afterwards (essentially the developer
use-case).

Your interface which requires pre-binding of devices into one group by
the administrator only makes sense if you want to force userspace to
use certain devices (which do not belong to the same hw-group) only
together. But I don't see a usecase for defining such constraints (yet).

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 15:24                                               ` Joerg Roedel
  (?)
@ 2011-08-26 15:29                                                 ` Alexander Graf
  -1 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-26 15:29 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Roedel, Joerg, aafabbri, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve


On 26.08.2011, at 10:24, Joerg Roedel wrote:

> On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
>> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
>>> 
>>> The reason is that you mean the usability for the programmer and I mean
>>> it for the actual user of qemu :)
>> 
>> No, we mean the actual user of qemu. The reason being that making a
>> device available for any user space application is an administrative
>> task.
>> 
>> Forget the KVM case for a moment and think of a user space device
>> driver. I as a user am not root. But I as a user when having access to
>> /dev/vfioX want to be able to access the device and manage it - and
>> only it. The admin of that box needs to set it up properly for me to
>> be able to access it.
> 
> Right, and that task is being performed by attaching the device(s) in
> question to the vfio driver. The rights-management happens on the
> /dev/vfio/$group file.

Yup :)

> 
>> So having two steps is really the correct way to go:
>> 
>>  * create VFIO group
>>  * use VFIO group
>> 
>> because the two are done by completely different users. It's similar
>> to how tun/tap works in Linux too. Of course nothing keeps you from
>> also creating a group on the fly, but it shouldn't be the only
>> interface available. The persistent setup is definitely more useful.
> 
> I see the use-case. But to make it as easy as possible for the end-user
> we can do both.
> 
> So the user of (qemu again) does this:
> 
> # vfio-ctl attach 00:01.0
> vfio-ctl: attached to group 8
> # vfio-ctl attach 00:02.0
> vfio-ctl: attached to group 16
> $ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...
> 
> which should cover the usecase you prefer. Qemu still creates the
> meta-group that allow the devices to share the same page-table. But what
> should also be possible is:
> 
> # qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0
> 
> In that case qemu detects that the devices are not yet bound to vfio and
> will do so and also unbinds them afterwards (essentially the developer
> use-case).

I agree. The same it works with tun today. You can either have qemu spawn a tun device dynamically or have a preallocated one you use. If you run qemu as a user (which I always do), I preallocate a tun device and attach qemu to it.

> Your interface which requires pre-binding of devices into one group by
> the administrator only makes sense if you want to force userspace to
> use certain devices (which do not belong to the same hw-group) only
> together. But I don't see a usecase for defining such constraints (yet).

Agreed. As long as the kernel backend can always figure out the hw-groups, we're good :)


Alex


^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 15:29                                                 ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-26 15:29 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	qemu-devel, aafabbri, iommu, Avi Kivity, Anthony Liguori,
	linux-pci, linuxppc-dev, benve


On 26.08.2011, at 10:24, Joerg Roedel wrote:

> On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
>> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
>>>=20
>>> The reason is that you mean the usability for the programmer and I =
mean
>>> it for the actual user of qemu :)
>>=20
>> No, we mean the actual user of qemu. The reason being that making a
>> device available for any user space application is an administrative
>> task.
>>=20
>> Forget the KVM case for a moment and think of a user space device
>> driver. I as a user am not root. But I as a user when having access =
to
>> /dev/vfioX want to be able to access the device and manage it - and
>> only it. The admin of that box needs to set it up properly for me to
>> be able to access it.
>=20
> Right, and that task is being performed by attaching the device(s) in
> question to the vfio driver. The rights-management happens on the
> /dev/vfio/$group file.

Yup :)

>=20
>> So having two steps is really the correct way to go:
>>=20
>>  * create VFIO group
>>  * use VFIO group
>>=20
>> because the two are done by completely different users. It's similar
>> to how tun/tap works in Linux too. Of course nothing keeps you from
>> also creating a group on the fly, but it shouldn't be the only
>> interface available. The persistent setup is definitely more useful.
>=20
> I see the use-case. But to make it as easy as possible for the =
end-user
> we can do both.
>=20
> So the user of (qemu again) does this:
>=20
> # vfio-ctl attach 00:01.0
> vfio-ctl: attached to group 8
> # vfio-ctl attach 00:02.0
> vfio-ctl: attached to group 16
> $ qemu -device vfio-pci,host=3D00:01.0 -device vfio,host=3D00:01.0 ...
>=20
> which should cover the usecase you prefer. Qemu still creates the
> meta-group that allow the devices to share the same page-table. But =
what
> should also be possible is:
>=20
> # qemu -device vfio-pci,host=3D00:01.0 -device vfio-pci,host=3D00:02.0
>=20
> In that case qemu detects that the devices are not yet bound to vfio =
and
> will do so and also unbinds them afterwards (essentially the developer
> use-case).

I agree. The same it works with tun today. You can either have qemu =
spawn a tun device dynamically or have a preallocated one you use. If =
you run qemu as a user (which I always do), I preallocate a tun device =
and attach qemu to it.

> Your interface which requires pre-binding of devices into one group by
> the administrator only makes sense if you want to force userspace to
> use certain devices (which do not belong to the same hw-group) only
> together. But I don't see a usecase for defining such constraints =
(yet).

Agreed. As long as the kernel backend can always figure out the =
hw-groups, we're good :)


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 15:29                                                 ` Alexander Graf
  0 siblings, 0 replies; 322+ messages in thread
From: Alexander Graf @ 2011-08-26 15:29 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	qemu-devel, aafabbri, iommu, Avi Kivity, linux-pci, linuxppc-dev,
	benve


On 26.08.2011, at 10:24, Joerg Roedel wrote:

> On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
>> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
>>> 
>>> The reason is that you mean the usability for the programmer and I mean
>>> it for the actual user of qemu :)
>> 
>> No, we mean the actual user of qemu. The reason being that making a
>> device available for any user space application is an administrative
>> task.
>> 
>> Forget the KVM case for a moment and think of a user space device
>> driver. I as a user am not root. But I as a user when having access to
>> /dev/vfioX want to be able to access the device and manage it - and
>> only it. The admin of that box needs to set it up properly for me to
>> be able to access it.
> 
> Right, and that task is being performed by attaching the device(s) in
> question to the vfio driver. The rights-management happens on the
> /dev/vfio/$group file.

Yup :)

> 
>> So having two steps is really the correct way to go:
>> 
>>  * create VFIO group
>>  * use VFIO group
>> 
>> because the two are done by completely different users. It's similar
>> to how tun/tap works in Linux too. Of course nothing keeps you from
>> also creating a group on the fly, but it shouldn't be the only
>> interface available. The persistent setup is definitely more useful.
> 
> I see the use-case. But to make it as easy as possible for the end-user
> we can do both.
> 
> So the user of (qemu again) does this:
> 
> # vfio-ctl attach 00:01.0
> vfio-ctl: attached to group 8
> # vfio-ctl attach 00:02.0
> vfio-ctl: attached to group 16
> $ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...
> 
> which should cover the usecase you prefer. Qemu still creates the
> meta-group that allow the devices to share the same page-table. But what
> should also be possible is:
> 
> # qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0
> 
> In that case qemu detects that the devices are not yet bound to vfio and
> will do so and also unbinds them afterwards (essentially the developer
> use-case).

I agree. The same it works with tun today. You can either have qemu spawn a tun device dynamically or have a preallocated one you use. If you run qemu as a user (which I always do), I preallocate a tun device and attach qemu to it.

> Your interface which requires pre-binding of devices into one group by
> the administrator only makes sense if you want to force userspace to
> use certain devices (which do not belong to the same hw-group) only
> together. But I don't see a usecase for defining such constraints (yet).

Agreed. As long as the kernel backend can always figure out the hw-groups, we're good :)


Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 14:07                                             ` Alexander Graf
@ 2011-08-26 17:52                                               ` Aaron Fabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-26 17:52 UTC (permalink / raw)
  To: Alexander Graf, Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve




On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:

> 
<snip>
> 
> Forget the KVM case for a moment and think of a user space device driver. I as
> a user am not root. But I as a user when having access to /dev/vfioX want to
> be able to access the device and manage it - and only it. The admin of that
> box needs to set it up properly for me to be able to access it.
> 
> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users.

This is not the case for my userspace drivers using VFIO today.

Each process will open vfio devices on the fly, and they need to be able to
share IOMMU resources.

So I need the ability to dynamically bring up devices and assign them to a
group.  The number of actual devices and how they map to iommu domains is
not known ahead of time.  We have a single piece of silicon that can expose
hundreds of pci devices.

In my case, the only administrative task would be to give my processes/users
access to the vfio groups (which are initially singletons), and the
application actually opens them and needs the ability to merge groups
together to conserve IOMMU resources (assuming we're not going to expose
uiommu).

-Aaron

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 17:52                                               ` Aaron Fabbri
  0 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-26 17:52 UTC (permalink / raw)
  To: Alexander Graf, Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci, qemu-devel,
	chrisw, iommu, Avi Kivity, linuxppc-dev, benve




On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:

> 
<snip>
> 
> Forget the KVM case for a moment and think of a user space device driver. I as
> a user am not root. But I as a user when having access to /dev/vfioX want to
> be able to access the device and manage it - and only it. The admin of that
> box needs to set it up properly for me to be able to access it.
> 
> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users.

This is not the case for my userspace drivers using VFIO today.

Each process will open vfio devices on the fly, and they need to be able to
share IOMMU resources.

So I need the ability to dynamically bring up devices and assign them to a
group.  The number of actual devices and how they map to iommu domains is
not known ahead of time.  We have a single piece of silicon that can expose
hundreds of pci devices.

In my case, the only administrative task would be to give my processes/users
access to the vfio groups (which are initially singletons), and the
application actually opens them and needs the ability to merge groups
together to conserve IOMMU resources (assuming we're not going to expose
uiommu).

-Aaron

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-25 18:05                                             ` Joerg Roedel
  (?)
@ 2011-08-26 18:04                                               ` Alex Williamson
  -1 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-26 18:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:
> On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> > On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> 
> > > We need to solve this differently. ARM is starting to use the iommu-api
> > > too and this definitly does not work there. One possible solution might
> > > be to make the iommu-ops per-bus.
> > 
> > That sounds good.  Is anyone working on it?  It seems like it doesn't
> > hurt to use this in the interim, we may just be watching the wrong bus
> > and never add any sysfs group info.
> 
> I'll cook something up for RFC over the weekend.
> 
> > > Also the return type should not be long but something that fits into
> > > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > > choice.
> > 
> > The convenience of using seg|bus|dev|fn was too much to resist, too bad
> > it requires a full 32bits.  Maybe I'll change it to:
> >         int iommu_device_group(struct device *dev, unsigned int *group)
> 
> If we really expect segment numbers that need the full 16 bit then this
> would be the way to go. Otherwise I would prefer returning the group-id
> directly and partition the group-id space for the error values (s32 with
> negative numbers being errors).

It's unlikely to have segments using the top bit, but it would be broken
for an iommu driver to define it's group numbers using pci s:b:d.f if we
don't have that bit available.  Ben/David, do PEs have an identifier of
a convenient size?  I'd guess any hardware based identifier is going to
use a full unsigned bit width.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 18:04                                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-26 18:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:
> On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> > On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> 
> > > We need to solve this differently. ARM is starting to use the iommu-api
> > > too and this definitly does not work there. One possible solution might
> > > be to make the iommu-ops per-bus.
> > 
> > That sounds good.  Is anyone working on it?  It seems like it doesn't
> > hurt to use this in the interim, we may just be watching the wrong bus
> > and never add any sysfs group info.
> 
> I'll cook something up for RFC over the weekend.
> 
> > > Also the return type should not be long but something that fits into
> > > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > > choice.
> > 
> > The convenience of using seg|bus|dev|fn was too much to resist, too bad
> > it requires a full 32bits.  Maybe I'll change it to:
> >         int iommu_device_group(struct device *dev, unsigned int *group)
> 
> If we really expect segment numbers that need the full 16 bit then this
> would be the way to go. Otherwise I would prefer returning the group-id
> directly and partition the group-id space for the error values (s32 with
> negative numbers being errors).

It's unlikely to have segments using the top bit, but it would be broken
for an iommu driver to define it's group numbers using pci s:b:d.f if we
don't have that bit available.  Ben/David, do PEs have an identifier of
a convenient size?  I'd guess any hardware based identifier is going to
use a full unsigned bit width.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 18:04                                               ` Alex Williamson
  0 siblings, 0 replies; 322+ messages in thread
From: Alex Williamson @ 2011-08-26 18:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:
> On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> > On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> 
> > > We need to solve this differently. ARM is starting to use the iommu-api
> > > too and this definitly does not work there. One possible solution might
> > > be to make the iommu-ops per-bus.
> > 
> > That sounds good.  Is anyone working on it?  It seems like it doesn't
> > hurt to use this in the interim, we may just be watching the wrong bus
> > and never add any sysfs group info.
> 
> I'll cook something up for RFC over the weekend.
> 
> > > Also the return type should not be long but something that fits into
> > > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > > choice.
> > 
> > The convenience of using seg|bus|dev|fn was too much to resist, too bad
> > it requires a full 32bits.  Maybe I'll change it to:
> >         int iommu_device_group(struct device *dev, unsigned int *group)
> 
> If we really expect segment numbers that need the full 16 bit then this
> would be the way to go. Otherwise I would prefer returning the group-id
> directly and partition the group-id space for the error values (s32 with
> negative numbers being errors).

It's unlikely to have segments using the top bit, but it would be broken
for an iommu driver to define it's group numbers using pci s:b:d.f if we
don't have that bit available.  Ben/David, do PEs have an identifier of
a convenient size?  I'd guess any hardware based identifier is going to
use a full unsigned bit width.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 17:52                                               ` [Qemu-devel] " Aaron Fabbri
  (?)
@ 2011-08-26 19:35                                                 ` Chris Wright
  -1 siblings, 0 replies; 322+ messages in thread
From: Chris Wright @ 2011-08-26 19:35 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, chrisw, iommu, Avi Kivity, Roedel,
	Joerg, linuxppc-dev, benve

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
> > Forget the KVM case for a moment and think of a user space device driver. I as
> > a user am not root. But I as a user when having access to /dev/vfioX want to
> > be able to access the device and manage it - and only it. The admin of that
> > box needs to set it up properly for me to be able to access it.
> > 
> > So having two steps is really the correct way to go:
> > 
> >   * create VFIO group
> >   * use VFIO group
> > 
> > because the two are done by completely different users.
> 
> This is not the case for my userspace drivers using VFIO today.
> 
> Each process will open vfio devices on the fly, and they need to be able to
> share IOMMU resources.

How do you share IOMMU resources w/ multiple processes, are the processes
sharing memory?

> So I need the ability to dynamically bring up devices and assign them to a
> group.  The number of actual devices and how they map to iommu domains is
> not known ahead of time.  We have a single piece of silicon that can expose
> hundreds of pci devices.

This does not seem fundamentally different from the KVM use case.

We have 2 kinds of groupings.

1) low-level system or topoolgy grouping

   Some may have multiple devices in a single group

   * the PCIe-PCI bridge example
   * the POWER partitionable endpoint

   Many will not

   * singleton group, e.g. typical x86 PCIe function (majority of
     assigned devices)

   Not sure it makes sense to have these administratively defined as
   opposed to system defined.

2) logical grouping

   * multiple low-level groups (singleton or otherwise) attached to same
     process, allowing things like single set of io page tables where
     applicable.

   These are nominally adminstratively defined.  In the KVM case, there
   is likely a privileged task (i.e. libvirtd) involved w/ making the
   device available to the guest and can do things like group merging.
   In your userspace case, perhaps it should be directly exposed.

> In my case, the only administrative task would be to give my processes/users
> access to the vfio groups (which are initially singletons), and the
> application actually opens them and needs the ability to merge groups
> together to conserve IOMMU resources (assuming we're not going to expose
> uiommu).

I agree, we definitely need to expose _some_ way to do this.

thanks,
-chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 19:35                                                 ` Chris Wright
  0 siblings, 0 replies; 322+ messages in thread
From: Chris Wright @ 2011-08-26 19:35 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, Roedel, Joerg, linuxppc-dev, benve

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
> > Forget the KVM case for a moment and think of a user space device driver. I as
> > a user am not root. But I as a user when having access to /dev/vfioX want to
> > be able to access the device and manage it - and only it. The admin of that
> > box needs to set it up properly for me to be able to access it.
> > 
> > So having two steps is really the correct way to go:
> > 
> >   * create VFIO group
> >   * use VFIO group
> > 
> > because the two are done by completely different users.
> 
> This is not the case for my userspace drivers using VFIO today.
> 
> Each process will open vfio devices on the fly, and they need to be able to
> share IOMMU resources.

How do you share IOMMU resources w/ multiple processes, are the processes
sharing memory?

> So I need the ability to dynamically bring up devices and assign them to a
> group.  The number of actual devices and how they map to iommu domains is
> not known ahead of time.  We have a single piece of silicon that can expose
> hundreds of pci devices.

This does not seem fundamentally different from the KVM use case.

We have 2 kinds of groupings.

1) low-level system or topoolgy grouping

   Some may have multiple devices in a single group

   * the PCIe-PCI bridge example
   * the POWER partitionable endpoint

   Many will not

   * singleton group, e.g. typical x86 PCIe function (majority of
     assigned devices)

   Not sure it makes sense to have these administratively defined as
   opposed to system defined.

2) logical grouping

   * multiple low-level groups (singleton or otherwise) attached to same
     process, allowing things like single set of io page tables where
     applicable.

   These are nominally adminstratively defined.  In the KVM case, there
   is likely a privileged task (i.e. libvirtd) involved w/ making the
   device available to the guest and can do things like group merging.
   In your userspace case, perhaps it should be directly exposed.

> In my case, the only administrative task would be to give my processes/users
> access to the vfio groups (which are initially singletons), and the
> application actually opens them and needs the ability to merge groups
> together to conserve IOMMU resources (assuming we're not going to expose
> uiommu).

I agree, we definitely need to expose _some_ way to do this.

thanks,
-chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 19:35                                                 ` Chris Wright
  0 siblings, 0 replies; 322+ messages in thread
From: Chris Wright @ 2011-08-26 19:35 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, chrisw, iommu, Avi Kivity, Roedel,
	Joerg, linuxppc-dev, benve

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
> > Forget the KVM case for a moment and think of a user space device driver. I as
> > a user am not root. But I as a user when having access to /dev/vfioX want to
> > be able to access the device and manage it - and only it. The admin of that
> > box needs to set it up properly for me to be able to access it.
> > 
> > So having two steps is really the correct way to go:
> > 
> >   * create VFIO group
> >   * use VFIO group
> > 
> > because the two are done by completely different users.
> 
> This is not the case for my userspace drivers using VFIO today.
> 
> Each process will open vfio devices on the fly, and they need to be able to
> share IOMMU resources.

How do you share IOMMU resources w/ multiple processes, are the processes
sharing memory?

> So I need the ability to dynamically bring up devices and assign them to a
> group.  The number of actual devices and how they map to iommu domains is
> not known ahead of time.  We have a single piece of silicon that can expose
> hundreds of pci devices.

This does not seem fundamentally different from the KVM use case.

We have 2 kinds of groupings.

1) low-level system or topoolgy grouping

   Some may have multiple devices in a single group

   * the PCIe-PCI bridge example
   * the POWER partitionable endpoint

   Many will not

   * singleton group, e.g. typical x86 PCIe function (majority of
     assigned devices)

   Not sure it makes sense to have these administratively defined as
   opposed to system defined.

2) logical grouping

   * multiple low-level groups (singleton or otherwise) attached to same
     process, allowing things like single set of io page tables where
     applicable.

   These are nominally adminstratively defined.  In the KVM case, there
   is likely a privileged task (i.e. libvirtd) involved w/ making the
   device available to the guest and can do things like group merging.
   In your userspace case, perhaps it should be directly exposed.

> In my case, the only administrative task would be to give my processes/users
> access to the vfio groups (which are initially singletons), and the
> application actually opens them and needs the ability to merge groups
> together to conserve IOMMU resources (assuming we're not going to expose
> uiommu).

I agree, we definitely need to expose _some_ way to do this.

thanks,
-chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 19:35                                                 ` Chris Wright
  (?)
@ 2011-08-26 20:17                                                   ` Aaron Fabbri
  -1 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-26 20:17 UTC (permalink / raw)
  To: Chris Wright
  Cc: Alexander Graf, Roedel, Joerg, Alexey Kardashevskiy, kvm,
	Paul Mackerras, linux-pci, qemu-devel, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve




On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:

> * Aaron Fabbri (aafabbri@cisco.com) wrote:
>> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
>>> Forget the KVM case for a moment and think of a user space device driver. I
>>> as
>>> a user am not root. But I as a user when having access to /dev/vfioX want to
>>> be able to access the device and manage it - and only it. The admin of that
>>> box needs to set it up properly for me to be able to access it.
>>> 
>>> So having two steps is really the correct way to go:
>>> 
>>>   * create VFIO group
>>>   * use VFIO group
>>> 
>>> because the two are done by completely different users.
>> 
>> This is not the case for my userspace drivers using VFIO today.
>> 
>> Each process will open vfio devices on the fly, and they need to be able to
>> share IOMMU resources.
> 
> How do you share IOMMU resources w/ multiple processes, are the processes
> sharing memory?

Sorry, bad wording.  I share IOMMU domains *within* each process.

E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).

If I ever need to share devices across processes, the shared memory case
might be interesting.

> 
>> So I need the ability to dynamically bring up devices and assign them to a
>> group.  The number of actual devices and how they map to iommu domains is
>> not known ahead of time.  We have a single piece of silicon that can expose
>> hundreds of pci devices.
> 
> This does not seem fundamentally different from the KVM use case.
> 
> We have 2 kinds of groupings.
> 
> 1) low-level system or topoolgy grouping
> 
>    Some may have multiple devices in a single group
> 
>    * the PCIe-PCI bridge example
>    * the POWER partitionable endpoint
> 
>    Many will not
> 
>    * singleton group, e.g. typical x86 PCIe function (majority of
>      assigned devices)
> 
>    Not sure it makes sense to have these administratively defined as
>    opposed to system defined.
> 
> 2) logical grouping
> 
>    * multiple low-level groups (singleton or otherwise) attached to same
>      process, allowing things like single set of io page tables where
>      applicable.
> 
>    These are nominally adminstratively defined.  In the KVM case, there
>    is likely a privileged task (i.e. libvirtd) involved w/ making the
>    device available to the guest and can do things like group merging.
>    In your userspace case, perhaps it should be directly exposed.

Yes.  In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.

> 
>> In my case, the only administrative task would be to give my processes/users
>> access to the vfio groups (which are initially singletons), and the
>> application actually opens them and needs the ability to merge groups
>> together to conserve IOMMU resources (assuming we're not going to expose
>> uiommu).
> 
> I agree, we definitely need to expose _some_ way to do this.
> 
> thanks,
> -chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 20:17                                                   ` Aaron Fabbri
  0 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-26 20:17 UTC (permalink / raw)
  To: Chris Wright
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, iommu, Avi Kivity, Anthony Liguori,
	Roedel, Joerg, linuxppc-dev, benve




On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:

> * Aaron Fabbri (aafabbri@cisco.com) wrote:
>> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
>>> Forget the KVM case for a moment and think of a user space device driver. I
>>> as
>>> a user am not root. But I as a user when having access to /dev/vfioX want to
>>> be able to access the device and manage it - and only it. The admin of that
>>> box needs to set it up properly for me to be able to access it.
>>> 
>>> So having two steps is really the correct way to go:
>>> 
>>>   * create VFIO group
>>>   * use VFIO group
>>> 
>>> because the two are done by completely different users.
>> 
>> This is not the case for my userspace drivers using VFIO today.
>> 
>> Each process will open vfio devices on the fly, and they need to be able to
>> share IOMMU resources.
> 
> How do you share IOMMU resources w/ multiple processes, are the processes
> sharing memory?

Sorry, bad wording.  I share IOMMU domains *within* each process.

E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).

If I ever need to share devices across processes, the shared memory case
might be interesting.

> 
>> So I need the ability to dynamically bring up devices and assign them to a
>> group.  The number of actual devices and how they map to iommu domains is
>> not known ahead of time.  We have a single piece of silicon that can expose
>> hundreds of pci devices.
> 
> This does not seem fundamentally different from the KVM use case.
> 
> We have 2 kinds of groupings.
> 
> 1) low-level system or topoolgy grouping
> 
>    Some may have multiple devices in a single group
> 
>    * the PCIe-PCI bridge example
>    * the POWER partitionable endpoint
> 
>    Many will not
> 
>    * singleton group, e.g. typical x86 PCIe function (majority of
>      assigned devices)
> 
>    Not sure it makes sense to have these administratively defined as
>    opposed to system defined.
> 
> 2) logical grouping
> 
>    * multiple low-level groups (singleton or otherwise) attached to same
>      process, allowing things like single set of io page tables where
>      applicable.
> 
>    These are nominally adminstratively defined.  In the KVM case, there
>    is likely a privileged task (i.e. libvirtd) involved w/ making the
>    device available to the guest and can do things like group merging.
>    In your userspace case, perhaps it should be directly exposed.

Yes.  In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.

> 
>> In my case, the only administrative task would be to give my processes/users
>> access to the vfio groups (which are initially singletons), and the
>> application actually opens them and needs the ability to merge groups
>> together to conserve IOMMU resources (assuming we're not going to expose
>> uiommu).
> 
> I agree, we definitely need to expose _some_ way to do this.
> 
> thanks,
> -chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 20:17                                                   ` Aaron Fabbri
  0 siblings, 0 replies; 322+ messages in thread
From: Aaron Fabbri @ 2011-08-26 20:17 UTC (permalink / raw)
  To: Chris Wright
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, iommu, Avi Kivity, Roedel, Joerg,
	linuxppc-dev, benve




On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:

> * Aaron Fabbri (aafabbri@cisco.com) wrote:
>> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
>>> Forget the KVM case for a moment and think of a user space device driver. I
>>> as
>>> a user am not root. But I as a user when having access to /dev/vfioX want to
>>> be able to access the device and manage it - and only it. The admin of that
>>> box needs to set it up properly for me to be able to access it.
>>> 
>>> So having two steps is really the correct way to go:
>>> 
>>>   * create VFIO group
>>>   * use VFIO group
>>> 
>>> because the two are done by completely different users.
>> 
>> This is not the case for my userspace drivers using VFIO today.
>> 
>> Each process will open vfio devices on the fly, and they need to be able to
>> share IOMMU resources.
> 
> How do you share IOMMU resources w/ multiple processes, are the processes
> sharing memory?

Sorry, bad wording.  I share IOMMU domains *within* each process.

E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).

If I ever need to share devices across processes, the shared memory case
might be interesting.

> 
>> So I need the ability to dynamically bring up devices and assign them to a
>> group.  The number of actual devices and how they map to iommu domains is
>> not known ahead of time.  We have a single piece of silicon that can expose
>> hundreds of pci devices.
> 
> This does not seem fundamentally different from the KVM use case.
> 
> We have 2 kinds of groupings.
> 
> 1) low-level system or topoolgy grouping
> 
>    Some may have multiple devices in a single group
> 
>    * the PCIe-PCI bridge example
>    * the POWER partitionable endpoint
> 
>    Many will not
> 
>    * singleton group, e.g. typical x86 PCIe function (majority of
>      assigned devices)
> 
>    Not sure it makes sense to have these administratively defined as
>    opposed to system defined.
> 
> 2) logical grouping
> 
>    * multiple low-level groups (singleton or otherwise) attached to same
>      process, allowing things like single set of io page tables where
>      applicable.
> 
>    These are nominally adminstratively defined.  In the KVM case, there
>    is likely a privileged task (i.e. libvirtd) involved w/ making the
>    device available to the guest and can do things like group merging.
>    In your userspace case, perhaps it should be directly exposed.

Yes.  In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.

> 
>> In my case, the only administrative task would be to give my processes/users
>> access to the vfio groups (which are initially singletons), and the
>> application actually opens them and needs the ability to merge groups
>> together to conserve IOMMU resources (assuming we're not going to expose
>> uiommu).
> 
> I agree, we definitely need to expose _some_ way to do this.
> 
> thanks,
> -chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 20:17                                                   ` Aaron Fabbri
  (?)
@ 2011-08-26 21:06                                                     ` Chris Wright
  -1 siblings, 0 replies; 322+ messages in thread
From: Chris Wright @ 2011-08-26 21:06 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	Alexander Graf, qemu-devel, Chris Wright, iommu, Avi Kivity,
	linux-pci, linuxppc-dev, benve

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:
> > * Aaron Fabbri (aafabbri@cisco.com) wrote:
> >> Each process will open vfio devices on the fly, and they need to be able to
> >> share IOMMU resources.
> > 
> > How do you share IOMMU resources w/ multiple processes, are the processes
> > sharing memory?
> 
> Sorry, bad wording.  I share IOMMU domains *within* each process.

Ah, got it.  Thanks.

> E.g. If one process has 3 devices and another has 10, I can get by with two
> iommu domains (and can share buffers among devices within each process).
> 
> If I ever need to share devices across processes, the shared memory case
> might be interesting.
> 
> > 
> >> So I need the ability to dynamically bring up devices and assign them to a
> >> group.  The number of actual devices and how they map to iommu domains is
> >> not known ahead of time.  We have a single piece of silicon that can expose
> >> hundreds of pci devices.
> > 
> > This does not seem fundamentally different from the KVM use case.
> > 
> > We have 2 kinds of groupings.
> > 
> > 1) low-level system or topoolgy grouping
> > 
> >    Some may have multiple devices in a single group
> > 
> >    * the PCIe-PCI bridge example
> >    * the POWER partitionable endpoint
> > 
> >    Many will not
> > 
> >    * singleton group, e.g. typical x86 PCIe function (majority of
> >      assigned devices)
> > 
> >    Not sure it makes sense to have these administratively defined as
> >    opposed to system defined.
> > 
> > 2) logical grouping
> > 
> >    * multiple low-level groups (singleton or otherwise) attached to same
> >      process, allowing things like single set of io page tables where
> >      applicable.
> > 
> >    These are nominally adminstratively defined.  In the KVM case, there
> >    is likely a privileged task (i.e. libvirtd) involved w/ making the
> >    device available to the guest and can do things like group merging.
> >    In your userspace case, perhaps it should be directly exposed.
> 
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

I don't see an issue w/ this.  As long it can not add devices to the
system defined groups, it's not a privileged operation.  So we still
need the iommu domain concept exposed in some form to logically put
groups into a single iommu domain (if desired).  In fact, I believe Alex
covered this in his most recent recap:

  ...The group fd will provide interfaces for enumerating the devices
  in the group, returning a file descriptor for each device in the group
  (the "device fd"), binding groups together, and returning a file
  descriptor for iommu operations (the "iommu fd").

thanks,
-chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-26 21:06                                                     ` Chris Wright
  0 siblings, 0 replies; 322+ messages in thread
From: Chris Wright @ 2011-08-26 21:06 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	Alexander Graf, qemu-devel, Chris Wright, iommu, Avi Kivity,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:
> > * Aaron Fabbri (aafabbri@cisco.com) wrote:
> >> Each process will open vfio devices on the fly, and they need to be able to
> >> share IOMMU resources.
> > 
> > How do you share IOMMU resources w/ multiple processes, are the processes
> > sharing memory?
> 
> Sorry, bad wording.  I share IOMMU domains *within* each process.

Ah, got it.  Thanks.

> E.g. If one process has 3 devices and another has 10, I can get by with two
> iommu domains (and can share buffers among devices within each process).
> 
> If I ever need to share devices across processes, the shared memory case
> might be interesting.
> 
> > 
> >> So I need the ability to dynamically bring up devices and assign them to a
> >> group.  The number of actual devices and how they map to iommu domains is
> >> not known ahead of time.  We have a single piece of silicon that can expose
> >> hundreds of pci devices.
> > 
> > This does not seem fundamentally different from the KVM use case.
> > 
> > We have 2 kinds of groupings.
> > 
> > 1) low-level system or topoolgy grouping
> > 
> >    Some may have multiple devices in a single group
> > 
> >    * the PCIe-PCI bridge example
> >    * the POWER partitionable endpoint
> > 
> >    Many will not
> > 
> >    * singleton group, e.g. typical x86 PCIe function (majority of
> >      assigned devices)
> > 
> >    Not sure it makes sense to have these administratively defined as
> >    opposed to system defined.
> > 
> > 2) logical grouping
> > 
> >    * multiple low-level groups (singleton or otherwise) attached to same
> >      process, allowing things like single set of io page tables where
> >      applicable.
> > 
> >    These are nominally adminstratively defined.  In the KVM case, there
> >    is likely a privileged task (i.e. libvirtd) involved w/ making the
> >    device available to the guest and can do things like group merging.
> >    In your userspace case, perhaps it should be directly exposed.
> 
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

I don't see an issue w/ this.  As long it can not add devices to the
system defined groups, it's not a privileged operation.  So we still
need the iommu domain concept exposed in some form to logically put
groups into a single iommu domain (if desired).  In fact, I believe Alex
covered this in his most recent recap:

  ...The group fd will provide interfaces for enumerating the devices
  in the group, returning a file descriptor for each device in the group
  (the "device fd"), binding groups together, and returning a file
  descriptor for iommu operations (the "iommu fd").

thanks,
-chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-26 21:06                                                     ` Chris Wright
  0 siblings, 0 replies; 322+ messages in thread
From: Chris Wright @ 2011-08-26 21:06 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	Alexander Graf, qemu-devel, Chris Wright, iommu, Avi Kivity,
	linux-pci, linuxppc-dev, benve

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:
> > * Aaron Fabbri (aafabbri@cisco.com) wrote:
> >> Each process will open vfio devices on the fly, and they need to be able to
> >> share IOMMU resources.
> > 
> > How do you share IOMMU resources w/ multiple processes, are the processes
> > sharing memory?
> 
> Sorry, bad wording.  I share IOMMU domains *within* each process.

Ah, got it.  Thanks.

> E.g. If one process has 3 devices and another has 10, I can get by with two
> iommu domains (and can share buffers among devices within each process).
> 
> If I ever need to share devices across processes, the shared memory case
> might be interesting.
> 
> > 
> >> So I need the ability to dynamically bring up devices and assign them to a
> >> group.  The number of actual devices and how they map to iommu domains is
> >> not known ahead of time.  We have a single piece of silicon that can expose
> >> hundreds of pci devices.
> > 
> > This does not seem fundamentally different from the KVM use case.
> > 
> > We have 2 kinds of groupings.
> > 
> > 1) low-level system or topoolgy grouping
> > 
> >    Some may have multiple devices in a single group
> > 
> >    * the PCIe-PCI bridge example
> >    * the POWER partitionable endpoint
> > 
> >    Many will not
> > 
> >    * singleton group, e.g. typical x86 PCIe function (majority of
> >      assigned devices)
> > 
> >    Not sure it makes sense to have these administratively defined as
> >    opposed to system defined.
> > 
> > 2) logical grouping
> > 
> >    * multiple low-level groups (singleton or otherwise) attached to same
> >      process, allowing things like single set of io page tables where
> >      applicable.
> > 
> >    These are nominally adminstratively defined.  In the KVM case, there
> >    is likely a privileged task (i.e. libvirtd) involved w/ making the
> >    device available to the guest and can do things like group merging.
> >    In your userspace case, perhaps it should be directly exposed.
> 
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

I don't see an issue w/ this.  As long it can not add devices to the
system defined groups, it's not a privileged operation.  So we still
need the iommu domain concept exposed in some form to logically put
groups into a single iommu domain (if desired).  In fact, I believe Alex
covered this in his most recent recap:

  ...The group fd will provide interfaces for enumerating the devices
  in the group, returning a file descriptor for each device in the group
  (the "device fd"), binding groups together, and returning a file
  descriptor for iommu operations (the "iommu fd").

thanks,
-chris

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26  9:24                                       ` Roedel, Joerg
  (?)
@ 2011-08-28 13:14                                         ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-28 13:14 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci, qemu-devel, Alexander Graf, chrisw, iommu,
	linuxppc-dev, benve

On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
> >
> >  As I see it there are two options: (a) make subsequent accesses from
> >  userspace or the guest result in either a SIGBUS that userspace must
> >  either deal with or die, or (b) replace the mapping with a dummy RO
> >  mapping containing 0xff, with any trapped writes emulated as nops.
>
> The biggest problem with this approach is that it has to happen in the
> context of the given process. Linux can't really modify an mm which
> which belong to another context in a safe way.
>

Is use_mm() insufficient?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-28 13:14                                         ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-28 13:14 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci, qemu-devel, Alexander Graf, chrisw, iommu,
	Anthony Liguori, linuxppc-dev, benve

On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
> >
> >  As I see it there are two options: (a) make subsequent accesses from
> >  userspace or the guest result in either a SIGBUS that userspace must
> >  either deal with or die, or (b) replace the mapping with a dummy RO
> >  mapping containing 0xff, with any trapped writes emulated as nops.
>
> The biggest problem with this approach is that it has to happen in the
> context of the given process. Linux can't really modify an mm which
> which belong to another context in a safe way.
>

Is use_mm() insufficient?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-28 13:14                                         ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-28 13:14 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci, qemu-devel, Alexander Graf, chrisw, iommu,
	linuxppc-dev, benve

On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
> >
> >  As I see it there are two options: (a) make subsequent accesses from
> >  userspace or the guest result in either a SIGBUS that userspace must
> >  either deal with or die, or (b) replace the mapping with a dummy RO
> >  mapping containing 0xff, with any trapped writes emulated as nops.
>
> The biggest problem with this approach is that it has to happen in the
> context of the given process. Linux can't really modify an mm which
> which belong to another context in a safe way.
>

Is use_mm() insufficient?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-28 13:14                                         ` Avi Kivity
  (?)
@ 2011-08-28 13:56                                           ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-28 13:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Roedel, Joerg, Alexander Graf, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, iommu, chrisw, Alex Williamson,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> On 08/26/2011 12:24 PM, Roedel, Joerg wrote:

>> The biggest problem with this approach is that it has to happen in the
>> context of the given process. Linux can't really modify an mm which
>> which belong to another context in a safe way.
>>
>
> Is use_mm() insufficient?

Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.

One example:

	Process A		Process B			Process C
	.			.				.
	.		<--	takes A->mm			.
	.			and assignes as B->mm		.
	.			.			-->	Wants to take
	.			.				B->mm, but gets
								A->mm now

This can't be secured by a lock, because it introduces potential
A->B<-->B->A lock problem when two processes try to take each others mm.
It could probably be solved by a task->real_mm pointer, havn't thought
about this yet...

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-28 13:56                                           ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-28 13:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> On 08/26/2011 12:24 PM, Roedel, Joerg wrote:

>> The biggest problem with this approach is that it has to happen in the
>> context of the given process. Linux can't really modify an mm which
>> which belong to another context in a safe way.
>>
>
> Is use_mm() insufficient?

Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.

One example:

	Process A		Process B			Process C
	.			.				.
	.		<--	takes A->mm			.
	.			and assignes as B->mm		.
	.			.			-->	Wants to take
	.			.				B->mm, but gets
								A->mm now

This can't be secured by a lock, because it introduces potential
A->B<-->B->A lock problem when two processes try to take each others mm.
It could probably be solved by a task->real_mm pointer, havn't thought
about this yet...

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-28 13:56                                           ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-28 13:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	linux-pci, linuxppc-dev, benve

On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> On 08/26/2011 12:24 PM, Roedel, Joerg wrote:

>> The biggest problem with this approach is that it has to happen in the
>> context of the given process. Linux can't really modify an mm which
>> which belong to another context in a safe way.
>>
>
> Is use_mm() insufficient?

Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.

One example:

	Process A		Process B			Process C
	.			.				.
	.		<--	takes A->mm			.
	.			and assignes as B->mm		.
	.			.			-->	Wants to take
	.			.				B->mm, but gets
								A->mm now

This can't be secured by a lock, because it introduces potential
A->B<-->B->A lock problem when two processes try to take each others mm.
It could probably be solved by a task->real_mm pointer, havn't thought
about this yet...

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-28 13:56                                           ` Joerg Roedel
  (?)
@ 2011-08-28 14:04                                             ` Avi Kivity
  -1 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-28 14:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	linux-pci, linuxppc-dev, benve

On 08/28/2011 04:56 PM, Joerg Roedel wrote:
> On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> >  On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
>
> >>  The biggest problem with this approach is that it has to happen in the
> >>  context of the given process. Linux can't really modify an mm which
> >>  which belong to another context in a safe way.
> >>
> >
> >  Is use_mm() insufficient?
>
> Yes, it introduces a set of race conditions when a process that already
> has an mm wants to take over another processes mm temporarily (and when
> use_mm is modified to actually provide this functionality). It is only
> save when used from kernel-thread context.
>
> One example:
>
> 	Process A		Process B			Process C
> 	.			.				.
> 	.		<--	takes A->mm			.
> 	.			and assignes as B->mm		.
> 	.			.			-->	Wants to take
> 	.			.				B->mm, but gets
> 								A->mm now

Good catch.

>
> This can't be secured by a lock, because it introduces potential
> A->B<-->B->A lock problem when two processes try to take each others mm.
> It could probably be solved by a task->real_mm pointer, havn't thought
> about this yet...
>

Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-28 14:04                                             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-28 14:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On 08/28/2011 04:56 PM, Joerg Roedel wrote:
> On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> >  On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
>
> >>  The biggest problem with this approach is that it has to happen in the
> >>  context of the given process. Linux can't really modify an mm which
> >>  which belong to another context in a safe way.
> >>
> >
> >  Is use_mm() insufficient?
>
> Yes, it introduces a set of race conditions when a process that already
> has an mm wants to take over another processes mm temporarily (and when
> use_mm is modified to actually provide this functionality). It is only
> save when used from kernel-thread context.
>
> One example:
>
> 	Process A		Process B			Process C
> 	.			.				.
> 	.		<--	takes A->mm			.
> 	.			and assignes as B->mm		.
> 	.			.			-->	Wants to take
> 	.			.				B->mm, but gets
> 								A->mm now

Good catch.

>
> This can't be secured by a lock, because it introduces potential
> A->B<-->B->A lock problem when two processes try to take each others mm.
> It could probably be solved by a task->real_mm pointer, havn't thought
> about this yet...
>

Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-28 14:04                                             ` Avi Kivity
  0 siblings, 0 replies; 322+ messages in thread
From: Avi Kivity @ 2011-08-28 14:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	linux-pci, linuxppc-dev, benve

On 08/28/2011 04:56 PM, Joerg Roedel wrote:
> On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> >  On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
>
> >>  The biggest problem with this approach is that it has to happen in the
> >>  context of the given process. Linux can't really modify an mm which
> >>  which belong to another context in a safe way.
> >>
> >
> >  Is use_mm() insufficient?
>
> Yes, it introduces a set of race conditions when a process that already
> has an mm wants to take over another processes mm temporarily (and when
> use_mm is modified to actually provide this functionality). It is only
> save when used from kernel-thread context.
>
> One example:
>
> 	Process A		Process B			Process C
> 	.			.				.
> 	.		<--	takes A->mm			.
> 	.			and assignes as B->mm		.
> 	.			.			-->	Wants to take
> 	.			.				B->mm, but gets
> 								A->mm now

Good catch.

>
> This can't be secured by a lock, because it introduces potential
> A->B<-->B->A lock problem when two processes try to take each others mm.
> It could probably be solved by a task->real_mm pointer, havn't thought
> about this yet...
>

Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 20:17                                                   ` Aaron Fabbri
  (?)
@ 2011-08-30  1:29                                                     ` David Gibson
  -1 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-30  1:29 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, Chris Wright, iommu, Avi Kivity,
	Roedel, Joerg, linuxppc-dev, benve

eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

The "persistent group" model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the "use" interface, and I don't see that making the admin
interface separate and persistent makes this any harder.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-30  1:29                                                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-30  1:29 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, Chris Wright, iommu, Avi Kivity,
	Anthony Liguori, Roedel, Joerg, linuxppc-dev, benve

eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

The "persistent group" model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the "use" interface, and I don't see that making the admin
interface separate and persistent makes this any harder.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-30  1:29                                                     ` David Gibson
  0 siblings, 0 replies; 322+ messages in thread
From: David Gibson @ 2011-08-30  1:29 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, linux-pci,
	Alexander Graf, qemu-devel, Chris Wright, iommu, Avi Kivity,
	Roedel, Joerg, linuxppc-dev, benve

eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

The "persistent group" model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the "use" interface, and I don't see that making the admin
interface separate and persistent makes this any harder.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-26 18:04                                               ` Alex Williamson
  (?)
@ 2011-08-30 16:13                                                 ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-30 16:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:04:22PM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:

> > If we really expect segment numbers that need the full 16 bit then this
> > would be the way to go. Otherwise I would prefer returning the group-id
> > directly and partition the group-id space for the error values (s32 with
> > negative numbers being errors).
> 
> It's unlikely to have segments using the top bit, but it would be broken
> for an iommu driver to define it's group numbers using pci s:b:d.f if we
> don't have that bit available.  Ben/David, do PEs have an identifier of
> a convenient size?  I'd guess any hardware based identifier is going to
> use a full unsigned bit width.

Okay, if we want to go the secure way I am fine with the "int *group"
parameter. Another option is to just return u64 and use the extended
number space for errors. But that is even worse as an interface, I
think.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-30 16:13                                                 ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-30 16:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:04:22PM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:

> > If we really expect segment numbers that need the full 16 bit then this
> > would be the way to go. Otherwise I would prefer returning the group-id
> > directly and partition the group-id space for the error values (s32 with
> > negative numbers being errors).
> 
> It's unlikely to have segments using the top bit, but it would be broken
> for an iommu driver to define it's group numbers using pci s:b:d.f if we
> don't have that bit available.  Ben/David, do PEs have an identifier of
> a convenient size?  I'd guess any hardware based identifier is going to
> use a full unsigned bit width.

Okay, if we want to go the secure way I am fine with the "int *group"
parameter. Another option is to just return u64 and use the extended
number space for errors. But that is even worse as an interface, I
think.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-30 16:13                                                 ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-30 16:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras, Roedel, Joerg,
	linux-pci, qemu-devel, Aaron Fabbri, iommu, Avi Kivity,
	linuxppc-dev, benve

On Fri, Aug 26, 2011 at 12:04:22PM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:

> > If we really expect segment numbers that need the full 16 bit then this
> > would be the way to go. Otherwise I would prefer returning the group-id
> > directly and partition the group-id space for the error values (s32 with
> > negative numbers being errors).
> 
> It's unlikely to have segments using the top bit, but it would be broken
> for an iommu driver to define it's group numbers using pci s:b:d.f if we
> don't have that bit available.  Ben/David, do PEs have an identifier of
> a convenient size?  I'd guess any hardware based identifier is going to
> use a full unsigned bit width.

Okay, if we want to go the secure way I am fine with the "int *group"
parameter. Another option is to just return u64 and use the extended
number space for errors. But that is even worse as an interface, I
think.

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
  2011-08-28 14:04                                             ` Avi Kivity
  (?)
@ 2011-08-30 16:14                                               ` Joerg Roedel
  -1 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-30 16:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Roedel, Joerg, Alexander Graf, Alexey Kardashevskiy, kvm,
	Paul Mackerras, qemu-devel, iommu, chrisw, Alex Williamson,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Sun, Aug 28, 2011 at 05:04:32PM +0300, Avi Kivity wrote:
> On 08/28/2011 04:56 PM, Joerg Roedel wrote:

>> This can't be secured by a lock, because it introduces potential
>> A->B<-->B->A lock problem when two processes try to take each others mm.
>> It could probably be solved by a task->real_mm pointer, havn't thought
>> about this yet...
>>
>
> Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

Right, a workqueue might do the trick. We'll evaluate that. Thanks for
the idea :)

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: kvm PCI assignment & VFIO ramblings
@ 2011-08-30 16:14                                               ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-30 16:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	Anthony Liguori, linux-pci, linuxppc-dev, benve

On Sun, Aug 28, 2011 at 05:04:32PM +0300, Avi Kivity wrote:
> On 08/28/2011 04:56 PM, Joerg Roedel wrote:

>> This can't be secured by a lock, because it introduces potential
>> A->B<-->B->A lock problem when two processes try to take each others mm.
>> It could probably be solved by a task->real_mm pointer, havn't thought
>> about this yet...
>>
>
> Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

Right, a workqueue might do the trick. We'll evaluate that. Thanks for
the idea :)

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

* Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings
@ 2011-08-30 16:14                                               ` Joerg Roedel
  0 siblings, 0 replies; 322+ messages in thread
From: Joerg Roedel @ 2011-08-30 16:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm, Paul Mackerras,
	Roedel, Joerg, qemu-devel, Alexander Graf, chrisw, iommu,
	linux-pci, linuxppc-dev, benve

On Sun, Aug 28, 2011 at 05:04:32PM +0300, Avi Kivity wrote:
> On 08/28/2011 04:56 PM, Joerg Roedel wrote:

>> This can't be secured by a lock, because it introduces potential
>> A->B<-->B->A lock problem when two processes try to take each others mm.
>> It could probably be solved by a task->real_mm pointer, havn't thought
>> about this yet...
>>
>
> Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

Right, a workqueue might do the trick. We'll evaluate that. Thanks for
the idea :)

	Joerg

^ permalink raw reply	[flat|nested] 322+ messages in thread

end of thread, other threads:[~2011-08-30 16:14 UTC | newest]

Thread overview: 322+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-29 23:58 kvm PCI assignment & VFIO ramblings Benjamin Herrenschmidt
2011-07-29 23:58 ` Benjamin Herrenschmidt
2011-07-30 18:20 ` Alex Williamson
2011-07-30 18:20   ` [Qemu-devel] " Alex Williamson
2011-07-30 18:20   ` Alex Williamson
2011-07-30 23:54   ` Benjamin Herrenschmidt
2011-07-30 23:54     ` [Qemu-devel] " Benjamin Herrenschmidt
2011-07-30 23:54     ` Benjamin Herrenschmidt
2011-08-01 18:59     ` Alex Williamson
2011-08-01 18:59       ` [Qemu-devel] " Alex Williamson
2011-08-01 18:59       ` Alex Williamson
2011-08-02  2:00       ` Benjamin Herrenschmidt
2011-08-02  2:00         ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-02  2:00         ` Benjamin Herrenschmidt
2011-07-30 23:55   ` Benjamin Herrenschmidt
2011-07-30 23:55     ` [Qemu-devel] " Benjamin Herrenschmidt
2011-07-30 23:55     ` Benjamin Herrenschmidt
2011-08-02  8:28   ` David Gibson
2011-08-02  8:28     ` [Qemu-devel] " David Gibson
2011-08-02  8:28     ` David Gibson
2011-08-02 18:14     ` Alex Williamson
2011-08-02 18:14       ` [Qemu-devel] " Alex Williamson
2011-08-02 18:14       ` Alex Williamson
2011-08-02 18:35       ` Alex Williamson
2011-08-02 18:35         ` [Qemu-devel] " Alex Williamson
2011-08-02 18:35         ` Alex Williamson
2011-08-03  2:04         ` David Gibson
2011-08-03  2:04           ` [Qemu-devel] " David Gibson
2011-08-03  2:04           ` David Gibson
2011-08-03  3:44           ` Alex Williamson
2011-08-03  3:44             ` [Qemu-devel] " Alex Williamson
2011-08-03  3:44             ` Alex Williamson
2011-08-04  0:39             ` David Gibson
2011-08-04  0:39               ` [Qemu-devel] " David Gibson
2011-08-08  8:28           ` Avi Kivity
2011-08-08  8:28             ` [Qemu-devel] " Avi Kivity
2011-08-08  8:28             ` Avi Kivity
2011-08-09 23:24             ` Alex Williamson
2011-08-09 23:24               ` [Qemu-devel] " Alex Williamson
2011-08-09 23:24               ` Alex Williamson
2011-08-10  2:48               ` Benjamin Herrenschmidt
2011-08-10  2:48                 ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-10  2:48                 ` Benjamin Herrenschmidt
2011-08-20 16:51                 ` Alex Williamson
2011-08-20 16:51                   ` [Qemu-devel] " Alex Williamson
2011-08-20 16:51                   ` Alex Williamson
2011-08-22  5:55                   ` David Gibson
2011-08-22  5:55                     ` [Qemu-devel] " David Gibson
2011-08-22  5:55                     ` David Gibson
2011-08-22 15:45                     ` Alex Williamson
2011-08-22 15:45                       ` [Qemu-devel] " Alex Williamson
2011-08-22 21:01                       ` Benjamin Herrenschmidt
2011-08-22 21:01                         ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-22 21:01                         ` Benjamin Herrenschmidt
2011-08-23 19:30                         ` Alex Williamson
2011-08-23 19:30                           ` [Qemu-devel] " Alex Williamson
2011-08-23 19:30                           ` Alex Williamson
2011-08-23 23:51                           ` Benjamin Herrenschmidt
2011-08-23 23:51                             ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-23 23:51                             ` Benjamin Herrenschmidt
2011-08-24  3:40                             ` Alexander Graf
2011-08-24  3:40                               ` [Qemu-devel] " Alexander Graf
2011-08-24  3:40                               ` Alexander Graf
2011-08-24 14:47                             ` Alex Williamson
2011-08-24 14:47                               ` [Qemu-devel] " Alex Williamson
2011-08-24 14:47                               ` Alex Williamson
2011-08-24  8:43                           ` Joerg Roedel
2011-08-24  8:43                             ` [Qemu-devel] " Joerg Roedel
2011-08-24  8:43                             ` Joerg Roedel
2011-08-24 14:56                             ` Alex Williamson
2011-08-24 14:56                               ` [Qemu-devel] " Alex Williamson
2011-08-24 14:56                               ` Alex Williamson
2011-08-25 11:01                               ` Roedel, Joerg
2011-08-25 11:01                                 ` [Qemu-devel] " Roedel, Joerg
2011-08-25 11:01                                 ` Roedel, Joerg
2011-08-23  2:38                       ` David Gibson
2011-08-23  2:38                         ` [Qemu-devel] " David Gibson
2011-08-23  2:38                         ` David Gibson
2011-08-23 16:23                         ` Alex Williamson
2011-08-23 16:23                           ` [Qemu-devel] " Alex Williamson
2011-08-23 16:23                           ` Alex Williamson
2011-08-23 23:41                           ` Benjamin Herrenschmidt
2011-08-23 23:41                             ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-23 23:41                             ` Benjamin Herrenschmidt
2011-08-24  3:36                             ` Alexander Graf
2011-08-24  3:36                               ` [Qemu-devel] " Alexander Graf
2011-08-24  3:36                               ` Alexander Graf
2011-08-22  6:30                   ` Avi Kivity
2011-08-22  6:30                     ` [Qemu-devel] " Avi Kivity
2011-08-22  6:30                     ` Avi Kivity
2011-08-22 10:46                     ` Joerg Roedel
2011-08-22 10:46                       ` [Qemu-devel] " Joerg Roedel
2011-08-22 10:46                       ` Joerg Roedel
2011-08-22 10:51                       ` Avi Kivity
2011-08-22 10:51                         ` [Qemu-devel] " Avi Kivity
2011-08-22 10:51                         ` Avi Kivity
2011-08-22 12:36                         ` Roedel, Joerg
2011-08-22 12:36                           ` [Qemu-devel] " Roedel, Joerg
2011-08-22 12:36                           ` Roedel, Joerg
2011-08-22 12:42                           ` Avi Kivity
2011-08-22 12:42                             ` [Qemu-devel] " Avi Kivity
2011-08-22 12:42                             ` Avi Kivity
2011-08-22 12:55                             ` Roedel, Joerg
2011-08-22 12:55                               ` [Qemu-devel] " Roedel, Joerg
2011-08-22 12:55                               ` Roedel, Joerg
2011-08-22 13:06                               ` Avi Kivity
2011-08-22 13:06                                 ` [Qemu-devel] " Avi Kivity
2011-08-22 13:06                                 ` Avi Kivity
2011-08-22 13:15                                 ` Roedel, Joerg
2011-08-22 13:15                                   ` [Qemu-devel] " Roedel, Joerg
2011-08-22 13:15                                   ` Roedel, Joerg
2011-08-22 13:17                                   ` Avi Kivity
2011-08-22 13:17                                     ` [Qemu-devel] " Avi Kivity
2011-08-22 13:17                                     ` Avi Kivity
2011-08-22 14:37                                     ` Roedel, Joerg
2011-08-22 14:37                                       ` [Qemu-devel] " Roedel, Joerg
2011-08-22 14:37                                       ` Roedel, Joerg
2011-08-22 20:53                     ` Benjamin Herrenschmidt
2011-08-22 20:53                       ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-22 20:53                       ` Benjamin Herrenschmidt
2011-08-22 17:25                   ` Joerg Roedel
2011-08-22 17:25                     ` [Qemu-devel] " Joerg Roedel
2011-08-22 17:25                     ` Joerg Roedel
2011-08-22 19:17                     ` Alex Williamson
2011-08-22 19:17                       ` [Qemu-devel] " Alex Williamson
2011-08-22 19:17                       ` Alex Williamson
2011-08-23 13:14                       ` Roedel, Joerg
2011-08-23 13:14                         ` [Qemu-devel] " Roedel, Joerg
2011-08-23 13:14                         ` Roedel, Joerg
2011-08-23 17:08                         ` Alex Williamson
2011-08-23 17:08                           ` [Qemu-devel] " Alex Williamson
2011-08-23 17:08                           ` Alex Williamson
2011-08-24  8:52                           ` Roedel, Joerg
2011-08-24  8:52                             ` [Qemu-devel] " Roedel, Joerg
2011-08-24  8:52                             ` Roedel, Joerg
2011-08-24 15:07                             ` Alex Williamson
2011-08-24 15:07                               ` [Qemu-devel] " Alex Williamson
2011-08-24 15:07                               ` Alex Williamson
2011-08-25 12:31                               ` Roedel, Joerg
2011-08-25 12:31                                 ` [Qemu-devel] " Roedel, Joerg
2011-08-25 12:31                                 ` Roedel, Joerg
2011-08-25 13:25                                 ` Alexander Graf
2011-08-25 13:25                                   ` [Qemu-devel] " Alexander Graf
2011-08-25 13:25                                   ` Alexander Graf
2011-08-26  4:24                                   ` David Gibson
2011-08-26  4:24                                     ` [Qemu-devel] " David Gibson
2011-08-26  4:24                                     ` David Gibson
2011-08-26  9:24                                     ` Roedel, Joerg
2011-08-26  9:24                                       ` [Qemu-devel] " Roedel, Joerg
2011-08-26  9:24                                       ` Roedel, Joerg
2011-08-28 13:14                                       ` Avi Kivity
2011-08-28 13:14                                         ` [Qemu-devel] " Avi Kivity
2011-08-28 13:14                                         ` Avi Kivity
2011-08-28 13:56                                         ` Joerg Roedel
2011-08-28 13:56                                           ` [Qemu-devel] " Joerg Roedel
2011-08-28 13:56                                           ` Joerg Roedel
2011-08-28 14:04                                           ` Avi Kivity
2011-08-28 14:04                                             ` [Qemu-devel] " Avi Kivity
2011-08-28 14:04                                             ` Avi Kivity
2011-08-30 16:14                                             ` Joerg Roedel
2011-08-30 16:14                                               ` [Qemu-devel] " Joerg Roedel
2011-08-30 16:14                                               ` Joerg Roedel
2011-08-22 21:03                     ` Benjamin Herrenschmidt
2011-08-22 21:03                       ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-22 21:03                       ` Benjamin Herrenschmidt
2011-08-23 13:18                       ` Roedel, Joerg
2011-08-23 13:18                         ` [Qemu-devel] " Roedel, Joerg
2011-08-23 13:18                         ` Roedel, Joerg
2011-08-23 23:35                         ` Benjamin Herrenschmidt
2011-08-23 23:35                           ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-23 23:35                           ` Benjamin Herrenschmidt
2011-08-24  8:53                           ` Roedel, Joerg
2011-08-24  8:53                             ` [Qemu-devel] " Roedel, Joerg
2011-08-24  8:53                             ` Roedel, Joerg
2011-08-22 20:29                   ` aafabbri
2011-08-22 20:29                     ` [Qemu-devel] " aafabbri
2011-08-22 20:29                     ` aafabbri
2011-08-22 20:49                     ` Benjamin Herrenschmidt
2011-08-22 20:49                       ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-22 21:38                       ` aafabbri
2011-08-22 21:38                         ` [Qemu-devel] " aafabbri
2011-08-22 21:38                         ` aafabbri
2011-08-22 21:49                         ` Benjamin Herrenschmidt
2011-08-22 21:49                           ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-22 21:49                           ` Benjamin Herrenschmidt
2011-08-23  0:52                           ` aafabbri
2011-08-23  0:52                             ` [Qemu-devel] " aafabbri
2011-08-23  0:52                             ` aafabbri
2011-08-23  6:54                             ` Benjamin Herrenschmidt
2011-08-23  6:54                               ` [Qemu-devel] " Benjamin Herrenschmidt
2011-08-23  6:54                               ` Benjamin Herrenschmidt
2011-08-23 11:09                               ` Joerg Roedel
2011-08-23 11:09                                 ` [Qemu-devel] " Joerg Roedel
2011-08-23 11:09                                 ` Joerg Roedel
2011-08-23 17:01                               ` Alex Williamson
2011-08-23 17:01                                 ` [Qemu-devel] " Alex Williamson
2011-08-23 17:01                                 ` Alex Williamson
2011-08-23 17:33                                 ` Aaron Fabbri
2011-08-23 17:33                                   ` [Qemu-devel] " Aaron Fabbri
2011-08-23 17:33                                   ` Aaron Fabbri
2011-08-23 18:01                                   ` Alex Williamson
2011-08-23 18:01                                     ` [Qemu-devel] " Alex Williamson
2011-08-23 18:01                                     ` Alex Williamson
2011-08-24  9:10                                   ` Joerg Roedel
2011-08-24  9:10                                     ` [Qemu-devel] " Joerg Roedel
2011-08-24  9:10                                     ` Joerg Roedel
2011-08-24 21:13                                     ` Alex Williamson
2011-08-24 21:13                                       ` [Qemu-devel] " Alex Williamson
2011-08-24 21:13                                       ` Alex Williamson
2011-08-25 10:54                                       ` Roedel, Joerg
2011-08-25 10:54                                         ` [Qemu-devel] " Roedel, Joerg
2011-08-25 10:54                                         ` Roedel, Joerg
2011-08-25 15:38                                         ` Don Dutile
2011-08-25 15:38                                           ` [Qemu-devel] " Don Dutile
2011-08-25 15:38                                           ` Don Dutile
2011-08-25 16:46                                           ` Roedel, Joerg
2011-08-25 16:46                                             ` [Qemu-devel] " Roedel, Joerg
2011-08-25 16:46                                             ` Roedel, Joerg
2011-08-25 17:20                                         ` Alex Williamson
2011-08-25 17:20                                           ` [Qemu-devel] " Alex Williamson
2011-08-25 17:20                                           ` Alex Williamson
2011-08-25 18:05                                           ` Joerg Roedel
2011-08-25 18:05                                             ` [Qemu-devel] " Joerg Roedel
2011-08-25 18:05                                             ` Joerg Roedel
2011-08-26 18:04                                             ` Alex Williamson
2011-08-26 18:04                                               ` [Qemu-devel] " Alex Williamson
2011-08-26 18:04                                               ` Alex Williamson
2011-08-30 16:13                                               ` Joerg Roedel
2011-08-30 16:13                                                 ` [Qemu-devel] " Joerg Roedel
2011-08-30 16:13                                                 ` Joerg Roedel
2011-08-23 11:04                             ` Joerg Roedel
2011-08-23 11:04                               ` [Qemu-devel] " Joerg Roedel
2011-08-23 11:04                               ` Joerg Roedel
2011-08-23 16:54                               ` aafabbri
2011-08-23 16:54                                 ` [Qemu-devel] " aafabbri
2011-08-23 16:54                                 ` aafabbri
2011-08-24  9:14                                 ` Roedel, Joerg
2011-08-24  9:14                                   ` [Qemu-devel] " Roedel, Joerg
2011-08-24  9:14                                   ` Roedel, Joerg
2011-08-24  9:33                                   ` David Gibson
2011-08-24  9:33                                     ` [Qemu-devel] " David Gibson
2011-08-24  9:33                                     ` David Gibson
2011-08-24 11:03                                     ` Roedel, Joerg
2011-08-24 11:03                                       ` [Qemu-devel] " Roedel, Joerg
2011-08-24 11:03                                       ` Roedel, Joerg
2011-08-26  4:20                                       ` David Gibson
2011-08-26  4:20                                         ` [Qemu-devel] " David Gibson
2011-08-26  4:20                                         ` David Gibson
2011-08-26  9:33                                         ` Roedel, Joerg
2011-08-26  9:33                                           ` [Qemu-devel] " Roedel, Joerg
2011-08-26  9:33                                           ` Roedel, Joerg
2011-08-26 14:07                                           ` Alexander Graf
2011-08-26 14:07                                             ` [Qemu-devel] " Alexander Graf
2011-08-26 14:07                                             ` Alexander Graf
2011-08-26 15:24                                             ` Joerg Roedel
2011-08-26 15:24                                               ` [Qemu-devel] " Joerg Roedel
2011-08-26 15:24                                               ` Joerg Roedel
2011-08-26 15:29                                               ` Alexander Graf
2011-08-26 15:29                                                 ` [Qemu-devel] " Alexander Graf
2011-08-26 15:29                                                 ` Alexander Graf
2011-08-26 17:52                                             ` Aaron Fabbri
2011-08-26 17:52                                               ` [Qemu-devel] " Aaron Fabbri
2011-08-26 19:35                                               ` Chris Wright
2011-08-26 19:35                                                 ` [Qemu-devel] " Chris Wright
2011-08-26 19:35                                                 ` Chris Wright
2011-08-26 20:17                                                 ` Aaron Fabbri
2011-08-26 20:17                                                   ` [Qemu-devel] " Aaron Fabbri
2011-08-26 20:17                                                   ` Aaron Fabbri
2011-08-26 21:06                                                   ` Chris Wright
2011-08-26 21:06                                                     ` [Qemu-devel] " Chris Wright
2011-08-26 21:06                                                     ` Chris Wright
2011-08-30  1:29                                                   ` David Gibson
2011-08-30  1:29                                                     ` [Qemu-devel] " David Gibson
2011-08-30  1:29                                                     ` David Gibson
2011-08-04 10:35   ` Joerg Roedel
2011-08-04 10:35     ` [Qemu-devel] " Joerg Roedel
2011-08-04 10:35     ` Joerg Roedel
2011-07-30 22:21 ` Benjamin Herrenschmidt
2011-07-30 22:21   ` Benjamin Herrenschmidt
2011-08-01 16:40   ` Alex Williamson
2011-08-01 16:40     ` Alex Williamson
2011-08-02  1:29     ` Benjamin Herrenschmidt
2011-07-31 14:09 ` Avi Kivity
2011-07-31 14:09   ` Avi Kivity
2011-08-01 20:27   ` Alex Williamson
2011-08-01 20:27     ` Alex Williamson
2011-08-02  8:32     ` Avi Kivity
2011-08-02  8:32       ` Avi Kivity
2011-08-04 10:41     ` Joerg Roedel
2011-08-04 10:41       ` Joerg Roedel
2011-08-05 10:26       ` Benjamin Herrenschmidt
2011-08-05 10:26         ` Benjamin Herrenschmidt
2011-08-05 12:57         ` Joerg Roedel
2011-08-05 12:57           ` Joerg Roedel
2011-08-02  1:27   ` Benjamin Herrenschmidt
2011-08-02  1:27     ` Benjamin Herrenschmidt
2011-08-02  9:12     ` Avi Kivity
2011-08-02  9:12       ` Avi Kivity
2011-08-02 12:58       ` Benjamin Herrenschmidt
2011-08-02 12:58         ` Benjamin Herrenschmidt
2011-08-02 13:39         ` Avi Kivity
2011-08-02 13:39           ` Avi Kivity
2011-08-02 15:34         ` Alex Williamson
2011-08-02 15:34           ` Alex Williamson
2011-08-02 21:29           ` Konrad Rzeszutek Wilk
2011-08-02 21:29             ` Konrad Rzeszutek Wilk
2011-08-03  1:02             ` Alex Williamson
2011-08-03  1:02               ` Alex Williamson
2011-08-02 14:39     ` Alex Williamson
2011-08-02 14:39       ` Alex Williamson
2011-08-01  2:48 ` David Gibson
2011-08-04 10:27 ` Joerg Roedel
2011-08-04 10:27   ` Joerg Roedel
2011-08-05 10:42   ` Benjamin Herrenschmidt
2011-08-05 10:42     ` Benjamin Herrenschmidt
2011-08-05 13:44     ` Joerg Roedel
2011-08-05 13:44       ` Joerg Roedel
2011-08-05 22:49       ` Benjamin Herrenschmidt
2011-08-05 22:49         ` Benjamin Herrenschmidt
2011-08-05 15:10     ` Alex Williamson
2011-08-05 15:10       ` Alex Williamson
2011-08-08  6:07       ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.