All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC:  vfio API changes needed for powerpc
@ 2013-04-02 17:32 ` Yoder Stuart-B08248
  0 siblings, 0 replies; 60+ messages in thread
From: Yoder Stuart-B08248 @ 2013-04-02 17:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

Alex,

We are in the process of implementing vfio-pci support for the Freescale
IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
than x86, and will involve creating a 'type 2' vfio implementation.

For each device's DMA mappings, PAMU has an overall aperture and a number
of windows.  All sizes and window counts must be power of 2.  To illustrate,
below is a mapping for a 256MB guest, including guest memory (backed by
64MB huge pages) and some windows for MSIs:

    Total aperture: 512MB
    # of windows: 8

    win gphys/
    #   iova        phys          size
    --- ----        ----          ----
    0   0x00000000  0xX_XX000000  64MB
    1   0x04000000  0xX_XX000000  64MB
    2   0x08000000  0xX_XX000000  64MB
    3   0x0C000000  0xX_XX000000  64MB
    4   0x10000000  0xf_fe044000  4KB    // msi bank 1
    5   0x14000000  0xf_fe045000  4KB    // msi bank 2
    6   0x18000000  0xf_fe046000  4KB    // msi bank 3
    7            -             -  disabled

There are a couple of updates needed to the vfio user->kernel interface
that we would like your feedback on.

1.  IOMMU geometry

   The kernel IOMMU driver now has an interface (see domain_set_attr,
   domain_get_attr) that lets us set the domain geometry using
   "attributes".

   We want to expose that to user space, so envision needing a couple
   of new ioctls to do this:
        VFIO_IOMMU_SET_ATTR
        VFIO_IOMMU_GET_ATTR     

2.   MSI window mappings

   The more problematic question is how to deal with MSIs.  We need to
   create mappings for up to 3 MSI banks that a device may need to target
   to generate interrupts.  The Linux MSI driver can allocate MSIs from
   the 3 banks any way it wants, and currently user space has no way of
   knowing which bank may be used for a given device.   

   There are 3 options we have discussed and would like your direction:

   A.  Implicit mappings -- with this approach user space would not
       explicitly map MSIs.  User space would be required to set the
       geometry so that there are 3 unused windows (the last 3 windows)
       for MSIs, and it would be up to the kernel to create the mappings.
       This approach requires some specific semantics (leaving 3 windows)
       and it potentially gets a little weird-- when should the kernel
       actually create the MSI mappings?  When should they be unmapped?
       Some convention would need to be established.

   B.  Explicit mapping using DMA map flags.  The idea is that a new
       flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
       a mapping is to be created for the supplied iova.  No vaddr
       is given though.  So in the above example there would be a
       a dma map at 0x10000000 for 24KB (and no vaddr).   It's
       up to the kernel to determine which bank gets mapped where.
       So, this option puts user space in control of which windows
       are used for MSIs and when MSIs are mapped/unmapped.   There
       would need to be some semantics as to how this is used-- it
       only makes sense

   C.  Explicit mapping using normal DMA map.  The last idea is that
       we would introduce a new ioctl to give user-space an fd to 
       the MSI bank, which could be mmapped.  The flow would be
       something like this:
          -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
          -user space mmaps the fd, getting a vaddr
          -user space does a normal DMA map for desired iova
       This approach makes everything explicit, but adds a new ioctl
       applicable most likely only to the PAMU (type2 iommu).

Any feedback or direction?

Thanks,
Stuart 
    

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] RFC:  vfio API changes needed for powerpc
@ 2013-04-02 17:32 ` Yoder Stuart-B08248
  0 siblings, 0 replies; 60+ messages in thread
From: Yoder Stuart-B08248 @ 2013-04-02 17:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, qemu-devel, agraf, iommu,
	Bhushan Bharat-R65777, Sethi Varun-B16395

Alex,

We are in the process of implementing vfio-pci support for the Freescale
IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
than x86, and will involve creating a 'type 2' vfio implementation.

For each device's DMA mappings, PAMU has an overall aperture and a number
of windows.  All sizes and window counts must be power of 2.  To illustrate,
below is a mapping for a 256MB guest, including guest memory (backed by
64MB huge pages) and some windows for MSIs:

    Total aperture: 512MB
    # of windows: 8

    win gphys/
    #   iova        phys          size
    --- ----        ----          ----
    0   0x00000000  0xX_XX000000  64MB
    1   0x04000000  0xX_XX000000  64MB
    2   0x08000000  0xX_XX000000  64MB
    3   0x0C000000  0xX_XX000000  64MB
    4   0x10000000  0xf_fe044000  4KB    // msi bank 1
    5   0x14000000  0xf_fe045000  4KB    // msi bank 2
    6   0x18000000  0xf_fe046000  4KB    // msi bank 3
    7            -             -  disabled

There are a couple of updates needed to the vfio user->kernel interface
that we would like your feedback on.

1.  IOMMU geometry

   The kernel IOMMU driver now has an interface (see domain_set_attr,
   domain_get_attr) that lets us set the domain geometry using
   "attributes".

   We want to expose that to user space, so envision needing a couple
   of new ioctls to do this:
        VFIO_IOMMU_SET_ATTR
        VFIO_IOMMU_GET_ATTR     

2.   MSI window mappings

   The more problematic question is how to deal with MSIs.  We need to
   create mappings for up to 3 MSI banks that a device may need to target
   to generate interrupts.  The Linux MSI driver can allocate MSIs from
   the 3 banks any way it wants, and currently user space has no way of
   knowing which bank may be used for a given device.   

   There are 3 options we have discussed and would like your direction:

   A.  Implicit mappings -- with this approach user space would not
       explicitly map MSIs.  User space would be required to set the
       geometry so that there are 3 unused windows (the last 3 windows)
       for MSIs, and it would be up to the kernel to create the mappings.
       This approach requires some specific semantics (leaving 3 windows)
       and it potentially gets a little weird-- when should the kernel
       actually create the MSI mappings?  When should they be unmapped?
       Some convention would need to be established.

   B.  Explicit mapping using DMA map flags.  The idea is that a new
       flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
       a mapping is to be created for the supplied iova.  No vaddr
       is given though.  So in the above example there would be a
       a dma map at 0x10000000 for 24KB (and no vaddr).   It's
       up to the kernel to determine which bank gets mapped where.
       So, this option puts user space in control of which windows
       are used for MSIs and when MSIs are mapped/unmapped.   There
       would need to be some semantics as to how this is used-- it
       only makes sense

   C.  Explicit mapping using normal DMA map.  The last idea is that
       we would introduce a new ioctl to give user-space an fd to 
       the MSI bank, which could be mmapped.  The flow would be
       something like this:
          -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
          -user space mmaps the fd, getting a vaddr
          -user space does a normal DMA map for desired iova
       This approach makes everything explicit, but adds a new ioctl
       applicable most likely only to the PAMU (type2 iommu).

Any feedback or direction?

Thanks,
Stuart 
    

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 17:32 ` [Qemu-devel] " Yoder Stuart-B08248
@ 2013-04-02 19:39     ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 19:39 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Bhushan Bharat-R65777

On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
> Alex,
> 
> We are in the process of implementing vfio-pci support for the  
> Freescale
> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
> different
> than x86, and will involve creating a 'type 2' vfio implementation.
> 
> For each device's DMA mappings, PAMU has an overall aperture and a  
> number
> of windows.  All sizes and window counts must be power of 2.  To  
> illustrate,
> below is a mapping for a 256MB guest, including guest memory (backed  
> by
> 64MB huge pages) and some windows for MSIs:
> 
>     Total aperture: 512MB
>     # of windows: 8
> 
>     win gphys/
>     #   iova        phys          size
>     --- ----        ----          ----
>     0   0x00000000  0xX_XX000000  64MB
>     1   0x04000000  0xX_XX000000  64MB
>     2   0x08000000  0xX_XX000000  64MB
>     3   0x0C000000  0xX_XX000000  64MB
>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>     7            -             -  disabled
> 
> There are a couple of updates needed to the vfio user->kernel  
> interface
> that we would like your feedback on.
> 
> 1.  IOMMU geometry
> 
>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>    domain_get_attr) that lets us set the domain geometry using
>    "attributes".
> 
>    We want to expose that to user space, so envision needing a couple
>    of new ioctls to do this:
>         VFIO_IOMMU_SET_ATTR
>         VFIO_IOMMU_GET_ATTR

Note that this means attributes need to be updated for user-API  
appropriateness, such as using fixed-size types.

> 2.   MSI window mappings
> 
>    The more problematic question is how to deal with MSIs.  We need to
>    create mappings for up to 3 MSI banks that a device may need to  
> target
>    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> from
>    the 3 banks any way it wants, and currently user space has no way  
> of
>    knowing which bank may be used for a given device.
> 
>    There are 3 options we have discussed and would like your  
> direction:
> 
>    A.  Implicit mappings -- with this approach user space would not
>        explicitly map MSIs.  User space would be required to set the
>        geometry so that there are 3 unused windows (the last 3  
> windows)

Where does userspace get the number "3" from?  E.g. on newer chips  
there are 4 MSI banks.  Maybe future chips have even more.

>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>        a mapping is to be created for the supplied iova.  No vaddr
>        is given though.  So in the above example there would be a
>        a dma map at 0x10000000 for 24KB (and no vaddr).

A single 24 KiB mapping wouldn't work (and why 24KB?  What if only one  
MSI group is involved in this VFIO group?  What if four MSI groups are  
involved?).  You'd need to either have a naturally aligned,  
power-of-two sized mapping that covers exactly the pages you want to  
map and no more, or you'd need to create a separate mapping for each  
MSI bank, and due to PAMU subwindow alignment restrictions these  
mappings could not be contiguous in iova-space.

>    C.  Explicit mapping using normal DMA map.  The last idea is that
>        we would introduce a new ioctl to give user-space an fd to
>        the MSI bank, which could be mmapped.  The flow would be
>        something like this:
>           -for each group user space calls new ioctl  
> VFIO_GROUP_GET_MSI_FD
>           -user space mmaps the fd, getting a vaddr
>           -user space does a normal DMA map for desired iova
>        This approach makes everything explicit, but adds a new ioctl
>        applicable most likely only to the PAMU (type2 iommu).

The new ioctl isn't really specific to PAMU (or whatever "type2" is  
supposed to be, which nobody ever explains when I ask), so much as to  
the MSI implementation.  It just exposes the MSI register as another  
device resource (well, technically a groupwide resource, unless we  
expose it on a per-device basis and provide enough information for  
userspace to recognize when it's the same for other devices in the  
group) to be mmapped, which userspace can choose to map in the IOMMU as  
well.

Note that in the explicit case, userspace would have to program the MSI  
iova into the PCI device's config space (or communicate the chosen  
address to the kernel so it can set the config space registers).

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 19:39     ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 19:39 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Alex Williamson, Bhushan Bharat-R65777, Sethi Varun-B16395

On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
> Alex,
> 
> We are in the process of implementing vfio-pci support for the  
> Freescale
> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
> different
> than x86, and will involve creating a 'type 2' vfio implementation.
> 
> For each device's DMA mappings, PAMU has an overall aperture and a  
> number
> of windows.  All sizes and window counts must be power of 2.  To  
> illustrate,
> below is a mapping for a 256MB guest, including guest memory (backed  
> by
> 64MB huge pages) and some windows for MSIs:
> 
>     Total aperture: 512MB
>     # of windows: 8
> 
>     win gphys/
>     #   iova        phys          size
>     --- ----        ----          ----
>     0   0x00000000  0xX_XX000000  64MB
>     1   0x04000000  0xX_XX000000  64MB
>     2   0x08000000  0xX_XX000000  64MB
>     3   0x0C000000  0xX_XX000000  64MB
>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>     7            -             -  disabled
> 
> There are a couple of updates needed to the vfio user->kernel  
> interface
> that we would like your feedback on.
> 
> 1.  IOMMU geometry
> 
>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>    domain_get_attr) that lets us set the domain geometry using
>    "attributes".
> 
>    We want to expose that to user space, so envision needing a couple
>    of new ioctls to do this:
>         VFIO_IOMMU_SET_ATTR
>         VFIO_IOMMU_GET_ATTR

Note that this means attributes need to be updated for user-API  
appropriateness, such as using fixed-size types.

> 2.   MSI window mappings
> 
>    The more problematic question is how to deal with MSIs.  We need to
>    create mappings for up to 3 MSI banks that a device may need to  
> target
>    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> from
>    the 3 banks any way it wants, and currently user space has no way  
> of
>    knowing which bank may be used for a given device.
> 
>    There are 3 options we have discussed and would like your  
> direction:
> 
>    A.  Implicit mappings -- with this approach user space would not
>        explicitly map MSIs.  User space would be required to set the
>        geometry so that there are 3 unused windows (the last 3  
> windows)

Where does userspace get the number "3" from?  E.g. on newer chips  
there are 4 MSI banks.  Maybe future chips have even more.

>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>        a mapping is to be created for the supplied iova.  No vaddr
>        is given though.  So in the above example there would be a
>        a dma map at 0x10000000 for 24KB (and no vaddr).

A single 24 KiB mapping wouldn't work (and why 24KB?  What if only one  
MSI group is involved in this VFIO group?  What if four MSI groups are  
involved?).  You'd need to either have a naturally aligned,  
power-of-two sized mapping that covers exactly the pages you want to  
map and no more, or you'd need to create a separate mapping for each  
MSI bank, and due to PAMU subwindow alignment restrictions these  
mappings could not be contiguous in iova-space.

>    C.  Explicit mapping using normal DMA map.  The last idea is that
>        we would introduce a new ioctl to give user-space an fd to
>        the MSI bank, which could be mmapped.  The flow would be
>        something like this:
>           -for each group user space calls new ioctl  
> VFIO_GROUP_GET_MSI_FD
>           -user space mmaps the fd, getting a vaddr
>           -user space does a normal DMA map for desired iova
>        This approach makes everything explicit, but adds a new ioctl
>        applicable most likely only to the PAMU (type2 iommu).

The new ioctl isn't really specific to PAMU (or whatever "type2" is  
supposed to be, which nobody ever explains when I ask), so much as to  
the MSI implementation.  It just exposes the MSI register as another  
device resource (well, technically a groupwide resource, unless we  
expose it on a per-device basis and provide enough information for  
userspace to recognize when it's the same for other devices in the  
group) to be mmapped, which userspace can choose to map in the IOMMU as  
well.

Note that in the explicit case, userspace would have to program the MSI  
iova into the PCI device's config space (or communicate the chosen  
address to the kernel so it can set the config space registers).

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC:  vfio API changes needed for powerpc
  2013-04-02 17:32 ` [Qemu-devel] " Yoder Stuart-B08248
@ 2013-04-02 20:32     ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 20:32 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

Hi Stuart,

On Tue, 2013-04-02 at 17:32 +0000, Yoder Stuart-B08248 wrote:
> Alex,
> 
> We are in the process of implementing vfio-pci support for the Freescale
> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
> than x86, and will involve creating a 'type 2' vfio implementation.
> 
> For each device's DMA mappings, PAMU has an overall aperture and a number
> of windows.  All sizes and window counts must be power of 2.  To illustrate,
> below is a mapping for a 256MB guest, including guest memory (backed by
> 64MB huge pages) and some windows for MSIs:
> 
>     Total aperture: 512MB
>     # of windows: 8
> 
>     win gphys/
>     #   iova        phys          size
>     --- ----        ----          ----
>     0   0x00000000  0xX_XX000000  64MB
>     1   0x04000000  0xX_XX000000  64MB
>     2   0x08000000  0xX_XX000000  64MB
>     3   0x0C000000  0xX_XX000000  64MB
>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>     7            -             -  disabled
> 
> There are a couple of updates needed to the vfio user->kernel interface
> that we would like your feedback on.
> 
> 1.  IOMMU geometry
> 
>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>    domain_get_attr) that lets us set the domain geometry using
>    "attributes".
> 
>    We want to expose that to user space, so envision needing a couple
>    of new ioctls to do this:
>         VFIO_IOMMU_SET_ATTR
>         VFIO_IOMMU_GET_ATTR     

Any ioctls to the vfiofd (/dev/vfio/vfio) not claimed by vfio-core are
passed to the IOMMU driver.  So you can effectively have your own type2
ioctl extensions.  Alexey has already posted patches to do this for
SPAPR that add VFIO_IOMMU_ENABLE/DISABLE to allow him access to
VFIO_IOMMU_GET_INFO to examine locked page requirements.  As Scott notes
we need to come up with a clean userspace interface for these though.

> 2.   MSI window mappings
> 
>    The more problematic question is how to deal with MSIs.  We need to
>    create mappings for up to 3 MSI banks that a device may need to target
>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>    the 3 banks any way it wants, and currently user space has no way of
>    knowing which bank may be used for a given device.   
> 
>    There are 3 options we have discussed and would like your direction:
> 
>    A.  Implicit mappings -- with this approach user space would not
>        explicitly map MSIs.  User space would be required to set the
>        geometry so that there are 3 unused windows (the last 3 windows)
>        for MSIs, and it would be up to the kernel to create the mappings.
>        This approach requires some specific semantics (leaving 3 windows)
>        and it potentially gets a little weird-- when should the kernel
>        actually create the MSI mappings?  When should they be unmapped?
>        Some convention would need to be established.

VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
number exposed to userspace on GET and transparently add MSI entries on
SET.  On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device MSI
address/data registers.  What kind of restrictions do you have around
adding and removing windows while the aperture is enabled?

>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>        a mapping is to be created for the supplied iova.  No vaddr
>        is given though.  So in the above example there would be a
>        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
>        up to the kernel to determine which bank gets mapped where.
>        So, this option puts user space in control of which windows
>        are used for MSIs and when MSIs are mapped/unmapped.   There
>        would need to be some semantics as to how this is used-- it
>        only makes sense

This could also be done as another "type2" ioctl extension.  What's the
value to userspace in determining which windows are used by which banks?
It sounds like the case that there are X banks and if userspace wants to
use MSI it needs to leave X windows available for that.  Is this just
buying userspace a few more windows to allow them the choice between MSI
or RAM?

>    C.  Explicit mapping using normal DMA map.  The last idea is that
>        we would introduce a new ioctl to give user-space an fd to 
>        the MSI bank, which could be mmapped.  The flow would be
>        something like this:
>           -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
>           -user space mmaps the fd, getting a vaddr
>           -user space does a normal DMA map for desired iova
>        This approach makes everything explicit, but adds a new ioctl
>        applicable most likely only to the PAMU (type2 iommu).

And the DMA_MAP of that mmap then allows userspace to select the window
used?  This one seems like a lot of overhead, adding a new ioctl, new
fd, mmap, special mapping path, etc.  It would be less overhead to just
add an ioctl to enable MSI, maybe letting userspace pick which windows
get used, but I'm still not sure what the value is to userspace in
exposing it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC:  vfio API changes needed for powerpc
@ 2013-04-02 20:32     ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 20:32 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm, qemu-devel, agraf, iommu,
	Bhushan Bharat-R65777, Sethi Varun-B16395

Hi Stuart,

On Tue, 2013-04-02 at 17:32 +0000, Yoder Stuart-B08248 wrote:
> Alex,
> 
> We are in the process of implementing vfio-pci support for the Freescale
> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
> than x86, and will involve creating a 'type 2' vfio implementation.
> 
> For each device's DMA mappings, PAMU has an overall aperture and a number
> of windows.  All sizes and window counts must be power of 2.  To illustrate,
> below is a mapping for a 256MB guest, including guest memory (backed by
> 64MB huge pages) and some windows for MSIs:
> 
>     Total aperture: 512MB
>     # of windows: 8
> 
>     win gphys/
>     #   iova        phys          size
>     --- ----        ----          ----
>     0   0x00000000  0xX_XX000000  64MB
>     1   0x04000000  0xX_XX000000  64MB
>     2   0x08000000  0xX_XX000000  64MB
>     3   0x0C000000  0xX_XX000000  64MB
>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>     7            -             -  disabled
> 
> There are a couple of updates needed to the vfio user->kernel interface
> that we would like your feedback on.
> 
> 1.  IOMMU geometry
> 
>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>    domain_get_attr) that lets us set the domain geometry using
>    "attributes".
> 
>    We want to expose that to user space, so envision needing a couple
>    of new ioctls to do this:
>         VFIO_IOMMU_SET_ATTR
>         VFIO_IOMMU_GET_ATTR     

Any ioctls to the vfiofd (/dev/vfio/vfio) not claimed by vfio-core are
passed to the IOMMU driver.  So you can effectively have your own type2
ioctl extensions.  Alexey has already posted patches to do this for
SPAPR that add VFIO_IOMMU_ENABLE/DISABLE to allow him access to
VFIO_IOMMU_GET_INFO to examine locked page requirements.  As Scott notes
we need to come up with a clean userspace interface for these though.

> 2.   MSI window mappings
> 
>    The more problematic question is how to deal with MSIs.  We need to
>    create mappings for up to 3 MSI banks that a device may need to target
>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>    the 3 banks any way it wants, and currently user space has no way of
>    knowing which bank may be used for a given device.   
> 
>    There are 3 options we have discussed and would like your direction:
> 
>    A.  Implicit mappings -- with this approach user space would not
>        explicitly map MSIs.  User space would be required to set the
>        geometry so that there are 3 unused windows (the last 3 windows)
>        for MSIs, and it would be up to the kernel to create the mappings.
>        This approach requires some specific semantics (leaving 3 windows)
>        and it potentially gets a little weird-- when should the kernel
>        actually create the MSI mappings?  When should they be unmapped?
>        Some convention would need to be established.

VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
number exposed to userspace on GET and transparently add MSI entries on
SET.  On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device MSI
address/data registers.  What kind of restrictions do you have around
adding and removing windows while the aperture is enabled?

>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>        a mapping is to be created for the supplied iova.  No vaddr
>        is given though.  So in the above example there would be a
>        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
>        up to the kernel to determine which bank gets mapped where.
>        So, this option puts user space in control of which windows
>        are used for MSIs and when MSIs are mapped/unmapped.   There
>        would need to be some semantics as to how this is used-- it
>        only makes sense

This could also be done as another "type2" ioctl extension.  What's the
value to userspace in determining which windows are used by which banks?
It sounds like the case that there are X banks and if userspace wants to
use MSI it needs to leave X windows available for that.  Is this just
buying userspace a few more windows to allow them the choice between MSI
or RAM?

>    C.  Explicit mapping using normal DMA map.  The last idea is that
>        we would introduce a new ioctl to give user-space an fd to 
>        the MSI bank, which could be mmapped.  The flow would be
>        something like this:
>           -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
>           -user space mmaps the fd, getting a vaddr
>           -user space does a normal DMA map for desired iova
>        This approach makes everything explicit, but adds a new ioctl
>        applicable most likely only to the PAMU (type2 iommu).

And the DMA_MAP of that mmap then allows userspace to select the window
used?  This one seems like a lot of overhead, adding a new ioctl, new
fd, mmap, special mapping path, etc.  It would be less overhead to just
add an ioctl to enable MSI, maybe letting userspace pick which windows
get used, but I'm still not sure what the value is to userspace in
exposing it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 19:39     ` [Qemu-devel] " Scott Wood
@ 2013-04-02 20:38       ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 20:38 UTC (permalink / raw)
  To: Scott Wood
  Cc: Yoder Stuart-B08248, Wood Scott-B07421, kvm, agraf, iommu,
	qemu-devel, Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
>>
>> Alex,
>>
>> We are in the process of implementing vfio-pci support for the Freescale
>> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
>> than x86, and will involve creating a 'type 2' vfio implementation.
>>
>> For each device's DMA mappings, PAMU has an overall aperture and a number
>> of windows.  All sizes and window counts must be power of 2.  To
>> illustrate,
>> below is a mapping for a 256MB guest, including guest memory (backed by
>> 64MB huge pages) and some windows for MSIs:
>>
>>     Total aperture: 512MB
>>     # of windows: 8
>>
>>     win gphys/
>>     #   iova        phys          size
>>     --- ----        ----          ----
>>     0   0x00000000  0xX_XX000000  64MB
>>     1   0x04000000  0xX_XX000000  64MB
>>     2   0x08000000  0xX_XX000000  64MB
>>     3   0x0C000000  0xX_XX000000  64MB
>>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>>     7            -             -  disabled
>>
>> There are a couple of updates needed to the vfio user->kernel interface
>> that we would like your feedback on.
>>
>> 1.  IOMMU geometry
>>
>>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>>    domain_get_attr) that lets us set the domain geometry using
>>    "attributes".
>>
>>    We want to expose that to user space, so envision needing a couple
>>    of new ioctls to do this:
>>         VFIO_IOMMU_SET_ATTR
>>         VFIO_IOMMU_GET_ATTR
>
>
> Note that this means attributes need to be updated for user-API
> appropriateness, such as using fixed-size types.
>
>
>> 2.   MSI window mappings
>>
>>    The more problematic question is how to deal with MSIs.  We need to
>>    create mappings for up to 3 MSI banks that a device may need to target
>>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>>    the 3 banks any way it wants, and currently user space has no way of
>>    knowing which bank may be used for a given device.
>>
>>    There are 3 options we have discussed and would like your direction:
>>
>>    A.  Implicit mappings -- with this approach user space would not
>>        explicitly map MSIs.  User space would be required to set the
>>        geometry so that there are 3 unused windows (the last 3 windows)
>
>
> Where does userspace get the number "3" from?  E.g. on newer chips there are
> 4 MSI banks.  Maybe future chips have even more.

Ok, then make the number 4.   The chance of more MSI banks in future chips
is nil, and if it ever happened user space could adjust.  Also,
practically speaking since memory is typically allocate in powers of
2 way you need to approximately double the window geometry anyway.

>>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>>        a mapping is to be created for the supplied iova.  No vaddr
>>        is given though.  So in the above example there would be a
>>        a dma map at 0x10000000 for 24KB (and no vaddr).
>
>
> A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI
> group is involved in this VFIO group?  What if four MSI groups are
> involved?).  You'd need to either have a naturally aligned, power-of-two
> sized mapping that covers exactly the pages you want to map and no more, or
> you'd need to create a separate mapping for each MSI bank, and due to PAMU
> subwindow alignment restrictions these mappings could not be contiguous in
> iova-space.

You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI banks
perhaps we could just do one 64MB*3 mapping to identify which windows
are used for MSIs.

If only one MSI bank was involved the kernel could get clever and only enable
the banks actually needed.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 20:38       ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 20:38 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, qemu-devel, agraf, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
>>
>> Alex,
>>
>> We are in the process of implementing vfio-pci support for the Freescale
>> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
>> than x86, and will involve creating a 'type 2' vfio implementation.
>>
>> For each device's DMA mappings, PAMU has an overall aperture and a number
>> of windows.  All sizes and window counts must be power of 2.  To
>> illustrate,
>> below is a mapping for a 256MB guest, including guest memory (backed by
>> 64MB huge pages) and some windows for MSIs:
>>
>>     Total aperture: 512MB
>>     # of windows: 8
>>
>>     win gphys/
>>     #   iova        phys          size
>>     --- ----        ----          ----
>>     0   0x00000000  0xX_XX000000  64MB
>>     1   0x04000000  0xX_XX000000  64MB
>>     2   0x08000000  0xX_XX000000  64MB
>>     3   0x0C000000  0xX_XX000000  64MB
>>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>>     7            -             -  disabled
>>
>> There are a couple of updates needed to the vfio user->kernel interface
>> that we would like your feedback on.
>>
>> 1.  IOMMU geometry
>>
>>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>>    domain_get_attr) that lets us set the domain geometry using
>>    "attributes".
>>
>>    We want to expose that to user space, so envision needing a couple
>>    of new ioctls to do this:
>>         VFIO_IOMMU_SET_ATTR
>>         VFIO_IOMMU_GET_ATTR
>
>
> Note that this means attributes need to be updated for user-API
> appropriateness, such as using fixed-size types.
>
>
>> 2.   MSI window mappings
>>
>>    The more problematic question is how to deal with MSIs.  We need to
>>    create mappings for up to 3 MSI banks that a device may need to target
>>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>>    the 3 banks any way it wants, and currently user space has no way of
>>    knowing which bank may be used for a given device.
>>
>>    There are 3 options we have discussed and would like your direction:
>>
>>    A.  Implicit mappings -- with this approach user space would not
>>        explicitly map MSIs.  User space would be required to set the
>>        geometry so that there are 3 unused windows (the last 3 windows)
>
>
> Where does userspace get the number "3" from?  E.g. on newer chips there are
> 4 MSI banks.  Maybe future chips have even more.

Ok, then make the number 4.   The chance of more MSI banks in future chips
is nil, and if it ever happened user space could adjust.  Also,
practically speaking since memory is typically allocate in powers of
2 way you need to approximately double the window geometry anyway.

>>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>>        a mapping is to be created for the supplied iova.  No vaddr
>>        is given though.  So in the above example there would be a
>>        a dma map at 0x10000000 for 24KB (and no vaddr).
>
>
> A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI
> group is involved in this VFIO group?  What if four MSI groups are
> involved?).  You'd need to either have a naturally aligned, power-of-two
> sized mapping that covers exactly the pages you want to map and no more, or
> you'd need to create a separate mapping for each MSI bank, and due to PAMU
> subwindow alignment restrictions these mappings could not be contiguous in
> iova-space.

You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI banks
perhaps we could just do one 64MB*3 mapping to identify which windows
are used for MSIs.

If only one MSI bank was involved the kernel could get clever and only enable
the banks actually needed.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:38       ` [Qemu-devel] " Stuart Yoder
@ 2013-04-02 20:47         ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 20:47 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Yoder Stuart-B08248, Wood Scott-B07421, kvm, agraf, iommu,
	qemu-devel, Bhushan Bharat-R65777

On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
> >>
> >> Alex,
> >>
> >> We are in the process of implementing vfio-pci support for the  
> Freescale
> >> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
> different
> >> than x86, and will involve creating a 'type 2' vfio implementation.
> >>
> >> For each device's DMA mappings, PAMU has an overall aperture and a  
> number
> >> of windows.  All sizes and window counts must be power of 2.  To
> >> illustrate,
> >> below is a mapping for a 256MB guest, including guest memory  
> (backed by
> >> 64MB huge pages) and some windows for MSIs:
> >>
> >>     Total aperture: 512MB
> >>     # of windows: 8
> >>
> >>     win gphys/
> >>     #   iova        phys          size
> >>     --- ----        ----          ----
> >>     0   0x00000000  0xX_XX000000  64MB
> >>     1   0x04000000  0xX_XX000000  64MB
> >>     2   0x08000000  0xX_XX000000  64MB
> >>     3   0x0C000000  0xX_XX000000  64MB
> >>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
> >>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
> >>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
> >>     7            -             -  disabled
> >>
> >> There are a couple of updates needed to the vfio user->kernel  
> interface
> >> that we would like your feedback on.
> >>
> >> 1.  IOMMU geometry
> >>
> >>    The kernel IOMMU driver now has an interface (see  
> domain_set_attr,
> >>    domain_get_attr) that lets us set the domain geometry using
> >>    "attributes".
> >>
> >>    We want to expose that to user space, so envision needing a  
> couple
> >>    of new ioctls to do this:
> >>         VFIO_IOMMU_SET_ATTR
> >>         VFIO_IOMMU_GET_ATTR
> >
> >
> > Note that this means attributes need to be updated for user-API
> > appropriateness, such as using fixed-size types.
> >
> >
> >> 2.   MSI window mappings
> >>
> >>    The more problematic question is how to deal with MSIs.  We  
> need to
> >>    create mappings for up to 3 MSI banks that a device may need to  
> target
> >>    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> from
> >>    the 3 banks any way it wants, and currently user space has no  
> way of
> >>    knowing which bank may be used for a given device.
> >>
> >>    There are 3 options we have discussed and would like your  
> direction:
> >>
> >>    A.  Implicit mappings -- with this approach user space would not
> >>        explicitly map MSIs.  User space would be required to set  
> the
> >>        geometry so that there are 3 unused windows (the last 3  
> windows)
> >
> >
> > Where does userspace get the number "3" from?  E.g. on newer chips  
> there are
> > 4 MSI banks.  Maybe future chips have even more.
> 
> Ok, then make the number 4.   The chance of more MSI banks in future  
> chips
> is nil,

What makes you so sure?  Especially since you seem to be presenting  
this as not specifically an MPIC API.

> and if it ever happened user space could adjust.

What bit of API is going to tell it that it needs to adjust?

> Also, practically speaking since memory is typically allocate in  
> powers of
> 2 way you need to approximately double the window geometry anyway.

Only if your existing mapping needs fit exactly in a power of two.

> >>    B.  Explicit mapping using DMA map flags.  The idea is that a  
> new
> >>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> >>        a mapping is to be created for the supplied iova.  No vaddr
> >>        is given though.  So in the above example there would be a
> >>        a dma map at 0x10000000 for 24KB (and no vaddr).
> >
> >
> > A single 24 KiB mapping wouldn't work (and why 24KB? What if only  
> one MSI
> > group is involved in this VFIO group?  What if four MSI groups are
> > involved?).  You'd need to either have a naturally aligned,  
> power-of-two
> > sized mapping that covers exactly the pages you want to map and no  
> more, or
> > you'd need to create a separate mapping for each MSI bank, and due  
> to PAMU
> > subwindow alignment restrictions these mappings could not be  
> contiguous in
> > iova-space.
> 
> You're right, a single 24KB mapping wouldn't work--  in the case of 3  
> MSI banks
> perhaps we could just do one 64MB*3 mapping to identify which windows
> are used for MSIs.

Where did the assumption of a 64MiB subwindow size come from?

> If only one MSI bank was involved the kernel could get clever and  
> only enable
> the banks actually needed.

I'd rather see cleverness kept in userspace.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 20:47         ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 20:47 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, qemu-devel, agraf, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
> >>
> >> Alex,
> >>
> >> We are in the process of implementing vfio-pci support for the  
> Freescale
> >> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
> different
> >> than x86, and will involve creating a 'type 2' vfio implementation.
> >>
> >> For each device's DMA mappings, PAMU has an overall aperture and a  
> number
> >> of windows.  All sizes and window counts must be power of 2.  To
> >> illustrate,
> >> below is a mapping for a 256MB guest, including guest memory  
> (backed by
> >> 64MB huge pages) and some windows for MSIs:
> >>
> >>     Total aperture: 512MB
> >>     # of windows: 8
> >>
> >>     win gphys/
> >>     #   iova        phys          size
> >>     --- ----        ----          ----
> >>     0   0x00000000  0xX_XX000000  64MB
> >>     1   0x04000000  0xX_XX000000  64MB
> >>     2   0x08000000  0xX_XX000000  64MB
> >>     3   0x0C000000  0xX_XX000000  64MB
> >>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
> >>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
> >>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
> >>     7            -             -  disabled
> >>
> >> There are a couple of updates needed to the vfio user->kernel  
> interface
> >> that we would like your feedback on.
> >>
> >> 1.  IOMMU geometry
> >>
> >>    The kernel IOMMU driver now has an interface (see  
> domain_set_attr,
> >>    domain_get_attr) that lets us set the domain geometry using
> >>    "attributes".
> >>
> >>    We want to expose that to user space, so envision needing a  
> couple
> >>    of new ioctls to do this:
> >>         VFIO_IOMMU_SET_ATTR
> >>         VFIO_IOMMU_GET_ATTR
> >
> >
> > Note that this means attributes need to be updated for user-API
> > appropriateness, such as using fixed-size types.
> >
> >
> >> 2.   MSI window mappings
> >>
> >>    The more problematic question is how to deal with MSIs.  We  
> need to
> >>    create mappings for up to 3 MSI banks that a device may need to  
> target
> >>    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> from
> >>    the 3 banks any way it wants, and currently user space has no  
> way of
> >>    knowing which bank may be used for a given device.
> >>
> >>    There are 3 options we have discussed and would like your  
> direction:
> >>
> >>    A.  Implicit mappings -- with this approach user space would not
> >>        explicitly map MSIs.  User space would be required to set  
> the
> >>        geometry so that there are 3 unused windows (the last 3  
> windows)
> >
> >
> > Where does userspace get the number "3" from?  E.g. on newer chips  
> there are
> > 4 MSI banks.  Maybe future chips have even more.
> 
> Ok, then make the number 4.   The chance of more MSI banks in future  
> chips
> is nil,

What makes you so sure?  Especially since you seem to be presenting  
this as not specifically an MPIC API.

> and if it ever happened user space could adjust.

What bit of API is going to tell it that it needs to adjust?

> Also, practically speaking since memory is typically allocate in  
> powers of
> 2 way you need to approximately double the window geometry anyway.

Only if your existing mapping needs fit exactly in a power of two.

> >>    B.  Explicit mapping using DMA map flags.  The idea is that a  
> new
> >>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> >>        a mapping is to be created for the supplied iova.  No vaddr
> >>        is given though.  So in the above example there would be a
> >>        a dma map at 0x10000000 for 24KB (and no vaddr).
> >
> >
> > A single 24 KiB mapping wouldn't work (and why 24KB? What if only  
> one MSI
> > group is involved in this VFIO group?  What if four MSI groups are
> > involved?).  You'd need to either have a naturally aligned,  
> power-of-two
> > sized mapping that covers exactly the pages you want to map and no  
> more, or
> > you'd need to create a separate mapping for each MSI bank, and due  
> to PAMU
> > subwindow alignment restrictions these mappings could not be  
> contiguous in
> > iova-space.
> 
> You're right, a single 24KB mapping wouldn't work--  in the case of 3  
> MSI banks
> perhaps we could just do one 64MB*3 mapping to identify which windows
> are used for MSIs.

Where did the assumption of a 64MiB subwindow size come from?

> If only one MSI bank was involved the kernel could get clever and  
> only enable
> the banks actually needed.

I'd rather see cleverness kept in userspace.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:32     ` [Qemu-devel] " Alex Williamson
@ 2013-04-02 20:54         ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 20:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> 2.   MSI window mappings
>>
>>    The more problematic question is how to deal with MSIs.  We need to
>>    create mappings for up to 3 MSI banks that a device may need to target
>>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>>    the 3 banks any way it wants, and currently user space has no way of
>>    knowing which bank may be used for a given device.
>>
>>    There are 3 options we have discussed and would like your direction:
>>
>>    A.  Implicit mappings -- with this approach user space would not
>>        explicitly map MSIs.  User space would be required to set the
>>        geometry so that there are 3 unused windows (the last 3 windows)
>>        for MSIs, and it would be up to the kernel to create the mappings.
>>        This approach requires some specific semantics (leaving 3 windows)
>>        and it potentially gets a little weird-- when should the kernel
>>        actually create the MSI mappings?  When should they be unmapped?
>>        Some convention would need to be established.
>
> VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
> number exposed to userspace on GET and transparently add MSI entries on
> SET.

The number of windows is always power of 2 (and max is 256).  And to reduce
PAMU cache pressure you want to use the fewest number of windows
you can.    So, I don't see practically how we could transparently
steal entries to
add the MSIs.     Either user space knows to leave empty windows for
MSIs and by convention the kernel knows which windows those are (as
in option #A) or explicitly tell the kernel which windows (as in option #B).

> On x86 the interrupt remapper handles this transparently when MSI
> is enabled and userspace never gets direct access to the device MSI
> address/data registers.  What kind of restrictions do you have around
> adding and removing windows while the aperture is enabled?

The windows can be enabled/disabled event while the aperture is
enabled (pretty sure)...

>>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>>        a mapping is to be created for the supplied iova.  No vaddr
>>        is given though.  So in the above example there would be a
>>        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
>>        up to the kernel to determine which bank gets mapped where.
>>        So, this option puts user space in control of which windows
>>        are used for MSIs and when MSIs are mapped/unmapped.   There
>>        would need to be some semantics as to how this is used-- it
>>        only makes sense
>
> This could also be done as another "type2" ioctl extension.  What's the
> value to userspace in determining which windows are used by which banks?
> It sounds like the case that there are X banks and if userspace wants to
> use MSI it needs to leave X windows available for that.  Is this just
> buying userspace a few more windows to allow them the choice between MSI
> or RAM?

Yes, it would potentially give user space the flexibility some more windows.
It also makes more explicit when the MSI mappings are created.  In option
#A the MSI mappings would probably get created at the time of the first
normal DMA map.

So, you're saying with this approach you'd rather see a new type 2
ioctl instead of adding new flags to DMA map, right?

>>    C.  Explicit mapping using normal DMA map.  The last idea is that
>>        we would introduce a new ioctl to give user-space an fd to
>>        the MSI bank, which could be mmapped.  The flow would be
>>        something like this:
>>           -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
>>           -user space mmaps the fd, getting a vaddr
>>           -user space does a normal DMA map for desired iova
>>        This approach makes everything explicit, but adds a new ioctl
>>        applicable most likely only to the PAMU (type2 iommu).
>
> And the DMA_MAP of that mmap then allows userspace to select the window
> used?  This one seems like a lot of overhead, adding a new ioctl, new
> fd, mmap, special mapping path, etc.  It would be less overhead to just
> add an ioctl to enable MSI, maybe letting userspace pick which windows
> get used, but I'm still not sure what the value is to userspace in
> exposing it.  Thanks,

Thanks,
Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 20:54         ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 20:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
>> 2.   MSI window mappings
>>
>>    The more problematic question is how to deal with MSIs.  We need to
>>    create mappings for up to 3 MSI banks that a device may need to target
>>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>>    the 3 banks any way it wants, and currently user space has no way of
>>    knowing which bank may be used for a given device.
>>
>>    There are 3 options we have discussed and would like your direction:
>>
>>    A.  Implicit mappings -- with this approach user space would not
>>        explicitly map MSIs.  User space would be required to set the
>>        geometry so that there are 3 unused windows (the last 3 windows)
>>        for MSIs, and it would be up to the kernel to create the mappings.
>>        This approach requires some specific semantics (leaving 3 windows)
>>        and it potentially gets a little weird-- when should the kernel
>>        actually create the MSI mappings?  When should they be unmapped?
>>        Some convention would need to be established.
>
> VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
> number exposed to userspace on GET and transparently add MSI entries on
> SET.

The number of windows is always power of 2 (and max is 256).  And to reduce
PAMU cache pressure you want to use the fewest number of windows
you can.    So, I don't see practically how we could transparently
steal entries to
add the MSIs.     Either user space knows to leave empty windows for
MSIs and by convention the kernel knows which windows those are (as
in option #A) or explicitly tell the kernel which windows (as in option #B).

> On x86 the interrupt remapper handles this transparently when MSI
> is enabled and userspace never gets direct access to the device MSI
> address/data registers.  What kind of restrictions do you have around
> adding and removing windows while the aperture is enabled?

The windows can be enabled/disabled event while the aperture is
enabled (pretty sure)...

>>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>>        a mapping is to be created for the supplied iova.  No vaddr
>>        is given though.  So in the above example there would be a
>>        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
>>        up to the kernel to determine which bank gets mapped where.
>>        So, this option puts user space in control of which windows
>>        are used for MSIs and when MSIs are mapped/unmapped.   There
>>        would need to be some semantics as to how this is used-- it
>>        only makes sense
>
> This could also be done as another "type2" ioctl extension.  What's the
> value to userspace in determining which windows are used by which banks?
> It sounds like the case that there are X banks and if userspace wants to
> use MSI it needs to leave X windows available for that.  Is this just
> buying userspace a few more windows to allow them the choice between MSI
> or RAM?

Yes, it would potentially give user space the flexibility some more windows.
It also makes more explicit when the MSI mappings are created.  In option
#A the MSI mappings would probably get created at the time of the first
normal DMA map.

So, you're saying with this approach you'd rather see a new type 2
ioctl instead of adding new flags to DMA map, right?

>>    C.  Explicit mapping using normal DMA map.  The last idea is that
>>        we would introduce a new ioctl to give user-space an fd to
>>        the MSI bank, which could be mmapped.  The flow would be
>>        something like this:
>>           -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
>>           -user space mmaps the fd, getting a vaddr
>>           -user space does a normal DMA map for desired iova
>>        This approach makes everything explicit, but adds a new ioctl
>>        applicable most likely only to the PAMU (type2 iommu).
>
> And the DMA_MAP of that mmap then allows userspace to select the window
> used?  This one seems like a lot of overhead, adding a new ioctl, new
> fd, mmap, special mapping path, etc.  It would be less overhead to just
> add an ioctl to enable MSI, maybe letting userspace pick which windows
> get used, but I'm still not sure what the value is to userspace in
> exposing it.  Thanks,

Thanks,
Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:32     ` [Qemu-devel] " Alex Williamson
@ 2013-04-02 20:57         ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 20:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:32 +0000, Yoder Stuart-B08248 wrote:
> > 2.   MSI window mappings
> >
> >    The more problematic question is how to deal with MSIs.  We need  
> to
> >    create mappings for up to 3 MSI banks that a device may need to  
> target
> >    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> from
> >    the 3 banks any way it wants, and currently user space has no  
> way of
> >    knowing which bank may be used for a given device.
> >
> >    There are 3 options we have discussed and would like your  
> direction:
> >
> >    A.  Implicit mappings -- with this approach user space would not
> >        explicitly map MSIs.  User space would be required to set the
> >        geometry so that there are 3 unused windows (the last 3  
> windows)
> >        for MSIs, and it would be up to the kernel to create the  
> mappings.
> >        This approach requires some specific semantics (leaving 3  
> windows)
> >        and it potentially gets a little weird-- when should the  
> kernel
> >        actually create the MSI mappings?  When should they be  
> unmapped?
> >        Some convention would need to be established.
> 
> VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
> the
> number exposed to userspace on GET and transparently add MSI entries  
> on
> SET.

What do you mean by "reduce the number exposed"?  Userspace decides how  
many entries there are, but it must be a power of two beteen 1 and 256.

> On x86 the interrupt remapper handles this transparently when MSI
> is enabled and userspace never gets direct access to the device MSI
> address/data registers.

x86 has a totally different mechanism here, as far as I understand --  
even before you get into restrictions on mappings.

> What kind of restrictions do you have around
> adding and removing windows while the aperture is enabled?

Subwindows can be modified while the aperture is enabled, but the  
aperture size and number of subwindows cannot be changed.

> >    B.  Explicit mapping using DMA map flags.  The idea is that a new
> >        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> >        a mapping is to be created for the supplied iova.  No vaddr
> >        is given though.  So in the above example there would be a
> >        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
> >        up to the kernel to determine which bank gets mapped where.
> >        So, this option puts user space in control of which windows
> >        are used for MSIs and when MSIs are mapped/unmapped.   There
> >        would need to be some semantics as to how this is used-- it
> >        only makes sense
> 
> This could also be done as another "type2" ioctl extension.

Again, what is "type2", specifically?  If someone else is adding their  
own IOMMU that is kind of, sort of like PAMU, how would they know if  
it's close enough?  What assumptions can a user make when they see that  
they're dealing with "type2"?

> What's the value to userspace in determining which windows are used  
> by which banks?

That depends on who programs the MSI config space address.  What is  
important is userspace controlling which iovas will be dedicated to  
this, in case it wants to put something else there.

> It sounds like the case that there are X banks and if userspace wants  
> to
> use MSI it needs to leave X windows available for that.  Is this just
> buying userspace a few more windows to allow them the choice between  
> MSI
> or RAM?

Well, there could be that.  But also, userspace will generally have a  
much better idea of the type of mappings it's creating, so it's easier  
to keep everything explicit at the kernel/user interface than require  
more complicated code in the kernel to figure things out automatically  
(not just for MSIs but in general).

If the kernel automatically creates the MSI mappings, when does it  
assume that userspace is done creating its own?  What if userspace  
doesn't need any DMA other than the MSIs?  What if userspace wants to  
continue dynamically modifying its other mappings?

> >    C.  Explicit mapping using normal DMA map.  The last idea is that
> >        we would introduce a new ioctl to give user-space an fd to
> >        the MSI bank, which could be mmapped.  The flow would be
> >        something like this:
> >           -for each group user space calls new ioctl  
> VFIO_GROUP_GET_MSI_FD
> >           -user space mmaps the fd, getting a vaddr
> >           -user space does a normal DMA map for desired iova
> >        This approach makes everything explicit, but adds a new ioctl
> >        applicable most likely only to the PAMU (type2 iommu).
> 
> And the DMA_MAP of that mmap then allows userspace to select the  
> window
> used?  This one seems like a lot of overhead, adding a new ioctl, new
> fd, mmap, special mapping path, etc.

There's going to be special stuff no matter what.  This would keep it  
separated from the IOMMU map code.

I'm not sure what you mean by "overhead" here... the runtime overhead  
of setting things up is not particularly relevant as long as it's  
reasonable.  If you mean development and maintenance effort, keeping  
things well separated should help.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 20:57         ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 20:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:32 +0000, Yoder Stuart-B08248 wrote:
> > 2.   MSI window mappings
> >
> >    The more problematic question is how to deal with MSIs.  We need  
> to
> >    create mappings for up to 3 MSI banks that a device may need to  
> target
> >    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> from
> >    the 3 banks any way it wants, and currently user space has no  
> way of
> >    knowing which bank may be used for a given device.
> >
> >    There are 3 options we have discussed and would like your  
> direction:
> >
> >    A.  Implicit mappings -- with this approach user space would not
> >        explicitly map MSIs.  User space would be required to set the
> >        geometry so that there are 3 unused windows (the last 3  
> windows)
> >        for MSIs, and it would be up to the kernel to create the  
> mappings.
> >        This approach requires some specific semantics (leaving 3  
> windows)
> >        and it potentially gets a little weird-- when should the  
> kernel
> >        actually create the MSI mappings?  When should they be  
> unmapped?
> >        Some convention would need to be established.
> 
> VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
> the
> number exposed to userspace on GET and transparently add MSI entries  
> on
> SET.

What do you mean by "reduce the number exposed"?  Userspace decides how  
many entries there are, but it must be a power of two beteen 1 and 256.

> On x86 the interrupt remapper handles this transparently when MSI
> is enabled and userspace never gets direct access to the device MSI
> address/data registers.

x86 has a totally different mechanism here, as far as I understand --  
even before you get into restrictions on mappings.

> What kind of restrictions do you have around
> adding and removing windows while the aperture is enabled?

Subwindows can be modified while the aperture is enabled, but the  
aperture size and number of subwindows cannot be changed.

> >    B.  Explicit mapping using DMA map flags.  The idea is that a new
> >        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> >        a mapping is to be created for the supplied iova.  No vaddr
> >        is given though.  So in the above example there would be a
> >        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
> >        up to the kernel to determine which bank gets mapped where.
> >        So, this option puts user space in control of which windows
> >        are used for MSIs and when MSIs are mapped/unmapped.   There
> >        would need to be some semantics as to how this is used-- it
> >        only makes sense
> 
> This could also be done as another "type2" ioctl extension.

Again, what is "type2", specifically?  If someone else is adding their  
own IOMMU that is kind of, sort of like PAMU, how would they know if  
it's close enough?  What assumptions can a user make when they see that  
they're dealing with "type2"?

> What's the value to userspace in determining which windows are used  
> by which banks?

That depends on who programs the MSI config space address.  What is  
important is userspace controlling which iovas will be dedicated to  
this, in case it wants to put something else there.

> It sounds like the case that there are X banks and if userspace wants  
> to
> use MSI it needs to leave X windows available for that.  Is this just
> buying userspace a few more windows to allow them the choice between  
> MSI
> or RAM?

Well, there could be that.  But also, userspace will generally have a  
much better idea of the type of mappings it's creating, so it's easier  
to keep everything explicit at the kernel/user interface than require  
more complicated code in the kernel to figure things out automatically  
(not just for MSIs but in general).

If the kernel automatically creates the MSI mappings, when does it  
assume that userspace is done creating its own?  What if userspace  
doesn't need any DMA other than the MSIs?  What if userspace wants to  
continue dynamically modifying its other mappings?

> >    C.  Explicit mapping using normal DMA map.  The last idea is that
> >        we would introduce a new ioctl to give user-space an fd to
> >        the MSI bank, which could be mmapped.  The flow would be
> >        something like this:
> >           -for each group user space calls new ioctl  
> VFIO_GROUP_GET_MSI_FD
> >           -user space mmaps the fd, getting a vaddr
> >           -user space does a normal DMA map for desired iova
> >        This approach makes everything explicit, but adds a new ioctl
> >        applicable most likely only to the PAMU (type2 iommu).
> 
> And the DMA_MAP of that mmap then allows userspace to select the  
> window
> used?  This one seems like a lot of overhead, adding a new ioctl, new
> fd, mmap, special mapping path, etc.

There's going to be special stuff no matter what.  This would keep it  
separated from the IOMMU map code.

I'm not sure what you mean by "overhead" here... the runtime overhead  
of setting things up is not particularly relevant as long as it's  
reasonable.  If you mean development and maintenance effort, keeping  
things well separated should help.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:47         ` [Qemu-devel] " Scott Wood
@ 2013-04-02 20:58           ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 20:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, agraf-l3A5Bk7waGM,
	Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 3:47 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:
>>
>> On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
>> wrote:
>> > On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
>> >>
>> >> Alex,
>> >>
>> >> We are in the process of implementing vfio-pci support for the
>> >> Freescale
>> >> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite
>> >> different
>> >> than x86, and will involve creating a 'type 2' vfio implementation.
>> >>
>> >> For each device's DMA mappings, PAMU has an overall aperture and a
>> >> number
>> >> of windows.  All sizes and window counts must be power of 2.  To
>> >> illustrate,
>> >> below is a mapping for a 256MB guest, including guest memory (backed by
>> >> 64MB huge pages) and some windows for MSIs:
>> >>
>> >>     Total aperture: 512MB
>> >>     # of windows: 8
>> >>
>> >>     win gphys/
>> >>     #   iova        phys          size
>> >>     --- ----        ----          ----
>> >>     0   0x00000000  0xX_XX000000  64MB
>> >>     1   0x04000000  0xX_XX000000  64MB
>> >>     2   0x08000000  0xX_XX000000  64MB
>> >>     3   0x0C000000  0xX_XX000000  64MB
>> >>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>> >>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>> >>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>> >>     7            -             -  disabled
>> >>
>> >> There are a couple of updates needed to the vfio user->kernel interface
>> >> that we would like your feedback on.
>> >>
>> >> 1.  IOMMU geometry
>> >>
>> >>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>> >>    domain_get_attr) that lets us set the domain geometry using
>> >>    "attributes".
>> >>
>> >>    We want to expose that to user space, so envision needing a couple
>> >>    of new ioctls to do this:
>> >>         VFIO_IOMMU_SET_ATTR
>> >>         VFIO_IOMMU_GET_ATTR
>> >
>> >
>> > Note that this means attributes need to be updated for user-API
>> > appropriateness, such as using fixed-size types.
>> >
>> >
>> >> 2.   MSI window mappings
>> >>
>> >>    The more problematic question is how to deal with MSIs.  We need to
>> >>    create mappings for up to 3 MSI banks that a device may need to
>> >> target
>> >>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>> >>    the 3 banks any way it wants, and currently user space has no way of
>> >>    knowing which bank may be used for a given device.
>> >>
>> >>    There are 3 options we have discussed and would like your direction:
>> >>
>> >>    A.  Implicit mappings -- with this approach user space would not
>> >>        explicitly map MSIs.  User space would be required to set the
>> >>        geometry so that there are 3 unused windows (the last 3 windows)
>> >
>> >
>> > Where does userspace get the number "3" from?  E.g. on newer chips there
>> > are
>> > 4 MSI banks.  Maybe future chips have even more.
>>
>> Ok, then make the number 4.   The chance of more MSI banks in future chips
>> is nil,
>
>
> What makes you so sure?  Especially since you seem to be presenting this as
> not specifically an MPIC API.
>
>
>> and if it ever happened user space could adjust.
>
>
> What bit of API is going to tell it that it needs to adjust?

Haven't thought through that completely, but I guess we could add an API
to return the number of MSI banks for type 2 iommus.

>> Also, practically speaking since memory is typically allocate in powers of
>> 2 way you need to approximately double the window geometry anyway.
>
>
> Only if your existing mapping needs fit exactly in a power of two.
>
>
>> >>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>> >>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>> >>        a mapping is to be created for the supplied iova.  No vaddr
>> >>        is given though.  So in the above example there would be a
>> >>        a dma map at 0x10000000 for 24KB (and no vaddr).
>> >
>> >
>> > A single 24 KiB mapping wouldn't work (and why 24KB? What if only one
>> > MSI
>> > group is involved in this VFIO group?  What if four MSI groups are
>> > involved?).  You'd need to either have a naturally aligned, power-of-two
>> > sized mapping that covers exactly the pages you want to map and no more,
>> > or
>> > you'd need to create a separate mapping for each MSI bank, and due to
>> > PAMU
>> > subwindow alignment restrictions these mappings could not be contiguous
>> > in
>> > iova-space.
>>
>> You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI
>> banks
>> perhaps we could just do one 64MB*3 mapping to identify which windows
>> are used for MSIs.
>
>
> Where did the assumption of a 64MiB subwindow size come from?

The example I was using.   User space would need to create a
mapping for window_size * msi_bank_count.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 20:58           ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 20:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, qemu-devel, agraf, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 3:47 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:
>>
>> On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
>> >>
>> >> Alex,
>> >>
>> >> We are in the process of implementing vfio-pci support for the
>> >> Freescale
>> >> IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite
>> >> different
>> >> than x86, and will involve creating a 'type 2' vfio implementation.
>> >>
>> >> For each device's DMA mappings, PAMU has an overall aperture and a
>> >> number
>> >> of windows.  All sizes and window counts must be power of 2.  To
>> >> illustrate,
>> >> below is a mapping for a 256MB guest, including guest memory (backed by
>> >> 64MB huge pages) and some windows for MSIs:
>> >>
>> >>     Total aperture: 512MB
>> >>     # of windows: 8
>> >>
>> >>     win gphys/
>> >>     #   iova        phys          size
>> >>     --- ----        ----          ----
>> >>     0   0x00000000  0xX_XX000000  64MB
>> >>     1   0x04000000  0xX_XX000000  64MB
>> >>     2   0x08000000  0xX_XX000000  64MB
>> >>     3   0x0C000000  0xX_XX000000  64MB
>> >>     4   0x10000000  0xf_fe044000  4KB    // msi bank 1
>> >>     5   0x14000000  0xf_fe045000  4KB    // msi bank 2
>> >>     6   0x18000000  0xf_fe046000  4KB    // msi bank 3
>> >>     7            -             -  disabled
>> >>
>> >> There are a couple of updates needed to the vfio user->kernel interface
>> >> that we would like your feedback on.
>> >>
>> >> 1.  IOMMU geometry
>> >>
>> >>    The kernel IOMMU driver now has an interface (see domain_set_attr,
>> >>    domain_get_attr) that lets us set the domain geometry using
>> >>    "attributes".
>> >>
>> >>    We want to expose that to user space, so envision needing a couple
>> >>    of new ioctls to do this:
>> >>         VFIO_IOMMU_SET_ATTR
>> >>         VFIO_IOMMU_GET_ATTR
>> >
>> >
>> > Note that this means attributes need to be updated for user-API
>> > appropriateness, such as using fixed-size types.
>> >
>> >
>> >> 2.   MSI window mappings
>> >>
>> >>    The more problematic question is how to deal with MSIs.  We need to
>> >>    create mappings for up to 3 MSI banks that a device may need to
>> >> target
>> >>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
>> >>    the 3 banks any way it wants, and currently user space has no way of
>> >>    knowing which bank may be used for a given device.
>> >>
>> >>    There are 3 options we have discussed and would like your direction:
>> >>
>> >>    A.  Implicit mappings -- with this approach user space would not
>> >>        explicitly map MSIs.  User space would be required to set the
>> >>        geometry so that there are 3 unused windows (the last 3 windows)
>> >
>> >
>> > Where does userspace get the number "3" from?  E.g. on newer chips there
>> > are
>> > 4 MSI banks.  Maybe future chips have even more.
>>
>> Ok, then make the number 4.   The chance of more MSI banks in future chips
>> is nil,
>
>
> What makes you so sure?  Especially since you seem to be presenting this as
> not specifically an MPIC API.
>
>
>> and if it ever happened user space could adjust.
>
>
> What bit of API is going to tell it that it needs to adjust?

Haven't thought through that completely, but I guess we could add an API
to return the number of MSI banks for type 2 iommus.

>> Also, practically speaking since memory is typically allocate in powers of
>> 2 way you need to approximately double the window geometry anyway.
>
>
> Only if your existing mapping needs fit exactly in a power of two.
>
>
>> >>    B.  Explicit mapping using DMA map flags.  The idea is that a new
>> >>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
>> >>        a mapping is to be created for the supplied iova.  No vaddr
>> >>        is given though.  So in the above example there would be a
>> >>        a dma map at 0x10000000 for 24KB (and no vaddr).
>> >
>> >
>> > A single 24 KiB mapping wouldn't work (and why 24KB? What if only one
>> > MSI
>> > group is involved in this VFIO group?  What if four MSI groups are
>> > involved?).  You'd need to either have a naturally aligned, power-of-two
>> > sized mapping that covers exactly the pages you want to map and no more,
>> > or
>> > you'd need to create a separate mapping for each MSI bank, and due to
>> > PAMU
>> > subwindow alignment restrictions these mappings could not be contiguous
>> > in
>> > iova-space.
>>
>> You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI
>> banks
>> perhaps we could just do one 64MB*3 mapping to identify which windows
>> are used for MSIs.
>
>
> Where did the assumption of a 64MiB subwindow size come from?

The example I was using.   User space would need to create a
mapping for window_size * msi_bank_count.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:57         ` [Qemu-devel] " Scott Wood
@ 2013-04-02 21:08           ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 21:08 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
>> This could also be done as another "type2" ioctl extension.
>
>
> Again, what is "type2", specifically?  If someone else is adding their own
> IOMMU that is kind of, sort of like PAMU, how would they know if it's close
> enough?  What assumptions can a user make when they see that they're dealing
> with "type2"?

We will define that as part of the type2 implementation.   Highly unlikely
anything but a PAMU will comply.

>> What's the value to userspace in determining which windows are used by
>> which banks?
>
>
> That depends on who programs the MSI config space address.  What is
> important is userspace controlling which iovas will be dedicated to this, in
> case it wants to put something else there.
>
>
>> It sounds like the case that there are X banks and if userspace wants to
>> use MSI it needs to leave X windows available for that.  Is this just
>> buying userspace a few more windows to allow them the choice between MSI
>> or RAM?
>
>
> Well, there could be that.  But also, userspace will generally have a much
> better idea of the type of mappings it's creating, so it's easier to keep
> everything explicit at the kernel/user interface than require more
> complicated code in the kernel to figure things out automatically (not just
> for MSIs but in general).
>
> If the kernel automatically creates the MSI mappings, when does it assume
> that userspace is done creating its own?  What if userspace doesn't need any
> DMA other than the MSIs?  What if userspace wants to continue dynamically
> modifying its other mappings?
>
>
>> >    C.  Explicit mapping using normal DMA map.  The last idea is that
>> >        we would introduce a new ioctl to give user-space an fd to
>> >        the MSI bank, which could be mmapped.  The flow would be
>> >        something like this:
>> >           -for each group user space calls new ioctl
>> > VFIO_GROUP_GET_MSI_FD
>> >           -user space mmaps the fd, getting a vaddr
>> >           -user space does a normal DMA map for desired iova
>> >        This approach makes everything explicit, but adds a new ioctl
>> >        applicable most likely only to the PAMU (type2 iommu).
>>
>> And the DMA_MAP of that mmap then allows userspace to select the window
>> used?  This one seems like a lot of overhead, adding a new ioctl, new
>> fd, mmap, special mapping path, etc.
>
>
> There's going to be special stuff no matter what.  This would keep it
> separated from the IOMMU map code.
>
> I'm not sure what you mean by "overhead" here... the runtime overhead of
> setting things up is not particularly relevant as long as it's reasonable.
> If you mean development and maintenance effort, keeping things well
> separated should help.

We don't need to change DMA_MAP.  If we can simply add a new "type 2"
ioctl that allows user space to set which windows are MSIs, it seems vastly
less complex than an ioctl to supply a new fd, mmap of it, etc.

So maybe 2 ioctls:
    VFIO_IOMMU_GET_MSI_COUNT
    VFIO_IOMMU_MAP_MSI(iova, size)

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 21:08           ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-02 21:08 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood@freescale.com> wrote:
>> This could also be done as another "type2" ioctl extension.
>
>
> Again, what is "type2", specifically?  If someone else is adding their own
> IOMMU that is kind of, sort of like PAMU, how would they know if it's close
> enough?  What assumptions can a user make when they see that they're dealing
> with "type2"?

We will define that as part of the type2 implementation.   Highly unlikely
anything but a PAMU will comply.

>> What's the value to userspace in determining which windows are used by
>> which banks?
>
>
> That depends on who programs the MSI config space address.  What is
> important is userspace controlling which iovas will be dedicated to this, in
> case it wants to put something else there.
>
>
>> It sounds like the case that there are X banks and if userspace wants to
>> use MSI it needs to leave X windows available for that.  Is this just
>> buying userspace a few more windows to allow them the choice between MSI
>> or RAM?
>
>
> Well, there could be that.  But also, userspace will generally have a much
> better idea of the type of mappings it's creating, so it's easier to keep
> everything explicit at the kernel/user interface than require more
> complicated code in the kernel to figure things out automatically (not just
> for MSIs but in general).
>
> If the kernel automatically creates the MSI mappings, when does it assume
> that userspace is done creating its own?  What if userspace doesn't need any
> DMA other than the MSIs?  What if userspace wants to continue dynamically
> modifying its other mappings?
>
>
>> >    C.  Explicit mapping using normal DMA map.  The last idea is that
>> >        we would introduce a new ioctl to give user-space an fd to
>> >        the MSI bank, which could be mmapped.  The flow would be
>> >        something like this:
>> >           -for each group user space calls new ioctl
>> > VFIO_GROUP_GET_MSI_FD
>> >           -user space mmaps the fd, getting a vaddr
>> >           -user space does a normal DMA map for desired iova
>> >        This approach makes everything explicit, but adds a new ioctl
>> >        applicable most likely only to the PAMU (type2 iommu).
>>
>> And the DMA_MAP of that mmap then allows userspace to select the window
>> used?  This one seems like a lot of overhead, adding a new ioctl, new
>> fd, mmap, special mapping path, etc.
>
>
> There's going to be special stuff no matter what.  This would keep it
> separated from the IOMMU map code.
>
> I'm not sure what you mean by "overhead" here... the runtime overhead of
> setting things up is not particularly relevant as long as it's reasonable.
> If you mean development and maintenance effort, keeping things well
> separated should help.

We don't need to change DMA_MAP.  If we can simply add a new "type 2"
ioctl that allows user space to set which windows are MSIs, it seems vastly
less complex than an ioctl to supply a new fd, mmap of it, etc.

So maybe 2 ioctls:
    VFIO_IOMMU_GET_MSI_COUNT
    VFIO_IOMMU_MAP_MSI(iova, size)

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:54         ` [Qemu-devel] " Stuart Yoder
@ 2013-04-02 21:16             ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 21:16 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> 2.   MSI window mappings
> >>
> >>    The more problematic question is how to deal with MSIs.  We need to
> >>    create mappings for up to 3 MSI banks that a device may need to target
> >>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
> >>    the 3 banks any way it wants, and currently user space has no way of
> >>    knowing which bank may be used for a given device.
> >>
> >>    There are 3 options we have discussed and would like your direction:
> >>
> >>    A.  Implicit mappings -- with this approach user space would not
> >>        explicitly map MSIs.  User space would be required to set the
> >>        geometry so that there are 3 unused windows (the last 3 windows)
> >>        for MSIs, and it would be up to the kernel to create the mappings.
> >>        This approach requires some specific semantics (leaving 3 windows)
> >>        and it potentially gets a little weird-- when should the kernel
> >>        actually create the MSI mappings?  When should they be unmapped?
> >>        Some convention would need to be established.
> >
> > VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
> > number exposed to userspace on GET and transparently add MSI entries on
> > SET.
> 
> The number of windows is always power of 2 (and max is 256).  And to reduce
> PAMU cache pressure you want to use the fewest number of windows
> you can.    So, I don't see practically how we could transparently
> steal entries to
> add the MSIs.     Either user space knows to leave empty windows for
> MSIs and by convention the kernel knows which windows those are (as
> in option #A) or explicitly tell the kernel which windows (as in option #B).

Ok, apparently I don't understand the API.  Is it something like
userspace calls GET_ATTR and finds out that there are 256 available
windows, userspace determines that it needs 8 for RAM and then it has an
MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
prone to exploitation by the first userspace to allocate it's aperture,
but I'm also not sure why userspace could specify the (non-power of 2)
number of windows it needs for RAM, then VFIO would see that the devices
attached have MSI and add those windows and align to a power of 2.

> > On x86 the interrupt remapper handles this transparently when MSI
> > is enabled and userspace never gets direct access to the device MSI
> > address/data registers.  What kind of restrictions do you have around
> > adding and removing windows while the aperture is enabled?
> 
> The windows can be enabled/disabled event while the aperture is
> enabled (pretty sure)...
> 
> >>    B.  Explicit mapping using DMA map flags.  The idea is that a new
> >>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> >>        a mapping is to be created for the supplied iova.  No vaddr
> >>        is given though.  So in the above example there would be a
> >>        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
> >>        up to the kernel to determine which bank gets mapped where.
> >>        So, this option puts user space in control of which windows
> >>        are used for MSIs and when MSIs are mapped/unmapped.   There
> >>        would need to be some semantics as to how this is used-- it
> >>        only makes sense
> >
> > This could also be done as another "type2" ioctl extension.  What's the
> > value to userspace in determining which windows are used by which banks?
> > It sounds like the case that there are X banks and if userspace wants to
> > use MSI it needs to leave X windows available for that.  Is this just
> > buying userspace a few more windows to allow them the choice between MSI
> > or RAM?
> 
> Yes, it would potentially give user space the flexibility some more windows.
> It also makes more explicit when the MSI mappings are created.  In option
> #A the MSI mappings would probably get created at the time of the first
> normal DMA map.
> 
> So, you're saying with this approach you'd rather see a new type 2
> ioctl instead of adding new flags to DMA map, right?

I'm not sure I know enough yet to have a suggestion.  What would be the
purpose of userspace specifying the iova and size here?  If userspace
just needs to know that it needs X addition windows for MSI and can tell
the kernel to use banks 0 through (X-1) for MSI, that sounds more like
an ioctl interface than a DMA_MAP flag.  Thanks,

Alex

> >>    C.  Explicit mapping using normal DMA map.  The last idea is that
> >>        we would introduce a new ioctl to give user-space an fd to
> >>        the MSI bank, which could be mmapped.  The flow would be
> >>        something like this:
> >>           -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
> >>           -user space mmaps the fd, getting a vaddr
> >>           -user space does a normal DMA map for desired iova
> >>        This approach makes everything explicit, but adds a new ioctl
> >>        applicable most likely only to the PAMU (type2 iommu).
> >
> > And the DMA_MAP of that mmap then allows userspace to select the window
> > used?  This one seems like a lot of overhead, adding a new ioctl, new
> > fd, mmap, special mapping path, etc.  It would be less overhead to just
> > add an ioctl to enable MSI, maybe letting userspace pick which windows
> > get used, but I'm still not sure what the value is to userspace in
> > exposing it.  Thanks,
> 
> Thanks,
> Stuart
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 21:16             ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 21:16 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> >> 2.   MSI window mappings
> >>
> >>    The more problematic question is how to deal with MSIs.  We need to
> >>    create mappings for up to 3 MSI banks that a device may need to target
> >>    to generate interrupts.  The Linux MSI driver can allocate MSIs from
> >>    the 3 banks any way it wants, and currently user space has no way of
> >>    knowing which bank may be used for a given device.
> >>
> >>    There are 3 options we have discussed and would like your direction:
> >>
> >>    A.  Implicit mappings -- with this approach user space would not
> >>        explicitly map MSIs.  User space would be required to set the
> >>        geometry so that there are 3 unused windows (the last 3 windows)
> >>        for MSIs, and it would be up to the kernel to create the mappings.
> >>        This approach requires some specific semantics (leaving 3 windows)
> >>        and it potentially gets a little weird-- when should the kernel
> >>        actually create the MSI mappings?  When should they be unmapped?
> >>        Some convention would need to be established.
> >
> > VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
> > number exposed to userspace on GET and transparently add MSI entries on
> > SET.
> 
> The number of windows is always power of 2 (and max is 256).  And to reduce
> PAMU cache pressure you want to use the fewest number of windows
> you can.    So, I don't see practically how we could transparently
> steal entries to
> add the MSIs.     Either user space knows to leave empty windows for
> MSIs and by convention the kernel knows which windows those are (as
> in option #A) or explicitly tell the kernel which windows (as in option #B).

Ok, apparently I don't understand the API.  Is it something like
userspace calls GET_ATTR and finds out that there are 256 available
windows, userspace determines that it needs 8 for RAM and then it has an
MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
prone to exploitation by the first userspace to allocate it's aperture,
but I'm also not sure why userspace could specify the (non-power of 2)
number of windows it needs for RAM, then VFIO would see that the devices
attached have MSI and add those windows and align to a power of 2.

> > On x86 the interrupt remapper handles this transparently when MSI
> > is enabled and userspace never gets direct access to the device MSI
> > address/data registers.  What kind of restrictions do you have around
> > adding and removing windows while the aperture is enabled?
> 
> The windows can be enabled/disabled event while the aperture is
> enabled (pretty sure)...
> 
> >>    B.  Explicit mapping using DMA map flags.  The idea is that a new
> >>        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> >>        a mapping is to be created for the supplied iova.  No vaddr
> >>        is given though.  So in the above example there would be a
> >>        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
> >>        up to the kernel to determine which bank gets mapped where.
> >>        So, this option puts user space in control of which windows
> >>        are used for MSIs and when MSIs are mapped/unmapped.   There
> >>        would need to be some semantics as to how this is used-- it
> >>        only makes sense
> >
> > This could also be done as another "type2" ioctl extension.  What's the
> > value to userspace in determining which windows are used by which banks?
> > It sounds like the case that there are X banks and if userspace wants to
> > use MSI it needs to leave X windows available for that.  Is this just
> > buying userspace a few more windows to allow them the choice between MSI
> > or RAM?
> 
> Yes, it would potentially give user space the flexibility some more windows.
> It also makes more explicit when the MSI mappings are created.  In option
> #A the MSI mappings would probably get created at the time of the first
> normal DMA map.
> 
> So, you're saying with this approach you'd rather see a new type 2
> ioctl instead of adding new flags to DMA map, right?

I'm not sure I know enough yet to have a suggestion.  What would be the
purpose of userspace specifying the iova and size here?  If userspace
just needs to know that it needs X addition windows for MSI and can tell
the kernel to use banks 0 through (X-1) for MSI, that sounds more like
an ioctl interface than a DMA_MAP flag.  Thanks,

Alex

> >>    C.  Explicit mapping using normal DMA map.  The last idea is that
> >>        we would introduce a new ioctl to give user-space an fd to
> >>        the MSI bank, which could be mmapped.  The flow would be
> >>        something like this:
> >>           -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
> >>           -user space mmaps the fd, getting a vaddr
> >>           -user space does a normal DMA map for desired iova
> >>        This approach makes everything explicit, but adds a new ioctl
> >>        applicable most likely only to the PAMU (type2 iommu).
> >
> > And the DMA_MAP of that mmap then allows userspace to select the window
> > used?  This one seems like a lot of overhead, adding a new ioctl, new
> > fd, mmap, special mapping path, etc.  It would be less overhead to just
> > add an ioctl to enable MSI, maybe letting userspace pick which windows
> > get used, but I'm still not sure what the value is to userspace in
> > exposing it.  Thanks,
> 
> Thanks,
> Stuart
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 20:57         ` [Qemu-devel] " Scott Wood
@ 2013-04-02 21:32           ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 21:32 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 17:32 +0000, Yoder Stuart-B08248 wrote:
> > > 2.   MSI window mappings
> > >
> > >    The more problematic question is how to deal with MSIs.  We need  
> > to
> > >    create mappings for up to 3 MSI banks that a device may need to  
> > target
> > >    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> > from
> > >    the 3 banks any way it wants, and currently user space has no  
> > way of
> > >    knowing which bank may be used for a given device.
> > >
> > >    There are 3 options we have discussed and would like your  
> > direction:
> > >
> > >    A.  Implicit mappings -- with this approach user space would not
> > >        explicitly map MSIs.  User space would be required to set the
> > >        geometry so that there are 3 unused windows (the last 3  
> > windows)
> > >        for MSIs, and it would be up to the kernel to create the  
> > mappings.
> > >        This approach requires some specific semantics (leaving 3  
> > windows)
> > >        and it potentially gets a little weird-- when should the  
> > kernel
> > >        actually create the MSI mappings?  When should they be  
> > unmapped?
> > >        Some convention would need to be established.
> > 
> > VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
> > the
> > number exposed to userspace on GET and transparently add MSI entries  
> > on
> > SET.
> 
> What do you mean by "reduce the number exposed"?  Userspace decides how  
> many entries there are, but it must be a power of two beteen 1 and 256.

I didn't understand the API.

> > On x86 the interrupt remapper handles this transparently when MSI
> > is enabled and userspace never gets direct access to the device MSI
> > address/data registers.
> 
> x86 has a totally different mechanism here, as far as I understand --  
> even before you get into restrictions on mappings.

So what control will userspace have over programming the actually MSI
vectors on PAMU?

> > What kind of restrictions do you have around
> > adding and removing windows while the aperture is enabled?
> 
> Subwindows can be modified while the aperture is enabled, but the  
> aperture size and number of subwindows cannot be changed.
> 
> > >    B.  Explicit mapping using DMA map flags.  The idea is that a new
> > >        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> > >        a mapping is to be created for the supplied iova.  No vaddr
> > >        is given though.  So in the above example there would be a
> > >        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
> > >        up to the kernel to determine which bank gets mapped where.
> > >        So, this option puts user space in control of which windows
> > >        are used for MSIs and when MSIs are mapped/unmapped.   There
> > >        would need to be some semantics as to how this is used-- it
> > >        only makes sense
> > 
> > This could also be done as another "type2" ioctl extension.
> 
> Again, what is "type2", specifically?  If someone else is adding their  
> own IOMMU that is kind of, sort of like PAMU, how would they know if  
> it's close enough?  What assumptions can a user make when they see that  
> they're dealing with "type2"?

Naming always has and always will be a problem.  I assume this is named
type2 rather than PAMU because it's trying to expose a generic windowed
IOMMU fitting the IOMMU API.  Like type1, it doesn't really make sense
to name it "IOMMU API" because that's a kernel internal interface and
we're designing a userspace interface that just happens to use that.
Tagging it to a piece of hardware makes it less reusable.  Type1 is
arbitrary.  It might as well be named "brown" and this one can be
"blue".

> > What's the value to userspace in determining which windows are used  
> > by which banks?
> 
> That depends on who programs the MSI config space address.  What is  
> important is userspace controlling which iovas will be dedicated to  
> this, in case it wants to put something else there.

So userspace is programming the MSI vectors, targeting a user programmed
iova?  But an iova selects a window and I thought there were some number
of MSI banks and we don't really know which ones we'll need...  still
confused.

> > It sounds like the case that there are X banks and if userspace wants  
> > to
> > use MSI it needs to leave X windows available for that.  Is this just
> > buying userspace a few more windows to allow them the choice between  
> > MSI
> > or RAM?
> 
> Well, there could be that.  But also, userspace will generally have a  
> much better idea of the type of mappings it's creating, so it's easier  
> to keep everything explicit at the kernel/user interface than require  
> more complicated code in the kernel to figure things out automatically  
> (not just for MSIs but in general).
> 
> If the kernel automatically creates the MSI mappings, when does it  
> assume that userspace is done creating its own?  What if userspace  
> doesn't need any DMA other than the MSIs?  What if userspace wants to  
> continue dynamically modifying its other mappings?

Yep, valid arguments.

> > >    C.  Explicit mapping using normal DMA map.  The last idea is that
> > >        we would introduce a new ioctl to give user-space an fd to
> > >        the MSI bank, which could be mmapped.  The flow would be
> > >        something like this:
> > >           -for each group user space calls new ioctl  
> > VFIO_GROUP_GET_MSI_FD
> > >           -user space mmaps the fd, getting a vaddr
> > >           -user space does a normal DMA map for desired iova
> > >        This approach makes everything explicit, but adds a new ioctl
> > >        applicable most likely only to the PAMU (type2 iommu).
> > 
> > And the DMA_MAP of that mmap then allows userspace to select the  
> > window
> > used?  This one seems like a lot of overhead, adding a new ioctl, new
> > fd, mmap, special mapping path, etc.
> 
> There's going to be special stuff no matter what.  This would keep it  
> separated from the IOMMU map code.
> 
> I'm not sure what you mean by "overhead" here... the runtime overhead  
> of setting things up is not particularly relevant as long as it's  
> reasonable.  If you mean development and maintenance effort, keeping  
> things well separated should help.

Overhead in terms of code required and complexity.  More things to
reference count and shut down in the proper order on userspace exit.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 21:32           ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 21:32 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 17:32 +0000, Yoder Stuart-B08248 wrote:
> > > 2.   MSI window mappings
> > >
> > >    The more problematic question is how to deal with MSIs.  We need  
> > to
> > >    create mappings for up to 3 MSI banks that a device may need to  
> > target
> > >    to generate interrupts.  The Linux MSI driver can allocate MSIs  
> > from
> > >    the 3 banks any way it wants, and currently user space has no  
> > way of
> > >    knowing which bank may be used for a given device.
> > >
> > >    There are 3 options we have discussed and would like your  
> > direction:
> > >
> > >    A.  Implicit mappings -- with this approach user space would not
> > >        explicitly map MSIs.  User space would be required to set the
> > >        geometry so that there are 3 unused windows (the last 3  
> > windows)
> > >        for MSIs, and it would be up to the kernel to create the  
> > mappings.
> > >        This approach requires some specific semantics (leaving 3  
> > windows)
> > >        and it potentially gets a little weird-- when should the  
> > kernel
> > >        actually create the MSI mappings?  When should they be  
> > unmapped?
> > >        Some convention would need to be established.
> > 
> > VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
> > the
> > number exposed to userspace on GET and transparently add MSI entries  
> > on
> > SET.
> 
> What do you mean by "reduce the number exposed"?  Userspace decides how  
> many entries there are, but it must be a power of two beteen 1 and 256.

I didn't understand the API.

> > On x86 the interrupt remapper handles this transparently when MSI
> > is enabled and userspace never gets direct access to the device MSI
> > address/data registers.
> 
> x86 has a totally different mechanism here, as far as I understand --  
> even before you get into restrictions on mappings.

So what control will userspace have over programming the actually MSI
vectors on PAMU?

> > What kind of restrictions do you have around
> > adding and removing windows while the aperture is enabled?
> 
> Subwindows can be modified while the aperture is enabled, but the  
> aperture size and number of subwindows cannot be changed.
> 
> > >    B.  Explicit mapping using DMA map flags.  The idea is that a new
> > >        flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
> > >        a mapping is to be created for the supplied iova.  No vaddr
> > >        is given though.  So in the above example there would be a
> > >        a dma map at 0x10000000 for 24KB (and no vaddr).   It's
> > >        up to the kernel to determine which bank gets mapped where.
> > >        So, this option puts user space in control of which windows
> > >        are used for MSIs and when MSIs are mapped/unmapped.   There
> > >        would need to be some semantics as to how this is used-- it
> > >        only makes sense
> > 
> > This could also be done as another "type2" ioctl extension.
> 
> Again, what is "type2", specifically?  If someone else is adding their  
> own IOMMU that is kind of, sort of like PAMU, how would they know if  
> it's close enough?  What assumptions can a user make when they see that  
> they're dealing with "type2"?

Naming always has and always will be a problem.  I assume this is named
type2 rather than PAMU because it's trying to expose a generic windowed
IOMMU fitting the IOMMU API.  Like type1, it doesn't really make sense
to name it "IOMMU API" because that's a kernel internal interface and
we're designing a userspace interface that just happens to use that.
Tagging it to a piece of hardware makes it less reusable.  Type1 is
arbitrary.  It might as well be named "brown" and this one can be
"blue".

> > What's the value to userspace in determining which windows are used  
> > by which banks?
> 
> That depends on who programs the MSI config space address.  What is  
> important is userspace controlling which iovas will be dedicated to  
> this, in case it wants to put something else there.

So userspace is programming the MSI vectors, targeting a user programmed
iova?  But an iova selects a window and I thought there were some number
of MSI banks and we don't really know which ones we'll need...  still
confused.

> > It sounds like the case that there are X banks and if userspace wants  
> > to
> > use MSI it needs to leave X windows available for that.  Is this just
> > buying userspace a few more windows to allow them the choice between  
> > MSI
> > or RAM?
> 
> Well, there could be that.  But also, userspace will generally have a  
> much better idea of the type of mappings it's creating, so it's easier  
> to keep everything explicit at the kernel/user interface than require  
> more complicated code in the kernel to figure things out automatically  
> (not just for MSIs but in general).
> 
> If the kernel automatically creates the MSI mappings, when does it  
> assume that userspace is done creating its own?  What if userspace  
> doesn't need any DMA other than the MSIs?  What if userspace wants to  
> continue dynamically modifying its other mappings?

Yep, valid arguments.

> > >    C.  Explicit mapping using normal DMA map.  The last idea is that
> > >        we would introduce a new ioctl to give user-space an fd to
> > >        the MSI bank, which could be mmapped.  The flow would be
> > >        something like this:
> > >           -for each group user space calls new ioctl  
> > VFIO_GROUP_GET_MSI_FD
> > >           -user space mmaps the fd, getting a vaddr
> > >           -user space does a normal DMA map for desired iova
> > >        This approach makes everything explicit, but adds a new ioctl
> > >        applicable most likely only to the PAMU (type2 iommu).
> > 
> > And the DMA_MAP of that mmap then allows userspace to select the  
> > window
> > used?  This one seems like a lot of overhead, adding a new ioctl, new
> > fd, mmap, special mapping path, etc.
> 
> There's going to be special stuff no matter what.  This would keep it  
> separated from the IOMMU map code.
> 
> I'm not sure what you mean by "overhead" here... the runtime overhead  
> of setting things up is not particularly relevant as long as it's  
> reasonable.  If you mean development and maintenance effort, keeping  
> things well separated should help.

Overhead in terms of code required and complexity.  More things to
reference count and shut down in the proper order on userspace exit.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 21:08           ` [Qemu-devel] " Stuart Yoder
@ 2013-04-02 21:38               ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 21:38 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777, Scott Wood

On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> >> >    C.  Explicit mapping using normal DMA map.  The last idea is that
> >> >        we would introduce a new ioctl to give user-space an fd to
> >> >        the MSI bank, which could be mmapped.  The flow would be
> >> >        something like this:
> >> >           -for each group user space calls new ioctl
> >> > VFIO_GROUP_GET_MSI_FD
> >> >           -user space mmaps the fd, getting a vaddr
> >> >           -user space does a normal DMA map for desired iova
> >> >        This approach makes everything explicit, but adds a new ioctl
> >> >        applicable most likely only to the PAMU (type2 iommu).
> >>
> >> And the DMA_MAP of that mmap then allows userspace to select the window
> >> used?  This one seems like a lot of overhead, adding a new ioctl, new
> >> fd, mmap, special mapping path, etc.
> >
> >
> > There's going to be special stuff no matter what.  This would keep it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime overhead of
> > setting things up is not particularly relevant as long as it's reasonable.
> > If you mean development and maintenance effort, keeping things well
> > separated should help.
> 
> We don't need to change DMA_MAP.  If we can simply add a new "type 2"
> ioctl that allows user space to set which windows are MSIs, it seems vastly
> less complex than an ioctl to supply a new fd, mmap of it, etc.
> 
> So maybe 2 ioctls:
>     VFIO_IOMMU_GET_MSI_COUNT
>     VFIO_IOMMU_MAP_MSI(iova, size)
> 

How are MSIs related to devices on PAMU?  On x86 MSI count is very
device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually
VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with it
being a device ioctl is that you need to get the device FD, but the
IOMMU protection needs to be established before you can get that... so
there's an ordering problem if you need it from the device before
configuring the IOMMU.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 21:38               ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-02 21:38 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777, Scott Wood

On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood@freescale.com> wrote:
> >> >    C.  Explicit mapping using normal DMA map.  The last idea is that
> >> >        we would introduce a new ioctl to give user-space an fd to
> >> >        the MSI bank, which could be mmapped.  The flow would be
> >> >        something like this:
> >> >           -for each group user space calls new ioctl
> >> > VFIO_GROUP_GET_MSI_FD
> >> >           -user space mmaps the fd, getting a vaddr
> >> >           -user space does a normal DMA map for desired iova
> >> >        This approach makes everything explicit, but adds a new ioctl
> >> >        applicable most likely only to the PAMU (type2 iommu).
> >>
> >> And the DMA_MAP of that mmap then allows userspace to select the window
> >> used?  This one seems like a lot of overhead, adding a new ioctl, new
> >> fd, mmap, special mapping path, etc.
> >
> >
> > There's going to be special stuff no matter what.  This would keep it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime overhead of
> > setting things up is not particularly relevant as long as it's reasonable.
> > If you mean development and maintenance effort, keeping things well
> > separated should help.
> 
> We don't need to change DMA_MAP.  If we can simply add a new "type 2"
> ioctl that allows user space to set which windows are MSIs, it seems vastly
> less complex than an ioctl to supply a new fd, mmap of it, etc.
> 
> So maybe 2 ioctls:
>     VFIO_IOMMU_GET_MSI_COUNT
>     VFIO_IOMMU_MAP_MSI(iova, size)
> 

How are MSIs related to devices on PAMU?  On x86 MSI count is very
device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually
VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with it
being a device ioctl is that you need to get the device FD, but the
IOMMU protection needs to be established before you can get that... so
there's an ordering problem if you need it from the device before
configuring the IOMMU.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 21:08           ` [Qemu-devel] " Stuart Yoder
@ 2013-04-02 21:55               ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 21:55 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	Bhushan Bharat-R65777

On 04/02/2013 04:08:27 PM, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org>  
> wrote:
> >> This could also be done as another "type2" ioctl extension.
> >
> >
> > Again, what is "type2", specifically?  If someone else is adding  
> their own
> > IOMMU that is kind of, sort of like PAMU, how would they know if  
> it's close
> > enough?  What assumptions can a user make when they see that  
> they're dealing
> > with "type2"?
> 
> We will define that as part of the type2 implementation.   Highly  
> unlikely
> anything but a PAMU will comply.

So then why not just call it "pamu" instead of being obfuscatory?

> > There's going to be special stuff no matter what.  This would keep  
> it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime  
> overhead of
> > setting things up is not particularly relevant as long as it's  
> reasonable.
> > If you mean development and maintenance effort, keeping things well
> > separated should help.
> 
> We don't need to change DMA_MAP.  If we can simply add a new "type 2"
> ioctl that allows user space to set which windows are MSIs,

And what specifically does that ioctl do?  It causes new mappings to be  
created, right?  So you're changing (or at least adding to) the DMA map  
mechanism.

> it seems vastly less complex than an ioctl to supply a new fd, mmap  
> of it, etc.

I don't see enough complexity in the mmap approach for anything to be  
"vastly less complex" in comparison.  I think you're building the mmap  
approach up in your head to be a lot worse that it would actually be.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 21:55               ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 21:55 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On 04/02/2013 04:08:27 PM, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> >> This could also be done as another "type2" ioctl extension.
> >
> >
> > Again, what is "type2", specifically?  If someone else is adding  
> their own
> > IOMMU that is kind of, sort of like PAMU, how would they know if  
> it's close
> > enough?  What assumptions can a user make when they see that  
> they're dealing
> > with "type2"?
> 
> We will define that as part of the type2 implementation.   Highly  
> unlikely
> anything but a PAMU will comply.

So then why not just call it "pamu" instead of being obfuscatory?

> > There's going to be special stuff no matter what.  This would keep  
> it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime  
> overhead of
> > setting things up is not particularly relevant as long as it's  
> reasonable.
> > If you mean development and maintenance effort, keeping things well
> > separated should help.
> 
> We don't need to change DMA_MAP.  If we can simply add a new "type 2"
> ioctl that allows user space to set which windows are MSIs,

And what specifically does that ioctl do?  It causes new mappings to be  
created, right?  So you're changing (or at least adding to) the DMA map  
mechanism.

> it seems vastly less complex than an ioctl to supply a new fd, mmap  
> of it, etc.

I don't see enough complexity in the mmap approach for anything to be  
"vastly less complex" in comparison.  I think you're building the mmap  
approach up in your head to be a lot worse that it would actually be.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 21:16             ` [Qemu-devel] " Alex Williamson
@ 2013-04-02 22:13                 ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 22:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/02/2013 04:16:11 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
> > The number of windows is always power of 2 (and max is 256).  And  
> to reduce
> > PAMU cache pressure you want to use the fewest number of windows
> > you can.    So, I don't see practically how we could transparently
> > steal entries to
> > add the MSIs.     Either user space knows to leave empty windows for
> > MSIs and by convention the kernel knows which windows those are (as
> > in option #A) or explicitly tell the kernel which windows (as in  
> option #B).
> 
> Ok, apparently I don't understand the API.  Is it something like
> userspace calls GET_ATTR and finds out that there are 256 available
> windows, userspace determines that it needs 8 for RAM and then it has  
> an
> MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
> prone to exploitation by the first userspace to allocate it's  
> aperture,

What exploitation?

It's not as if there is a pool of 256 global windows that users  
allocate from.  The subwindow count is just how finely divided the  
aperture is.  The only way one user will affect another is through  
cache contention (which is why we want the minimum number of subwindows  
that we can get away with).

> but I'm also not sure why userspace could specify the (non-power of 2)
> number of windows it needs for RAM, then VFIO would see that the  
> devices
> attached have MSI and add those windows and align to a power of 2.

If you double the subwindow count without userspace knowing, you have  
to double the aperture as well (and you may need to grow up or down  
depending on alignment).  This means you also need to halve the maximum  
aperture that userspace can request.  And you need to expose a  
different number of maximum subwindows in the IOMMU API based on  
whether we might have MSIs of this type.  It's ugly and awkward, and  
removes the possibility for userspace to place the MSIs in some unused  
slot in the middle, or not use MSIs at all.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 22:13                 ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 22:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, Stuart Yoder, agraf, qemu-devel,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On 04/02/2013 04:16:11 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
> > The number of windows is always power of 2 (and max is 256).  And  
> to reduce
> > PAMU cache pressure you want to use the fewest number of windows
> > you can.    So, I don't see practically how we could transparently
> > steal entries to
> > add the MSIs.     Either user space knows to leave empty windows for
> > MSIs and by convention the kernel knows which windows those are (as
> > in option #A) or explicitly tell the kernel which windows (as in  
> option #B).
> 
> Ok, apparently I don't understand the API.  Is it something like
> userspace calls GET_ATTR and finds out that there are 256 available
> windows, userspace determines that it needs 8 for RAM and then it has  
> an
> MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
> prone to exploitation by the first userspace to allocate it's  
> aperture,

What exploitation?

It's not as if there is a pool of 256 global windows that users  
allocate from.  The subwindow count is just how finely divided the  
aperture is.  The only way one user will affect another is through  
cache contention (which is why we want the minimum number of subwindows  
that we can get away with).

> but I'm also not sure why userspace could specify the (non-power of 2)
> number of windows it needs for RAM, then VFIO would see that the  
> devices
> attached have MSI and add those windows and align to a power of 2.

If you double the subwindow count without userspace knowing, you have  
to double the aperture as well (and you may need to grow up or down  
depending on alignment).  This means you also need to halve the maximum  
aperture that userspace can request.  And you need to expose a  
different number of maximum subwindows in the IOMMU API based on  
whether we might have MSIs of this type.  It's ugly and awkward, and  
removes the possibility for userspace to place the MSIs in some unused  
slot in the middle, or not use MSIs at all.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 21:32           ` [Qemu-devel] " Alex Williamson
@ 2013-04-02 22:44               ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 22:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > On x86 the interrupt remapper handles this transparently when MSI
> > > is enabled and userspace never gets direct access to the device  
> MSI
> > > address/data registers.
> >
> > x86 has a totally different mechanism here, as far as I understand  
> --
> > even before you get into restrictions on mappings.
> 
> So what control will userspace have over programming the actually MSI
> vectors on PAMU?

Not sure what you mean -- PAMU doesn't get explicitly involved in  
MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
you want isolation, you need to make sure that an MSI group is only  
used by one VFIO group, and that you're on a chip that has alias pages  
with just one MSI bank register each (newer chips do, but the first  
chip to have a PAMU didn't).

> > > This could also be done as another "type2" ioctl extension.
> >
> > Again, what is "type2", specifically?  If someone else is adding  
> their
> > own IOMMU that is kind of, sort of like PAMU, how would they know if
> > it's close enough?  What assumptions can a user make when they see  
> that
> > they're dealing with "type2"?
> 
> Naming always has and always will be a problem.  I assume this is  
> named
> type2 rather than PAMU because it's trying to expose a generic  
> windowed
> IOMMU fitting the IOMMU API.

But how closely is the MSI situation related to a generic windowed  
IOMMU, then?  We could just as well have a highly flexible IOMMU in  
terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
be mapped rather than a translation table.  Or we could have a windowed  
IOMMU that has an MSI translation table.

> Like type1, it doesn't really make sense
> to name it "IOMMU API" because that's a kernel internal interface and
> we're designing a userspace interface that just happens to use that.
> Tagging it to a piece of hardware makes it less reusable.

Well, that's my point.  Is it reusable at all, anyway?  If not, then  
giving it a more obscure name won't change that.  If it is reusable,  
then where is the line drawn between things that are PAMU-specific or  
MPIC-specific and things that are part of the "generic windowed IOMMU"  
abstraction?

>  Type1 is arbitrary.  It might as well be named "brown" and this one  
> can be
> "blue".

The difference is that "type1" seems to refer to hardware that can do  
arbitrary 4K page mappings, possibly constrained by an aperture but  
nothing else.  More than one IOMMU can reasonably fit that.  The odds  
that another IOMMU would have exactly the same restrictions as PAMU  
seem smaller in comparison.

In any case, if you had to deal with some Intel-only quirk, would it  
make sense to call it a "type1 attribute"?  I'm not advocating one way  
or the other on whether an abstraction is viable here (though Stuart  
seems to think it's "highly unlikely anything but a PAMU will comply"),  
just that if it is to be abstracted rather than a hardware-specific  
interface, we need to document what is and is not part of the  
abstraction.  Otherwise a non-PAMU-specific user won't know what they  
can rely on, and someone adding support for a new windowed IOMMU won't  
know if theirs is close enough, or they need to introduce a "type3".

> > > What's the value to userspace in determining which windows are  
> used
> > > by which banks?
> >
> > That depends on who programs the MSI config space address.  What is
> > important is userspace controlling which iovas will be dedicated to
> > this, in case it wants to put something else there.
> 
> So userspace is programming the MSI vectors, targeting a user  
> programmed
> iova?  But an iova selects a window and I thought there were some  
> number
> of MSI banks and we don't really know which ones we'll need...  still
> confused.

Userspace would also need a way to find out the page offset and data  
value.  That may be an argument in favor of having the two ioctls  
Stuart later suggested (get MSI count, and map MSI).  Would there be  
any complication in the VFIO code from tracking a mapping that doesn't  
have a userspace virtual address associated with it?

> > There's going to be special stuff no matter what.  This would keep  
> it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime  
> overhead
> > of setting things up is not particularly relevant as long as it's
> > reasonable.  If you mean development and maintenance effort, keeping
> > things well separated should help.
> 
> Overhead in terms of code required and complexity.  More things to
> reference count and shut down in the proper order on userspace exit.
> Thanks,

That didn't stop others from having me convert the KVM device control  
API to use file descriptors instead of something more ad-hoc with a  
better-defined destruction order. :-)

I don't know if it necessarily needs to be a separate fd -- it could be  
just another device resource like BARs, with some way for userspace to  
tell if the page is shared by multiple devices in the group (e.g. make  
the physical address visible).

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 22:44               ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 22:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > On x86 the interrupt remapper handles this transparently when MSI
> > > is enabled and userspace never gets direct access to the device  
> MSI
> > > address/data registers.
> >
> > x86 has a totally different mechanism here, as far as I understand  
> --
> > even before you get into restrictions on mappings.
> 
> So what control will userspace have over programming the actually MSI
> vectors on PAMU?

Not sure what you mean -- PAMU doesn't get explicitly involved in  
MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
you want isolation, you need to make sure that an MSI group is only  
used by one VFIO group, and that you're on a chip that has alias pages  
with just one MSI bank register each (newer chips do, but the first  
chip to have a PAMU didn't).

> > > This could also be done as another "type2" ioctl extension.
> >
> > Again, what is "type2", specifically?  If someone else is adding  
> their
> > own IOMMU that is kind of, sort of like PAMU, how would they know if
> > it's close enough?  What assumptions can a user make when they see  
> that
> > they're dealing with "type2"?
> 
> Naming always has and always will be a problem.  I assume this is  
> named
> type2 rather than PAMU because it's trying to expose a generic  
> windowed
> IOMMU fitting the IOMMU API.

But how closely is the MSI situation related to a generic windowed  
IOMMU, then?  We could just as well have a highly flexible IOMMU in  
terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
be mapped rather than a translation table.  Or we could have a windowed  
IOMMU that has an MSI translation table.

> Like type1, it doesn't really make sense
> to name it "IOMMU API" because that's a kernel internal interface and
> we're designing a userspace interface that just happens to use that.
> Tagging it to a piece of hardware makes it less reusable.

Well, that's my point.  Is it reusable at all, anyway?  If not, then  
giving it a more obscure name won't change that.  If it is reusable,  
then where is the line drawn between things that are PAMU-specific or  
MPIC-specific and things that are part of the "generic windowed IOMMU"  
abstraction?

>  Type1 is arbitrary.  It might as well be named "brown" and this one  
> can be
> "blue".

The difference is that "type1" seems to refer to hardware that can do  
arbitrary 4K page mappings, possibly constrained by an aperture but  
nothing else.  More than one IOMMU can reasonably fit that.  The odds  
that another IOMMU would have exactly the same restrictions as PAMU  
seem smaller in comparison.

In any case, if you had to deal with some Intel-only quirk, would it  
make sense to call it a "type1 attribute"?  I'm not advocating one way  
or the other on whether an abstraction is viable here (though Stuart  
seems to think it's "highly unlikely anything but a PAMU will comply"),  
just that if it is to be abstracted rather than a hardware-specific  
interface, we need to document what is and is not part of the  
abstraction.  Otherwise a non-PAMU-specific user won't know what they  
can rely on, and someone adding support for a new windowed IOMMU won't  
know if theirs is close enough, or they need to introduce a "type3".

> > > What's the value to userspace in determining which windows are  
> used
> > > by which banks?
> >
> > That depends on who programs the MSI config space address.  What is
> > important is userspace controlling which iovas will be dedicated to
> > this, in case it wants to put something else there.
> 
> So userspace is programming the MSI vectors, targeting a user  
> programmed
> iova?  But an iova selects a window and I thought there were some  
> number
> of MSI banks and we don't really know which ones we'll need...  still
> confused.

Userspace would also need a way to find out the page offset and data  
value.  That may be an argument in favor of having the two ioctls  
Stuart later suggested (get MSI count, and map MSI).  Would there be  
any complication in the VFIO code from tracking a mapping that doesn't  
have a userspace virtual address associated with it?

> > There's going to be special stuff no matter what.  This would keep  
> it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime  
> overhead
> > of setting things up is not particularly relevant as long as it's
> > reasonable.  If you mean development and maintenance effort, keeping
> > things well separated should help.
> 
> Overhead in terms of code required and complexity.  More things to
> reference count and shut down in the proper order on userspace exit.
> Thanks,

That didn't stop others from having me convert the KVM device control  
API to use file descriptors instead of something more ad-hoc with a  
better-defined destruction order. :-)

I don't know if it necessarily needs to be a separate fd -- it could be  
just another device resource like BARs, with some way for userspace to  
tell if the page is shared by multiple devices in the group (e.g. make  
the physical address visible).

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 21:38               ` [Qemu-devel] " Alex Williamson
@ 2013-04-02 22:50                   ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 22:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, agraf-l3A5Bk7waGM,
	Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
> <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> > >> >    C.  Explicit mapping using normal DMA map.  The last idea  
> is that
> > >> >        we would introduce a new ioctl to give user-space an fd  
> to
> > >> >        the MSI bank, which could be mmapped.  The flow would be
> > >> >        something like this:
> > >> >           -for each group user space calls new ioctl
> > >> > VFIO_GROUP_GET_MSI_FD
> > >> >           -user space mmaps the fd, getting a vaddr
> > >> >           -user space does a normal DMA map for desired iova
> > >> >        This approach makes everything explicit, but adds a new  
> ioctl
> > >> >        applicable most likely only to the PAMU (type2 iommu).
> > >>
> > >> And the DMA_MAP of that mmap then allows userspace to select the  
> window
> > >> used?  This one seems like a lot of overhead, adding a new  
> ioctl, new
> > >> fd, mmap, special mapping path, etc.
> > >
> > >
> > > There's going to be special stuff no matter what.  This would  
> keep it
> > > separated from the IOMMU map code.
> > >
> > > I'm not sure what you mean by "overhead" here... the runtime  
> overhead of
> > > setting things up is not particularly relevant as long as it's  
> reasonable.
> > > If you mean development and maintenance effort, keeping things  
> well
> > > separated should help.
> >
> > We don't need to change DMA_MAP.  If we can simply add a new "type  
> 2"
> > ioctl that allows user space to set which windows are MSIs, it  
> seems vastly
> > less complex than an ioctl to supply a new fd, mmap of it, etc.
> >
> > So maybe 2 ioctls:
> >     VFIO_IOMMU_GET_MSI_COUNT

Do you mean a count of actual MSIs or a count of MSI banks used by the  
whole VFIO group?

> >     VFIO_IOMMU_MAP_MSI(iova, size)

Not sure how you mean "size" to be used -- for MPIC it would be 4K per  
bank, and you can only map one bank at a time (which bank you're  
mapping should be a parameter, if only so that the kernel doesn't have  
to keep iteration state for you).

> How are MSIs related to devices on PAMU?

PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
device is standard PCI stuff.  Each MSI bank (which is part of the  
MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
map all MSI banks that are in use by any of the devices in the group.   
Ideally we'd let the VFIO grouping influence the allocation of MSIs.

> On x86 MSI count is very
> device specific, which means it wold be a VFIO_DEVICE_* ioctl  
> (actually
> VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
> it
> being a device ioctl is that you need to get the device FD, but the
> IOMMU protection needs to be established before you can get that... so
> there's an ordering problem if you need it from the device before
> configuring the IOMMU.  Thanks,

What do you mean by "IOMMU protection needs to be established"?   
Wouldn't we just start with no mappings in place?

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-02 22:50                   ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-02 22:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, Stuart Yoder, qemu-devel, agraf,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
> <scottwood@freescale.com> wrote:
> > >> >    C.  Explicit mapping using normal DMA map.  The last idea  
> is that
> > >> >        we would introduce a new ioctl to give user-space an fd  
> to
> > >> >        the MSI bank, which could be mmapped.  The flow would be
> > >> >        something like this:
> > >> >           -for each group user space calls new ioctl
> > >> > VFIO_GROUP_GET_MSI_FD
> > >> >           -user space mmaps the fd, getting a vaddr
> > >> >           -user space does a normal DMA map for desired iova
> > >> >        This approach makes everything explicit, but adds a new  
> ioctl
> > >> >        applicable most likely only to the PAMU (type2 iommu).
> > >>
> > >> And the DMA_MAP of that mmap then allows userspace to select the  
> window
> > >> used?  This one seems like a lot of overhead, adding a new  
> ioctl, new
> > >> fd, mmap, special mapping path, etc.
> > >
> > >
> > > There's going to be special stuff no matter what.  This would  
> keep it
> > > separated from the IOMMU map code.
> > >
> > > I'm not sure what you mean by "overhead" here... the runtime  
> overhead of
> > > setting things up is not particularly relevant as long as it's  
> reasonable.
> > > If you mean development and maintenance effort, keeping things  
> well
> > > separated should help.
> >
> > We don't need to change DMA_MAP.  If we can simply add a new "type  
> 2"
> > ioctl that allows user space to set which windows are MSIs, it  
> seems vastly
> > less complex than an ioctl to supply a new fd, mmap of it, etc.
> >
> > So maybe 2 ioctls:
> >     VFIO_IOMMU_GET_MSI_COUNT

Do you mean a count of actual MSIs or a count of MSI banks used by the  
whole VFIO group?

> >     VFIO_IOMMU_MAP_MSI(iova, size)

Not sure how you mean "size" to be used -- for MPIC it would be 4K per  
bank, and you can only map one bank at a time (which bank you're  
mapping should be a parameter, if only so that the kernel doesn't have  
to keep iteration state for you).

> How are MSIs related to devices on PAMU?

PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
device is standard PCI stuff.  Each MSI bank (which is part of the  
MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
map all MSI banks that are in use by any of the devices in the group.   
Ideally we'd let the VFIO grouping influence the allocation of MSIs.

> On x86 MSI count is very
> device specific, which means it wold be a VFIO_DEVICE_* ioctl  
> (actually
> VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
> it
> being a device ioctl is that you need to get the device FD, but the
> IOMMU protection needs to be established before you can get that... so
> there's an ordering problem if you need it from the device before
> configuring the IOMMU.  Thanks,

What do you mean by "IOMMU protection needs to be established"?   
Wouldn't we just start with no mappings in place?

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 22:13                 ` [Qemu-devel] " Scott Wood
@ 2013-04-03  2:54                   ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03  2:54 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, 2013-04-02 at 17:13 -0500, Scott Wood wrote:
> On 04/02/2013 04:16:11 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
> > > The number of windows is always power of 2 (and max is 256).  And  
> > to reduce
> > > PAMU cache pressure you want to use the fewest number of windows
> > > you can.    So, I don't see practically how we could transparently
> > > steal entries to
> > > add the MSIs.     Either user space knows to leave empty windows for
> > > MSIs and by convention the kernel knows which windows those are (as
> > > in option #A) or explicitly tell the kernel which windows (as in  
> > option #B).
> > 
> > Ok, apparently I don't understand the API.  Is it something like
> > userspace calls GET_ATTR and finds out that there are 256 available
> > windows, userspace determines that it needs 8 for RAM and then it has  
> > an
> > MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
> > prone to exploitation by the first userspace to allocate it's  
> > aperture,
> 
> What exploitation?
> 
> It's not as if there is a pool of 256 global windows that users  
> allocate from.  The subwindow count is just how finely divided the  
> aperture is.  The only way one user will affect another is through  
> cache contention (which is why we want the minimum number of subwindows  
> that we can get away with).
> 
> > but I'm also not sure why userspace could specify the (non-power of 2)
> > number of windows it needs for RAM, then VFIO would see that the  
> > devices
> > attached have MSI and add those windows and align to a power of 2.
> 
> If you double the subwindow count without userspace knowing, you have  
> to double the aperture as well (and you may need to grow up or down  
> depending on alignment).  This means you also need to halve the maximum  
> aperture that userspace can request.  And you need to expose a  
> different number of maximum subwindows in the IOMMU API based on  
> whether we might have MSIs of this type.  It's ugly and awkward, and  
> removes the possibility for userspace to place the MSIs in some unused  
> slot in the middle, or not use MSIs at all.

Ok, I missed this in Stuart's example:

    Total aperture: 512MB
    # of windows: 8

    win gphys/
    #   iova        phys          size
    --- ----        ----          ----
    0   0x00000000  0xX_XX000000  64MB
    1   0x04000000  0xX_XX000000  64MB
    2   0x08000000  0xX_XX000000  64MB
    3   0x0C000000  0xX_XX000000  64MB
    4   0x10000000  0xf_fe044000  4KB    // msi bank 1
          ^^
    5   0x14000000  0xf_fe045000  4KB    // msi bank 2
          ^^
    6   0x18000000  0xf_fe046000  4KB    // msi bank 3
          ^^
    7            -             -  disabled

So even though the MSI banks are 4k in this example, they're still on
64MB boundaries.  If userspace were to leave this as 256 windows, each
would be 2MB and we'd use 128 of them to map the same memory as these
4x64MB windows and thrash the iotlb harder.  The picture is becoming
clearer.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03  2:54                   ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03  2:54 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, Stuart Yoder, agraf, qemu-devel,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On Tue, 2013-04-02 at 17:13 -0500, Scott Wood wrote:
> On 04/02/2013 04:16:11 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
> > > The number of windows is always power of 2 (and max is 256).  And  
> > to reduce
> > > PAMU cache pressure you want to use the fewest number of windows
> > > you can.    So, I don't see practically how we could transparently
> > > steal entries to
> > > add the MSIs.     Either user space knows to leave empty windows for
> > > MSIs and by convention the kernel knows which windows those are (as
> > > in option #A) or explicitly tell the kernel which windows (as in  
> > option #B).
> > 
> > Ok, apparently I don't understand the API.  Is it something like
> > userspace calls GET_ATTR and finds out that there are 256 available
> > windows, userspace determines that it needs 8 for RAM and then it has  
> > an
> > MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
> > prone to exploitation by the first userspace to allocate it's  
> > aperture,
> 
> What exploitation?
> 
> It's not as if there is a pool of 256 global windows that users  
> allocate from.  The subwindow count is just how finely divided the  
> aperture is.  The only way one user will affect another is through  
> cache contention (which is why we want the minimum number of subwindows  
> that we can get away with).
> 
> > but I'm also not sure why userspace could specify the (non-power of 2)
> > number of windows it needs for RAM, then VFIO would see that the  
> > devices
> > attached have MSI and add those windows and align to a power of 2.
> 
> If you double the subwindow count without userspace knowing, you have  
> to double the aperture as well (and you may need to grow up or down  
> depending on alignment).  This means you also need to halve the maximum  
> aperture that userspace can request.  And you need to expose a  
> different number of maximum subwindows in the IOMMU API based on  
> whether we might have MSIs of this type.  It's ugly and awkward, and  
> removes the possibility for userspace to place the MSIs in some unused  
> slot in the middle, or not use MSIs at all.

Ok, I missed this in Stuart's example:

    Total aperture: 512MB
    # of windows: 8

    win gphys/
    #   iova        phys          size
    --- ----        ----          ----
    0   0x00000000  0xX_XX000000  64MB
    1   0x04000000  0xX_XX000000  64MB
    2   0x08000000  0xX_XX000000  64MB
    3   0x0C000000  0xX_XX000000  64MB
    4   0x10000000  0xf_fe044000  4KB    // msi bank 1
          ^^
    5   0x14000000  0xf_fe045000  4KB    // msi bank 2
          ^^
    6   0x18000000  0xf_fe046000  4KB    // msi bank 3
          ^^
    7            -             -  disabled

So even though the MSI banks are 4k in this example, they're still on
64MB boundaries.  If userspace were to leave this as 256 windows, each
would be 2MB and we'd use 128 of them to map the same memory as these
4x64MB windows and thrash the iotlb harder.  The picture is becoming
clearer.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 22:44               ` [Qemu-devel] " Scott Wood
@ 2013-04-03  3:12                 ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03  3:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
> On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > > On x86 the interrupt remapper handles this transparently when MSI
> > > > is enabled and userspace never gets direct access to the device  
> > MSI
> > > > address/data registers.
> > >
> > > x86 has a totally different mechanism here, as far as I understand  
> > --
> > > even before you get into restrictions on mappings.
> > 
> > So what control will userspace have over programming the actually MSI
> > vectors on PAMU?
> 
> Not sure what you mean -- PAMU doesn't get explicitly involved in  
> MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
> you want isolation, you need to make sure that an MSI group is only  
> used by one VFIO group, and that you're on a chip that has alias pages  
> with just one MSI bank register each (newer chips do, but the first  
> chip to have a PAMU didn't).

How does a user figure this out?

> > > > This could also be done as another "type2" ioctl extension.
> > >
> > > Again, what is "type2", specifically?  If someone else is adding  
> > their
> > > own IOMMU that is kind of, sort of like PAMU, how would they know if
> > > it's close enough?  What assumptions can a user make when they see  
> > that
> > > they're dealing with "type2"?
> > 
> > Naming always has and always will be a problem.  I assume this is  
> > named
> > type2 rather than PAMU because it's trying to expose a generic  
> > windowed
> > IOMMU fitting the IOMMU API.
> 
> But how closely is the MSI situation related to a generic windowed  
> IOMMU, then?  We could just as well have a highly flexible IOMMU in  
> terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
> be mapped rather than a translation table.  Or we could have a windowed  
> IOMMU that has an MSI translation table.
> 
> > Like type1, it doesn't really make sense
> > to name it "IOMMU API" because that's a kernel internal interface and
> > we're designing a userspace interface that just happens to use that.
> > Tagging it to a piece of hardware makes it less reusable.
> 
> Well, that's my point.  Is it reusable at all, anyway?  If not, then  
> giving it a more obscure name won't change that.  If it is reusable,  
> then where is the line drawn between things that are PAMU-specific or  
> MPIC-specific and things that are part of the "generic windowed IOMMU"  
> abstraction?
> 
> >  Type1 is arbitrary.  It might as well be named "brown" and this one  
> > can be
> > "blue".
> 
> The difference is that "type1" seems to refer to hardware that can do  
> arbitrary 4K page mappings, possibly constrained by an aperture but  
> nothing else.  More than one IOMMU can reasonably fit that.  The odds  
> that another IOMMU would have exactly the same restrictions as PAMU  
> seem smaller in comparison.
> 
> In any case, if you had to deal with some Intel-only quirk, would it  
> make sense to call it a "type1 attribute"?  I'm not advocating one way  
> or the other on whether an abstraction is viable here (though Stuart  
> seems to think it's "highly unlikely anything but a PAMU will comply"),  
> just that if it is to be abstracted rather than a hardware-specific  
> interface, we need to document what is and is not part of the  
> abstraction.  Otherwise a non-PAMU-specific user won't know what they  
> can rely on, and someone adding support for a new windowed IOMMU won't  
> know if theirs is close enough, or they need to introduce a "type3".

So Alexey named the SPAPR IOMMU something related to spapr...
surprisingly enough.  I'm fine with that.  If you think it's unique
enough, name it something appropriately.  I haven't seen the code and
don't know the architecture sufficiently to have an opinion.

> > > > What's the value to userspace in determining which windows are  
> > used
> > > > by which banks?
> > >
> > > That depends on who programs the MSI config space address.  What is
> > > important is userspace controlling which iovas will be dedicated to
> > > this, in case it wants to put something else there.
> > 
> > So userspace is programming the MSI vectors, targeting a user  
> > programmed
> > iova?  But an iova selects a window and I thought there were some  
> > number
> > of MSI banks and we don't really know which ones we'll need...  still
> > confused.
> 
> Userspace would also need a way to find out the page offset and data  
> value.  That may be an argument in favor of having the two ioctls  
> Stuart later suggested (get MSI count, and map MSI).

Connecting the user set iova and host kernel assigned irq number is
where I'm still lost, but I'll follow-up with that question in the other
thread.

> Would there be  
> any complication in the VFIO code from tracking a mapping that doesn't  
> have a userspace virtual address associated with it?

Only the VFIO iommu driver tracks mappings, the QEMU userspace component
doesn't (replies on the memory API for type1), nor does any of the
kernel framework code.

> > > There's going to be special stuff no matter what.  This would keep  
> > it
> > > separated from the IOMMU map code.
> > >
> > > I'm not sure what you mean by "overhead" here... the runtime  
> > overhead
> > > of setting things up is not particularly relevant as long as it's
> > > reasonable.  If you mean development and maintenance effort, keeping
> > > things well separated should help.
> > 
> > Overhead in terms of code required and complexity.  More things to
> > reference count and shut down in the proper order on userspace exit.
> > Thanks,
> 
> That didn't stop others from having me convert the KVM device control  
> API to use file descriptors instead of something more ad-hoc with a  
> better-defined destruction order. :-)
> 
> I don't know if it necessarily needs to be a separate fd -- it could be  
> just another device resource like BARs, with some way for userspace to  
> tell if the page is shared by multiple devices in the group (e.g. make  
> the physical address visible).

That was my first thought when I read option C.  The down side is that
resources are attached to a device and these MSI banks are potentially
associated with multiple devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03  3:12                 ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03  3:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
> On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > > On x86 the interrupt remapper handles this transparently when MSI
> > > > is enabled and userspace never gets direct access to the device  
> > MSI
> > > > address/data registers.
> > >
> > > x86 has a totally different mechanism here, as far as I understand  
> > --
> > > even before you get into restrictions on mappings.
> > 
> > So what control will userspace have over programming the actually MSI
> > vectors on PAMU?
> 
> Not sure what you mean -- PAMU doesn't get explicitly involved in  
> MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
> you want isolation, you need to make sure that an MSI group is only  
> used by one VFIO group, and that you're on a chip that has alias pages  
> with just one MSI bank register each (newer chips do, but the first  
> chip to have a PAMU didn't).

How does a user figure this out?

> > > > This could also be done as another "type2" ioctl extension.
> > >
> > > Again, what is "type2", specifically?  If someone else is adding  
> > their
> > > own IOMMU that is kind of, sort of like PAMU, how would they know if
> > > it's close enough?  What assumptions can a user make when they see  
> > that
> > > they're dealing with "type2"?
> > 
> > Naming always has and always will be a problem.  I assume this is  
> > named
> > type2 rather than PAMU because it's trying to expose a generic  
> > windowed
> > IOMMU fitting the IOMMU API.
> 
> But how closely is the MSI situation related to a generic windowed  
> IOMMU, then?  We could just as well have a highly flexible IOMMU in  
> terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
> be mapped rather than a translation table.  Or we could have a windowed  
> IOMMU that has an MSI translation table.
> 
> > Like type1, it doesn't really make sense
> > to name it "IOMMU API" because that's a kernel internal interface and
> > we're designing a userspace interface that just happens to use that.
> > Tagging it to a piece of hardware makes it less reusable.
> 
> Well, that's my point.  Is it reusable at all, anyway?  If not, then  
> giving it a more obscure name won't change that.  If it is reusable,  
> then where is the line drawn between things that are PAMU-specific or  
> MPIC-specific and things that are part of the "generic windowed IOMMU"  
> abstraction?
> 
> >  Type1 is arbitrary.  It might as well be named "brown" and this one  
> > can be
> > "blue".
> 
> The difference is that "type1" seems to refer to hardware that can do  
> arbitrary 4K page mappings, possibly constrained by an aperture but  
> nothing else.  More than one IOMMU can reasonably fit that.  The odds  
> that another IOMMU would have exactly the same restrictions as PAMU  
> seem smaller in comparison.
> 
> In any case, if you had to deal with some Intel-only quirk, would it  
> make sense to call it a "type1 attribute"?  I'm not advocating one way  
> or the other on whether an abstraction is viable here (though Stuart  
> seems to think it's "highly unlikely anything but a PAMU will comply"),  
> just that if it is to be abstracted rather than a hardware-specific  
> interface, we need to document what is and is not part of the  
> abstraction.  Otherwise a non-PAMU-specific user won't know what they  
> can rely on, and someone adding support for a new windowed IOMMU won't  
> know if theirs is close enough, or they need to introduce a "type3".

So Alexey named the SPAPR IOMMU something related to spapr...
surprisingly enough.  I'm fine with that.  If you think it's unique
enough, name it something appropriately.  I haven't seen the code and
don't know the architecture sufficiently to have an opinion.

> > > > What's the value to userspace in determining which windows are  
> > used
> > > > by which banks?
> > >
> > > That depends on who programs the MSI config space address.  What is
> > > important is userspace controlling which iovas will be dedicated to
> > > this, in case it wants to put something else there.
> > 
> > So userspace is programming the MSI vectors, targeting a user  
> > programmed
> > iova?  But an iova selects a window and I thought there were some  
> > number
> > of MSI banks and we don't really know which ones we'll need...  still
> > confused.
> 
> Userspace would also need a way to find out the page offset and data  
> value.  That may be an argument in favor of having the two ioctls  
> Stuart later suggested (get MSI count, and map MSI).

Connecting the user set iova and host kernel assigned irq number is
where I'm still lost, but I'll follow-up with that question in the other
thread.

> Would there be  
> any complication in the VFIO code from tracking a mapping that doesn't  
> have a userspace virtual address associated with it?

Only the VFIO iommu driver tracks mappings, the QEMU userspace component
doesn't (replies on the memory API for type1), nor does any of the
kernel framework code.

> > > There's going to be special stuff no matter what.  This would keep  
> > it
> > > separated from the IOMMU map code.
> > >
> > > I'm not sure what you mean by "overhead" here... the runtime  
> > overhead
> > > of setting things up is not particularly relevant as long as it's
> > > reasonable.  If you mean development and maintenance effort, keeping
> > > things well separated should help.
> > 
> > Overhead in terms of code required and complexity.  More things to
> > reference count and shut down in the proper order on userspace exit.
> > Thanks,
> 
> That didn't stop others from having me convert the KVM device control  
> API to use file descriptors instead of something more ad-hoc with a  
> better-defined destruction order. :-)
> 
> I don't know if it necessarily needs to be a separate fd -- it could be  
> just another device resource like BARs, with some way for userspace to  
> tell if the page is shared by multiple devices in the group (e.g. make  
> the physical address visible).

That was my first thought when I read option C.  The down side is that
resources are attached to a device and these MSI banks are potentially
associated with multiple devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 22:50                   ` [Qemu-devel] " Scott Wood
@ 2013-04-03  3:37                     ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03  3:37 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, agraf-l3A5Bk7waGM,
	Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
> On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
> > <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> > > >> >    C.  Explicit mapping using normal DMA map.  The last idea  
> > is that
> > > >> >        we would introduce a new ioctl to give user-space an fd  
> > to
> > > >> >        the MSI bank, which could be mmapped.  The flow would be
> > > >> >        something like this:
> > > >> >           -for each group user space calls new ioctl
> > > >> > VFIO_GROUP_GET_MSI_FD
> > > >> >           -user space mmaps the fd, getting a vaddr
> > > >> >           -user space does a normal DMA map for desired iova
> > > >> >        This approach makes everything explicit, but adds a new  
> > ioctl
> > > >> >        applicable most likely only to the PAMU (type2 iommu).
> > > >>
> > > >> And the DMA_MAP of that mmap then allows userspace to select the  
> > window
> > > >> used?  This one seems like a lot of overhead, adding a new  
> > ioctl, new
> > > >> fd, mmap, special mapping path, etc.
> > > >
> > > >
> > > > There's going to be special stuff no matter what.  This would  
> > keep it
> > > > separated from the IOMMU map code.
> > > >
> > > > I'm not sure what you mean by "overhead" here... the runtime  
> > overhead of
> > > > setting things up is not particularly relevant as long as it's  
> > reasonable.
> > > > If you mean development and maintenance effort, keeping things  
> > well
> > > > separated should help.
> > >
> > > We don't need to change DMA_MAP.  If we can simply add a new "type  
> > 2"
> > > ioctl that allows user space to set which windows are MSIs, it  
> > seems vastly
> > > less complex than an ioctl to supply a new fd, mmap of it, etc.
> > >
> > > So maybe 2 ioctls:
> > >     VFIO_IOMMU_GET_MSI_COUNT
> 
> Do you mean a count of actual MSIs or a count of MSI banks used by the  
> whole VFIO group?

I hope the latter, which would clarify how this is distinct from
DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
dynamically adding a device could bring along additional MSI banks?

> > >     VFIO_IOMMU_MAP_MSI(iova, size)
> 
> Not sure how you mean "size" to be used -- for MPIC it would be 4K per  
> bank, and you can only map one bank at a time (which bank you're  
> mapping should be a parameter, if only so that the kernel doesn't have  
> to keep iteration state for you).
> 
> > How are MSIs related to devices on PAMU?
> 
> PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
> device is standard PCI stuff.  Each MSI bank (which is part of the  
> MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
> map all MSI banks that are in use by any of the devices in the group.   
> Ideally we'd let the VFIO grouping influence the allocation of MSIs.

The current VFIO MSI support has the host handling everything about MSI.
The user never programs an MSI vector to the physical device, they set
up everything through ioctl.  On interrupt, we simply trigger an eventfd
and leave it to things like KVM irqfd or QEMU to do the right thing in a
virtual machine.

Here the MSI vector has to go through a PAMU window to hit the correct
MSI bank.  So that means it has some component of the iova involved,
which we're proposing here is controlled by userspace (whether that
vector uses an offset from 0x10000000 or 0x00000000 depending on which
window slot is used to make the MSI bank).  I assume we're still working
in a model where the physical interrupt fires into the host and a
host-based interrupt handler triggers an eventfd, right?  So that means
the vector also has host components so we trigger the correct ISR.  How
is that coordinated?

Would is be possible for userspace to simply leave room for MSI bank
mapping (how much room could be determined by something like
VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
DMA_MAP starting at the 0x0 address of the aperture, growing up, and
VFIO will map banks on demand at the top of the aperture, growing down?
Wouldn't that avoid a lot of issues with userspace needing to know
anything about MSI banks (other than count) and coordinating irq numbers
and enabling handlers?

> > On x86 MSI count is very
> > device specific, which means it wold be a VFIO_DEVICE_* ioctl  
> > (actually
> > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
> > it
> > being a device ioctl is that you need to get the device FD, but the
> > IOMMU protection needs to be established before you can get that... so
> > there's an ordering problem if you need it from the device before
> > configuring the IOMMU.  Thanks,
> 
> What do you mean by "IOMMU protection needs to be established"?   
> Wouldn't we just start with no mappings in place?

If no mappings blocks all DMA, sure, that's fine.  Once the VFIO device
FD is accessible by userspace we have to protect the host against DMA.
If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
could be exploitable.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03  3:37                     ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03  3:37 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, Stuart Yoder, qemu-devel, agraf,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
> On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
> > <scottwood@freescale.com> wrote:
> > > >> >    C.  Explicit mapping using normal DMA map.  The last idea  
> > is that
> > > >> >        we would introduce a new ioctl to give user-space an fd  
> > to
> > > >> >        the MSI bank, which could be mmapped.  The flow would be
> > > >> >        something like this:
> > > >> >           -for each group user space calls new ioctl
> > > >> > VFIO_GROUP_GET_MSI_FD
> > > >> >           -user space mmaps the fd, getting a vaddr
> > > >> >           -user space does a normal DMA map for desired iova
> > > >> >        This approach makes everything explicit, but adds a new  
> > ioctl
> > > >> >        applicable most likely only to the PAMU (type2 iommu).
> > > >>
> > > >> And the DMA_MAP of that mmap then allows userspace to select the  
> > window
> > > >> used?  This one seems like a lot of overhead, adding a new  
> > ioctl, new
> > > >> fd, mmap, special mapping path, etc.
> > > >
> > > >
> > > > There's going to be special stuff no matter what.  This would  
> > keep it
> > > > separated from the IOMMU map code.
> > > >
> > > > I'm not sure what you mean by "overhead" here... the runtime  
> > overhead of
> > > > setting things up is not particularly relevant as long as it's  
> > reasonable.
> > > > If you mean development and maintenance effort, keeping things  
> > well
> > > > separated should help.
> > >
> > > We don't need to change DMA_MAP.  If we can simply add a new "type  
> > 2"
> > > ioctl that allows user space to set which windows are MSIs, it  
> > seems vastly
> > > less complex than an ioctl to supply a new fd, mmap of it, etc.
> > >
> > > So maybe 2 ioctls:
> > >     VFIO_IOMMU_GET_MSI_COUNT
> 
> Do you mean a count of actual MSIs or a count of MSI banks used by the  
> whole VFIO group?

I hope the latter, which would clarify how this is distinct from
DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
dynamically adding a device could bring along additional MSI banks?

> > >     VFIO_IOMMU_MAP_MSI(iova, size)
> 
> Not sure how you mean "size" to be used -- for MPIC it would be 4K per  
> bank, and you can only map one bank at a time (which bank you're  
> mapping should be a parameter, if only so that the kernel doesn't have  
> to keep iteration state for you).
> 
> > How are MSIs related to devices on PAMU?
> 
> PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
> device is standard PCI stuff.  Each MSI bank (which is part of the  
> MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
> map all MSI banks that are in use by any of the devices in the group.   
> Ideally we'd let the VFIO grouping influence the allocation of MSIs.

The current VFIO MSI support has the host handling everything about MSI.
The user never programs an MSI vector to the physical device, they set
up everything through ioctl.  On interrupt, we simply trigger an eventfd
and leave it to things like KVM irqfd or QEMU to do the right thing in a
virtual machine.

Here the MSI vector has to go through a PAMU window to hit the correct
MSI bank.  So that means it has some component of the iova involved,
which we're proposing here is controlled by userspace (whether that
vector uses an offset from 0x10000000 or 0x00000000 depending on which
window slot is used to make the MSI bank).  I assume we're still working
in a model where the physical interrupt fires into the host and a
host-based interrupt handler triggers an eventfd, right?  So that means
the vector also has host components so we trigger the correct ISR.  How
is that coordinated?

Would is be possible for userspace to simply leave room for MSI bank
mapping (how much room could be determined by something like
VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
DMA_MAP starting at the 0x0 address of the aperture, growing up, and
VFIO will map banks on demand at the top of the aperture, growing down?
Wouldn't that avoid a lot of issues with userspace needing to know
anything about MSI banks (other than count) and coordinating irq numbers
and enabling handlers?

> > On x86 MSI count is very
> > device specific, which means it wold be a VFIO_DEVICE_* ioctl  
> > (actually
> > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
> > it
> > being a device ioctl is that you need to get the device FD, but the
> > IOMMU protection needs to be established before you can get that... so
> > there's an ordering problem if you need it from the device before
> > configuring the IOMMU.  Thanks,
> 
> What do you mean by "IOMMU protection needs to be established"?   
> Wouldn't we just start with no mappings in place?

If no mappings blocks all DMA, sure, that's fine.  Once the VFIO device
FD is accessible by userspace we have to protect the host against DMA.
If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
could be exploitable.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03  3:12                 ` [Qemu-devel] " Alex Williamson
@ 2013-04-03 18:25                   ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 18:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Scott Wood, Wood Scott-B07421, kvm, agraf, qemu-devel,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

>> >  Type1 is arbitrary.  It might as well be named "brown" and this one
>> > can be
>> > "blue".
>>
>> The difference is that "type1" seems to refer to hardware that can do
>> arbitrary 4K page mappings, possibly constrained by an aperture but
>> nothing else.  More than one IOMMU can reasonably fit that.  The odds
>> that another IOMMU would have exactly the same restrictions as PAMU
>> seem smaller in comparison.
>>
>> In any case, if you had to deal with some Intel-only quirk, would it
>> make sense to call it a "type1 attribute"?  I'm not advocating one way
>> or the other on whether an abstraction is viable here (though Stuart
>> seems to think it's "highly unlikely anything but a PAMU will comply"),
>> just that if it is to be abstracted rather than a hardware-specific
>> interface, we need to document what is and is not part of the
>> abstraction.  Otherwise a non-PAMU-specific user won't know what they
>> can rely on, and someone adding support for a new windowed IOMMU won't
>> know if theirs is close enough, or they need to introduce a "type3".
>
> So Alexey named the SPAPR IOMMU something related to spapr...
> surprisingly enough.  I'm fine with that.  If you think it's unique
> enough, name it something appropriately.  I haven't seen the code and
> don't know the architecture sufficiently to have an opinion.

The only reason I suggested "type 2" is that I thought that was the
convention...we would enumerate different iommus.   I think that
calling it "pamu" is better and more clear.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 18:25                   ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 18:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777, Scott Wood

>> >  Type1 is arbitrary.  It might as well be named "brown" and this one
>> > can be
>> > "blue".
>>
>> The difference is that "type1" seems to refer to hardware that can do
>> arbitrary 4K page mappings, possibly constrained by an aperture but
>> nothing else.  More than one IOMMU can reasonably fit that.  The odds
>> that another IOMMU would have exactly the same restrictions as PAMU
>> seem smaller in comparison.
>>
>> In any case, if you had to deal with some Intel-only quirk, would it
>> make sense to call it a "type1 attribute"?  I'm not advocating one way
>> or the other on whether an abstraction is viable here (though Stuart
>> seems to think it's "highly unlikely anything but a PAMU will comply"),
>> just that if it is to be abstracted rather than a hardware-specific
>> interface, we need to document what is and is not part of the
>> abstraction.  Otherwise a non-PAMU-specific user won't know what they
>> can rely on, and someone adding support for a new windowed IOMMU won't
>> know if theirs is close enough, or they need to introduce a "type3".
>
> So Alexey named the SPAPR IOMMU something related to spapr...
> surprisingly enough.  I'm fine with that.  If you think it's unique
> enough, name it something appropriately.  I haven't seen the code and
> don't know the architecture sufficiently to have an opinion.

The only reason I suggested "type 2" is that I thought that was the
convention...we would enumerate different iommus.   I think that
calling it "pamu" is better and more clear.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-02 22:50                   ` [Qemu-devel] " Scott Wood
@ 2013-04-03 18:32                     ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 18:32 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
>>
>> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
>> > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
>> > wrote:
>> > >> >    C.  Explicit mapping using normal DMA map.  The last idea is
>> > >> > that
>> > >> >        we would introduce a new ioctl to give user-space an fd to
>> > >> >        the MSI bank, which could be mmapped.  The flow would be
>> > >> >        something like this:
>> > >> >           -for each group user space calls new ioctl
>> > >> > VFIO_GROUP_GET_MSI_FD
>> > >> >           -user space mmaps the fd, getting a vaddr
>> > >> >           -user space does a normal DMA map for desired iova
>> > >> >        This approach makes everything explicit, but adds a new
>> > >> > ioctl
>> > >> >        applicable most likely only to the PAMU (type2 iommu).
>> > >>
>> > >> And the DMA_MAP of that mmap then allows userspace to select the
>> > >> window
>> > >> used?  This one seems like a lot of overhead, adding a new ioctl, new
>> > >> fd, mmap, special mapping path, etc.
>> > >
>> > >
>> > > There's going to be special stuff no matter what.  This would keep it
>> > > separated from the IOMMU map code.
>> > >
>> > > I'm not sure what you mean by "overhead" here... the runtime overhead
>> > > of
>> > > setting things up is not particularly relevant as long as it's
>> > > reasonable.
>> > > If you mean development and maintenance effort, keeping things well
>> > > separated should help.
>> >
>> > We don't need to change DMA_MAP.  If we can simply add a new "type 2"
>> > ioctl that allows user space to set which windows are MSIs, it seems
>> > vastly
>> > less complex than an ioctl to supply a new fd, mmap of it, etc.
>> >
>> > So maybe 2 ioctls:
>> >     VFIO_IOMMU_GET_MSI_COUNT
>
>
> Do you mean a count of actual MSIs or a count of MSI banks used by the whole
> VFIO group?

I meant # of MSI banks, so VFIO_IOMMU_GET_MSI_BANK_COUNT would be better.

>> >     VFIO_IOMMU_MAP_MSI(iova, size)
>
>
> Not sure how you mean "size" to be used -- for MPIC it would be 4K per bank,
> and you can only map one bank at a time (which bank you're mapping should be
> a parameter, if only so that the kernel doesn't have to keep iteration state
> for you).

The intent was for user space to tell the kernel which windows to use
for MSI.   So I envisioned a total size of window-size * msi-bank-count.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 18:32                     ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 18:32 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
>>
>> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
>> > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood <scottwood@freescale.com>
>> > wrote:
>> > >> >    C.  Explicit mapping using normal DMA map.  The last idea is
>> > >> > that
>> > >> >        we would introduce a new ioctl to give user-space an fd to
>> > >> >        the MSI bank, which could be mmapped.  The flow would be
>> > >> >        something like this:
>> > >> >           -for each group user space calls new ioctl
>> > >> > VFIO_GROUP_GET_MSI_FD
>> > >> >           -user space mmaps the fd, getting a vaddr
>> > >> >           -user space does a normal DMA map for desired iova
>> > >> >        This approach makes everything explicit, but adds a new
>> > >> > ioctl
>> > >> >        applicable most likely only to the PAMU (type2 iommu).
>> > >>
>> > >> And the DMA_MAP of that mmap then allows userspace to select the
>> > >> window
>> > >> used?  This one seems like a lot of overhead, adding a new ioctl, new
>> > >> fd, mmap, special mapping path, etc.
>> > >
>> > >
>> > > There's going to be special stuff no matter what.  This would keep it
>> > > separated from the IOMMU map code.
>> > >
>> > > I'm not sure what you mean by "overhead" here... the runtime overhead
>> > > of
>> > > setting things up is not particularly relevant as long as it's
>> > > reasonable.
>> > > If you mean development and maintenance effort, keeping things well
>> > > separated should help.
>> >
>> > We don't need to change DMA_MAP.  If we can simply add a new "type 2"
>> > ioctl that allows user space to set which windows are MSIs, it seems
>> > vastly
>> > less complex than an ioctl to supply a new fd, mmap of it, etc.
>> >
>> > So maybe 2 ioctls:
>> >     VFIO_IOMMU_GET_MSI_COUNT
>
>
> Do you mean a count of actual MSIs or a count of MSI banks used by the whole
> VFIO group?

I meant # of MSI banks, so VFIO_IOMMU_GET_MSI_BANK_COUNT would be better.

>> >     VFIO_IOMMU_MAP_MSI(iova, size)
>
>
> Not sure how you mean "size" to be used -- for MPIC it would be 4K per bank,
> and you can only map one bank at a time (which bank you're mapping should be
> a parameter, if only so that the kernel doesn't have to keep iteration state
> for you).

The intent was for user space to tell the kernel which windows to use
for MSI.   So I envisioned a total size of window-size * msi-bank-count.

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03 18:32                     ` [Qemu-devel] " Stuart Yoder
@ 2013-04-03 18:39                       ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 18:39 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Alex Williamson, Wood Scott-B07421, kvm, agraf, qemu-devel,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On 04/03/2013 01:32:26 PM, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> >>
> >> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> >> >     VFIO_IOMMU_MAP_MSI(iova, size)
> >
> >
> > Not sure how you mean "size" to be used -- for MPIC it would be 4K  
> per bank,
> > and you can only map one bank at a time (which bank you're mapping  
> should be
> > a parameter, if only so that the kernel doesn't have to keep  
> iteration state
> > for you).
> 
> The intent was for user space to tell the kernel which windows to use
> for MSI.   So I envisioned a total size of window-size *  
> msi-bank-count.

Size doesn't tell the kernel *which* banks to use, only how many.  If  
it already knows which banks are used by the group, then it also knows  
how many are used.  And size is misleading because the mapping is not  
generally going to be contiguous.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 18:39                       ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 18:39 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On 04/03/2013 01:32:26 PM, Stuart Yoder wrote:
> On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> >>
> >> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> >> >     VFIO_IOMMU_MAP_MSI(iova, size)
> >
> >
> > Not sure how you mean "size" to be used -- for MPIC it would be 4K  
> per bank,
> > and you can only map one bank at a time (which bank you're mapping  
> should be
> > a parameter, if only so that the kernel doesn't have to keep  
> iteration state
> > for you).
> 
> The intent was for user space to tell the kernel which windows to use
> for MSI.   So I envisioned a total size of window-size *  
> msi-bank-count.

Size doesn't tell the kernel *which* banks to use, only how many.  If  
it already knows which banks are used by the group, then it also knows  
how many are used.  And size is misleading because the mapping is not  
generally going to be contiguous.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03  3:37                     ` [Qemu-devel] " Alex Williamson
@ 2013-04-03 19:09                       ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 19:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Scott Wood, Wood Scott-B07421, kvm, agraf, qemu-devel,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

> Would is be possible for userspace to simply leave room for MSI bank
> mapping (how much room could be determined by something like
> VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
> DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> VFIO will map banks on demand at the top of the aperture, growing down?
> Wouldn't that avoid a lot of issues with userspace needing to know
> anything about MSI banks (other than count) and coordinating irq numbers
> and enabling handlers?

This is basically option #A in the original proposals sent.   I like
this approach, in that it
is simpler and keeps user space mostly out of this...which is
consistent with how things are done
on x86.  User space just needs to know how many windows to leave at
the top of the aperture.
The kernel then has the flexibility to use those windows how it wants.

But one question, is when should the kernel actually map (and unmap)
the MSI banks.   One thing we need to do is enable the aperture...and current
thinking is that is done on the first DMA_MAP.   Similarly when the last mapping
is unmapped we could also umap the MSI banks.

Sequence would be something like:

        VFIO_GROUP_SET_CONTAINER     // add groups to the container

        VFIO_SET_IOMMU(VFIO_FSL_PAMU)    // set iommu model

        cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT    // returns max # of MSI banks

        VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture

        VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
including MSI banks

        VFIO_IOMMU_MAP_DMA    // map the guest's memory
               ---> kernel enables aperture and maps needed MSI banks here

        VFIO_DEVICE_SET_IRQS
               ---> kernel sets actual MSI addr/data in physical
device here (I think)


Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 19:09                       ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 19:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777, Scott Wood

> Would is be possible for userspace to simply leave room for MSI bank
> mapping (how much room could be determined by something like
> VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
> DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> VFIO will map banks on demand at the top of the aperture, growing down?
> Wouldn't that avoid a lot of issues with userspace needing to know
> anything about MSI banks (other than count) and coordinating irq numbers
> and enabling handlers?

This is basically option #A in the original proposals sent.   I like
this approach, in that it
is simpler and keeps user space mostly out of this...which is
consistent with how things are done
on x86.  User space just needs to know how many windows to leave at
the top of the aperture.
The kernel then has the flexibility to use those windows how it wants.

But one question, is when should the kernel actually map (and unmap)
the MSI banks.   One thing we need to do is enable the aperture...and current
thinking is that is done on the first DMA_MAP.   Similarly when the last mapping
is unmapped we could also umap the MSI banks.

Sequence would be something like:

        VFIO_GROUP_SET_CONTAINER     // add groups to the container

        VFIO_SET_IOMMU(VFIO_FSL_PAMU)    // set iommu model

        cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT    // returns max # of MSI banks

        VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture

        VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
including MSI banks

        VFIO_IOMMU_MAP_DMA    // map the guest's memory
               ---> kernel enables aperture and maps needed MSI banks here

        VFIO_DEVICE_SET_IRQS
               ---> kernel sets actual MSI addr/data in physical
device here (I think)


Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03 19:09                       ` [Qemu-devel] " Stuart Yoder
@ 2013-04-03 19:18                           ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 19:18 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	Bhushan Bharat-R65777

On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
> > Would is be possible for userspace to simply leave room for MSI bank
> > mapping (how much room could be determined by something like
> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
> can
> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> > VFIO will map banks on demand at the top of the aperture, growing  
> down?
> > Wouldn't that avoid a lot of issues with userspace needing to know
> > anything about MSI banks (other than count) and coordinating irq  
> numbers
> > and enabling handlers?
> 
> This is basically option #A in the original proposals sent.   I like
> this approach, in that it
> is simpler and keeps user space mostly out of this...which is
> consistent with how things are done
> on x86.  User space just needs to know how many windows to leave at
> the top of the aperture.
> The kernel then has the flexibility to use those windows how it wants.
> 
> But one question, is when should the kernel actually map (and unmap)
> the MSI banks.

I think userspace should explicitly request it.  Userspace still  
wouldn't need to know anything but the count:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i < count; i++)
	VFIO_IOMMU_MAP_MSI_BANK(iova, i);
// The kernel now knows where each bank has been mapped, and can update  
PCI config space appropriately.

> One thing we need to do is enable the aperture...and current
> thinking is that is done on the first DMA_MAP.

What if there are no other mappings required?

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 19:18                           ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 19:18 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
> > Would is be possible for userspace to simply leave room for MSI bank
> > mapping (how much room could be determined by something like
> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
> can
> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> > VFIO will map banks on demand at the top of the aperture, growing  
> down?
> > Wouldn't that avoid a lot of issues with userspace needing to know
> > anything about MSI banks (other than count) and coordinating irq  
> numbers
> > and enabling handlers?
> 
> This is basically option #A in the original proposals sent.   I like
> this approach, in that it
> is simpler and keeps user space mostly out of this...which is
> consistent with how things are done
> on x86.  User space just needs to know how many windows to leave at
> the top of the aperture.
> The kernel then has the flexibility to use those windows how it wants.
> 
> But one question, is when should the kernel actually map (and unmap)
> the MSI banks.

I think userspace should explicitly request it.  Userspace still  
wouldn't need to know anything but the count:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i < count; i++)
	VFIO_IOMMU_MAP_MSI_BANK(iova, i);
// The kernel now knows where each bank has been mapped, and can update  
PCI config space appropriately.

> One thing we need to do is enable the aperture...and current
> thinking is that is done on the first DMA_MAP.

What if there are no other mappings required?

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03 19:09                       ` [Qemu-devel] " Stuart Yoder
@ 2013-04-03 19:23                           ` Alex Williamson
  -1 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03 19:23 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777, Scott Wood

On Wed, 2013-04-03 at 14:09 -0500, Stuart Yoder wrote:
> > Would is be possible for userspace to simply leave room for MSI bank
> > mapping (how much room could be determined by something like
> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> > VFIO will map banks on demand at the top of the aperture, growing down?
> > Wouldn't that avoid a lot of issues with userspace needing to know
> > anything about MSI banks (other than count) and coordinating irq numbers
> > and enabling handlers?
> 
> This is basically option #A in the original proposals sent.   I like
> this approach, in that it
> is simpler and keeps user space mostly out of this...which is
> consistent with how things are done
> on x86.  User space just needs to know how many windows to leave at
> the top of the aperture.
> The kernel then has the flexibility to use those windows how it wants.
> 
> But one question, is when should the kernel actually map (and unmap)
> the MSI banks.   One thing we need to do is enable the aperture...and current
> thinking is that is done on the first DMA_MAP.   Similarly when the last mapping
> is unmapped we could also umap the MSI banks.
> 
> Sequence would be something like:
> 
>         VFIO_GROUP_SET_CONTAINER     // add groups to the container
> 
>         VFIO_SET_IOMMU(VFIO_FSL_PAMU)    // set iommu model
> 
>         cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT    // returns max # of MSI banks
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
> including MSI banks
> 
>         VFIO_IOMMU_MAP_DMA    // map the guest's memory
>                ---> kernel enables aperture and maps needed MSI banks here
> 
>         VFIO_DEVICE_SET_IRQS
>                ---> kernel sets actual MSI addr/data in physical
> device here (I think)

You could also make use of the IOMMU_ENABLE/DISABLE entry points that
Alexey plans to use.  Ideally I'd think that you'd want to enable the
required MSI banks for a device on DEVICE_SET_IRQs.  That's effectively
what happens on x86.  Perhaps some information stored in the domain
structure would let architecture hooks in MSI setup enable those
mappings for you?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 19:23                           ` Alex Williamson
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Williamson @ 2013-04-03 19:23 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777, Scott Wood

On Wed, 2013-04-03 at 14:09 -0500, Stuart Yoder wrote:
> > Would is be possible for userspace to simply leave room for MSI bank
> > mapping (how much room could be determined by something like
> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> > VFIO will map banks on demand at the top of the aperture, growing down?
> > Wouldn't that avoid a lot of issues with userspace needing to know
> > anything about MSI banks (other than count) and coordinating irq numbers
> > and enabling handlers?
> 
> This is basically option #A in the original proposals sent.   I like
> this approach, in that it
> is simpler and keeps user space mostly out of this...which is
> consistent with how things are done
> on x86.  User space just needs to know how many windows to leave at
> the top of the aperture.
> The kernel then has the flexibility to use those windows how it wants.
> 
> But one question, is when should the kernel actually map (and unmap)
> the MSI banks.   One thing we need to do is enable the aperture...and current
> thinking is that is done on the first DMA_MAP.   Similarly when the last mapping
> is unmapped we could also umap the MSI banks.
> 
> Sequence would be something like:
> 
>         VFIO_GROUP_SET_CONTAINER     // add groups to the container
> 
>         VFIO_SET_IOMMU(VFIO_FSL_PAMU)    // set iommu model
> 
>         cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT    // returns max # of MSI banks
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
> including MSI banks
> 
>         VFIO_IOMMU_MAP_DMA    // map the guest's memory
>                ---> kernel enables aperture and maps needed MSI banks here
> 
>         VFIO_DEVICE_SET_IRQS
>                ---> kernel sets actual MSI addr/data in physical
> device here (I think)

You could also make use of the IOMMU_ENABLE/DISABLE entry points that
Alexey plans to use.  Ideally I'd think that you'd want to enable the
required MSI banks for a device on DEVICE_SET_IRQs.  That's effectively
what happens on x86.  Perhaps some information stored in the domain
structure would let architecture hooks in MSI setup enable those
mappings for you?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03 19:09                       ` [Qemu-devel] " Stuart Yoder
@ 2013-04-03 19:26                         ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 19:26 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Alex Williamson, Wood Scott-B07421, kvm, agraf, qemu-devel,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
> > Would is be possible for userspace to simply leave room for MSI bank
> > mapping (how much room could be determined by something like
> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
> can
> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> > VFIO will map banks on demand at the top of the aperture, growing  
> down?
> > Wouldn't that avoid a lot of issues with userspace needing to know
> > anything about MSI banks (other than count) and coordinating irq  
> numbers
> > and enabling handlers?
> 
> This is basically option #A in the original proposals sent.   I like
> this approach, in that it
> is simpler and keeps user space mostly out of this...which is
> consistent with how things are done
> on x86.  User space just needs to know how many windows to leave at
> the top of the aperture.
> The kernel then has the flexibility to use those windows how it wants.
> 
> But one question, is when should the kernel actually map (and unmap)
> the MSI banks.   One thing we need to do is enable the aperture...and  
> current
> thinking is that is done on the first DMA_MAP.   Similarly when the  
> last mapping
> is unmapped we could also umap the MSI banks.
> 
> Sequence would be something like:
> 
>         VFIO_GROUP_SET_CONTAINER     // add groups to the container
> 
>         VFIO_SET_IOMMU(VFIO_FSL_PAMU)    // set iommu model
> 
>         cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT    // returns max # of  
> MSI banks
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
> including MSI banks
> 
>         VFIO_IOMMU_MAP_DMA    // map the guest's memory
>                ---> kernel enables aperture and maps needed MSI banks  
> here

Maps them where?

What if there is more than one explicit DMA mapping?  What if DMA  
mappings are changed during operation?

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 19:26                         ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 19:26 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
> > Would is be possible for userspace to simply leave room for MSI bank
> > mapping (how much room could be determined by something like
> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
> can
> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> > VFIO will map banks on demand at the top of the aperture, growing  
> down?
> > Wouldn't that avoid a lot of issues with userspace needing to know
> > anything about MSI banks (other than count) and coordinating irq  
> numbers
> > and enabling handlers?
> 
> This is basically option #A in the original proposals sent.   I like
> this approach, in that it
> is simpler and keeps user space mostly out of this...which is
> consistent with how things are done
> on x86.  User space just needs to know how many windows to leave at
> the top of the aperture.
> The kernel then has the flexibility to use those windows how it wants.
> 
> But one question, is when should the kernel actually map (and unmap)
> the MSI banks.   One thing we need to do is enable the aperture...and  
> current
> thinking is that is done on the first DMA_MAP.   Similarly when the  
> last mapping
> is unmapped we could also umap the MSI banks.
> 
> Sequence would be something like:
> 
>         VFIO_GROUP_SET_CONTAINER     // add groups to the container
> 
>         VFIO_SET_IOMMU(VFIO_FSL_PAMU)    // set iommu model
> 
>         cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT    // returns max # of  
> MSI banks
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture
> 
>         VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows,
> including MSI banks
> 
>         VFIO_IOMMU_MAP_DMA    // map the guest's memory
>                ---> kernel enables aperture and maps needed MSI banks  
> here

Maps them where?

What if there is more than one explicit DMA mapping?  What if DMA  
mappings are changed during operation?

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03 19:18                           ` [Qemu-devel] " Scott Wood
@ 2013-04-03 19:43                             ` Stuart Yoder
  -1 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 19:43 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	Bhushan Bharat-R65777

On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
>>
>> > Would is be possible for userspace to simply leave room for MSI bank
>> > mapping (how much room could be determined by something like
>> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
>> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
>> > VFIO will map banks on demand at the top of the aperture, growing down?
>> > Wouldn't that avoid a lot of issues with userspace needing to know
>> > anything about MSI banks (other than count) and coordinating irq numbers
>> > and enabling handlers?
>>
>> This is basically option #A in the original proposals sent.   I like
>> this approach, in that it
>> is simpler and keeps user space mostly out of this...which is
>> consistent with how things are done
>> on x86.  User space just needs to know how many windows to leave at
>> the top of the aperture.
>> The kernel then has the flexibility to use those windows how it wants.
>>
>> But one question, is when should the kernel actually map (and unmap)
>> the MSI banks.
>
>
> I think userspace should explicitly request it.  Userspace still wouldn't
> need to know anything but the count:
>
> count = VFIO_IOMMU_GET_MSI_BANK_COUNT
> VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
> VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
> // do other DMA maps now, or later, or not at all, doesn't matter
> for (i = 0; i < count; i++)
>         VFIO_IOMMU_MAP_MSI_BANK(iova, i);
> // The kernel now knows where each bank has been mapped, and can update PCI
> config space appropriately.

And the overall aperture enable/disable would occur on the first
dma/msi map() and last dma/msi unmap()?

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 19:43                             ` Stuart Yoder
  0 siblings, 0 replies; 60+ messages in thread
From: Stuart Yoder @ 2013-04-03 19:43 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm, agraf, iommu, qemu-devel,
	Yoder Stuart-B08248, Alex Williamson, Bhushan Bharat-R65777

On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
>>
>> > Would is be possible for userspace to simply leave room for MSI bank
>> > mapping (how much room could be determined by something like
>> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
>> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and
>> > VFIO will map banks on demand at the top of the aperture, growing down?
>> > Wouldn't that avoid a lot of issues with userspace needing to know
>> > anything about MSI banks (other than count) and coordinating irq numbers
>> > and enabling handlers?
>>
>> This is basically option #A in the original proposals sent.   I like
>> this approach, in that it
>> is simpler and keeps user space mostly out of this...which is
>> consistent with how things are done
>> on x86.  User space just needs to know how many windows to leave at
>> the top of the aperture.
>> The kernel then has the flexibility to use those windows how it wants.
>>
>> But one question, is when should the kernel actually map (and unmap)
>> the MSI banks.
>
>
> I think userspace should explicitly request it.  Userspace still wouldn't
> need to know anything but the count:
>
> count = VFIO_IOMMU_GET_MSI_BANK_COUNT
> VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
> VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
> // do other DMA maps now, or later, or not at all, doesn't matter
> for (i = 0; i < count; i++)
>         VFIO_IOMMU_MAP_MSI_BANK(iova, i);
> // The kernel now knows where each bank has been mapped, and can update PCI
> config space appropriately.

And the overall aperture enable/disable would occur on the first
dma/msi map() and last dma/msi unmap()?

Stuart

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03 19:43                             ` [Qemu-devel] " Stuart Yoder
@ 2013-04-03 20:00                                 ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 20:00 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/03/2013 02:43:06 PM, Stuart Yoder wrote:
> On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org>  
> wrote:
> > On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
> >>
> >> > Would is be possible for userspace to simply leave room for MSI  
> bank
> >> > mapping (how much room could be determined by something like
> >> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that  
> userspace can
> >> > DMA_MAP starting at the 0x0 address of the aperture, growing up,  
> and
> >> > VFIO will map banks on demand at the top of the aperture,  
> growing down?
> >> > Wouldn't that avoid a lot of issues with userspace needing to  
> know
> >> > anything about MSI banks (other than count) and coordinating irq  
> numbers
> >> > and enabling handlers?
> >>
> >> This is basically option #A in the original proposals sent.   I  
> like
> >> this approach, in that it
> >> is simpler and keeps user space mostly out of this...which is
> >> consistent with how things are done
> >> on x86.  User space just needs to know how many windows to leave at
> >> the top of the aperture.
> >> The kernel then has the flexibility to use those windows how it  
> wants.
> >>
> >> But one question, is when should the kernel actually map (and  
> unmap)
> >> the MSI banks.
> >
> >
> > I think userspace should explicitly request it.  Userspace still  
> wouldn't
> > need to know anything but the count:
> >
> > count = VFIO_IOMMU_GET_MSI_BANK_COUNT
> > VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
> > VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
> > // do other DMA maps now, or later, or not at all, doesn't matter
> > for (i = 0; i < count; i++)
> >         VFIO_IOMMU_MAP_MSI_BANK(iova, i);
> > // The kernel now knows where each bank has been mapped, and can  
> update PCI
> > config space appropriately.
> 
> And the overall aperture enable/disable would occur on the first
> dma/msi map() and last dma/msi unmap()?

Yes.  We may want the optional ability to do an overall enable/disable  
for reasons we discussed a while ago, but in the absence of an explicit  
disable the domain would be enabled on first map.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 20:00                                 ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 20:00 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On 04/03/2013 02:43:06 PM, Stuart Yoder wrote:
> On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 04/03/2013 02:09:45 PM, Stuart Yoder wrote:
> >>
> >> > Would is be possible for userspace to simply leave room for MSI  
> bank
> >> > mapping (how much room could be determined by something like
> >> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that  
> userspace can
> >> > DMA_MAP starting at the 0x0 address of the aperture, growing up,  
> and
> >> > VFIO will map banks on demand at the top of the aperture,  
> growing down?
> >> > Wouldn't that avoid a lot of issues with userspace needing to  
> know
> >> > anything about MSI banks (other than count) and coordinating irq  
> numbers
> >> > and enabling handlers?
> >>
> >> This is basically option #A in the original proposals sent.   I  
> like
> >> this approach, in that it
> >> is simpler and keeps user space mostly out of this...which is
> >> consistent with how things are done
> >> on x86.  User space just needs to know how many windows to leave at
> >> the top of the aperture.
> >> The kernel then has the flexibility to use those windows how it  
> wants.
> >>
> >> But one question, is when should the kernel actually map (and  
> unmap)
> >> the MSI banks.
> >
> >
> > I think userspace should explicitly request it.  Userspace still  
> wouldn't
> > need to know anything but the count:
> >
> > count = VFIO_IOMMU_GET_MSI_BANK_COUNT
> > VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
> > VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
> > // do other DMA maps now, or later, or not at all, doesn't matter
> > for (i = 0; i < count; i++)
> >         VFIO_IOMMU_MAP_MSI_BANK(iova, i);
> > // The kernel now knows where each bank has been mapped, and can  
> update PCI
> > config space appropriately.
> 
> And the overall aperture enable/disable would occur on the first
> dma/msi map() and last dma/msi unmap()?

Yes.  We may want the optional ability to do an overall enable/disable  
for reasons we discussed a while ago, but in the absence of an explicit  
disable the domain would be enabled on first map.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03  3:37                     ` [Qemu-devel] " Alex Williamson
@ 2013-04-03 21:19                         ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 21:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, agraf-l3A5Bk7waGM,
	Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/02/2013 10:37:20 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
> > On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> > > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood
> > > <scottwood-KZfg59tc24xl57MIdRCFDg@public.gmane.org> wrote:
> > > > >> >    C.  Explicit mapping using normal DMA map.  The last  
> idea
> > > is that
> > > > >> >        we would introduce a new ioctl to give user-space  
> an fd
> > > to
> > > > >> >        the MSI bank, which could be mmapped.  The flow  
> would be
> > > > >> >        something like this:
> > > > >> >           -for each group user space calls new ioctl
> > > > >> > VFIO_GROUP_GET_MSI_FD
> > > > >> >           -user space mmaps the fd, getting a vaddr
> > > > >> >           -user space does a normal DMA map for desired  
> iova
> > > > >> >        This approach makes everything explicit, but adds a  
> new
> > > ioctl
> > > > >> >        applicable most likely only to the PAMU (type2  
> iommu).
> > > > >>
> > > > >> And the DMA_MAP of that mmap then allows userspace to select  
> the
> > > window
> > > > >> used?  This one seems like a lot of overhead, adding a new
> > > ioctl, new
> > > > >> fd, mmap, special mapping path, etc.
> > > > >
> > > > >
> > > > > There's going to be special stuff no matter what.  This would
> > > keep it
> > > > > separated from the IOMMU map code.
> > > > >
> > > > > I'm not sure what you mean by "overhead" here... the runtime
> > > overhead of
> > > > > setting things up is not particularly relevant as long as it's
> > > reasonable.
> > > > > If you mean development and maintenance effort, keeping things
> > > well
> > > > > separated should help.
> > > >
> > > > We don't need to change DMA_MAP.  If we can simply add a new  
> "type
> > > 2"
> > > > ioctl that allows user space to set which windows are MSIs, it
> > > seems vastly
> > > > less complex than an ioctl to supply a new fd, mmap of it, etc.
> > > >
> > > > So maybe 2 ioctls:
> > > >     VFIO_IOMMU_GET_MSI_COUNT
> >
> > Do you mean a count of actual MSIs or a count of MSI banks used by  
> the
> > whole VFIO group?
> 
> I hope the latter, which would clarify how this is distinct from
> DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
> dynamically adding a device could bring along additional MSI banks?

I'm not sure -- maybe we could say that hotplug can add banks, but not  
remove them or change the order, so userspace would just need to check  
if the number of banks changed, and map the extras.

> The current VFIO MSI support has the host handling everything about  
> MSI.
> The user never programs an MSI vector to the physical device, they set
> up everything through ioctl.  On interrupt, we simply trigger an  
> eventfd
> and leave it to things like KVM irqfd or QEMU to do the right thing  
> in a
> virtual machine.
> 
> Here the MSI vector has to go through a PAMU window to hit the correct
> MSI bank.  So that means it has some component of the iova involved,
> which we're proposing here is controlled by userspace (whether that
> vector uses an offset from 0x10000000 or 0x00000000 depending on which
> window slot is used to make the MSI bank).  I assume we're still  
> working
> in a model where the physical interrupt fires into the host and a
> host-based interrupt handler triggers an eventfd, right?

Yes (subject to possible future optimizations).

> So that means the vector also has host components so we trigger the  
> correct ISR.  How
> is that coordinated?

Everything but the iova component needs to come from the host MSI  
allocator.

> Would is be possible for userspace to simply leave room for MSI bank
> mapping (how much room could be determined by something like
> VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
> can
> DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> VFIO will map banks on demand at the top of the aperture, growing  
> down?
> Wouldn't that avoid a lot of issues with userspace needing to know
> anything about MSI banks (other than count) and coordinating irq  
> numbers
> and enabling handlers?

This would restrict a (possibly unlikely) use case where the user wants  
to map something near the top of the aperture but has another place  
MSIs can go (or is willing to live without MSIs).  Otherwise it could  
be workable, as long as we can require an explicit MSI enabling on a  
device to happen after the aperture and subwindow count are set up.   
I'm not sure it would really buy anything over having userspace iterate  
over the MSI bank count, though -- it would probably be a bit more  
complicated.

> > > On x86 MSI count is very
> > > device specific, which means it wold be a VFIO_DEVICE_* ioctl
> > > (actually
> > > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble  
> with
> > > it
> > > being a device ioctl is that you need to get the device FD, but  
> the
> > > IOMMU protection needs to be established before you can get  
> that... so
> > > there's an ordering problem if you need it from the device before
> > > configuring the IOMMU.  Thanks,
> >
> > What do you mean by "IOMMU protection needs to be established"?
> > Wouldn't we just start with no mappings in place?
> 
> If no mappings blocks all DMA, sure, that's fine.  Once the VFIO  
> device
> FD is accessible by userspace we have to protect the host against DMA.
> If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
> could be exploitable.  Thanks,

Unless the PAMU is globally in bypass mode (which it wouldn't be),  
there's no way to disable protection other than creating one giant  
mapping.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 21:19                         ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 21:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, Stuart Yoder, qemu-devel, agraf,
	Yoder Stuart-B08248, iommu, Bhushan Bharat-R65777

On 04/02/2013 10:37:20 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
> > On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> > > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood
> > > <scottwood@freescale.com> wrote:
> > > > >> >    C.  Explicit mapping using normal DMA map.  The last  
> idea
> > > is that
> > > > >> >        we would introduce a new ioctl to give user-space  
> an fd
> > > to
> > > > >> >        the MSI bank, which could be mmapped.  The flow  
> would be
> > > > >> >        something like this:
> > > > >> >           -for each group user space calls new ioctl
> > > > >> > VFIO_GROUP_GET_MSI_FD
> > > > >> >           -user space mmaps the fd, getting a vaddr
> > > > >> >           -user space does a normal DMA map for desired  
> iova
> > > > >> >        This approach makes everything explicit, but adds a  
> new
> > > ioctl
> > > > >> >        applicable most likely only to the PAMU (type2  
> iommu).
> > > > >>
> > > > >> And the DMA_MAP of that mmap then allows userspace to select  
> the
> > > window
> > > > >> used?  This one seems like a lot of overhead, adding a new
> > > ioctl, new
> > > > >> fd, mmap, special mapping path, etc.
> > > > >
> > > > >
> > > > > There's going to be special stuff no matter what.  This would
> > > keep it
> > > > > separated from the IOMMU map code.
> > > > >
> > > > > I'm not sure what you mean by "overhead" here... the runtime
> > > overhead of
> > > > > setting things up is not particularly relevant as long as it's
> > > reasonable.
> > > > > If you mean development and maintenance effort, keeping things
> > > well
> > > > > separated should help.
> > > >
> > > > We don't need to change DMA_MAP.  If we can simply add a new  
> "type
> > > 2"
> > > > ioctl that allows user space to set which windows are MSIs, it
> > > seems vastly
> > > > less complex than an ioctl to supply a new fd, mmap of it, etc.
> > > >
> > > > So maybe 2 ioctls:
> > > >     VFIO_IOMMU_GET_MSI_COUNT
> >
> > Do you mean a count of actual MSIs or a count of MSI banks used by  
> the
> > whole VFIO group?
> 
> I hope the latter, which would clarify how this is distinct from
> DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
> dynamically adding a device could bring along additional MSI banks?

I'm not sure -- maybe we could say that hotplug can add banks, but not  
remove them or change the order, so userspace would just need to check  
if the number of banks changed, and map the extras.

> The current VFIO MSI support has the host handling everything about  
> MSI.
> The user never programs an MSI vector to the physical device, they set
> up everything through ioctl.  On interrupt, we simply trigger an  
> eventfd
> and leave it to things like KVM irqfd or QEMU to do the right thing  
> in a
> virtual machine.
> 
> Here the MSI vector has to go through a PAMU window to hit the correct
> MSI bank.  So that means it has some component of the iova involved,
> which we're proposing here is controlled by userspace (whether that
> vector uses an offset from 0x10000000 or 0x00000000 depending on which
> window slot is used to make the MSI bank).  I assume we're still  
> working
> in a model where the physical interrupt fires into the host and a
> host-based interrupt handler triggers an eventfd, right?

Yes (subject to possible future optimizations).

> So that means the vector also has host components so we trigger the  
> correct ISR.  How
> is that coordinated?

Everything but the iova component needs to come from the host MSI  
allocator.

> Would is be possible for userspace to simply leave room for MSI bank
> mapping (how much room could be determined by something like
> VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace  
> can
> DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> VFIO will map banks on demand at the top of the aperture, growing  
> down?
> Wouldn't that avoid a lot of issues with userspace needing to know
> anything about MSI banks (other than count) and coordinating irq  
> numbers
> and enabling handlers?

This would restrict a (possibly unlikely) use case where the user wants  
to map something near the top of the aperture but has another place  
MSIs can go (or is willing to live without MSIs).  Otherwise it could  
be workable, as long as we can require an explicit MSI enabling on a  
device to happen after the aperture and subwindow count are set up.   
I'm not sure it would really buy anything over having userspace iterate  
over the MSI bank count, though -- it would probably be a bit more  
complicated.

> > > On x86 MSI count is very
> > > device specific, which means it wold be a VFIO_DEVICE_* ioctl
> > > (actually
> > > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble  
> with
> > > it
> > > being a device ioctl is that you need to get the device FD, but  
> the
> > > IOMMU protection needs to be established before you can get  
> that... so
> > > there's an ordering problem if you need it from the device before
> > > configuring the IOMMU.  Thanks,
> >
> > What do you mean by "IOMMU protection needs to be established"?
> > Wouldn't we just start with no mappings in place?
> 
> If no mappings blocks all DMA, sure, that's fine.  Once the VFIO  
> device
> FD is accessible by userspace we have to protect the host against DMA.
> If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
> could be exploitable.  Thanks,

Unless the PAMU is globally in bypass mode (which it wouldn't be),  
there's no way to disable protection other than creating one giant  
mapping.

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: RFC: vfio API changes needed for powerpc
  2013-04-03  3:12                 ` [Qemu-devel] " Alex Williamson
@ 2013-04-03 21:25                     ` Scott Wood
  -1 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 21:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, Yoder Stuart-B08248,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Bhushan Bharat-R65777

On 04/02/2013 10:12:31 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
> > On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> > > On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > > > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > > > On x86 the interrupt remapper handles this transparently when  
> MSI
> > > > > is enabled and userspace never gets direct access to the  
> device
> > > MSI
> > > > > address/data registers.
> > > >
> > > > x86 has a totally different mechanism here, as far as I  
> understand
> > > --
> > > > even before you get into restrictions on mappings.
> > >
> > > So what control will userspace have over programming the actually  
> MSI
> > > vectors on PAMU?
> >
> > Not sure what you mean -- PAMU doesn't get explicitly involved in
> > MSIs.  It's just another 4K page mapping (per relevant MSI bank).   
> If
> > you want isolation, you need to make sure that an MSI group is only
> > used by one VFIO group, and that you're on a chip that has alias  
> pages
> > with just one MSI bank register each (newer chips do, but the first
> > chip to have a PAMU didn't).
> 
> How does a user figure this out?

The user's involvement could be limited to setting a policy knob of  
whether that degree of isolation is required (if required and  
unavailable, all devices using an MSI bank would be forced into the  
same group).  We'd need to do something with MSI allocation so that we  
avoid using an MSI bank with more than one IOMMU group where possible.   
I'm not sure about the details yet, or how practical this is.  There  
might need to be some MSI bank assignment done as part of the VFIO  
device binding process, if there are going to be more VFIO groups than  
there are MSI banks (reserving one bank for host use).

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
@ 2013-04-03 21:25                     ` Scott Wood
  0 siblings, 0 replies; 60+ messages in thread
From: Scott Wood @ 2013-04-03 21:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm, agraf, qemu-devel, Yoder Stuart-B08248,
	iommu, Bhushan Bharat-R65777

On 04/02/2013 10:12:31 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
> > On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> > > On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > > > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > > > On x86 the interrupt remapper handles this transparently when  
> MSI
> > > > > is enabled and userspace never gets direct access to the  
> device
> > > MSI
> > > > > address/data registers.
> > > >
> > > > x86 has a totally different mechanism here, as far as I  
> understand
> > > --
> > > > even before you get into restrictions on mappings.
> > >
> > > So what control will userspace have over programming the actually  
> MSI
> > > vectors on PAMU?
> >
> > Not sure what you mean -- PAMU doesn't get explicitly involved in
> > MSIs.  It's just another 4K page mapping (per relevant MSI bank).   
> If
> > you want isolation, you need to make sure that an MSI group is only
> > used by one VFIO group, and that you're on a chip that has alias  
> pages
> > with just one MSI bank register each (newer chips do, but the first
> > chip to have a PAMU didn't).
> 
> How does a user figure this out?

The user's involvement could be limited to setting a policy knob of  
whether that degree of isolation is required (if required and  
unavailable, all devices using an MSI bank would be forced into the  
same group).  We'd need to do something with MSI allocation so that we  
avoid using an MSI bank with more than one IOMMU group where possible.   
I'm not sure about the details yet, or how practical this is.  There  
might need to be some MSI bank assignment done as part of the VFIO  
device binding process, if there are going to be more VFIO groups than  
there are MSI banks (reserving one bank for host use).

-Scott

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2013-04-03 21:41 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-02 17:32 RFC: vfio API changes needed for powerpc Yoder Stuart-B08248
2013-04-02 17:32 ` [Qemu-devel] " Yoder Stuart-B08248
     [not found] ` <9F6FE96B71CF29479FF1CDC8046E15035A0F13-TcFNo7jSaXOLgTCmFNXF2K4g8xLGJsHaLnY5E4hWTkheoWH0uzbU5w@public.gmane.org>
2013-04-02 19:39   ` Scott Wood
2013-04-02 19:39     ` [Qemu-devel] " Scott Wood
2013-04-02 20:38     ` Stuart Yoder
2013-04-02 20:38       ` [Qemu-devel] " Stuart Yoder
2013-04-02 20:47       ` Scott Wood
2013-04-02 20:47         ` [Qemu-devel] " Scott Wood
2013-04-02 20:58         ` Stuart Yoder
2013-04-02 20:58           ` [Qemu-devel] " Stuart Yoder
2013-04-02 20:32   ` Alex Williamson
2013-04-02 20:32     ` [Qemu-devel] " Alex Williamson
     [not found]     ` <1364934737.2882.149.camel-xdHQ/5r00wBBDLzU/O5InQ@public.gmane.org>
2013-04-02 20:54       ` Stuart Yoder
2013-04-02 20:54         ` [Qemu-devel] " Stuart Yoder
     [not found]         ` <CALRxmdBrW5HUpprhWLr9-U8_t5LBuaPEtWP9vQGOAfyOQ0eV9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-02 21:16           ` Alex Williamson
2013-04-02 21:16             ` [Qemu-devel] " Alex Williamson
     [not found]             ` <1364937371.2882.166.camel-xdHQ/5r00wBBDLzU/O5InQ@public.gmane.org>
2013-04-02 22:13               ` Scott Wood
2013-04-02 22:13                 ` [Qemu-devel] " Scott Wood
2013-04-03  2:54                 ` Alex Williamson
2013-04-03  2:54                   ` [Qemu-devel] " Alex Williamson
2013-04-02 20:57       ` Scott Wood
2013-04-02 20:57         ` [Qemu-devel] " Scott Wood
2013-04-02 21:08         ` Stuart Yoder
2013-04-02 21:08           ` [Qemu-devel] " Stuart Yoder
     [not found]           ` <CALRxmdCe-RMZuhPSviQsQUxWZbxXABOZDsM8ZoNaqrp=xH+TaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-02 21:38             ` Alex Williamson
2013-04-02 21:38               ` [Qemu-devel] " Alex Williamson
     [not found]               ` <1364938725.2882.184.camel-xdHQ/5r00wBBDLzU/O5InQ@public.gmane.org>
2013-04-02 22:50                 ` Scott Wood
2013-04-02 22:50                   ` [Qemu-devel] " Scott Wood
2013-04-03  3:37                   ` Alex Williamson
2013-04-03  3:37                     ` [Qemu-devel] " Alex Williamson
2013-04-03 19:09                     ` Stuart Yoder
2013-04-03 19:09                       ` [Qemu-devel] " Stuart Yoder
     [not found]                       ` <CALRxmdDHYp1+svjwDrhMc9w2p7H8t87PphEZ1t+97C-EHDR+3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-03 19:18                         ` Scott Wood
2013-04-03 19:18                           ` [Qemu-devel] " Scott Wood
2013-04-03 19:43                           ` Stuart Yoder
2013-04-03 19:43                             ` [Qemu-devel] " Stuart Yoder
     [not found]                             ` <CALRxmdAbVFbPUCDV1fHauAeQxDQWDC8SjsQn-Xyn3PbzKhFjmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-03 20:00                               ` Scott Wood
2013-04-03 20:00                                 ` [Qemu-devel] " Scott Wood
2013-04-03 19:23                         ` Alex Williamson
2013-04-03 19:23                           ` [Qemu-devel] " Alex Williamson
2013-04-03 19:26                       ` Scott Wood
2013-04-03 19:26                         ` [Qemu-devel] " Scott Wood
     [not found]                     ` <1364960240.2882.230.camel-xdHQ/5r00wBBDLzU/O5InQ@public.gmane.org>
2013-04-03 21:19                       ` Scott Wood
2013-04-03 21:19                         ` [Qemu-devel] " Scott Wood
2013-04-03 18:32                   ` Stuart Yoder
2013-04-03 18:32                     ` [Qemu-devel] " Stuart Yoder
2013-04-03 18:39                     ` Scott Wood
2013-04-03 18:39                       ` [Qemu-devel] " Scott Wood
2013-04-02 21:55             ` Scott Wood
2013-04-02 21:55               ` [Qemu-devel] " Scott Wood
2013-04-02 21:32         ` Alex Williamson
2013-04-02 21:32           ` [Qemu-devel] " Alex Williamson
     [not found]           ` <1364938324.2882.179.camel-xdHQ/5r00wBBDLzU/O5InQ@public.gmane.org>
2013-04-02 22:44             ` Scott Wood
2013-04-02 22:44               ` [Qemu-devel] " Scott Wood
2013-04-03  3:12               ` Alex Williamson
2013-04-03  3:12                 ` [Qemu-devel] " Alex Williamson
2013-04-03 18:25                 ` Stuart Yoder
2013-04-03 18:25                   ` [Qemu-devel] " Stuart Yoder
     [not found]                 ` <1364958751.2882.209.camel-xdHQ/5r00wBBDLzU/O5InQ@public.gmane.org>
2013-04-03 21:25                   ` Scott Wood
2013-04-03 21:25                     ` [Qemu-devel] " Scott Wood

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.