All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] IOMMU Groups + VFIO
@ 2012-05-11 22:55 ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

This is the latest incantation of IOMMU Groups and matching VFIO
code.  After some discussions with BenH and David Gibson I think
we have some agreement on moving forward with the IOMMU Group
approach, which covers the first 6 patches in this series.  The
basic idea of this is that IOMMU groups need to be a more
integrated part of the driver model.  The current iommu_device_group
interface requires the user to do all all the heavy lifting of
finding devices reporting the same ID and putting them together into
something useful.  If an esoteric driver like VFIO is the only user
of the groups, this works nicely, but if we want to use groups for
common DMA paths, it's too much overhead.  We therefore add an
iommu_group pointer off the device so we can easily get to the group
for fast paths like streaming DMA (GregKH, does this work for you?).

I include code to support IOMMU group on both amd_iommu and
intel-iommu.  These take into account device quirks, like the Ricoh
device that David Woodhouse has noted on a few occasions.  I'll
leave it to the iommu owners to make use of these.

For VFIO itself, the core has had nearly a complete rewrite to
support the IOMMU group interface and also to switch to a new model
for allowing IOMMU domains to share context (merge is gone).  To
allow the most flexibility (and also to avoid ratholing on the exposed
IOMMU interface), the VFIO IOMMU backend is completely modular,
allowing the user to probe and initiate a specific backend for a
group.  I've poorly chosen the name "x86" for the current IOMMU
backend, simply for lack of a good way to describe a non-window based,
page table driven IOMMU implementation.

These patches can be found in git here:

git://github.com/awilliam/linux-vfio.git (iommu-group-vfio-20120511)

The matching qemu tree supporting VFIO device assignment is here:

git://github.com/awilliam/qemu-vfio.git (iommu-group-vfio)

Sorry, this is based on Qemu-1.0.  Updating to 1.1 is next on my
todo list.

I'd really like to move forward with this, so please provide
comments and feedback.  This touches iommu, x86 iommu drivers, pci,
device, and adds a new driver, so I appreciate any kind of ack
(or nak I suppose) to know if we've got end-to-end support.  Thanks!

Alex
---

Alex Williamson (13):
      vfio: Add PCI device driver
      pci: Misc pci_reg additions
      pci: Create common pcibios_err_to_errno
      pci: export pci_user functions for use by other drivers
      vfio: x86 IOMMU implementation
      vfio: Add documentation
      vfio: VFIO core
      iommu: Make use of DMA quirking and ACS enabled check for groups
      pci: New pci_acs_enabled()
      pci: New pci_dma_quirk()
      iommu: IOMMU groups for VT-d and AMD-Vi
      iommu: IOMMU Groups
      driver core: Add iommu_group tracking to struct device


 Documentation/ioctl/ioctl-number.txt |    1 
 Documentation/vfio.txt               |  315 +++++++
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/iommu/amd_iommu.c            |   52 +
 drivers/iommu/intel-iommu.c          |   72 +-
 drivers/iommu/iommu.c                |  449 +++++++++-
 drivers/pci/access.c                 |    6 
 drivers/pci/pci.c                    |   43 +
 drivers/pci/pci.h                    |    8 
 drivers/pci/quirks.c                 |   22 
 drivers/vfio/Kconfig                 |   16 
 drivers/vfio/Makefile                |    3 
 drivers/vfio/pci/Kconfig             |    8 
 drivers/vfio/pci/Makefile            |    4 
 drivers/vfio/pci/vfio_pci.c          |  557 ++++++++++++
 drivers/vfio/pci/vfio_pci_config.c   | 1527 ++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_intrs.c    |  724 ++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h  |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c     |  267 ++++++
 drivers/vfio/vfio.c                  | 1406 +++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_x86.c        |  743 +++++++++++++++++
 drivers/xen/xen-pciback/conf_space.c |    6 
 include/linux/device.h               |    2 
 include/linux/iommu.h                |   84 ++
 include/linux/pci.h                  |   37 +
 include/linux/pci_regs.h             |  112 ++
 include/linux/vfio.h                 |  442 ++++++++++
 29 files changed, 6892 insertions(+), 116 deletions(-)
 create mode 100644 Documentation/vfio.txt
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 drivers/vfio/vfio_iommu_x86.c
 create mode 100644 include/linux/vfio.h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 00/13] IOMMU Groups + VFIO
@ 2012-05-11 22:55 ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

This is the latest incantation of IOMMU Groups and matching VFIO
code.  After some discussions with BenH and David Gibson I think
we have some agreement on moving forward with the IOMMU Group
approach, which covers the first 6 patches in this series.  The
basic idea of this is that IOMMU groups need to be a more
integrated part of the driver model.  The current iommu_device_group
interface requires the user to do all all the heavy lifting of
finding devices reporting the same ID and putting them together into
something useful.  If an esoteric driver like VFIO is the only user
of the groups, this works nicely, but if we want to use groups for
common DMA paths, it's too much overhead.  We therefore add an
iommu_group pointer off the device so we can easily get to the group
for fast paths like streaming DMA (GregKH, does this work for you?).

I include code to support IOMMU group on both amd_iommu and
intel-iommu.  These take into account device quirks, like the Ricoh
device that David Woodhouse has noted on a few occasions.  I'll
leave it to the iommu owners to make use of these.

For VFIO itself, the core has had nearly a complete rewrite to
support the IOMMU group interface and also to switch to a new model
for allowing IOMMU domains to share context (merge is gone).  To
allow the most flexibility (and also to avoid ratholing on the exposed
IOMMU interface), the VFIO IOMMU backend is completely modular,
allowing the user to probe and initiate a specific backend for a
group.  I've poorly chosen the name "x86" for the current IOMMU
backend, simply for lack of a good way to describe a non-window based,
page table driven IOMMU implementation.

These patches can be found in git here:

git://github.com/awilliam/linux-vfio.git (iommu-group-vfio-20120511)

The matching qemu tree supporting VFIO device assignment is here:

git://github.com/awilliam/qemu-vfio.git (iommu-group-vfio)

Sorry, this is based on Qemu-1.0.  Updating to 1.1 is next on my
todo list.

I'd really like to move forward with this, so please provide
comments and feedback.  This touches iommu, x86 iommu drivers, pci,
device, and adds a new driver, so I appreciate any kind of ack
(or nak I suppose) to know if we've got end-to-end support.  Thanks!

Alex
---

Alex Williamson (13):
      vfio: Add PCI device driver
      pci: Misc pci_reg additions
      pci: Create common pcibios_err_to_errno
      pci: export pci_user functions for use by other drivers
      vfio: x86 IOMMU implementation
      vfio: Add documentation
      vfio: VFIO core
      iommu: Make use of DMA quirking and ACS enabled check for groups
      pci: New pci_acs_enabled()
      pci: New pci_dma_quirk()
      iommu: IOMMU groups for VT-d and AMD-Vi
      iommu: IOMMU Groups
      driver core: Add iommu_group tracking to struct device


 Documentation/ioctl/ioctl-number.txt |    1 
 Documentation/vfio.txt               |  315 +++++++
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/iommu/amd_iommu.c            |   52 +
 drivers/iommu/intel-iommu.c          |   72 +-
 drivers/iommu/iommu.c                |  449 +++++++++-
 drivers/pci/access.c                 |    6 
 drivers/pci/pci.c                    |   43 +
 drivers/pci/pci.h                    |    8 
 drivers/pci/quirks.c                 |   22 
 drivers/vfio/Kconfig                 |   16 
 drivers/vfio/Makefile                |    3 
 drivers/vfio/pci/Kconfig             |    8 
 drivers/vfio/pci/Makefile            |    4 
 drivers/vfio/pci/vfio_pci.c          |  557 ++++++++++++
 drivers/vfio/pci/vfio_pci_config.c   | 1527 ++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_intrs.c    |  724 ++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h  |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c     |  267 ++++++
 drivers/vfio/vfio.c                  | 1406 +++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_x86.c        |  743 +++++++++++++++++
 drivers/xen/xen-pciback/conf_space.c |    6 
 include/linux/device.h               |    2 
 include/linux/iommu.h                |   84 ++
 include/linux/pci.h                  |   37 +
 include/linux/pci_regs.h             |  112 ++
 include/linux/vfio.h                 |  442 ++++++++++
 29 files changed, 6892 insertions(+), 116 deletions(-)
 create mode 100644 Documentation/vfio.txt
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 drivers/vfio/vfio_iommu_x86.c
 create mode 100644 include/linux/vfio.h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 00/13] IOMMU Groups + VFIO
@ 2012-05-11 22:55 ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

This is the latest incantation of IOMMU Groups and matching VFIO
code.  After some discussions with BenH and David Gibson I think
we have some agreement on moving forward with the IOMMU Group
approach, which covers the first 6 patches in this series.  The
basic idea of this is that IOMMU groups need to be a more
integrated part of the driver model.  The current iommu_device_group
interface requires the user to do all all the heavy lifting of
finding devices reporting the same ID and putting them together into
something useful.  If an esoteric driver like VFIO is the only user
of the groups, this works nicely, but if we want to use groups for
common DMA paths, it's too much overhead.  We therefore add an
iommu_group pointer off the device so we can easily get to the group
for fast paths like streaming DMA (GregKH, does this work for you?).

I include code to support IOMMU group on both amd_iommu and
intel-iommu.  These take into account device quirks, like the Ricoh
device that David Woodhouse has noted on a few occasions.  I'll
leave it to the iommu owners to make use of these.

For VFIO itself, the core has had nearly a complete rewrite to
support the IOMMU group interface and also to switch to a new model
for allowing IOMMU domains to share context (merge is gone).  To
allow the most flexibility (and also to avoid ratholing on the exposed
IOMMU interface), the VFIO IOMMU backend is completely modular,
allowing the user to probe and initiate a specific backend for a
group.  I've poorly chosen the name "x86" for the current IOMMU
backend, simply for lack of a good way to describe a non-window based,
page table driven IOMMU implementation.

These patches can be found in git here:

git://github.com/awilliam/linux-vfio.git (iommu-group-vfio-20120511)

The matching qemu tree supporting VFIO device assignment is here:

git://github.com/awilliam/qemu-vfio.git (iommu-group-vfio)

Sorry, this is based on Qemu-1.0.  Updating to 1.1 is next on my
todo list.

I'd really like to move forward with this, so please provide
comments and feedback.  This touches iommu, x86 iommu drivers, pci,
device, and adds a new driver, so I appreciate any kind of ack
(or nak I suppose) to know if we've got end-to-end support.  Thanks!

Alex
---

Alex Williamson (13):
      vfio: Add PCI device driver
      pci: Misc pci_reg additions
      pci: Create common pcibios_err_to_errno
      pci: export pci_user functions for use by other drivers
      vfio: x86 IOMMU implementation
      vfio: Add documentation
      vfio: VFIO core
      iommu: Make use of DMA quirking and ACS enabled check for groups
      pci: New pci_acs_enabled()
      pci: New pci_dma_quirk()
      iommu: IOMMU groups for VT-d and AMD-Vi
      iommu: IOMMU Groups
      driver core: Add iommu_group tracking to struct device


 Documentation/ioctl/ioctl-number.txt |    1 
 Documentation/vfio.txt               |  315 +++++++
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/iommu/amd_iommu.c            |   52 +
 drivers/iommu/intel-iommu.c          |   72 +-
 drivers/iommu/iommu.c                |  449 +++++++++-
 drivers/pci/access.c                 |    6 
 drivers/pci/pci.c                    |   43 +
 drivers/pci/pci.h                    |    8 
 drivers/pci/quirks.c                 |   22 
 drivers/vfio/Kconfig                 |   16 
 drivers/vfio/Makefile                |    3 
 drivers/vfio/pci/Kconfig             |    8 
 drivers/vfio/pci/Makefile            |    4 
 drivers/vfio/pci/vfio_pci.c          |  557 ++++++++++++
 drivers/vfio/pci/vfio_pci_config.c   | 1527 ++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_intrs.c    |  724 ++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h  |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c     |  267 ++++++
 drivers/vfio/vfio.c                  | 1406 +++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_x86.c        |  743 +++++++++++++++++
 drivers/xen/xen-pciback/conf_space.c |    6 
 include/linux/device.h               |    2 
 include/linux/iommu.h                |   84 ++
 include/linux/pci.h                  |   37 +
 include/linux/pci_regs.h             |  112 ++
 include/linux/vfio.h                 |  442 ++++++++++
 29 files changed, 6892 insertions(+), 116 deletions(-)
 create mode 100644 Documentation/vfio.txt
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 drivers/vfio/vfio_iommu_x86.c
 create mode 100644 include/linux/vfio.h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

IOMMU groups allow IOMMU drivers to represent DMA visibility
and isolation of devices.  Multiple devices may be grouped
together for the purposes of DMA.  Placing a pointer on
struct device enable easy access for things like streaming
DMA programming and drivers like VFIO.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 include/linux/device.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 5ad17cc..13dd26b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -35,6 +35,7 @@ struct subsys_private;
 struct bus_type;
 struct device_node;
 struct iommu_ops;
+struct iommu_group;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -677,6 +678,7 @@ struct device {
 	const struct attribute_group **groups;	/* optional groups */
 
 	void	(*release)(struct device *dev);
+	struct iommu_group	*iommu_group;
 };
 
 /* Get the wakeup routines, which depend on struct device */


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

IOMMU groups allow IOMMU drivers to represent DMA visibility
and isolation of devices.  Multiple devices may be grouped
together for the purposes of DMA.  Placing a pointer on
struct device enable easy access for things like streaming
DMA programming and drivers like VFIO.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 include/linux/device.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 5ad17cc..13dd26b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -35,6 +35,7 @@ struct subsys_private;
 struct bus_type;
 struct device_node;
 struct iommu_ops;
+struct iommu_group;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -677,6 +678,7 @@ struct device {
 	const struct attribute_group **groups;	/* optional groups */
 
 	void	(*release)(struct device *dev);
+	struct iommu_group	*iommu_group;
 };
 
 /* Get the wakeup routines, which depend on struct device */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

IOMMU groups allow IOMMU drivers to represent DMA visibility
and isolation of devices.  Multiple devices may be grouped
together for the purposes of DMA.  Placing a pointer on
struct device enable easy access for things like streaming
DMA programming and drivers like VFIO.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 include/linux/device.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 5ad17cc..13dd26b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -35,6 +35,7 @@ struct subsys_private;
 struct bus_type;
 struct device_node;
 struct iommu_ops;
+struct iommu_group;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -677,6 +678,7 @@ struct device {
 	const struct attribute_group **groups;	/* optional groups */
 
 	void	(*release)(struct device *dev);
+	struct iommu_group	*iommu_group;
 };
 
 /* Get the wakeup routines, which depend on struct device */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

IOMMU device groups are currently a rather vague associative notion
with assembly required by the user or user level driver provider to
do anything useful.  This patch intends to grow the IOMMU group concept
into something a bit more consumable.

To do this, we first create an object representing the group, struct
iommu_group.  This structure is allocated (iommu_group_alloc) and
filled (iommu_group_add_device) by the iommu driver.  The iommu driver
is free to add devices to the group using it's own set of policies.
This allows inclusion of devices based on physical hardware or topology
limitations of the platform, as well as soft requirements, such as
multi-function trust levels or peer-to-peer protection of the
interconnects.  Each device may only belong to a single iommu group,
which is linked from struct device.iommu_group.  IOMMU groups are
maintained using kobject reference counting, allowing for automatic
removal of empty, unreferenced groups.  It is the responsibility of
the iommu driver to remove devices from the group
(iommu_group_remove_device).

IOMMU groups also include a userspace representation in sysfs under
/sys/kernel/iommu_groups.  When allocated, each group is given a
dynamically assign ID (int).  The ID is managed by the core IOMMU group
code to support multiple heterogeneous iommu drivers, which could
potentially collide in group naming/numbering.  This also keeps group
IDs to small, easily managed values.  A directory is created under
/sys/kernel/iommu_groups for each group.  A further subdirectory named
"devices" contains links to each device within the group.  The iommu_group
file in the device's sysfs directory, which formerly contained a group
number when read, is now a link to the iommu group.  Example:

$ ls -l /sys/kernel/iommu_groups/26/devices/
total 0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1

$ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
[truncating perms/owner/timestamp]
/sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
					../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
					../../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
					../../../../kernel/iommu_groups/26

Groups also include several exported functions for use by user level
driver providers, for example VFIO.  These include:

iommu_group_get(): Acquires a reference to a group from a device
iommu_group_put(): Releases reference
iommu_group_for_each_dev(): Iterates over group devices using callback
iommu_group_[un]register_notifier(): Allows notification of device add
        and remove operations relevant to the group
iommu_group_id(): Return the group number

This patch also extends the IOMMU API to allow attaching groups to
domains.  This is currently a simple wrapper for iterating through
devices within a group, but it's expected that the IOMMU API may
eventually make groups a more integral part of domains.

Groups intentionally do not try to manage group ownership.  A user
level driver provider must independently acquire ownership for each
device within a group before making use of the group as a whole.
This may change in the future if group usage becomes more pervasive
across both DMA and IOMMU ops.

Groups intentionally do not provide a mechanism for driver locking
or otherwise manipulating driver matching/probing of devices within
the group.  Such interfaces are generic to devices and beyond the
scope of IOMMU groups.  If implemented, user level providers have
ready access via iommu_group_for_each_dev and group notifiers.

Groups currently provide no storage for iommu context, but some kind
of iommu_group_get/set_iommudata() interface is likely if groups
become more pervasive in the dma layers.

iommu_device_group() is removed here as it has no users.  The
replacement is:

	group = iommu_group_get(dev);
	id = iommu_group_id(group);
	iommu_group_put(group);

AMD-Vi & Intel VT-d support re-added in following patches.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |   21 --
 drivers/iommu/intel-iommu.c |   49 -----
 drivers/iommu/iommu.c       |  449 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/iommu.h       |   84 ++++++++
 4 files changed, 499 insertions(+), 104 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a5bee8e..32c00cd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3193,26 +3193,6 @@ static int amd_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-static int amd_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct iommu_dev_data *dev_data = dev->archdata.iommu;
-	struct pci_dev *pdev = to_pci_dev(dev);
-	u16 devid;
-
-	if (!dev_data)
-		return -ENODEV;
-
-	if (pdev->is_virtfn || !iommu_group_mf)
-		devid = dev_data->devid;
-	else
-		devid = calc_devid(pdev->bus->number,
-				   PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
-
-	*groupid = amd_iommu_alias_table[devid];
-
-	return 0;
-}
-
 static struct iommu_ops amd_iommu_ops = {
 	.domain_init = amd_iommu_domain_init,
 	.domain_destroy = amd_iommu_domain_destroy,
@@ -3222,7 +3202,6 @@ static struct iommu_ops amd_iommu_ops = {
 	.unmap = amd_iommu_unmap,
 	.iova_to_phys = amd_iommu_iova_to_phys,
 	.domain_has_cap = amd_iommu_domain_has_cap,
-	.device_group = amd_iommu_device_group,
 	.pgsize_bitmap	= AMD_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f93d5ac..d4a0ff7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,54 +4087,6 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-/*
- * Group numbers are arbitrary.  Device with the same group number
- * indicate the iommu cannot differentiate between them.  To avoid
- * tracking used groups we just use the seg|bus|devfn of the lowest
- * level we're able to differentiate devices
- */
-static int intel_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct pci_dev *bridge;
-	union {
-		struct {
-			u8 devfn;
-			u8 bus;
-			u16 segment;
-		} pci;
-		u32 group;
-	} id;
-
-	if (iommu_no_mapping(dev))
-		return -ENODEV;
-
-	id.pci.segment = pci_domain_nr(pdev->bus);
-	id.pci.bus = pdev->bus->number;
-	id.pci.devfn = pdev->devfn;
-
-	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
-		return -ENODEV;
-
-	bridge = pci_find_upstream_pcie_bridge(pdev);
-	if (bridge) {
-		if (pci_is_pcie(bridge)) {
-			id.pci.bus = bridge->subordinate->number;
-			id.pci.devfn = 0;
-		} else {
-			id.pci.bus = bridge->bus->number;
-			id.pci.devfn = bridge->devfn;
-		}
-	}
-
-	if (!pdev->is_virtfn && iommu_group_mf)
-		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
-
-	*groupid = id.group;
-
-	return 0;
-}
-
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4144,7 +4096,6 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
-	.device_group	= intel_iommu_device_group,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2198b2d..f75004e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -26,60 +26,404 @@
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/idr.h>
+#include <linux/notifier.h>
+
+static struct kset *iommu_group_kset;
+static struct ida iommu_group_ida;
+static struct mutex iommu_group_mutex;
+
+struct iommu_group {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct list_head devices;
+	struct mutex mutex;
+	struct blocking_notifier_head notifier;
+	int id;
+};
+
+struct iommu_device {
+	struct list_head list;
+	struct device *dev;
+};
+
+struct iommu_group_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct iommu_group *group, char *buf);
+	ssize_t (*store)(struct iommu_group *group,
+			 const char *buf, size_t count);
+};
 
-static ssize_t show_iommu_group(struct device *dev,
-				struct device_attribute *attr, char *buf)
+#define to_iommu_group_attr(_attr)	\
+	container_of(_attr, struct iommu_group_attribute, attr)
+#define to_iommu_group(_kobj)		\
+	container_of(_kobj, struct iommu_group, kobj)
+
+static ssize_t iommu_group_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
 {
-	unsigned int groupid;
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
 
-	if (iommu_device_group(dev, &groupid))
-		return 0;
+	if (attr->show)
+		ret = attr->show(group, buf);
+	return ret;
+}
 
-	return sprintf(buf, "%u", groupid);
+static ssize_t iommu_group_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(group, buf, count);
+	return ret;
 }
-static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
 
-static int add_iommu_group(struct device *dev, void *data)
+static void iommu_group_release(struct kobject *kobj)
 {
-	unsigned int groupid;
+	struct iommu_group *group = to_iommu_group(kobj);
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		return device_create_file(dev, &dev_attr_iommu_group);
+	mutex_lock(&iommu_group_mutex);
+	ida_remove(&iommu_group_ida, group->id);
+	mutex_unlock(&iommu_group_mutex);
 
-	return 0;
+	kfree(group);
 }
 
-static int remove_iommu_group(struct device *dev)
+static const struct sysfs_ops iommu_group_sysfs_ops = {
+	.show = iommu_group_attr_show,
+	.store = iommu_group_attr_store,
+};
+
+static struct kobj_type iommu_group_ktype = {
+	.sysfs_ops = &iommu_group_sysfs_ops,
+	.release = iommu_group_release,
+};
+
+/**
+ * iommu_group_alloc - Allocate a new group
+ *
+ * This function is called by an iommu driver to allocate a new iommu
+ * group.  The iommu group represents the minimum granularity of the iommu.
+ * Upon successful return, the caller holds a reference to the supplied
+ * group in order to hold the group until devices are added.  Use
+ * iommu_group_put() to release this extra reference count, allowing the
+ * group to be automatically reclaimed once it has no devices or external
+ * references.
+ */
+struct iommu_group *iommu_group_alloc(void)
 {
-	unsigned int groupid;
+	struct iommu_group *group;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	group->kobj.kset = iommu_group_kset;
+	mutex_init(&group->mutex);
+	INIT_LIST_HEAD(&group->devices);
+	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
+
+	mutex_lock(&iommu_group_mutex);
+
+again:
+	if (unlikely(0 == ida_pre_get(&iommu_group_ida, GFP_KERNEL))) {
+		kfree(group);
+		mutex_unlock(&iommu_group_mutex);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (-EAGAIN == ida_get_new(&iommu_group_ida, &group->id))
+		goto again;
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		device_remove_file(dev, &dev_attr_iommu_group);
+	mutex_unlock(&iommu_group_mutex);
+
+	ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
+				   NULL, "%d", group->id);
+	if (ret) {
+		mutex_lock(&iommu_group_mutex);
+		ida_remove(&iommu_group_ida, group->id);
+		mutex_unlock(&iommu_group_mutex);
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
+	if (!group->devices_kobj) {
+		kobject_put(&group->kobj); /* triggers .release & free */
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * The devices_kobj holds a reference on the group kobject, so
+	 * as long as that exists so will the group.  We can therefore
+	 * use the devices_kobj for reference counting.
+	 */
+	kobject_put(&group->kobj);
+
+	return group;
+}
 
+/**
+ * iommu_group_add_device - add a device to an iommu group
+ * @group: the group into which to add the device (reference should be held)
+ * @dev: the device
+ *
+ * This function is called by an iommu driver to add a device into a
+ * group.  Adding a device increments the group reference count.
+ */
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	int ret;
+	struct iommu_device *device;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return -ENOMEM;
+
+	device->dev = dev;
+
+	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
+	if (ret) {
+		kfree(device);
+		return ret;
+	}
+
+	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
+				kobject_name(&dev->kobj));
+	if (ret) {
+		sysfs_remove_link(&dev->kobj, "iommu_group");
+		kfree(device);
+		return ret;
+	}
+
+	kobject_get(group->devices_kobj);
+
+	dev->iommu_group = group;
+
+	mutex_lock(&group->mutex);
+	list_add_tail(&device->list, &group->devices);
+	mutex_unlock(&group->mutex);
+
+	/* Notify any listeners about change to group. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_ADD_DEVICE, dev);
 	return 0;
 }
 
-static int iommu_device_notifier(struct notifier_block *nb,
-				 unsigned long action, void *data)
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct iommu_device *device;
+
+	/* Pre-notify listeners that a device is being removed. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
+			list_del(&device->list);
+			kfree(device);
+			break;
+		}
+	}
+	mutex_unlock(&group->mutex);
+
+	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
+	sysfs_remove_link(&dev->kobj, "iommu_group");
+
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
+}
+
+/**
+ * iommu_group_for_each_dev - iterate over each device in the group
+ * @group: the group
+ * @data: caller opaque data to be passed to callback function
+ * @fn: caller supplied callback function
+ *
+ * This function is called by group users to iterate over group devices.
+ * Callers should hold a reference count to the group during callback.
+ */
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
+{
+	struct iommu_device *device;
+	int ret = 0;
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		ret = fn(device->dev, data);
+		if (ret)
+			break;
+	}
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_group_for_each_dev);
+
+/**
+ * iommu_group_get - Return the group for a device and increment reference
+ * @dev: get the group that this device belongs to
+ *
+ * This function is called by iommu drivers and users to get the group
+ * for the specified device.  If found, the group is returned and the group
+ * reference in incremented, else NULL.
+ */
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	if (group)
+		kobject_get(group->devices_kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get);
+
+/**
+ * iommu_group_put - Decrement group reference
+ * @group: the group to use
+ *
+ * This function is called by iommu drivers and users to release the
+ * iommu group.  Once the reference count is zero, the group is released.
+ */
+void iommu_group_put(struct iommu_group *group)
+{
+	if (group)
+		kobject_put(group->devices_kobj);
+}
+EXPORT_SYMBOL_GPL(iommu_group_put);
+
+/**
+ * iommu_group_register_notifier - Register a notifier for group changes
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * This function allows iommu group users to track changes in a group.
+ * See include/linux/iommu.h for actions sent via this notifier.  Caller
+ * should hold a reference to the group throughout notifier registration.
+ */
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_register_notifier);
+
+/**
+ * iommu_group_unregister_notifier - Unregister a notifier
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * Unregister a previously registered group notifier block.
+ */
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
+
+/**
+ * iommu_group_id - Return ID for a group
+ * @group: the group to ID
+ *
+ * Return the unique ID for the group matching the sysfs group number.
+ */
+int iommu_group_id(struct iommu_group *group)
+{
+	return group->id;
+}
+EXPORT_SYMBOL_GPL(iommu_group_id);
+
+static int add_iommu_group(struct device *dev, void *data)
+{
+	struct iommu_ops *ops = data;
+
+	if (!ops->add_device)
+		return -ENODEV;
+
+	WARN_ON(dev->iommu_group);
+
+	return ops->add_device(dev);
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+			      unsigned long action, void *data)
 {
 	struct device *dev = data;
+	struct iommu_ops *ops = dev->bus->iommu_ops;
+	struct iommu_group *group;
+	unsigned long group_action = 0;
+
+	/*
+	 * ADD/DEL call into iommu driver ops if provided, which may
+	 * result in ADD/DEL notifiers to group->notifier
+	 */
+	if (action == BUS_NOTIFY_ADD_DEVICE) {
+		if (ops->add_device)
+			return ops->add_device(dev);
+	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
+		if (ops->remove_device && dev->iommu_group) {
+			ops->remove_device(dev);
+			return 0;
+		}
+	}
 
-	if (action == BUS_NOTIFY_ADD_DEVICE)
-		return add_iommu_group(dev, NULL);
-	else if (action == BUS_NOTIFY_DEL_DEVICE)
-		return remove_iommu_group(dev);
+	/*
+	 * Remaining BUS_NOTIFYs get filtered and republished to the
+	 * group, if anyone is listening
+	 */
+	group = iommu_group_get(dev);
+	if (!group)
+		return 0;
 
+	switch (action) {
+	case BUS_NOTIFY_BIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
+		break;
+	case BUS_NOTIFY_BOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBIND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER;
+		break;
+	}
+
+	if (group_action)
+		blocking_notifier_call_chain(&group->notifier,
+					     group_action, dev);
+
+	iommu_group_put(group);
 	return 0;
 }
 
-static struct notifier_block iommu_device_nb = {
-	.notifier_call = iommu_device_notifier,
+static struct notifier_block iommu_bus_nb = {
+	.notifier_call = iommu_bus_notifier,
 };
 
 static void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops)
 {
-	bus_register_notifier(bus, &iommu_device_nb);
-	bus_for_each_dev(bus, NULL, NULL, add_iommu_group);
+	bus_register_notifier(bus, &iommu_bus_nb);
+	bus_for_each_dev(bus, NULL, ops, add_iommu_group);
 }
 
 /**
@@ -189,6 +533,45 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device);
 
+/*
+ * IOMMU groups are really the natrual working unit of the IOMMU, but
+ * the IOMMU API works on domains and devices.  Bridge that gap by
+ * iterating over the devices in a group.  Ideally we'd have a single
+ * device which represents the requestor ID of the group, but we also
+ * allow IOMMU drivers to create policy defined minimum sets, where
+ * the physical hardware may be able to distiguish members, but we
+ * wish to group them at a higher level (ex. untrusted multi-function
+ * PCI devices).  Thus we attach each device.
+ */
+static int iommu_group_do_attach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	return iommu_attach_device(domain, dev);
+}
+
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return iommu_group_for_each_dev(group, domain,
+					iommu_group_do_attach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_group);
+
+static int iommu_group_do_detach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	iommu_detach_device(domain, dev);
+
+	return 0;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_group);
+
 phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 			       unsigned long iova)
 {
@@ -333,11 +716,15 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
-int iommu_device_group(struct device *dev, unsigned int *groupid)
+static int __init iommu_init(void)
 {
-	if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
-		return dev->bus->iommu_ops->device_group(dev, groupid);
+	iommu_group_kset = kset_create_and_add("iommu_groups",
+					       NULL, kernel_kobj);
+	ida_init(&iommu_group_ida);
+	mutex_init(&iommu_group_mutex);
 
-	return -ENODEV;
+	BUG_ON(!iommu_group_kset);
+
+	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_device_group);
+subsys_initcall(iommu_init);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..92cf3c2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@
 #define IOMMU_CACHE	(4) /* DMA cache coherency */
 
 struct iommu_ops;
+struct iommu_group;
 struct bus_type;
 struct device;
 struct iommu_domain;
@@ -59,6 +60,8 @@ struct iommu_domain {
  * @iova_to_phys: translate iova to physical address
  * @domain_has_cap: domain capabilities query
  * @commit: commit iommu domain
+ * @add_device: add device to iommu grouping
+ * @remove_device: remove device from iommu grouping
  * @pgsize_bitmap: bitmap of supported page sizes
  */
 struct iommu_ops {
@@ -74,10 +77,18 @@ struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
-	int (*device_group)(struct device *dev, unsigned int *groupid);
+	int (*add_device)(struct device *dev);
+	void (*remove_device)(struct device *dev);
 	unsigned long pgsize_bitmap;
 };
 
+#define IOMMU_GROUP_NOTIFY_ADD_DEVICE		1 /* Device added */
+#define IOMMU_GROUP_NOTIFY_DEL_DEVICE		2 /* Pre Device removed */
+#define IOMMU_GROUP_NOTIFY_BIND_DRIVER		3 /* Pre Driver bind */
+#define IOMMU_GROUP_NOTIFY_BOUND_DRIVER		4 /* Post Driver bind */
+#define IOMMU_GROUP_NOTIFY_UNBIND_DRIVER	5 /* Pre Driver unbind */
+#define IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER	6 /* Post Driver unbind */
+
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -96,7 +107,24 @@ extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
 					iommu_fault_handler_t handler);
-extern int iommu_device_group(struct device *dev, unsigned int *groupid);
+
+extern int iommu_attach_group(struct iommu_domain *domain,
+			      struct iommu_group *group);
+extern void iommu_detach_group(struct iommu_domain *domain,
+			       struct iommu_group *group);
+extern struct iommu_group *iommu_group_alloc(void);
+extern int iommu_group_add_device(struct iommu_group *group,
+				  struct device *dev);
+extern void iommu_group_remove_device(struct device *dev);
+extern int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+				    int (*fn)(struct device *, void *));
+extern struct iommu_group *iommu_group_get(struct device *dev);
+extern void iommu_group_put(struct iommu_group *group);
+extern int iommu_group_register_notifier(struct iommu_group *group,
+					 struct notifier_block *nb);
+extern int iommu_group_unregister_notifier(struct iommu_group *group,
+					   struct notifier_block *nb);
+extern int iommu_group_id(struct iommu_group *group);
 
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
@@ -140,6 +168,7 @@ static inline int report_iommu_fault(struct iommu_domain *domain,
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
+struct iommu_group {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
@@ -195,11 +224,60 @@ static inline void iommu_set_fault_handler(struct iommu_domain *domain,
 {
 }
 
-static inline int iommu_device_group(struct device *dev, unsigned int *groupid)
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return -ENODEV;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+}
+
+struct iommu_group *iommu_group_alloc(void)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	return -ENODEV;
+}
+
+void iommu_group_remove_device(struct device *dev)
+{
+}
+
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
 {
 	return -ENODEV;
 }
 
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	return NULL;
+}
+
+void iommu_group_put(struct iommu_group *group)
+{
+}
+
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return -ENODEV;
+}
+
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return 0;
+}
+
+int iommu_group_id(struct iommu_group *group)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

IOMMU device groups are currently a rather vague associative notion
with assembly required by the user or user level driver provider to
do anything useful.  This patch intends to grow the IOMMU group concept
into something a bit more consumable.

To do this, we first create an object representing the group, struct
iommu_group.  This structure is allocated (iommu_group_alloc) and
filled (iommu_group_add_device) by the iommu driver.  The iommu driver
is free to add devices to the group using it's own set of policies.
This allows inclusion of devices based on physical hardware or topology
limitations of the platform, as well as soft requirements, such as
multi-function trust levels or peer-to-peer protection of the
interconnects.  Each device may only belong to a single iommu group,
which is linked from struct device.iommu_group.  IOMMU groups are
maintained using kobject reference counting, allowing for automatic
removal of empty, unreferenced groups.  It is the responsibility of
the iommu driver to remove devices from the group
(iommu_group_remove_device).

IOMMU groups also include a userspace representation in sysfs under
/sys/kernel/iommu_groups.  When allocated, each group is given a
dynamically assign ID (int).  The ID is managed by the core IOMMU group
code to support multiple heterogeneous iommu drivers, which could
potentially collide in group naming/numbering.  This also keeps group
IDs to small, easily managed values.  A directory is created under
/sys/kernel/iommu_groups for each group.  A further subdirectory named
"devices" contains links to each device within the group.  The iommu_group
file in the device's sysfs directory, which formerly contained a group
number when read, is now a link to the iommu group.  Example:

$ ls -l /sys/kernel/iommu_groups/26/devices/
total 0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1

$ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
[truncating perms/owner/timestamp]
/sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
					../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
					../../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
					../../../../kernel/iommu_groups/26

Groups also include several exported functions for use by user level
driver providers, for example VFIO.  These include:

iommu_group_get(): Acquires a reference to a group from a device
iommu_group_put(): Releases reference
iommu_group_for_each_dev(): Iterates over group devices using callback
iommu_group_[un]register_notifier(): Allows notification of device add
        and remove operations relevant to the group
iommu_group_id(): Return the group number

This patch also extends the IOMMU API to allow attaching groups to
domains.  This is currently a simple wrapper for iterating through
devices within a group, but it's expected that the IOMMU API may
eventually make groups a more integral part of domains.

Groups intentionally do not try to manage group ownership.  A user
level driver provider must independently acquire ownership for each
device within a group before making use of the group as a whole.
This may change in the future if group usage becomes more pervasive
across both DMA and IOMMU ops.

Groups intentionally do not provide a mechanism for driver locking
or otherwise manipulating driver matching/probing of devices within
the group.  Such interfaces are generic to devices and beyond the
scope of IOMMU groups.  If implemented, user level providers have
ready access via iommu_group_for_each_dev and group notifiers.

Groups currently provide no storage for iommu context, but some kind
of iommu_group_get/set_iommudata() interface is likely if groups
become more pervasive in the dma layers.

iommu_device_group() is removed here as it has no users.  The
replacement is:

	group = iommu_group_get(dev);
	id = iommu_group_id(group);
	iommu_group_put(group);

AMD-Vi & Intel VT-d support re-added in following patches.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/iommu/amd_iommu.c   |   21 --
 drivers/iommu/intel-iommu.c |   49 -----
 drivers/iommu/iommu.c       |  449 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/iommu.h       |   84 ++++++++
 4 files changed, 499 insertions(+), 104 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a5bee8e..32c00cd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3193,26 +3193,6 @@ static int amd_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-static int amd_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct iommu_dev_data *dev_data = dev->archdata.iommu;
-	struct pci_dev *pdev = to_pci_dev(dev);
-	u16 devid;
-
-	if (!dev_data)
-		return -ENODEV;
-
-	if (pdev->is_virtfn || !iommu_group_mf)
-		devid = dev_data->devid;
-	else
-		devid = calc_devid(pdev->bus->number,
-				   PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
-
-	*groupid = amd_iommu_alias_table[devid];
-
-	return 0;
-}
-
 static struct iommu_ops amd_iommu_ops = {
 	.domain_init = amd_iommu_domain_init,
 	.domain_destroy = amd_iommu_domain_destroy,
@@ -3222,7 +3202,6 @@ static struct iommu_ops amd_iommu_ops = {
 	.unmap = amd_iommu_unmap,
 	.iova_to_phys = amd_iommu_iova_to_phys,
 	.domain_has_cap = amd_iommu_domain_has_cap,
-	.device_group = amd_iommu_device_group,
 	.pgsize_bitmap	= AMD_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f93d5ac..d4a0ff7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,54 +4087,6 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-/*
- * Group numbers are arbitrary.  Device with the same group number
- * indicate the iommu cannot differentiate between them.  To avoid
- * tracking used groups we just use the seg|bus|devfn of the lowest
- * level we're able to differentiate devices
- */
-static int intel_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct pci_dev *bridge;
-	union {
-		struct {
-			u8 devfn;
-			u8 bus;
-			u16 segment;
-		} pci;
-		u32 group;
-	} id;
-
-	if (iommu_no_mapping(dev))
-		return -ENODEV;
-
-	id.pci.segment = pci_domain_nr(pdev->bus);
-	id.pci.bus = pdev->bus->number;
-	id.pci.devfn = pdev->devfn;
-
-	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
-		return -ENODEV;
-
-	bridge = pci_find_upstream_pcie_bridge(pdev);
-	if (bridge) {
-		if (pci_is_pcie(bridge)) {
-			id.pci.bus = bridge->subordinate->number;
-			id.pci.devfn = 0;
-		} else {
-			id.pci.bus = bridge->bus->number;
-			id.pci.devfn = bridge->devfn;
-		}
-	}
-
-	if (!pdev->is_virtfn && iommu_group_mf)
-		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
-
-	*groupid = id.group;
-
-	return 0;
-}
-
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4144,7 +4096,6 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
-	.device_group	= intel_iommu_device_group,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2198b2d..f75004e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -26,60 +26,404 @@
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/idr.h>
+#include <linux/notifier.h>
+
+static struct kset *iommu_group_kset;
+static struct ida iommu_group_ida;
+static struct mutex iommu_group_mutex;
+
+struct iommu_group {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct list_head devices;
+	struct mutex mutex;
+	struct blocking_notifier_head notifier;
+	int id;
+};
+
+struct iommu_device {
+	struct list_head list;
+	struct device *dev;
+};
+
+struct iommu_group_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct iommu_group *group, char *buf);
+	ssize_t (*store)(struct iommu_group *group,
+			 const char *buf, size_t count);
+};
 
-static ssize_t show_iommu_group(struct device *dev,
-				struct device_attribute *attr, char *buf)
+#define to_iommu_group_attr(_attr)	\
+	container_of(_attr, struct iommu_group_attribute, attr)
+#define to_iommu_group(_kobj)		\
+	container_of(_kobj, struct iommu_group, kobj)
+
+static ssize_t iommu_group_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
 {
-	unsigned int groupid;
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
 
-	if (iommu_device_group(dev, &groupid))
-		return 0;
+	if (attr->show)
+		ret = attr->show(group, buf);
+	return ret;
+}
 
-	return sprintf(buf, "%u", groupid);
+static ssize_t iommu_group_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(group, buf, count);
+	return ret;
 }
-static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
 
-static int add_iommu_group(struct device *dev, void *data)
+static void iommu_group_release(struct kobject *kobj)
 {
-	unsigned int groupid;
+	struct iommu_group *group = to_iommu_group(kobj);
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		return device_create_file(dev, &dev_attr_iommu_group);
+	mutex_lock(&iommu_group_mutex);
+	ida_remove(&iommu_group_ida, group->id);
+	mutex_unlock(&iommu_group_mutex);
 
-	return 0;
+	kfree(group);
 }
 
-static int remove_iommu_group(struct device *dev)
+static const struct sysfs_ops iommu_group_sysfs_ops = {
+	.show = iommu_group_attr_show,
+	.store = iommu_group_attr_store,
+};
+
+static struct kobj_type iommu_group_ktype = {
+	.sysfs_ops = &iommu_group_sysfs_ops,
+	.release = iommu_group_release,
+};
+
+/**
+ * iommu_group_alloc - Allocate a new group
+ *
+ * This function is called by an iommu driver to allocate a new iommu
+ * group.  The iommu group represents the minimum granularity of the iommu.
+ * Upon successful return, the caller holds a reference to the supplied
+ * group in order to hold the group until devices are added.  Use
+ * iommu_group_put() to release this extra reference count, allowing the
+ * group to be automatically reclaimed once it has no devices or external
+ * references.
+ */
+struct iommu_group *iommu_group_alloc(void)
 {
-	unsigned int groupid;
+	struct iommu_group *group;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	group->kobj.kset = iommu_group_kset;
+	mutex_init(&group->mutex);
+	INIT_LIST_HEAD(&group->devices);
+	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
+
+	mutex_lock(&iommu_group_mutex);
+
+again:
+	if (unlikely(0 == ida_pre_get(&iommu_group_ida, GFP_KERNEL))) {
+		kfree(group);
+		mutex_unlock(&iommu_group_mutex);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (-EAGAIN == ida_get_new(&iommu_group_ida, &group->id))
+		goto again;
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		device_remove_file(dev, &dev_attr_iommu_group);
+	mutex_unlock(&iommu_group_mutex);
+
+	ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
+				   NULL, "%d", group->id);
+	if (ret) {
+		mutex_lock(&iommu_group_mutex);
+		ida_remove(&iommu_group_ida, group->id);
+		mutex_unlock(&iommu_group_mutex);
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
+	if (!group->devices_kobj) {
+		kobject_put(&group->kobj); /* triggers .release & free */
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * The devices_kobj holds a reference on the group kobject, so
+	 * as long as that exists so will the group.  We can therefore
+	 * use the devices_kobj for reference counting.
+	 */
+	kobject_put(&group->kobj);
+
+	return group;
+}
 
+/**
+ * iommu_group_add_device - add a device to an iommu group
+ * @group: the group into which to add the device (reference should be held)
+ * @dev: the device
+ *
+ * This function is called by an iommu driver to add a device into a
+ * group.  Adding a device increments the group reference count.
+ */
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	int ret;
+	struct iommu_device *device;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return -ENOMEM;
+
+	device->dev = dev;
+
+	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
+	if (ret) {
+		kfree(device);
+		return ret;
+	}
+
+	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
+				kobject_name(&dev->kobj));
+	if (ret) {
+		sysfs_remove_link(&dev->kobj, "iommu_group");
+		kfree(device);
+		return ret;
+	}
+
+	kobject_get(group->devices_kobj);
+
+	dev->iommu_group = group;
+
+	mutex_lock(&group->mutex);
+	list_add_tail(&device->list, &group->devices);
+	mutex_unlock(&group->mutex);
+
+	/* Notify any listeners about change to group. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_ADD_DEVICE, dev);
 	return 0;
 }
 
-static int iommu_device_notifier(struct notifier_block *nb,
-				 unsigned long action, void *data)
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct iommu_device *device;
+
+	/* Pre-notify listeners that a device is being removed. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
+			list_del(&device->list);
+			kfree(device);
+			break;
+		}
+	}
+	mutex_unlock(&group->mutex);
+
+	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
+	sysfs_remove_link(&dev->kobj, "iommu_group");
+
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
+}
+
+/**
+ * iommu_group_for_each_dev - iterate over each device in the group
+ * @group: the group
+ * @data: caller opaque data to be passed to callback function
+ * @fn: caller supplied callback function
+ *
+ * This function is called by group users to iterate over group devices.
+ * Callers should hold a reference count to the group during callback.
+ */
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
+{
+	struct iommu_device *device;
+	int ret = 0;
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		ret = fn(device->dev, data);
+		if (ret)
+			break;
+	}
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_group_for_each_dev);
+
+/**
+ * iommu_group_get - Return the group for a device and increment reference
+ * @dev: get the group that this device belongs to
+ *
+ * This function is called by iommu drivers and users to get the group
+ * for the specified device.  If found, the group is returned and the group
+ * reference in incremented, else NULL.
+ */
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	if (group)
+		kobject_get(group->devices_kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get);
+
+/**
+ * iommu_group_put - Decrement group reference
+ * @group: the group to use
+ *
+ * This function is called by iommu drivers and users to release the
+ * iommu group.  Once the reference count is zero, the group is released.
+ */
+void iommu_group_put(struct iommu_group *group)
+{
+	if (group)
+		kobject_put(group->devices_kobj);
+}
+EXPORT_SYMBOL_GPL(iommu_group_put);
+
+/**
+ * iommu_group_register_notifier - Register a notifier for group changes
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * This function allows iommu group users to track changes in a group.
+ * See include/linux/iommu.h for actions sent via this notifier.  Caller
+ * should hold a reference to the group throughout notifier registration.
+ */
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_register_notifier);
+
+/**
+ * iommu_group_unregister_notifier - Unregister a notifier
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * Unregister a previously registered group notifier block.
+ */
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
+
+/**
+ * iommu_group_id - Return ID for a group
+ * @group: the group to ID
+ *
+ * Return the unique ID for the group matching the sysfs group number.
+ */
+int iommu_group_id(struct iommu_group *group)
+{
+	return group->id;
+}
+EXPORT_SYMBOL_GPL(iommu_group_id);
+
+static int add_iommu_group(struct device *dev, void *data)
+{
+	struct iommu_ops *ops = data;
+
+	if (!ops->add_device)
+		return -ENODEV;
+
+	WARN_ON(dev->iommu_group);
+
+	return ops->add_device(dev);
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+			      unsigned long action, void *data)
 {
 	struct device *dev = data;
+	struct iommu_ops *ops = dev->bus->iommu_ops;
+	struct iommu_group *group;
+	unsigned long group_action = 0;
+
+	/*
+	 * ADD/DEL call into iommu driver ops if provided, which may
+	 * result in ADD/DEL notifiers to group->notifier
+	 */
+	if (action == BUS_NOTIFY_ADD_DEVICE) {
+		if (ops->add_device)
+			return ops->add_device(dev);
+	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
+		if (ops->remove_device && dev->iommu_group) {
+			ops->remove_device(dev);
+			return 0;
+		}
+	}
 
-	if (action == BUS_NOTIFY_ADD_DEVICE)
-		return add_iommu_group(dev, NULL);
-	else if (action == BUS_NOTIFY_DEL_DEVICE)
-		return remove_iommu_group(dev);
+	/*
+	 * Remaining BUS_NOTIFYs get filtered and republished to the
+	 * group, if anyone is listening
+	 */
+	group = iommu_group_get(dev);
+	if (!group)
+		return 0;
 
+	switch (action) {
+	case BUS_NOTIFY_BIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
+		break;
+	case BUS_NOTIFY_BOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBIND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER;
+		break;
+	}
+
+	if (group_action)
+		blocking_notifier_call_chain(&group->notifier,
+					     group_action, dev);
+
+	iommu_group_put(group);
 	return 0;
 }
 
-static struct notifier_block iommu_device_nb = {
-	.notifier_call = iommu_device_notifier,
+static struct notifier_block iommu_bus_nb = {
+	.notifier_call = iommu_bus_notifier,
 };
 
 static void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops)
 {
-	bus_register_notifier(bus, &iommu_device_nb);
-	bus_for_each_dev(bus, NULL, NULL, add_iommu_group);
+	bus_register_notifier(bus, &iommu_bus_nb);
+	bus_for_each_dev(bus, NULL, ops, add_iommu_group);
 }
 
 /**
@@ -189,6 +533,45 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device);
 
+/*
+ * IOMMU groups are really the natrual working unit of the IOMMU, but
+ * the IOMMU API works on domains and devices.  Bridge that gap by
+ * iterating over the devices in a group.  Ideally we'd have a single
+ * device which represents the requestor ID of the group, but we also
+ * allow IOMMU drivers to create policy defined minimum sets, where
+ * the physical hardware may be able to distiguish members, but we
+ * wish to group them at a higher level (ex. untrusted multi-function
+ * PCI devices).  Thus we attach each device.
+ */
+static int iommu_group_do_attach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	return iommu_attach_device(domain, dev);
+}
+
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return iommu_group_for_each_dev(group, domain,
+					iommu_group_do_attach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_group);
+
+static int iommu_group_do_detach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	iommu_detach_device(domain, dev);
+
+	return 0;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_group);
+
 phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 			       unsigned long iova)
 {
@@ -333,11 +716,15 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
-int iommu_device_group(struct device *dev, unsigned int *groupid)
+static int __init iommu_init(void)
 {
-	if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
-		return dev->bus->iommu_ops->device_group(dev, groupid);
+	iommu_group_kset = kset_create_and_add("iommu_groups",
+					       NULL, kernel_kobj);
+	ida_init(&iommu_group_ida);
+	mutex_init(&iommu_group_mutex);
 
-	return -ENODEV;
+	BUG_ON(!iommu_group_kset);
+
+	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_device_group);
+subsys_initcall(iommu_init);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..92cf3c2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@
 #define IOMMU_CACHE	(4) /* DMA cache coherency */
 
 struct iommu_ops;
+struct iommu_group;
 struct bus_type;
 struct device;
 struct iommu_domain;
@@ -59,6 +60,8 @@ struct iommu_domain {
  * @iova_to_phys: translate iova to physical address
  * @domain_has_cap: domain capabilities query
  * @commit: commit iommu domain
+ * @add_device: add device to iommu grouping
+ * @remove_device: remove device from iommu grouping
  * @pgsize_bitmap: bitmap of supported page sizes
  */
 struct iommu_ops {
@@ -74,10 +77,18 @@ struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
-	int (*device_group)(struct device *dev, unsigned int *groupid);
+	int (*add_device)(struct device *dev);
+	void (*remove_device)(struct device *dev);
 	unsigned long pgsize_bitmap;
 };
 
+#define IOMMU_GROUP_NOTIFY_ADD_DEVICE		1 /* Device added */
+#define IOMMU_GROUP_NOTIFY_DEL_DEVICE		2 /* Pre Device removed */
+#define IOMMU_GROUP_NOTIFY_BIND_DRIVER		3 /* Pre Driver bind */
+#define IOMMU_GROUP_NOTIFY_BOUND_DRIVER		4 /* Post Driver bind */
+#define IOMMU_GROUP_NOTIFY_UNBIND_DRIVER	5 /* Pre Driver unbind */
+#define IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER	6 /* Post Driver unbind */
+
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -96,7 +107,24 @@ extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
 					iommu_fault_handler_t handler);
-extern int iommu_device_group(struct device *dev, unsigned int *groupid);
+
+extern int iommu_attach_group(struct iommu_domain *domain,
+			      struct iommu_group *group);
+extern void iommu_detach_group(struct iommu_domain *domain,
+			       struct iommu_group *group);
+extern struct iommu_group *iommu_group_alloc(void);
+extern int iommu_group_add_device(struct iommu_group *group,
+				  struct device *dev);
+extern void iommu_group_remove_device(struct device *dev);
+extern int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+				    int (*fn)(struct device *, void *));
+extern struct iommu_group *iommu_group_get(struct device *dev);
+extern void iommu_group_put(struct iommu_group *group);
+extern int iommu_group_register_notifier(struct iommu_group *group,
+					 struct notifier_block *nb);
+extern int iommu_group_unregister_notifier(struct iommu_group *group,
+					   struct notifier_block *nb);
+extern int iommu_group_id(struct iommu_group *group);
 
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
@@ -140,6 +168,7 @@ static inline int report_iommu_fault(struct iommu_domain *domain,
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
+struct iommu_group {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
@@ -195,11 +224,60 @@ static inline void iommu_set_fault_handler(struct iommu_domain *domain,
 {
 }
 
-static inline int iommu_device_group(struct device *dev, unsigned int *groupid)
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return -ENODEV;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+}
+
+struct iommu_group *iommu_group_alloc(void)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	return -ENODEV;
+}
+
+void iommu_group_remove_device(struct device *dev)
+{
+}
+
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
 {
 	return -ENODEV;
 }
 
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	return NULL;
+}
+
+void iommu_group_put(struct iommu_group *group)
+{
+}
+
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return -ENODEV;
+}
+
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return 0;
+}
+
+int iommu_group_id(struct iommu_group *group)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

IOMMU device groups are currently a rather vague associative notion
with assembly required by the user or user level driver provider to
do anything useful.  This patch intends to grow the IOMMU group concept
into something a bit more consumable.

To do this, we first create an object representing the group, struct
iommu_group.  This structure is allocated (iommu_group_alloc) and
filled (iommu_group_add_device) by the iommu driver.  The iommu driver
is free to add devices to the group using it's own set of policies.
This allows inclusion of devices based on physical hardware or topology
limitations of the platform, as well as soft requirements, such as
multi-function trust levels or peer-to-peer protection of the
interconnects.  Each device may only belong to a single iommu group,
which is linked from struct device.iommu_group.  IOMMU groups are
maintained using kobject reference counting, allowing for automatic
removal of empty, unreferenced groups.  It is the responsibility of
the iommu driver to remove devices from the group
(iommu_group_remove_device).

IOMMU groups also include a userspace representation in sysfs under
/sys/kernel/iommu_groups.  When allocated, each group is given a
dynamically assign ID (int).  The ID is managed by the core IOMMU group
code to support multiple heterogeneous iommu drivers, which could
potentially collide in group naming/numbering.  This also keeps group
IDs to small, easily managed values.  A directory is created under
/sys/kernel/iommu_groups for each group.  A further subdirectory named
"devices" contains links to each device within the group.  The iommu_group
file in the device's sysfs directory, which formerly contained a group
number when read, is now a link to the iommu group.  Example:

$ ls -l /sys/kernel/iommu_groups/26/devices/
total 0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1

$ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
[truncating perms/owner/timestamp]
/sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
					../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
					../../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
					../../../../kernel/iommu_groups/26

Groups also include several exported functions for use by user level
driver providers, for example VFIO.  These include:

iommu_group_get(): Acquires a reference to a group from a device
iommu_group_put(): Releases reference
iommu_group_for_each_dev(): Iterates over group devices using callback
iommu_group_[un]register_notifier(): Allows notification of device add
        and remove operations relevant to the group
iommu_group_id(): Return the group number

This patch also extends the IOMMU API to allow attaching groups to
domains.  This is currently a simple wrapper for iterating through
devices within a group, but it's expected that the IOMMU API may
eventually make groups a more integral part of domains.

Groups intentionally do not try to manage group ownership.  A user
level driver provider must independently acquire ownership for each
device within a group before making use of the group as a whole.
This may change in the future if group usage becomes more pervasive
across both DMA and IOMMU ops.

Groups intentionally do not provide a mechanism for driver locking
or otherwise manipulating driver matching/probing of devices within
the group.  Such interfaces are generic to devices and beyond the
scope of IOMMU groups.  If implemented, user level providers have
ready access via iommu_group_for_each_dev and group notifiers.

Groups currently provide no storage for iommu context, but some kind
of iommu_group_get/set_iommudata() interface is likely if groups
become more pervasive in the dma layers.

iommu_device_group() is removed here as it has no users.  The
replacement is:

	group = iommu_group_get(dev);
	id = iommu_group_id(group);
	iommu_group_put(group);

AMD-Vi & Intel VT-d support re-added in following patches.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |   21 --
 drivers/iommu/intel-iommu.c |   49 -----
 drivers/iommu/iommu.c       |  449 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/iommu.h       |   84 ++++++++
 4 files changed, 499 insertions(+), 104 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a5bee8e..32c00cd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3193,26 +3193,6 @@ static int amd_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-static int amd_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct iommu_dev_data *dev_data = dev->archdata.iommu;
-	struct pci_dev *pdev = to_pci_dev(dev);
-	u16 devid;
-
-	if (!dev_data)
-		return -ENODEV;
-
-	if (pdev->is_virtfn || !iommu_group_mf)
-		devid = dev_data->devid;
-	else
-		devid = calc_devid(pdev->bus->number,
-				   PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
-
-	*groupid = amd_iommu_alias_table[devid];
-
-	return 0;
-}
-
 static struct iommu_ops amd_iommu_ops = {
 	.domain_init = amd_iommu_domain_init,
 	.domain_destroy = amd_iommu_domain_destroy,
@@ -3222,7 +3202,6 @@ static struct iommu_ops amd_iommu_ops = {
 	.unmap = amd_iommu_unmap,
 	.iova_to_phys = amd_iommu_iova_to_phys,
 	.domain_has_cap = amd_iommu_domain_has_cap,
-	.device_group = amd_iommu_device_group,
 	.pgsize_bitmap	= AMD_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f93d5ac..d4a0ff7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,54 +4087,6 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-/*
- * Group numbers are arbitrary.  Device with the same group number
- * indicate the iommu cannot differentiate between them.  To avoid
- * tracking used groups we just use the seg|bus|devfn of the lowest
- * level we're able to differentiate devices
- */
-static int intel_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct pci_dev *bridge;
-	union {
-		struct {
-			u8 devfn;
-			u8 bus;
-			u16 segment;
-		} pci;
-		u32 group;
-	} id;
-
-	if (iommu_no_mapping(dev))
-		return -ENODEV;
-
-	id.pci.segment = pci_domain_nr(pdev->bus);
-	id.pci.bus = pdev->bus->number;
-	id.pci.devfn = pdev->devfn;
-
-	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
-		return -ENODEV;
-
-	bridge = pci_find_upstream_pcie_bridge(pdev);
-	if (bridge) {
-		if (pci_is_pcie(bridge)) {
-			id.pci.bus = bridge->subordinate->number;
-			id.pci.devfn = 0;
-		} else {
-			id.pci.bus = bridge->bus->number;
-			id.pci.devfn = bridge->devfn;
-		}
-	}
-
-	if (!pdev->is_virtfn && iommu_group_mf)
-		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
-
-	*groupid = id.group;
-
-	return 0;
-}
-
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4144,7 +4096,6 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
-	.device_group	= intel_iommu_device_group,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2198b2d..f75004e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -26,60 +26,404 @@
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/idr.h>
+#include <linux/notifier.h>
+
+static struct kset *iommu_group_kset;
+static struct ida iommu_group_ida;
+static struct mutex iommu_group_mutex;
+
+struct iommu_group {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct list_head devices;
+	struct mutex mutex;
+	struct blocking_notifier_head notifier;
+	int id;
+};
+
+struct iommu_device {
+	struct list_head list;
+	struct device *dev;
+};
+
+struct iommu_group_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct iommu_group *group, char *buf);
+	ssize_t (*store)(struct iommu_group *group,
+			 const char *buf, size_t count);
+};
 
-static ssize_t show_iommu_group(struct device *dev,
-				struct device_attribute *attr, char *buf)
+#define to_iommu_group_attr(_attr)	\
+	container_of(_attr, struct iommu_group_attribute, attr)
+#define to_iommu_group(_kobj)		\
+	container_of(_kobj, struct iommu_group, kobj)
+
+static ssize_t iommu_group_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
 {
-	unsigned int groupid;
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
 
-	if (iommu_device_group(dev, &groupid))
-		return 0;
+	if (attr->show)
+		ret = attr->show(group, buf);
+	return ret;
+}
 
-	return sprintf(buf, "%u", groupid);
+static ssize_t iommu_group_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(group, buf, count);
+	return ret;
 }
-static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
 
-static int add_iommu_group(struct device *dev, void *data)
+static void iommu_group_release(struct kobject *kobj)
 {
-	unsigned int groupid;
+	struct iommu_group *group = to_iommu_group(kobj);
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		return device_create_file(dev, &dev_attr_iommu_group);
+	mutex_lock(&iommu_group_mutex);
+	ida_remove(&iommu_group_ida, group->id);
+	mutex_unlock(&iommu_group_mutex);
 
-	return 0;
+	kfree(group);
 }
 
-static int remove_iommu_group(struct device *dev)
+static const struct sysfs_ops iommu_group_sysfs_ops = {
+	.show = iommu_group_attr_show,
+	.store = iommu_group_attr_store,
+};
+
+static struct kobj_type iommu_group_ktype = {
+	.sysfs_ops = &iommu_group_sysfs_ops,
+	.release = iommu_group_release,
+};
+
+/**
+ * iommu_group_alloc - Allocate a new group
+ *
+ * This function is called by an iommu driver to allocate a new iommu
+ * group.  The iommu group represents the minimum granularity of the iommu.
+ * Upon successful return, the caller holds a reference to the supplied
+ * group in order to hold the group until devices are added.  Use
+ * iommu_group_put() to release this extra reference count, allowing the
+ * group to be automatically reclaimed once it has no devices or external
+ * references.
+ */
+struct iommu_group *iommu_group_alloc(void)
 {
-	unsigned int groupid;
+	struct iommu_group *group;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	group->kobj.kset = iommu_group_kset;
+	mutex_init(&group->mutex);
+	INIT_LIST_HEAD(&group->devices);
+	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
+
+	mutex_lock(&iommu_group_mutex);
+
+again:
+	if (unlikely(0 == ida_pre_get(&iommu_group_ida, GFP_KERNEL))) {
+		kfree(group);
+		mutex_unlock(&iommu_group_mutex);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (-EAGAIN == ida_get_new(&iommu_group_ida, &group->id))
+		goto again;
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		device_remove_file(dev, &dev_attr_iommu_group);
+	mutex_unlock(&iommu_group_mutex);
+
+	ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
+				   NULL, "%d", group->id);
+	if (ret) {
+		mutex_lock(&iommu_group_mutex);
+		ida_remove(&iommu_group_ida, group->id);
+		mutex_unlock(&iommu_group_mutex);
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
+	if (!group->devices_kobj) {
+		kobject_put(&group->kobj); /* triggers .release & free */
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * The devices_kobj holds a reference on the group kobject, so
+	 * as long as that exists so will the group.  We can therefore
+	 * use the devices_kobj for reference counting.
+	 */
+	kobject_put(&group->kobj);
+
+	return group;
+}
 
+/**
+ * iommu_group_add_device - add a device to an iommu group
+ * @group: the group into which to add the device (reference should be held)
+ * @dev: the device
+ *
+ * This function is called by an iommu driver to add a device into a
+ * group.  Adding a device increments the group reference count.
+ */
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	int ret;
+	struct iommu_device *device;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return -ENOMEM;
+
+	device->dev = dev;
+
+	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
+	if (ret) {
+		kfree(device);
+		return ret;
+	}
+
+	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
+				kobject_name(&dev->kobj));
+	if (ret) {
+		sysfs_remove_link(&dev->kobj, "iommu_group");
+		kfree(device);
+		return ret;
+	}
+
+	kobject_get(group->devices_kobj);
+
+	dev->iommu_group = group;
+
+	mutex_lock(&group->mutex);
+	list_add_tail(&device->list, &group->devices);
+	mutex_unlock(&group->mutex);
+
+	/* Notify any listeners about change to group. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_ADD_DEVICE, dev);
 	return 0;
 }
 
-static int iommu_device_notifier(struct notifier_block *nb,
-				 unsigned long action, void *data)
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct iommu_device *device;
+
+	/* Pre-notify listeners that a device is being removed. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
+			list_del(&device->list);
+			kfree(device);
+			break;
+		}
+	}
+	mutex_unlock(&group->mutex);
+
+	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
+	sysfs_remove_link(&dev->kobj, "iommu_group");
+
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
+}
+
+/**
+ * iommu_group_for_each_dev - iterate over each device in the group
+ * @group: the group
+ * @data: caller opaque data to be passed to callback function
+ * @fn: caller supplied callback function
+ *
+ * This function is called by group users to iterate over group devices.
+ * Callers should hold a reference count to the group during callback.
+ */
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
+{
+	struct iommu_device *device;
+	int ret = 0;
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		ret = fn(device->dev, data);
+		if (ret)
+			break;
+	}
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_group_for_each_dev);
+
+/**
+ * iommu_group_get - Return the group for a device and increment reference
+ * @dev: get the group that this device belongs to
+ *
+ * This function is called by iommu drivers and users to get the group
+ * for the specified device.  If found, the group is returned and the group
+ * reference in incremented, else NULL.
+ */
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	if (group)
+		kobject_get(group->devices_kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get);
+
+/**
+ * iommu_group_put - Decrement group reference
+ * @group: the group to use
+ *
+ * This function is called by iommu drivers and users to release the
+ * iommu group.  Once the reference count is zero, the group is released.
+ */
+void iommu_group_put(struct iommu_group *group)
+{
+	if (group)
+		kobject_put(group->devices_kobj);
+}
+EXPORT_SYMBOL_GPL(iommu_group_put);
+
+/**
+ * iommu_group_register_notifier - Register a notifier for group changes
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * This function allows iommu group users to track changes in a group.
+ * See include/linux/iommu.h for actions sent via this notifier.  Caller
+ * should hold a reference to the group throughout notifier registration.
+ */
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_register_notifier);
+
+/**
+ * iommu_group_unregister_notifier - Unregister a notifier
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * Unregister a previously registered group notifier block.
+ */
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
+
+/**
+ * iommu_group_id - Return ID for a group
+ * @group: the group to ID
+ *
+ * Return the unique ID for the group matching the sysfs group number.
+ */
+int iommu_group_id(struct iommu_group *group)
+{
+	return group->id;
+}
+EXPORT_SYMBOL_GPL(iommu_group_id);
+
+static int add_iommu_group(struct device *dev, void *data)
+{
+	struct iommu_ops *ops = data;
+
+	if (!ops->add_device)
+		return -ENODEV;
+
+	WARN_ON(dev->iommu_group);
+
+	return ops->add_device(dev);
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+			      unsigned long action, void *data)
 {
 	struct device *dev = data;
+	struct iommu_ops *ops = dev->bus->iommu_ops;
+	struct iommu_group *group;
+	unsigned long group_action = 0;
+
+	/*
+	 * ADD/DEL call into iommu driver ops if provided, which may
+	 * result in ADD/DEL notifiers to group->notifier
+	 */
+	if (action == BUS_NOTIFY_ADD_DEVICE) {
+		if (ops->add_device)
+			return ops->add_device(dev);
+	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
+		if (ops->remove_device && dev->iommu_group) {
+			ops->remove_device(dev);
+			return 0;
+		}
+	}
 
-	if (action == BUS_NOTIFY_ADD_DEVICE)
-		return add_iommu_group(dev, NULL);
-	else if (action == BUS_NOTIFY_DEL_DEVICE)
-		return remove_iommu_group(dev);
+	/*
+	 * Remaining BUS_NOTIFYs get filtered and republished to the
+	 * group, if anyone is listening
+	 */
+	group = iommu_group_get(dev);
+	if (!group)
+		return 0;
 
+	switch (action) {
+	case BUS_NOTIFY_BIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
+		break;
+	case BUS_NOTIFY_BOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBIND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER;
+		break;
+	}
+
+	if (group_action)
+		blocking_notifier_call_chain(&group->notifier,
+					     group_action, dev);
+
+	iommu_group_put(group);
 	return 0;
 }
 
-static struct notifier_block iommu_device_nb = {
-	.notifier_call = iommu_device_notifier,
+static struct notifier_block iommu_bus_nb = {
+	.notifier_call = iommu_bus_notifier,
 };
 
 static void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops)
 {
-	bus_register_notifier(bus, &iommu_device_nb);
-	bus_for_each_dev(bus, NULL, NULL, add_iommu_group);
+	bus_register_notifier(bus, &iommu_bus_nb);
+	bus_for_each_dev(bus, NULL, ops, add_iommu_group);
 }
 
 /**
@@ -189,6 +533,45 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device);
 
+/*
+ * IOMMU groups are really the natrual working unit of the IOMMU, but
+ * the IOMMU API works on domains and devices.  Bridge that gap by
+ * iterating over the devices in a group.  Ideally we'd have a single
+ * device which represents the requestor ID of the group, but we also
+ * allow IOMMU drivers to create policy defined minimum sets, where
+ * the physical hardware may be able to distiguish members, but we
+ * wish to group them at a higher level (ex. untrusted multi-function
+ * PCI devices).  Thus we attach each device.
+ */
+static int iommu_group_do_attach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	return iommu_attach_device(domain, dev);
+}
+
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return iommu_group_for_each_dev(group, domain,
+					iommu_group_do_attach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_group);
+
+static int iommu_group_do_detach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	iommu_detach_device(domain, dev);
+
+	return 0;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_group);
+
 phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 			       unsigned long iova)
 {
@@ -333,11 +716,15 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
-int iommu_device_group(struct device *dev, unsigned int *groupid)
+static int __init iommu_init(void)
 {
-	if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
-		return dev->bus->iommu_ops->device_group(dev, groupid);
+	iommu_group_kset = kset_create_and_add("iommu_groups",
+					       NULL, kernel_kobj);
+	ida_init(&iommu_group_ida);
+	mutex_init(&iommu_group_mutex);
 
-	return -ENODEV;
+	BUG_ON(!iommu_group_kset);
+
+	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_device_group);
+subsys_initcall(iommu_init);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..92cf3c2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@
 #define IOMMU_CACHE	(4) /* DMA cache coherency */
 
 struct iommu_ops;
+struct iommu_group;
 struct bus_type;
 struct device;
 struct iommu_domain;
@@ -59,6 +60,8 @@ struct iommu_domain {
  * @iova_to_phys: translate iova to physical address
  * @domain_has_cap: domain capabilities query
  * @commit: commit iommu domain
+ * @add_device: add device to iommu grouping
+ * @remove_device: remove device from iommu grouping
  * @pgsize_bitmap: bitmap of supported page sizes
  */
 struct iommu_ops {
@@ -74,10 +77,18 @@ struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
-	int (*device_group)(struct device *dev, unsigned int *groupid);
+	int (*add_device)(struct device *dev);
+	void (*remove_device)(struct device *dev);
 	unsigned long pgsize_bitmap;
 };
 
+#define IOMMU_GROUP_NOTIFY_ADD_DEVICE		1 /* Device added */
+#define IOMMU_GROUP_NOTIFY_DEL_DEVICE		2 /* Pre Device removed */
+#define IOMMU_GROUP_NOTIFY_BIND_DRIVER		3 /* Pre Driver bind */
+#define IOMMU_GROUP_NOTIFY_BOUND_DRIVER		4 /* Post Driver bind */
+#define IOMMU_GROUP_NOTIFY_UNBIND_DRIVER	5 /* Pre Driver unbind */
+#define IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER	6 /* Post Driver unbind */
+
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -96,7 +107,24 @@ extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
 					iommu_fault_handler_t handler);
-extern int iommu_device_group(struct device *dev, unsigned int *groupid);
+
+extern int iommu_attach_group(struct iommu_domain *domain,
+			      struct iommu_group *group);
+extern void iommu_detach_group(struct iommu_domain *domain,
+			       struct iommu_group *group);
+extern struct iommu_group *iommu_group_alloc(void);
+extern int iommu_group_add_device(struct iommu_group *group,
+				  struct device *dev);
+extern void iommu_group_remove_device(struct device *dev);
+extern int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+				    int (*fn)(struct device *, void *));
+extern struct iommu_group *iommu_group_get(struct device *dev);
+extern void iommu_group_put(struct iommu_group *group);
+extern int iommu_group_register_notifier(struct iommu_group *group,
+					 struct notifier_block *nb);
+extern int iommu_group_unregister_notifier(struct iommu_group *group,
+					   struct notifier_block *nb);
+extern int iommu_group_id(struct iommu_group *group);
 
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
@@ -140,6 +168,7 @@ static inline int report_iommu_fault(struct iommu_domain *domain,
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
+struct iommu_group {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
@@ -195,11 +224,60 @@ static inline void iommu_set_fault_handler(struct iommu_domain *domain,
 {
 }
 
-static inline int iommu_device_group(struct device *dev, unsigned int *groupid)
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return -ENODEV;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+}
+
+struct iommu_group *iommu_group_alloc(void)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	return -ENODEV;
+}
+
+void iommu_group_remove_device(struct device *dev)
+{
+}
+
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
 {
 	return -ENODEV;
 }
 
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	return NULL;
+}
+
+void iommu_group_put(struct iommu_group *group)
+{
+}
+
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return -ENODEV;
+}
+
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return 0;
+}
+
+int iommu_group_id(struct iommu_group *group)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

Add back group support for AMD & Intel.  amd_iommu already tracks
devices and has init and uninit routines to manage groups.
intel-iommu does this on the fly, so we make use of the notifier
support built into iommu groups to create and remove groups.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |   28 +++++++++++++++++++++++++-
 drivers/iommu/intel-iommu.c |   46 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 32c00cd..b7e5ddf 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,11 @@ static bool check_device(struct device *dev)
 
 static int iommu_init_device(struct device *dev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
 	struct iommu_dev_data *dev_data;
+	struct iommu_group *group;
 	u16 alias;
+	int ret;
 
 	if (dev->archdata.iommu)
 		return 0;
@@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
 			return -ENOTSUPP;
 		}
 		dev_data->alias_data = alias_data;
+
+		dma_pdev = pci_get_bus_and_slot(alias >> 8, alias & 0xff);
+	} else
+		dma_pdev = pdev;
+
+	if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+	    pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+		dma_pdev = pci_get_slot(pdev->bus,
+					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+	group = iommu_group_get(&dma_pdev->dev);
+	if (!group) {
+		group = iommu_group_alloc();
+		if (IS_ERR(group))
+			return PTR_ERR(group);
 	}
 
+	ret = iommu_group_add_device(group, dev);
+
+	iommu_group_put(group);
+
+	if (ret)
+		return ret;
+
 	if (pci_iommuv2_capable(pdev)) {
 		struct amd_iommu *iommu;
 
@@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)
 
 static void iommu_uninit_device(struct device *dev)
 {
+	iommu_group_remove_device(dev);
+
 	/*
 	 * Nothing to do here - we keep dev_data around for unplugged devices
 	 * and reuse it when the device is re-plugged - not doing so would
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a0ff7..e63b33b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+static int intel_iommu_add_device(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *bridge, *dma_pdev = pdev;
+	struct iommu_group *group;
+	int ret;
+
+	if (!device_to_iommu(pci_domain_nr(pdev->bus),
+			     pdev->bus->number, pdev->devfn))
+		return -ENODEV;
+
+	bridge = pci_find_upstream_pcie_bridge(pdev);
+	if (bridge) {
+		if (pci_is_pcie(bridge))
+			dma_pdev = pci_get_domain_bus_and_slot(
+						pci_domain_nr(pdev->bus),
+						bridge->subordinate->number, 0);
+		else
+			dma_pdev = bridge;
+	}
+
+	if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+	    pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+		dma_pdev = pci_get_slot(pdev->bus,
+					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+	group = iommu_group_get(&dma_pdev->dev);
+	if (!group) {
+		group = iommu_group_alloc();
+		if (IS_ERR(group))
+			return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, dev);
+
+	iommu_group_put(group);
+	return ret;
+}
+
+static void intel_iommu_remove_device(struct device *dev)
+{
+	iommu_group_remove_device(dev);
+}
+
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
+	.add_device	= intel_iommu_add_device,
+	.remove_device	= intel_iommu_remove_device,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };
 


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

Add back group support for AMD & Intel.  amd_iommu already tracks
devices and has init and uninit routines to manage groups.
intel-iommu does this on the fly, so we make use of the notifier
support built into iommu groups to create and remove groups.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/iommu/amd_iommu.c   |   28 +++++++++++++++++++++++++-
 drivers/iommu/intel-iommu.c |   46 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 32c00cd..b7e5ddf 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,11 @@ static bool check_device(struct device *dev)
 
 static int iommu_init_device(struct device *dev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
 	struct iommu_dev_data *dev_data;
+	struct iommu_group *group;
 	u16 alias;
+	int ret;
 
 	if (dev->archdata.iommu)
 		return 0;
@@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
 			return -ENOTSUPP;
 		}
 		dev_data->alias_data = alias_data;
+
+		dma_pdev = pci_get_bus_and_slot(alias >> 8, alias & 0xff);
+	} else
+		dma_pdev = pdev;
+
+	if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+	    pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+		dma_pdev = pci_get_slot(pdev->bus,
+					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+	group = iommu_group_get(&dma_pdev->dev);
+	if (!group) {
+		group = iommu_group_alloc();
+		if (IS_ERR(group))
+			return PTR_ERR(group);
 	}
 
+	ret = iommu_group_add_device(group, dev);
+
+	iommu_group_put(group);
+
+	if (ret)
+		return ret;
+
 	if (pci_iommuv2_capable(pdev)) {
 		struct amd_iommu *iommu;
 
@@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)
 
 static void iommu_uninit_device(struct device *dev)
 {
+	iommu_group_remove_device(dev);
+
 	/*
 	 * Nothing to do here - we keep dev_data around for unplugged devices
 	 * and reuse it when the device is re-plugged - not doing so would
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a0ff7..e63b33b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+static int intel_iommu_add_device(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *bridge, *dma_pdev = pdev;
+	struct iommu_group *group;
+	int ret;
+
+	if (!device_to_iommu(pci_domain_nr(pdev->bus),
+			     pdev->bus->number, pdev->devfn))
+		return -ENODEV;
+
+	bridge = pci_find_upstream_pcie_bridge(pdev);
+	if (bridge) {
+		if (pci_is_pcie(bridge))
+			dma_pdev = pci_get_domain_bus_and_slot(
+						pci_domain_nr(pdev->bus),
+						bridge->subordinate->number, 0);
+		else
+			dma_pdev = bridge;
+	}
+
+	if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+	    pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+		dma_pdev = pci_get_slot(pdev->bus,
+					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+	group = iommu_group_get(&dma_pdev->dev);
+	if (!group) {
+		group = iommu_group_alloc();
+		if (IS_ERR(group))
+			return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, dev);
+
+	iommu_group_put(group);
+	return ret;
+}
+
+static void intel_iommu_remove_device(struct device *dev)
+{
+	iommu_group_remove_device(dev);
+}
+
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
+	.add_device	= intel_iommu_add_device,
+	.remove_device	= intel_iommu_remove_device,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Add back group support for AMD & Intel.  amd_iommu already tracks
devices and has init and uninit routines to manage groups.
intel-iommu does this on the fly, so we make use of the notifier
support built into iommu groups to create and remove groups.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |   28 +++++++++++++++++++++++++-
 drivers/iommu/intel-iommu.c |   46 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 32c00cd..b7e5ddf 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,11 @@ static bool check_device(struct device *dev)
 
 static int iommu_init_device(struct device *dev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
 	struct iommu_dev_data *dev_data;
+	struct iommu_group *group;
 	u16 alias;
+	int ret;
 
 	if (dev->archdata.iommu)
 		return 0;
@@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
 			return -ENOTSUPP;
 		}
 		dev_data->alias_data = alias_data;
+
+		dma_pdev = pci_get_bus_and_slot(alias >> 8, alias & 0xff);
+	} else
+		dma_pdev = pdev;
+
+	if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+	    pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+		dma_pdev = pci_get_slot(pdev->bus,
+					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+	group = iommu_group_get(&dma_pdev->dev);
+	if (!group) {
+		group = iommu_group_alloc();
+		if (IS_ERR(group))
+			return PTR_ERR(group);
 	}
 
+	ret = iommu_group_add_device(group, dev);
+
+	iommu_group_put(group);
+
+	if (ret)
+		return ret;
+
 	if (pci_iommuv2_capable(pdev)) {
 		struct amd_iommu *iommu;
 
@@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)
 
 static void iommu_uninit_device(struct device *dev)
 {
+	iommu_group_remove_device(dev);
+
 	/*
 	 * Nothing to do here - we keep dev_data around for unplugged devices
 	 * and reuse it when the device is re-plugged - not doing so would
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a0ff7..e63b33b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
+static int intel_iommu_add_device(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev *bridge, *dma_pdev = pdev;
+	struct iommu_group *group;
+	int ret;
+
+	if (!device_to_iommu(pci_domain_nr(pdev->bus),
+			     pdev->bus->number, pdev->devfn))
+		return -ENODEV;
+
+	bridge = pci_find_upstream_pcie_bridge(pdev);
+	if (bridge) {
+		if (pci_is_pcie(bridge))
+			dma_pdev = pci_get_domain_bus_and_slot(
+						pci_domain_nr(pdev->bus),
+						bridge->subordinate->number, 0);
+		else
+			dma_pdev = bridge;
+	}
+
+	if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+	    pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+		dma_pdev = pci_get_slot(pdev->bus,
+					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+	group = iommu_group_get(&dma_pdev->dev);
+	if (!group) {
+		group = iommu_group_alloc();
+		if (IS_ERR(group))
+			return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, dev);
+
+	iommu_group_put(group);
+	return ret;
+}
+
+static void intel_iommu_remove_device(struct device *dev)
+{
+	iommu_group_remove_device(dev);
+}
+
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
+	.add_device	= intel_iommu_add_device,
+	.remove_device	= intel_iommu_remove_device,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };
 

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

Integrating IOMMU groups more closely into the driver core allows
us to more easily work around DMA quirks.  The Ricoh multifunction
controller is a favorite example of devices that are currently
incompatible with IOMMU isolation as all the functions use the
requestor ID of function 0 for DMA.  Passing this device into
pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
driver can then construct an IOMMU group including both devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/quirks.c |   22 ++++++++++++++++++++++
 include/linux/pci.h  |    2 ++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4bf7102..6f9f7f9 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
 
 	return -ENOTTY;
 }
+
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
+{
+	struct pci_dev *dma_dev = dev;
+
+	/*
+	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
+	 *
+	 * Some Ricoh devices use the function 0 source ID for DMA on
+	 * other functions of a multifunction device.  The DMA devices
+	 * is therefore function 0, which will have implications of the
+	 * iommu grouping of these devices.
+	 */
+	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
+	    (dev->device == 0xe822 || dev->device == 0xe230 ||
+	     dev->device == 0xe832 || dev->device == 0xe476)) {
+		dma_dev = pci_get_slot(dev->bus,
+				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
+	}
+
+	return dma_dev;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..9910b5c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1479,9 +1479,11 @@ enum pci_fixup_pass {
 
 #ifdef CONFIG_PCI_QUIRKS
 void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev);
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev);
 #else
 static inline void pci_fixup_device(enum pci_fixup_pass pass,
 				    struct pci_dev *dev) {}
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev) { return dev }
 #endif
 
 void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, unsigned long maxlen);


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

Integrating IOMMU groups more closely into the driver core allows
us to more easily work around DMA quirks.  The Ricoh multifunction
controller is a favorite example of devices that are currently
incompatible with IOMMU isolation as all the functions use the
requestor ID of function 0 for DMA.  Passing this device into
pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
driver can then construct an IOMMU group including both devices.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/pci/quirks.c |   22 ++++++++++++++++++++++
 include/linux/pci.h  |    2 ++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4bf7102..6f9f7f9 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
 
 	return -ENOTTY;
 }
+
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
+{
+	struct pci_dev *dma_dev = dev;
+
+	/*
+	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
+	 *
+	 * Some Ricoh devices use the function 0 source ID for DMA on
+	 * other functions of a multifunction device.  The DMA devices
+	 * is therefore function 0, which will have implications of the
+	 * iommu grouping of these devices.
+	 */
+	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
+	    (dev->device == 0xe822 || dev->device == 0xe230 ||
+	     dev->device == 0xe832 || dev->device == 0xe476)) {
+		dma_dev = pci_get_slot(dev->bus,
+				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
+	}
+
+	return dma_dev;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..9910b5c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1479,9 +1479,11 @@ enum pci_fixup_pass {
 
 #ifdef CONFIG_PCI_QUIRKS
 void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev);
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev);
 #else
 static inline void pci_fixup_device(enum pci_fixup_pass pass,
 				    struct pci_dev *dev) {}
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev) { return dev }
 #endif
 
 void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, unsigned long maxlen);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-11 22:55   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:55 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Integrating IOMMU groups more closely into the driver core allows
us to more easily work around DMA quirks.  The Ricoh multifunction
controller is a favorite example of devices that are currently
incompatible with IOMMU isolation as all the functions use the
requestor ID of function 0 for DMA.  Passing this device into
pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
driver can then construct an IOMMU group including both devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/quirks.c |   22 ++++++++++++++++++++++
 include/linux/pci.h  |    2 ++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4bf7102..6f9f7f9 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
 
 	return -ENOTTY;
 }
+
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
+{
+	struct pci_dev *dma_dev = dev;
+
+	/*
+	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
+	 *
+	 * Some Ricoh devices use the function 0 source ID for DMA on
+	 * other functions of a multifunction device.  The DMA devices
+	 * is therefore function 0, which will have implications of the
+	 * iommu grouping of these devices.
+	 */
+	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
+	    (dev->device == 0xe822 || dev->device == 0xe230 ||
+	     dev->device == 0xe832 || dev->device == 0xe476)) {
+		dma_dev = pci_get_slot(dev->bus,
+				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
+	}
+
+	return dma_dev;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..9910b5c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1479,9 +1479,11 @@ enum pci_fixup_pass {
 
 #ifdef CONFIG_PCI_QUIRKS
 void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev);
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev);
 #else
 static inline void pci_fixup_device(enum pci_fixup_pass pass,
 				    struct pci_dev *dev) {}
+struct pci_dev *pci_dma_quirk(struct pci_dev *dev) { return dev }
 #endif
 
 void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, unsigned long maxlen);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |    1 +
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..d7f05ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
 	pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
 }
 
+#define PCI_EXT_CAP_ACS_ENABLED		(PCI_ACS_SV | PCI_ACS_RR | \
+					 PCI_ACS_CR | PCI_ACS_UF)
+
+/**
+ * pci_acs_enabled - test ACS support in downstream chain
+ * @dev: starting PCI device
+ *
+ * Returns the furthest downstream device with an unbroken ACS chain.  If
+ * ACS is enabled throughout the chain, the returned device is the same as
+ * the one passed in.
+ */
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
+{
+	struct pci_dev *acs_dev;
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_root_bus(dev->bus))
+		acs_dev = pci_acs_enabled(dev->bus->self);
+	else
+		return dev;
+
+	/* If the chain is already broken, pass on the device */
+	if (acs_dev != dev->bus->self)
+		return acs_dev;
+
+	if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
+		return dev;
+
+	if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
+		return dev;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return acs_dev;
+
+	pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
+	if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
+		return acs_dev;
+
+	return dev;
+}
+
 /**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..dc25da3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |    1 +
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..d7f05ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
 	pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
 }
 
+#define PCI_EXT_CAP_ACS_ENABLED		(PCI_ACS_SV | PCI_ACS_RR | \
+					 PCI_ACS_CR | PCI_ACS_UF)
+
+/**
+ * pci_acs_enabled - test ACS support in downstream chain
+ * @dev: starting PCI device
+ *
+ * Returns the furthest downstream device with an unbroken ACS chain.  If
+ * ACS is enabled throughout the chain, the returned device is the same as
+ * the one passed in.
+ */
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
+{
+	struct pci_dev *acs_dev;
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_root_bus(dev->bus))
+		acs_dev = pci_acs_enabled(dev->bus->self);
+	else
+		return dev;
+
+	/* If the chain is already broken, pass on the device */
+	if (acs_dev != dev->bus->self)
+		return acs_dev;
+
+	if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
+		return dev;
+
+	if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
+		return dev;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return acs_dev;
+
+	pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
+	if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
+		return acs_dev;
+
+	return dev;
+}
+
 /**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..dc25da3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |    1 +
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..d7f05ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
 	pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
 }
 
+#define PCI_EXT_CAP_ACS_ENABLED		(PCI_ACS_SV | PCI_ACS_RR | \
+					 PCI_ACS_CR | PCI_ACS_UF)
+
+/**
+ * pci_acs_enabled - test ACS support in downstream chain
+ * @dev: starting PCI device
+ *
+ * Returns the furthest downstream device with an unbroken ACS chain.  If
+ * ACS is enabled throughout the chain, the returned device is the same as
+ * the one passed in.
+ */
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
+{
+	struct pci_dev *acs_dev;
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_root_bus(dev->bus))
+		acs_dev = pci_acs_enabled(dev->bus->self);
+	else
+		return dev;
+
+	/* If the chain is already broken, pass on the device */
+	if (acs_dev != dev->bus->self)
+		return acs_dev;
+
+	if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
+		return dev;
+
+	if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
+		return dev;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return acs_dev;
+
+	pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
+	if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
+		return acs_dev;
+
+	return dev;
+}
+
 /**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..dc25da3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 06/13] iommu: Make use of DMA quirking and ACS enabled check for groups
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

Incorporate DMA quirking and ACS checking into amd_iommu and
intel-iommu.  Note that IOMMU groups are not yet used for
streaming DMA, so this doesn't immediately solve the problems
with broken Ricoh devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |    3 +++
 drivers/iommu/intel-iommu.c |    3 +++
 drivers/pci/pci.h           |    1 +
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b7e5ddf..a165311 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -291,6 +291,9 @@ static int iommu_init_device(struct device *dev)
 		dma_pdev = pci_get_slot(pdev->bus,
 					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
 
+	dma_pdev = pci_dma_quirk(dma_pdev);
+	dma_pdev = pci_acs_enabled(dma_pdev);
+
 	group = iommu_group_get(&dma_pdev->dev);
 	if (!group) {
 		group = iommu_group_alloc();
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e63b33b..5f526c7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4113,6 +4113,9 @@ static int intel_iommu_add_device(struct device *dev)
 		dma_pdev = pci_get_slot(pdev->bus,
 					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
 
+	dma_pdev = pci_dma_quirk(dma_pdev);
+	dma_pdev = pci_acs_enabled(dma_pdev);
+
 	group = iommu_group_get(&dma_pdev->dev);
 	if (!group) {
 		group = iommu_group_alloc();
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e494347..e8f2f8f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -310,6 +310,7 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 }
 
 extern void pci_enable_acs(struct pci_dev *dev);
+extern struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 struct pci_dev_reset_methods {
 	u16 vendor;


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 06/13] iommu: Make use of DMA quirking and ACS enabled check for groups
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

Incorporate DMA quirking and ACS checking into amd_iommu and
intel-iommu.  Note that IOMMU groups are not yet used for
streaming DMA, so this doesn't immediately solve the problems
with broken Ricoh devices.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/iommu/amd_iommu.c   |    3 +++
 drivers/iommu/intel-iommu.c |    3 +++
 drivers/pci/pci.h           |    1 +
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b7e5ddf..a165311 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -291,6 +291,9 @@ static int iommu_init_device(struct device *dev)
 		dma_pdev = pci_get_slot(pdev->bus,
 					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
 
+	dma_pdev = pci_dma_quirk(dma_pdev);
+	dma_pdev = pci_acs_enabled(dma_pdev);
+
 	group = iommu_group_get(&dma_pdev->dev);
 	if (!group) {
 		group = iommu_group_alloc();
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e63b33b..5f526c7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4113,6 +4113,9 @@ static int intel_iommu_add_device(struct device *dev)
 		dma_pdev = pci_get_slot(pdev->bus,
 					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
 
+	dma_pdev = pci_dma_quirk(dma_pdev);
+	dma_pdev = pci_acs_enabled(dma_pdev);
+
 	group = iommu_group_get(&dma_pdev->dev);
 	if (!group) {
 		group = iommu_group_alloc();
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e494347..e8f2f8f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -310,6 +310,7 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 }
 
 extern void pci_enable_acs(struct pci_dev *dev);
+extern struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 struct pci_dev_reset_methods {
 	u16 vendor;

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 06/13] iommu: Make use of DMA quirking and ACS enabled check for groups
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Incorporate DMA quirking and ACS checking into amd_iommu and
intel-iommu.  Note that IOMMU groups are not yet used for
streaming DMA, so this doesn't immediately solve the problems
with broken Ricoh devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |    3 +++
 drivers/iommu/intel-iommu.c |    3 +++
 drivers/pci/pci.h           |    1 +
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b7e5ddf..a165311 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -291,6 +291,9 @@ static int iommu_init_device(struct device *dev)
 		dma_pdev = pci_get_slot(pdev->bus,
 					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
 
+	dma_pdev = pci_dma_quirk(dma_pdev);
+	dma_pdev = pci_acs_enabled(dma_pdev);
+
 	group = iommu_group_get(&dma_pdev->dev);
 	if (!group) {
 		group = iommu_group_alloc();
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e63b33b..5f526c7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4113,6 +4113,9 @@ static int intel_iommu_add_device(struct device *dev)
 		dma_pdev = pci_get_slot(pdev->bus,
 					PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
 
+	dma_pdev = pci_dma_quirk(dma_pdev);
+	dma_pdev = pci_acs_enabled(dma_pdev);
+
 	group = iommu_group_get(&dma_pdev->dev);
 	if (!group) {
 		group = iommu_group_alloc();
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e494347..e8f2f8f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -310,6 +310,7 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 }
 
 extern void pci_enable_acs(struct pci_dev *dev);
+extern struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 struct pci_dev_reset_methods {
 	u16 vendor;

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 07/13] vfio: VFIO core
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

VFIO is a secure user level driver for use with both virtual machines
and user level drivers.  VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access.  It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).

New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface.  We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model.  IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms.  VFIO
users are able to query and initialize the IOMMU model of their
choice.

Please see the follow-on Documentation commit for further description
and usage example.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/ioctl/ioctl-number.txt |    1 
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/vfio/Kconfig                 |    8 
 drivers/vfio/Makefile                |    1 
 drivers/vfio/vfio.c                  | 1399 ++++++++++++++++++++++++++++++++++
 include/linux/vfio.h                 |  364 +++++++++
 8 files changed, 1784 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 include/linux/vfio.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index e34b531..111e30a 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr@solidum.com>
+';'	64-6F	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index de4e280..48e7600 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7227,6 +7227,14 @@ S:	Maintained
 F:	Documentation/filesystems/vfat.txt
 F:	fs/fat/
 
+VFIO DRIVER
+M:	Alex Williamson <alex.williamson@redhat.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	Documentation/vfio.txt
+F:	drivers/vfio/
+F:	include/linux/vfio.h
+
 VIDEOBUF2 FRAMEWORK
 M:	Pawel Osciak <pawel@osciak.com>
 M:	Marek Szyprowski <m.szyprowski@samsung.com>
diff --git a/drivers/Kconfig b/drivers/Kconfig
index d236aef..46eb115 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
 
 source "drivers/uio/Kconfig"
 
+source "drivers/vfio/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virtio/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 95952c8..fe1880a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
+obj-$(CONFIG_VFIO)		+= vfio/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 0000000..9acb1e7
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig VFIO
+	tristate "VFIO Non-Privileged userspace driver framework"
+	depends on IOMMU_API
+	help
+	  VFIO provides a framework for secure userspace device drivers.
+	  See Documentation/vfio.txt for more details.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 0000000..7500a67
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VFIO) += vfio.o
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
new file mode 100644
index 0000000..af0e4f8
--- /dev/null
+++ b/drivers/vfio/vfio.c
@@ -0,0 +1,1399 @@
+/*
+ * VFIO core
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/cdev.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+
+#define DRIVER_VERSION	"0.3"
+#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC	"VFIO - User Level meta-driver"
+
+static struct vfio {
+	struct class			*class;
+	struct list_head		iommu_drivers_list;
+	struct mutex			iommu_drivers_lock;
+	struct list_head		group_list;
+	struct idr			group_idr;
+	struct mutex			group_lock;
+	struct cdev			group_cdev;
+	struct device			*dev;
+	dev_t				devt;
+	struct cdev			cdev;
+	wait_queue_head_t		release_q;
+} vfio;
+
+struct vfio_iommu_driver {
+	const struct vfio_iommu_driver_ops	*ops;
+	struct list_head			vfio_next;
+};
+
+struct vfio_container {
+	struct kref			kref;
+	struct list_head		group_list;
+	struct mutex			group_lock;
+	struct vfio_iommu_driver	*iommu_driver;
+	void				*iommu_data;
+};
+
+struct vfio_group {
+	struct kref			kref;
+	int				minor;
+	atomic_t			container_users;
+	struct iommu_group		*iommu_group;
+	struct vfio_container		*container;
+	struct list_head		device_list;
+	struct mutex			device_lock;
+	struct device			*dev;
+	struct notifier_block		nb;
+	struct list_head		vfio_next;
+	struct list_head		container_next;
+};
+
+struct vfio_device {
+	struct kref			kref;
+	struct device			*dev;
+	const struct vfio_device_ops	*ops;
+	struct vfio_group		*group;
+	struct list_head		group_next;
+	void				*device_data;
+};
+
+/**
+ * IOMMU driver registration
+ */
+int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver, *tmp;
+
+	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
+	if (!driver)
+		return -ENOMEM;
+
+	driver->ops = ops;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+
+	/* Check for duplicates */
+	list_for_each_entry(tmp, &vfio.iommu_drivers_list, vfio_next) {
+		if (tmp->ops == ops) {
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return -EINVAL;
+		}
+	}
+
+	list_add(&driver->vfio_next, &vfio.iommu_drivers_list);
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_register_iommu_driver);
+
+void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		if (driver->ops == ops) {
+			list_del(&driver->vfio_next);
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return;
+		}
+	}
+	mutex_unlock(&vfio.iommu_drivers_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
+
+/**
+ * Group minor allocation/free - both called with vfio.group_lock held
+ */
+static int vfio_alloc_group_minor(struct vfio_group *group)
+{
+	int ret, minor;
+
+again:
+	if (unlikely(idr_pre_get(&vfio.group_idr, GFP_KERNEL) == 0))
+		return -ENOMEM;
+
+	/* index 0 is used by /dev/vfio/vfio */
+	ret = idr_get_new_above(&vfio.group_idr, group, 1, &minor);
+	if (ret == -EAGAIN)
+		goto again;
+	if (ret || minor > MINORMASK) {
+		if (minor > MINORMASK)
+			idr_remove(&vfio.group_idr, minor);
+		return -ENOSPC;
+	}
+
+	return minor;
+}
+
+static void vfio_free_group_minor(int minor)
+{
+	idr_remove(&vfio.group_idr, minor);
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data);
+static void vfio_group_get(struct vfio_group *group);
+
+/**
+ * Container objects - containers are created when /dev/vfio/vfio is
+ * opened, but their lifecycle extends until the last user is done, so
+ * it's freed via kref.  Must support container/group/device being
+ * closed in any order.
+ */
+static void vfio_container_get(struct vfio_container *container)
+{
+	kref_get(&container->kref);
+}
+
+static void vfio_container_release(struct kref *kref)
+{
+	struct vfio_container *container;
+	container = container_of(kref, struct vfio_container, kref);
+
+	kfree(container);
+}
+
+static void vfio_container_put(struct vfio_container *container)
+{
+	kref_put(&container->kref, vfio_container_release);
+}
+
+/**
+ * Group objects - create, release, get, put, search
+ */
+static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group, *tmp;
+	struct device *dev;
+	int ret, minor;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&group->kref);
+	INIT_LIST_HEAD(&group->device_list);
+	mutex_init(&group->device_lock);
+	atomic_set(&group->container_users, 0);
+	group->iommu_group = iommu_group;
+
+	group->nb.notifier_call = vfio_iommu_group_notifier;
+
+	/*
+	 * blocking notifiers acquire a rwsem around registering and hold
+	 * it around callback.  Therefore, need to register outside of
+	 * vfio.group_lock to avoid A-B/B-A contention.  Our callback won't
+	 * do anything unless it can find the group in vfio.group_list, so
+	 * no harm in registering early.
+	 */
+	ret = iommu_group_register_notifier(iommu_group, &group->nb);
+	if (ret) {
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	mutex_lock(&vfio.group_lock);
+
+	minor = vfio_alloc_group_minor(group);
+	if (minor < 0) {
+		mutex_unlock(&vfio.group_lock);
+		kfree(group);
+		return ERR_PTR(minor);
+	}
+
+	/* Did we race creating this group? */
+	list_for_each_entry(tmp, &vfio.group_list, vfio_next) {
+		if (tmp->iommu_group == iommu_group) {
+			vfio_group_get(tmp);
+			vfio_free_group_minor(minor);
+			mutex_unlock(&vfio.group_lock);
+			kfree(group);
+			return tmp;
+		}
+	}
+
+	dev = device_create(vfio.class, NULL, MKDEV(MAJOR(vfio.devt), minor),
+			    group, "%d", iommu_group_id(iommu_group));
+	if (IS_ERR(dev)) {
+		vfio_free_group_minor(minor);
+		mutex_unlock(&vfio.group_lock);
+		kfree(group);
+		return (struct vfio_group *)dev; /* ERR_PTR */
+	}
+
+	group->minor = minor;
+	group->dev = dev;
+
+	list_add(&group->vfio_next, &vfio.group_list);
+
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+static void vfio_group_release(struct kref *kref)
+{
+	struct vfio_group *group = container_of(kref, struct vfio_group, kref);
+
+	WARN_ON(!list_empty(&group->device_list));
+
+	device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group->minor));
+	list_del(&group->vfio_next);
+	vfio_free_group_minor(group->minor);
+
+	mutex_unlock(&vfio.group_lock);
+
+	/*
+	 * Unregister outside of lock.  A spurious callback is harmless now
+	 * that the group is no longer in vfio.group_list.
+	 */
+	iommu_group_unregister_notifier(group->iommu_group, &group->nb);
+
+	kfree(group);
+}
+
+static void vfio_group_put(struct vfio_group *group)
+{
+	mutex_lock(&vfio.group_lock);
+	/*
+	 * Release needs to unlock to unregister the notifier, so only
+	 * unlock if not released.
+	 */
+	if (!kref_put(&group->kref, vfio_group_release))
+		mutex_unlock(&vfio.group_lock);
+}
+
+/* Assume group_lock or group reference is held */
+static void vfio_group_get(struct vfio_group *group)
+{
+	kref_get(&group->kref);
+}
+
+/*
+ * Not really a try as we will sleep for mutex, but we need to make
+ * sure the group pointer is valid under lock and get a reference.
+ */
+static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
+{
+	struct vfio_group *target = group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group == target) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static
+struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group->iommu_group == iommu_group) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static struct vfio_group *vfio_group_get_from_minor(int minor)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	group = idr_find(&vfio.group_idr, minor);
+	if (!group) {
+		mutex_unlock(&vfio.group_lock);
+		return NULL;
+	}
+	vfio_group_get(group);
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+/**
+ * Device objects - create, release, get, put, search
+ */
+static
+struct vfio_device *vfio_group_create_device(struct vfio_group *group,
+					     struct device *dev,
+					     const struct vfio_device_ops *ops,
+					     void *device_data)
+{
+	struct vfio_device *device;
+	int ret;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&device->kref);
+	device->dev = dev;
+	device->group = group;
+	device->ops = ops;
+	device->device_data = device_data;
+
+	ret = dev_set_drvdata(dev, device);
+	if (ret) {
+		kfree(device);
+		return ERR_PTR(ret);
+	}
+
+	/* No need to get group_lock, caller has group reference */
+	vfio_group_get(group);
+
+	mutex_lock(&group->device_lock);
+	list_add(&device->group_next, &group->device_list);
+	mutex_unlock(&group->device_lock);
+
+	return device;
+}
+
+static void vfio_device_release(struct kref *kref)
+{
+	struct vfio_device *device = container_of(kref,
+						  struct vfio_device, kref);
+	struct vfio_group *group = device->group;
+
+	mutex_lock(&group->device_lock);
+	list_del(&device->group_next);
+	mutex_unlock(&group->device_lock);
+
+	dev_set_drvdata(device->dev, NULL);
+
+	kfree(device);
+
+	/* vfio_del_group_dev may be waiting for this device */
+	wake_up(&vfio.release_q);
+}
+
+/* Device reference always implies a group reference */
+static void vfio_device_put(struct vfio_device *device)
+{
+	kref_put(&device->kref, vfio_device_release);
+	vfio_group_put(device->group);
+}
+
+static void vfio_device_get(struct vfio_device *device)
+{
+	vfio_group_get(device->group);
+	kref_get(&device->kref);
+}
+
+static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
+						 struct device *dev)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (device->dev == dev) {
+			vfio_device_get(device);
+			mutex_unlock(&group->device_lock);
+			return device;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+	return NULL;
+}
+
+/**
+ * Async device support
+ */
+static int vfio_group_nb_add_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* Do we already know about it?  We shouldn't */
+	device = vfio_group_get_device(group, dev);
+	if (WARN_ON_ONCE(device)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	/* Nothing to do for idle groups */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	/* TODO Prevent device auto probing */
+	WARN("Device %s added to live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	return 0;
+}
+
+static int vfio_group_nb_del_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/*
+	 * Expect to fall out here.  If a device was in use, it would
+	 * have been bound to a vfio sub-driver, which would have blocked
+	 * in .remove at vfio_del_group_dev.  Sanity check that we no
+	 * longer track the device, so it's safe to remove.
+	 */
+	device = vfio_group_get_device(group, dev);
+	if (likely(!device))
+		return 0;
+
+	WARN("Device %s removed from live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	vfio_device_put(device);
+	return 0;
+}
+
+static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* We don't care what happens when the group isn't in use */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	if (device)
+		vfio_device_put(device);
+
+	return device ? 0 : -EINVAL;
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data)
+{
+	struct vfio_group *group = container_of(nb, struct vfio_group, nb);
+	struct device *dev = data;
+
+	/*
+	 * Need to go through a group_lock lookup to get a reference or
+	 * we risk racing a group being removed.  Leave a WARN_ON for
+	 * debuging, but if the group no longer exists, a spurious notify
+	 * is harmless.
+	 */
+	group = vfio_group_try_get(group);
+	if (WARN_ON(!group))
+		return NOTIFY_OK;
+
+	switch (action) {
+	case IOMMU_GROUP_NOTIFY_ADD_DEVICE:
+		vfio_group_nb_add_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_DEL_DEVICE:
+		vfio_group_nb_del_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_BIND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d binding to driver\n", __func__,
+		      dev_name(dev), iommu_group_id(group->iommu_group));
+		break;
+	case IOMMU_GROUP_NOTIFY_BOUND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d bound to driver %s\n", __func__,
+		       dev_name(dev), iommu_group_id(group->iommu_group),
+		       dev->driver->name);
+		BUG_ON(vfio_group_nb_verify(group, dev));
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d unbinding from driver %s\n",
+		       __func__, dev_name(dev),
+		       iommu_group_id(group->iommu_group), dev->driver->name);
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d unbound from driver\n",
+		       __func__, dev_name(dev),
+		       iommu_group_id(group->iommu_group));
+		/*
+		 * XXX An unbound device in a live group is ok, but we'd
+		 * really like to avoid the above BUG_ON by preventing other
+		 * drivers from binding to it.  Once that occurs, we have to
+		 * stop the system to maintain isolation.  At a minimum, we'd
+		 * want a toggle to disable driver auto probe for this device.
+		 */
+		break;
+	}
+
+	vfio_group_put(group);
+	return NOTIFY_OK;
+}
+
+/**
+ * VFIO driver API
+ */
+int vfio_add_group_dev(struct device *dev,
+		       const struct vfio_device_ops *ops, void *device_data)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		group = vfio_create_group(iommu_group);
+		if (IS_ERR(group)) {
+			iommu_group_put(iommu_group);
+			return PTR_ERR(group);
+		}
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (device) {
+		WARN(1, "Device %s already exists on group %d\n",
+		     dev_name(dev), iommu_group_id(iommu_group));
+		vfio_device_put(device);
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return -EBUSY;
+	}
+
+	device = vfio_group_create_device(group, dev, ops, device_data);
+	if (IS_ERR(device)) {
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return PTR_ERR(device);
+	}
+
+	/*
+	 * Added device holds reference to iommu_group and vfio_device
+	 * (which in turn holds reference to vfio_group).  Drop extra
+	 * group reference used while acquiring device.
+	 */
+	vfio_group_put(group);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_add_group_dev);
+
+/* Test whether a struct device is present in our tracking */
+static bool vfio_dev_present(struct device *dev)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return false;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		iommu_group_put(iommu_group);
+		return false;
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (!device) {
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return false;
+	}
+
+	vfio_device_put(device);
+	vfio_group_put(group);
+	iommu_group_put(iommu_group);
+	return true;
+}
+
+/*
+ * Decrement the device reference count and wait for the device to be
+ * removed.  Open file descriptors for the device... */
+void *vfio_del_group_dev(struct device *dev)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group = group->iommu_group;
+	void *device_data = device->device_data;
+
+	vfio_device_put(device);
+
+	/* TODO send a signal to encourage this to be released */
+	wait_event(vfio.release_q, !vfio_dev_present(dev));
+
+	iommu_group_put(iommu_group);
+
+	return device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_del_group_dev);
+
+/**
+ * VFIO base fd, /dev/vfio/vfio
+ */
+static long vfio_ioctl_check_extension(struct vfio_container *container,
+				       unsigned long arg)
+{
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+	long ret = 0;
+
+	switch (arg) {
+		/* No base extensions yet */
+	default:
+		/*
+		 * If no driver is set, poll all registered drivers for
+		 * extensions and return the first positive result.  If
+		 * a driver is already set, further queries will be passed
+		 * only to that driver.
+		 */
+		if (!driver) {
+			mutex_lock(&vfio.iommu_drivers_lock);
+			list_for_each_entry(driver, &vfio.iommu_drivers_list,
+					    vfio_next) {
+				if (!try_module_get(driver->ops->owner))
+					continue;
+
+				ret = driver->ops->ioctl(NULL,
+							 VFIO_CHECK_EXTENSION,
+							 arg);
+				module_put(driver->ops->owner);
+				if (ret > 0)
+					break;
+			}
+			mutex_unlock(&vfio.iommu_drivers_lock);
+		} else
+			ret = driver->ops->ioctl(container->iommu_data,
+						 VFIO_CHECK_EXTENSION, arg);
+	}
+
+	return ret;
+}
+
+/* hold container->group_lock */
+static int __vfio_container_attach_groups(struct vfio_container *container,
+					  struct vfio_iommu_driver *driver,
+					  void *data)
+{
+	struct vfio_group *group;
+	int ret = -ENODEV;
+
+	list_for_each_entry(group, &container->group_list, container_next) {
+		ret = driver->ops->attach_group(data, group->iommu_group);
+		if (ret)
+			goto unwind;
+	}
+
+	return ret;
+
+unwind:
+	list_for_each_entry_continue_reverse(group, &container->group_list,
+					     container_next) {
+		driver->ops->detach_group(data, group->iommu_group);
+	}
+
+	return ret;
+}
+
+static long vfio_ioctl_set_iommu(struct vfio_container *container,
+				 unsigned long arg)
+{
+	struct vfio_iommu_driver *driver;
+	long ret = -ENODEV;
+
+	mutex_lock(&container->group_lock);
+
+	/*
+	 * The container is designed to be an unprivileged interface while
+	 * the group can be assigned to specific users.  Therefore, only by
+	 * adding a group to a container does the user get the privilege of
+	 * enabling the iommu, which may allocate finite resources.  There
+	 * is no unset_iommu, but by removing all the groups from a container,
+	 * the container is deprivileged and returns to an unset state.
+	 */
+	if (list_empty(&container->group_list) || container->iommu_driver) {
+		mutex_unlock(&container->group_lock);
+		return -EINVAL;
+	}
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		void *data;
+
+		if (!try_module_get(driver->ops->owner))
+			continue;
+
+		/*
+		 * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
+		 * so test which iommu driver reported support for this
+		 * extension and call open on them.  We also pass them the
+		 * magic, allowing a single driver to support multiple
+		 * interfaces if they'd like.
+		 */
+		if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
+			module_put(driver->ops->owner);
+			continue;
+		}
+
+		/* module reference holds the driver we're working on */
+		mutex_unlock(&vfio.iommu_drivers_lock);
+
+		data = driver->ops->open(arg);
+		if (IS_ERR(data)) {
+			ret = PTR_ERR(data);
+			goto skip_drivers_unlock;
+		}
+
+		ret = __vfio_container_attach_groups(container, driver, data);
+		if (!ret) {
+			container->iommu_driver = driver;
+			container->iommu_data = data;
+		} else
+			driver->ops->release(data);
+
+		goto skip_drivers_unlock;
+	}
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+skip_drivers_unlock:
+	mutex_unlock(&container->group_lock);
+
+	return ret;
+}
+
+static long vfio_fops_unl_ioctl(struct file *filep,
+				unsigned int cmd, unsigned long arg)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	void *data;
+	long ret = -EINVAL;
+
+	if (!container)
+		return ret;
+
+	driver = container->iommu_driver;
+	data = container->iommu_data;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		ret = VFIO_API_VERSION;
+		break;
+	case VFIO_CHECK_EXTENSION:
+		ret = vfio_ioctl_check_extension(container, arg);
+		break;
+	case VFIO_SET_IOMMU:
+		ret = vfio_ioctl_set_iommu(container, arg);
+		break;
+	default:
+		if (driver) /* passthrough all unrecognized ioctls */
+			ret = driver->ops->ioctl(data, cmd, arg);
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_fops_compat_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container;
+
+	container = kzalloc(sizeof(*container), GFP_KERNEL);
+	if (!container)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&container->group_list);
+	mutex_init(&container->group_lock);
+	kref_init(&container->kref);
+
+	filep->private_data = container;
+
+	return 0;
+}
+
+static int vfio_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_container_put(container);
+
+	return 0;
+}
+
+/*
+ * Once an iommu driver is set, we optionally pass read/write/mmap
+ * on to the driver, allowing management interfaces beyond ioctl.
+ */
+static ssize_t vfio_fops_read(struct file *filep, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->read))
+		return -EINVAL;
+
+	return driver->ops->read(container->iommu_data, buf, count, ppos);
+}
+
+static ssize_t vfio_fops_write(struct file *filep, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->write))
+		return -EINVAL;
+
+	return driver->ops->write(container->iommu_data, buf, count, ppos);
+}
+
+static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->mmap))
+		return -EINVAL;
+
+	return driver->ops->mmap(container->iommu_data, vma);
+}
+
+static const struct file_operations vfio_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vfio_fops_open,
+	.release	= vfio_fops_release,
+	.read		= vfio_fops_read,
+	.write		= vfio_fops_write,
+	.unlocked_ioctl	= vfio_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_fops_mmap,
+};
+
+/**
+ * VFIO Group fd, /dev/vfio/$GROUP
+ */
+static void __vfio_group_unset_container(struct vfio_group *group)
+{
+	struct vfio_container *container = group->container;
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver)
+		driver->ops->detach_group(container->iommu_data,
+					  group->iommu_group);
+
+	group->container = NULL;
+	list_del(&group->container_next);
+
+	/* Detaching the last group deprivileges a container, remove iommu */
+	if (driver && list_empty(&container->group_list)) {
+		driver->ops->release(container->iommu_data);
+		module_put(driver->ops->owner);
+		container->iommu_driver = NULL;
+		container->iommu_data = NULL;
+	}
+
+	mutex_unlock(&container->group_lock);
+
+	vfio_container_put(container);
+}
+
+/*
+ * VFIO_GROUP_UNSET_CONTAINER should fail if there are other users or
+ * if there was no container to unset.  Since the ioctl is called on
+ * the group, we know that still exists, therefore the only valid
+ * transition here is 1->0.
+ */
+static int vfio_group_unset_container(struct vfio_group *group)
+{
+	int users = atomic_cmpxchg(&group->container_users, 1, 0);
+
+	if (!users)
+		return -EINVAL;
+	if (users != 1)
+		return -EBUSY;
+
+	__vfio_group_unset_container(group);
+
+	return 0;
+}
+
+/*
+ * When removing container users, anything that removes the last user
+ * implicitly removes the group from the container.  That is, if the
+ * group file descriptor is closed, as well as any device file descriptors,
+ * the group is free.
+ */
+static void vfio_group_try_dissolve_container(struct vfio_group *group)
+{
+	if (0 == atomic_dec_if_positive(&group->container_users))
+		__vfio_group_unset_container(group);
+}
+
+static int vfio_group_set_container(struct vfio_group *group, int container_fd)
+{
+	struct file *filep;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	int ret = 0;
+
+	if (atomic_read(&group->container_users))
+		return -EINVAL;
+
+	filep = fget(container_fd);
+	if (!filep)
+		return -EBADF;
+
+	/* Sanity check, is this really our fd? */
+	if (filep->f_op != &vfio_fops) {
+		fput(filep);
+		return -EINVAL;
+	}
+
+	container = filep->private_data;
+	WARN_ON(!container); /* fget ensures we don't race vfio_release */
+
+	mutex_lock(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver) {
+		ret = driver->ops->attach_group(container->iommu_data,
+						group->iommu_group);
+		if (ret)
+			goto unlock_out;
+	}
+
+	group->container = container;
+	list_add(&group->container_next, &container->group_list);
+
+	/* Get a reference on the container and mark a user within the group */
+	vfio_container_get(container);
+	atomic_inc(&group->container_users);
+
+unlock_out:
+	mutex_unlock(&container->group_lock);
+	fput(filep);
+	if (ret)
+		vfio_container_put(container);
+
+	return ret;
+}
+
+/*
+ * A vfio group is viable for use by userspace if all devices are either
+ * driver-less or bound to a vfio driver.  We test the latter by the
+ * existence of a struct vfio_device matching the dev.
+ */
+static int vfio_dev_viable(struct device *dev, void *data)
+{
+	struct vfio_group *group = data;
+	struct vfio_device *device;
+
+	if (!dev->driver)
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	vfio_device_put(device);
+
+	if (!device)
+		return -EINVAL;
+
+	return 0;
+}
+
+static bool vfio_group_viable(struct vfio_group *group)
+{
+	return (iommu_group_for_each_dev(group->iommu_group,
+					 group, vfio_dev_viable) == 0);
+}
+
+static const struct file_operations vfio_device_fops;
+
+static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
+{
+	struct vfio_device *device;
+	struct file *filep;
+	int ret = -ENODEV;
+
+	if (0 == atomic_read(&group->container_users) ||
+	    !group->container->iommu_driver || !vfio_group_viable(group))
+		return -EINVAL;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (strcmp(dev_name(device->dev), buf))
+			continue;
+
+		ret = device->ops->open(device->device_data);
+		if (ret)
+			break;
+		/*
+		 * We can't use anon_inode_getfd() because we need to modify
+		 * the f_mode flags directly to allow more than just ioctls
+		 */
+		ret = get_unused_fd();
+		if (ret < 0) {
+			device->ops->release(device->device_data);
+			break;
+		}
+
+		filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
+					   device, O_RDWR);
+		if (IS_ERR(filep)) {
+			put_unused_fd(ret);
+			ret = PTR_ERR(filep);
+			device->ops->release(device->device_data);
+			break;
+		}
+
+		/*
+		 * TODO: add an anon_inode interface to do this.
+		 * Appears to be missing by lack of need rather than
+		 * explicitly prevented.  Now there's need.
+		 */
+		filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+		fd_install(ret, filep);
+
+		vfio_device_get(device);
+		atomic_inc(&group->container_users);
+		break;
+	}
+	mutex_unlock(&group->device_lock);
+
+	return ret;
+}
+
+static long vfio_group_fops_unl_ioctl(struct file *filep,
+				      unsigned int cmd, unsigned long arg)
+{
+	struct vfio_group *group = filep->private_data;
+	long ret = -ENOTTY;
+
+	switch (cmd) {
+	case VFIO_GROUP_GET_STATUS:
+	{
+		struct vfio_group_status status;
+		unsigned long minsz;
+
+		minsz = offsetofend(struct vfio_group_status, flags);
+
+		if (copy_from_user(&status, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (status.argsz < minsz)
+			return -EINVAL;
+
+		status.flags = 0;
+
+		if (vfio_group_viable(group))
+			status.flags |= VFIO_GROUP_FLAGS_VIABLE;
+
+		if (group->container)
+			status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
+
+		ret = copy_to_user((void __user *)arg, &status, minsz);
+
+		break;
+	}
+	case VFIO_GROUP_SET_CONTAINER:
+	{
+		int fd;
+
+		if (get_user(fd, (int __user *)arg))
+			return -EFAULT;
+
+		if (fd < 0)
+			return -EINVAL;
+
+		ret = vfio_group_set_container(group, fd);
+		break;
+	}
+	case VFIO_GROUP_UNSET_CONTAINER:
+		ret = vfio_group_unset_container(group);
+		break;
+	case VFIO_GROUP_GET_DEVICE_FD:
+	{
+		char *buf;
+
+		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
+		if (IS_ERR(buf))
+			return PTR_ERR(buf);
+
+		ret = vfio_group_get_device_fd(group, buf);
+		kfree(buf);
+		break;
+	}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_group_fops_compat_ioctl(struct file *filep,
+					 unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_group_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_group_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group;
+
+	group = vfio_group_get_from_minor(iminor(inode));
+	if (!group)
+		return -ENODEV;
+
+	if (group->container) {
+		vfio_group_put(group);
+		return -EBUSY;
+	}
+
+	filep->private_data = group;
+
+	return 0;
+}
+
+static int vfio_group_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_group_try_dissolve_container(group);
+
+	vfio_group_put(group);
+
+	return 0;
+}
+
+static const struct file_operations vfio_group_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vfio_group_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_group_fops_compat_ioctl,
+#endif
+	.open		= vfio_group_fops_open,
+	.release	= vfio_group_fops_release,
+};
+
+/**
+ * VFIO Device fd
+ */
+static int vfio_device_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device = filep->private_data;
+
+	device->ops->release(device->device_data);
+
+	vfio_group_try_dissolve_container(device->group);
+
+	vfio_device_put(device);
+
+	return 0;
+}
+
+static long vfio_device_fops_unl_ioctl(struct file *filep,
+				       unsigned int cmd, unsigned long arg)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->ioctl))
+		return -EINVAL;
+
+	return device->ops->ioctl(device->device_data, cmd, arg);
+}
+
+static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->read))
+		return -EINVAL;
+
+	return device->ops->read(device->device_data, buf, count, ppos);
+}
+
+static ssize_t vfio_device_fops_write(struct file *filep,
+				      const char __user *buf,
+				      size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->write))
+		return -EINVAL;
+
+	return device->ops->write(device->device_data, buf, count, ppos);
+}
+
+static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->mmap))
+		return -EINVAL;
+
+	return device->ops->mmap(device->device_data, vma);
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_device_fops_compat_ioctl(struct file *filep,
+					  unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_device_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static const struct file_operations vfio_device_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vfio_device_fops_release,
+	.read		= vfio_device_fops_read,
+	.write		= vfio_device_fops_write,
+	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_device_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_device_fops_mmap,
+};
+
+/**
+ * Module/class support
+ */
+static char *vfio_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
+}
+
+static int __init vfio_init(void)
+{
+	int ret;
+
+	idr_init(&vfio.group_idr);
+	mutex_init(&vfio.group_lock);
+	mutex_init(&vfio.iommu_drivers_lock);
+	INIT_LIST_HEAD(&vfio.group_list);
+	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+	init_waitqueue_head(&vfio.release_q);
+
+	vfio.class = class_create(THIS_MODULE, "vfio");
+	if (IS_ERR(vfio.class)) {
+		ret = PTR_ERR(vfio.class);
+		goto err_class;
+	}
+
+	vfio.class->devnode = vfio_devnode;
+
+	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
+	if (ret)
+		goto err_base_chrdev;
+
+	cdev_init(&vfio.cdev, &vfio_fops);
+	ret = cdev_add(&vfio.cdev, vfio.devt, 1);
+	if (ret)
+		goto err_base_cdev;
+
+	vfio.dev = device_create(vfio.class, NULL, vfio.devt, NULL, "vfio");
+	if (IS_ERR(vfio.dev)) {
+		ret = PTR_ERR(vfio.dev);
+		goto err_base_dev;
+	}
+
+	/* /dev/vfio/$GROUP */
+	cdev_init(&vfio.group_cdev, &vfio_group_fops);
+	ret = cdev_add(&vfio.group_cdev,
+		       MKDEV(MAJOR(vfio.devt), 1), MINORMASK - 1);
+	if (ret)
+		goto err_groups_cdev;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	return 0;
+
+err_groups_cdev:
+	device_destroy(vfio.class, vfio.devt);
+err_base_dev:
+	cdev_del(&vfio.cdev);
+err_base_cdev:
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+err_base_chrdev:
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+err_class:
+	return ret;
+}
+
+static void __exit vfio_cleanup(void)
+{
+	WARN_ON(!list_empty(&vfio.group_list));
+
+	idr_destroy(&vfio.group_idr);
+	cdev_del(&vfio.group_cdev);
+	device_destroy(vfio.class, vfio.devt);
+	cdev_del(&vfio.cdev);
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+}
+
+module_init(vfio_init);
+module_exit(vfio_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
new file mode 100644
index 0000000..a264054
--- /dev/null
+++ b/include/linux/vfio.h
@@ -0,0 +1,364 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VFIO_API_VERSION	0
+
+#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
+
+/**
+ * struct vfio_device_ops - VFIO bus driver device callbacks
+ *
+ * @open: Called when userspace creates new file descriptor for device
+ * @release: Called when userspace releases file descriptor for device
+ * @read: Perform read(2) on device file descriptor
+ * @write: Perform write(2) on device file descriptor
+ * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
+ *         operations documented below
+ * @mmap: Perform mmap(2) on a region of the device file descriptor
+ */
+struct vfio_device_ops {
+	char	*name;
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t count, loff_t *size);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+extern int vfio_add_group_dev(struct device *dev,
+			      const struct vfio_device_ops *ops,
+			      void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+/**
+ * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
+ */
+struct vfio_iommu_driver_ops {
+	char		*name;
+	struct module	*owner;
+	void		*(*open)(unsigned long arg);
+	void		(*release)(void *iommu_data);
+	ssize_t		(*read)(void *iommu_data, char __user *buf,
+				size_t count, loff_t *ppos);
+	ssize_t		(*write)(void *iommu_data, const char __user *buf,
+				 size_t count, loff_t *size);
+	long		(*ioctl)(void *iommu_data, unsigned int cmd,
+				 unsigned long arg);
+	int		(*mmap)(void *iommu_data, struct vm_area_struct *vma);
+	int		(*attach_group)(void *iommu_data,
+					struct iommu_group *group);
+	void		(*detach_group)(void *iommu_data,
+					struct iommu_group *group);
+
+};
+
+extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
+
+extern void vfio_unregister_iommu_driver(
+				const struct vfio_iommu_driver_ops *ops);
+
+/**
+ * offsetofend(TYPE, MEMBER)
+ *
+ * @TYPE: The type of the structure
+ * @MEMBER: The member within the structure to get the end offset of
+ *
+ * Simple helper macro for dealing with variable sized structures passed
+ * from user space.  This allows us to easily determine if the provided
+ * structure is sized to include various fields.
+ */
+#define offsetofend(TYPE, MEMBER) ({				\
+	TYPE tmp;						\
+	offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); })		\
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_X86_IOMMU		1
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION		_IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ *						struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET	(1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry.  The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ *						struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+};
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+	__u32	index;		/* Region index */
+	__u32	resv;		/* Reserved for alignment */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks.  Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts.  This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index.  This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront.  In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_EVENTFD		(1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+	__u32	index;		/* IRQ index */
+	__s32	count;		/* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts.  Caller provides
+ * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided.  If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts.  For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device.  Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+	__u32	index;
+	__s32	start;
+	__s32	count;
+	__u8	data[];
+};
+#define VFIO_DEVICE_SET_IRQS		_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
+					 VFIO_IRQ_SET_DATA_BOOL | \
+					 VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
+					 VFIO_IRQ_SET_ACTION_UNMASK | \
+					 VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+#endif /* VFIO_H */


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 07/13] vfio: VFIO core
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

VFIO is a secure user level driver for use with both virtual machines
and user level drivers.  VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access.  It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).

New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface.  We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model.  IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms.  VFIO
users are able to query and initialize the IOMMU model of their
choice.

Please see the follow-on Documentation commit for further description
and usage example.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 Documentation/ioctl/ioctl-number.txt |    1 
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/vfio/Kconfig                 |    8 
 drivers/vfio/Makefile                |    1 
 drivers/vfio/vfio.c                  | 1399 ++++++++++++++++++++++++++++++++++
 include/linux/vfio.h                 |  364 +++++++++
 8 files changed, 1784 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 include/linux/vfio.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index e34b531..111e30a 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr-b3vwm0MSY3FBDgjK7y7TUQ@public.gmane.org>
+';'	64-6F	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index de4e280..48e7600 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7227,6 +7227,14 @@ S:	Maintained
 F:	Documentation/filesystems/vfat.txt
 F:	fs/fat/
 
+VFIO DRIVER
+M:	Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+L:	kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Maintained
+F:	Documentation/vfio.txt
+F:	drivers/vfio/
+F:	include/linux/vfio.h
+
 VIDEOBUF2 FRAMEWORK
 M:	Pawel Osciak <pawel-FA/gS7QP4orQT0dZR+AlfA@public.gmane.org>
 M:	Marek Szyprowski <m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
diff --git a/drivers/Kconfig b/drivers/Kconfig
index d236aef..46eb115 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
 
 source "drivers/uio/Kconfig"
 
+source "drivers/vfio/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virtio/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 95952c8..fe1880a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
+obj-$(CONFIG_VFIO)		+= vfio/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 0000000..9acb1e7
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig VFIO
+	tristate "VFIO Non-Privileged userspace driver framework"
+	depends on IOMMU_API
+	help
+	  VFIO provides a framework for secure userspace device drivers.
+	  See Documentation/vfio.txt for more details.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 0000000..7500a67
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VFIO) += vfio.o
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
new file mode 100644
index 0000000..af0e4f8
--- /dev/null
+++ b/drivers/vfio/vfio.c
@@ -0,0 +1,1399 @@
+/*
+ * VFIO core
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+#include <linux/cdev.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+
+#define DRIVER_VERSION	"0.3"
+#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
+#define DRIVER_DESC	"VFIO - User Level meta-driver"
+
+static struct vfio {
+	struct class			*class;
+	struct list_head		iommu_drivers_list;
+	struct mutex			iommu_drivers_lock;
+	struct list_head		group_list;
+	struct idr			group_idr;
+	struct mutex			group_lock;
+	struct cdev			group_cdev;
+	struct device			*dev;
+	dev_t				devt;
+	struct cdev			cdev;
+	wait_queue_head_t		release_q;
+} vfio;
+
+struct vfio_iommu_driver {
+	const struct vfio_iommu_driver_ops	*ops;
+	struct list_head			vfio_next;
+};
+
+struct vfio_container {
+	struct kref			kref;
+	struct list_head		group_list;
+	struct mutex			group_lock;
+	struct vfio_iommu_driver	*iommu_driver;
+	void				*iommu_data;
+};
+
+struct vfio_group {
+	struct kref			kref;
+	int				minor;
+	atomic_t			container_users;
+	struct iommu_group		*iommu_group;
+	struct vfio_container		*container;
+	struct list_head		device_list;
+	struct mutex			device_lock;
+	struct device			*dev;
+	struct notifier_block		nb;
+	struct list_head		vfio_next;
+	struct list_head		container_next;
+};
+
+struct vfio_device {
+	struct kref			kref;
+	struct device			*dev;
+	const struct vfio_device_ops	*ops;
+	struct vfio_group		*group;
+	struct list_head		group_next;
+	void				*device_data;
+};
+
+/**
+ * IOMMU driver registration
+ */
+int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver, *tmp;
+
+	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
+	if (!driver)
+		return -ENOMEM;
+
+	driver->ops = ops;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+
+	/* Check for duplicates */
+	list_for_each_entry(tmp, &vfio.iommu_drivers_list, vfio_next) {
+		if (tmp->ops == ops) {
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return -EINVAL;
+		}
+	}
+
+	list_add(&driver->vfio_next, &vfio.iommu_drivers_list);
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_register_iommu_driver);
+
+void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		if (driver->ops == ops) {
+			list_del(&driver->vfio_next);
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return;
+		}
+	}
+	mutex_unlock(&vfio.iommu_drivers_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
+
+/**
+ * Group minor allocation/free - both called with vfio.group_lock held
+ */
+static int vfio_alloc_group_minor(struct vfio_group *group)
+{
+	int ret, minor;
+
+again:
+	if (unlikely(idr_pre_get(&vfio.group_idr, GFP_KERNEL) == 0))
+		return -ENOMEM;
+
+	/* index 0 is used by /dev/vfio/vfio */
+	ret = idr_get_new_above(&vfio.group_idr, group, 1, &minor);
+	if (ret == -EAGAIN)
+		goto again;
+	if (ret || minor > MINORMASK) {
+		if (minor > MINORMASK)
+			idr_remove(&vfio.group_idr, minor);
+		return -ENOSPC;
+	}
+
+	return minor;
+}
+
+static void vfio_free_group_minor(int minor)
+{
+	idr_remove(&vfio.group_idr, minor);
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data);
+static void vfio_group_get(struct vfio_group *group);
+
+/**
+ * Container objects - containers are created when /dev/vfio/vfio is
+ * opened, but their lifecycle extends until the last user is done, so
+ * it's freed via kref.  Must support container/group/device being
+ * closed in any order.
+ */
+static void vfio_container_get(struct vfio_container *container)
+{
+	kref_get(&container->kref);
+}
+
+static void vfio_container_release(struct kref *kref)
+{
+	struct vfio_container *container;
+	container = container_of(kref, struct vfio_container, kref);
+
+	kfree(container);
+}
+
+static void vfio_container_put(struct vfio_container *container)
+{
+	kref_put(&container->kref, vfio_container_release);
+}
+
+/**
+ * Group objects - create, release, get, put, search
+ */
+static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group, *tmp;
+	struct device *dev;
+	int ret, minor;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&group->kref);
+	INIT_LIST_HEAD(&group->device_list);
+	mutex_init(&group->device_lock);
+	atomic_set(&group->container_users, 0);
+	group->iommu_group = iommu_group;
+
+	group->nb.notifier_call = vfio_iommu_group_notifier;
+
+	/*
+	 * blocking notifiers acquire a rwsem around registering and hold
+	 * it around callback.  Therefore, need to register outside of
+	 * vfio.group_lock to avoid A-B/B-A contention.  Our callback won't
+	 * do anything unless it can find the group in vfio.group_list, so
+	 * no harm in registering early.
+	 */
+	ret = iommu_group_register_notifier(iommu_group, &group->nb);
+	if (ret) {
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	mutex_lock(&vfio.group_lock);
+
+	minor = vfio_alloc_group_minor(group);
+	if (minor < 0) {
+		mutex_unlock(&vfio.group_lock);
+		kfree(group);
+		return ERR_PTR(minor);
+	}
+
+	/* Did we race creating this group? */
+	list_for_each_entry(tmp, &vfio.group_list, vfio_next) {
+		if (tmp->iommu_group == iommu_group) {
+			vfio_group_get(tmp);
+			vfio_free_group_minor(minor);
+			mutex_unlock(&vfio.group_lock);
+			kfree(group);
+			return tmp;
+		}
+	}
+
+	dev = device_create(vfio.class, NULL, MKDEV(MAJOR(vfio.devt), minor),
+			    group, "%d", iommu_group_id(iommu_group));
+	if (IS_ERR(dev)) {
+		vfio_free_group_minor(minor);
+		mutex_unlock(&vfio.group_lock);
+		kfree(group);
+		return (struct vfio_group *)dev; /* ERR_PTR */
+	}
+
+	group->minor = minor;
+	group->dev = dev;
+
+	list_add(&group->vfio_next, &vfio.group_list);
+
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+static void vfio_group_release(struct kref *kref)
+{
+	struct vfio_group *group = container_of(kref, struct vfio_group, kref);
+
+	WARN_ON(!list_empty(&group->device_list));
+
+	device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group->minor));
+	list_del(&group->vfio_next);
+	vfio_free_group_minor(group->minor);
+
+	mutex_unlock(&vfio.group_lock);
+
+	/*
+	 * Unregister outside of lock.  A spurious callback is harmless now
+	 * that the group is no longer in vfio.group_list.
+	 */
+	iommu_group_unregister_notifier(group->iommu_group, &group->nb);
+
+	kfree(group);
+}
+
+static void vfio_group_put(struct vfio_group *group)
+{
+	mutex_lock(&vfio.group_lock);
+	/*
+	 * Release needs to unlock to unregister the notifier, so only
+	 * unlock if not released.
+	 */
+	if (!kref_put(&group->kref, vfio_group_release))
+		mutex_unlock(&vfio.group_lock);
+}
+
+/* Assume group_lock or group reference is held */
+static void vfio_group_get(struct vfio_group *group)
+{
+	kref_get(&group->kref);
+}
+
+/*
+ * Not really a try as we will sleep for mutex, but we need to make
+ * sure the group pointer is valid under lock and get a reference.
+ */
+static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
+{
+	struct vfio_group *target = group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group == target) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static
+struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group->iommu_group == iommu_group) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static struct vfio_group *vfio_group_get_from_minor(int minor)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	group = idr_find(&vfio.group_idr, minor);
+	if (!group) {
+		mutex_unlock(&vfio.group_lock);
+		return NULL;
+	}
+	vfio_group_get(group);
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+/**
+ * Device objects - create, release, get, put, search
+ */
+static
+struct vfio_device *vfio_group_create_device(struct vfio_group *group,
+					     struct device *dev,
+					     const struct vfio_device_ops *ops,
+					     void *device_data)
+{
+	struct vfio_device *device;
+	int ret;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&device->kref);
+	device->dev = dev;
+	device->group = group;
+	device->ops = ops;
+	device->device_data = device_data;
+
+	ret = dev_set_drvdata(dev, device);
+	if (ret) {
+		kfree(device);
+		return ERR_PTR(ret);
+	}
+
+	/* No need to get group_lock, caller has group reference */
+	vfio_group_get(group);
+
+	mutex_lock(&group->device_lock);
+	list_add(&device->group_next, &group->device_list);
+	mutex_unlock(&group->device_lock);
+
+	return device;
+}
+
+static void vfio_device_release(struct kref *kref)
+{
+	struct vfio_device *device = container_of(kref,
+						  struct vfio_device, kref);
+	struct vfio_group *group = device->group;
+
+	mutex_lock(&group->device_lock);
+	list_del(&device->group_next);
+	mutex_unlock(&group->device_lock);
+
+	dev_set_drvdata(device->dev, NULL);
+
+	kfree(device);
+
+	/* vfio_del_group_dev may be waiting for this device */
+	wake_up(&vfio.release_q);
+}
+
+/* Device reference always implies a group reference */
+static void vfio_device_put(struct vfio_device *device)
+{
+	kref_put(&device->kref, vfio_device_release);
+	vfio_group_put(device->group);
+}
+
+static void vfio_device_get(struct vfio_device *device)
+{
+	vfio_group_get(device->group);
+	kref_get(&device->kref);
+}
+
+static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
+						 struct device *dev)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (device->dev == dev) {
+			vfio_device_get(device);
+			mutex_unlock(&group->device_lock);
+			return device;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+	return NULL;
+}
+
+/**
+ * Async device support
+ */
+static int vfio_group_nb_add_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* Do we already know about it?  We shouldn't */
+	device = vfio_group_get_device(group, dev);
+	if (WARN_ON_ONCE(device)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	/* Nothing to do for idle groups */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	/* TODO Prevent device auto probing */
+	WARN("Device %s added to live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	return 0;
+}
+
+static int vfio_group_nb_del_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/*
+	 * Expect to fall out here.  If a device was in use, it would
+	 * have been bound to a vfio sub-driver, which would have blocked
+	 * in .remove at vfio_del_group_dev.  Sanity check that we no
+	 * longer track the device, so it's safe to remove.
+	 */
+	device = vfio_group_get_device(group, dev);
+	if (likely(!device))
+		return 0;
+
+	WARN("Device %s removed from live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	vfio_device_put(device);
+	return 0;
+}
+
+static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* We don't care what happens when the group isn't in use */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	if (device)
+		vfio_device_put(device);
+
+	return device ? 0 : -EINVAL;
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data)
+{
+	struct vfio_group *group = container_of(nb, struct vfio_group, nb);
+	struct device *dev = data;
+
+	/*
+	 * Need to go through a group_lock lookup to get a reference or
+	 * we risk racing a group being removed.  Leave a WARN_ON for
+	 * debuging, but if the group no longer exists, a spurious notify
+	 * is harmless.
+	 */
+	group = vfio_group_try_get(group);
+	if (WARN_ON(!group))
+		return NOTIFY_OK;
+
+	switch (action) {
+	case IOMMU_GROUP_NOTIFY_ADD_DEVICE:
+		vfio_group_nb_add_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_DEL_DEVICE:
+		vfio_group_nb_del_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_BIND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d binding to driver\n", __func__,
+		      dev_name(dev), iommu_group_id(group->iommu_group));
+		break;
+	case IOMMU_GROUP_NOTIFY_BOUND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d bound to driver %s\n", __func__,
+		       dev_name(dev), iommu_group_id(group->iommu_group),
+		       dev->driver->name);
+		BUG_ON(vfio_group_nb_verify(group, dev));
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d unbinding from driver %s\n",
+		       __func__, dev_name(dev),
+		       iommu_group_id(group->iommu_group), dev->driver->name);
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d unbound from driver\n",
+		       __func__, dev_name(dev),
+		       iommu_group_id(group->iommu_group));
+		/*
+		 * XXX An unbound device in a live group is ok, but we'd
+		 * really like to avoid the above BUG_ON by preventing other
+		 * drivers from binding to it.  Once that occurs, we have to
+		 * stop the system to maintain isolation.  At a minimum, we'd
+		 * want a toggle to disable driver auto probe for this device.
+		 */
+		break;
+	}
+
+	vfio_group_put(group);
+	return NOTIFY_OK;
+}
+
+/**
+ * VFIO driver API
+ */
+int vfio_add_group_dev(struct device *dev,
+		       const struct vfio_device_ops *ops, void *device_data)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		group = vfio_create_group(iommu_group);
+		if (IS_ERR(group)) {
+			iommu_group_put(iommu_group);
+			return PTR_ERR(group);
+		}
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (device) {
+		WARN(1, "Device %s already exists on group %d\n",
+		     dev_name(dev), iommu_group_id(iommu_group));
+		vfio_device_put(device);
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return -EBUSY;
+	}
+
+	device = vfio_group_create_device(group, dev, ops, device_data);
+	if (IS_ERR(device)) {
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return PTR_ERR(device);
+	}
+
+	/*
+	 * Added device holds reference to iommu_group and vfio_device
+	 * (which in turn holds reference to vfio_group).  Drop extra
+	 * group reference used while acquiring device.
+	 */
+	vfio_group_put(group);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_add_group_dev);
+
+/* Test whether a struct device is present in our tracking */
+static bool vfio_dev_present(struct device *dev)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return false;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		iommu_group_put(iommu_group);
+		return false;
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (!device) {
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return false;
+	}
+
+	vfio_device_put(device);
+	vfio_group_put(group);
+	iommu_group_put(iommu_group);
+	return true;
+}
+
+/*
+ * Decrement the device reference count and wait for the device to be
+ * removed.  Open file descriptors for the device... */
+void *vfio_del_group_dev(struct device *dev)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group = group->iommu_group;
+	void *device_data = device->device_data;
+
+	vfio_device_put(device);
+
+	/* TODO send a signal to encourage this to be released */
+	wait_event(vfio.release_q, !vfio_dev_present(dev));
+
+	iommu_group_put(iommu_group);
+
+	return device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_del_group_dev);
+
+/**
+ * VFIO base fd, /dev/vfio/vfio
+ */
+static long vfio_ioctl_check_extension(struct vfio_container *container,
+				       unsigned long arg)
+{
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+	long ret = 0;
+
+	switch (arg) {
+		/* No base extensions yet */
+	default:
+		/*
+		 * If no driver is set, poll all registered drivers for
+		 * extensions and return the first positive result.  If
+		 * a driver is already set, further queries will be passed
+		 * only to that driver.
+		 */
+		if (!driver) {
+			mutex_lock(&vfio.iommu_drivers_lock);
+			list_for_each_entry(driver, &vfio.iommu_drivers_list,
+					    vfio_next) {
+				if (!try_module_get(driver->ops->owner))
+					continue;
+
+				ret = driver->ops->ioctl(NULL,
+							 VFIO_CHECK_EXTENSION,
+							 arg);
+				module_put(driver->ops->owner);
+				if (ret > 0)
+					break;
+			}
+			mutex_unlock(&vfio.iommu_drivers_lock);
+		} else
+			ret = driver->ops->ioctl(container->iommu_data,
+						 VFIO_CHECK_EXTENSION, arg);
+	}
+
+	return ret;
+}
+
+/* hold container->group_lock */
+static int __vfio_container_attach_groups(struct vfio_container *container,
+					  struct vfio_iommu_driver *driver,
+					  void *data)
+{
+	struct vfio_group *group;
+	int ret = -ENODEV;
+
+	list_for_each_entry(group, &container->group_list, container_next) {
+		ret = driver->ops->attach_group(data, group->iommu_group);
+		if (ret)
+			goto unwind;
+	}
+
+	return ret;
+
+unwind:
+	list_for_each_entry_continue_reverse(group, &container->group_list,
+					     container_next) {
+		driver->ops->detach_group(data, group->iommu_group);
+	}
+
+	return ret;
+}
+
+static long vfio_ioctl_set_iommu(struct vfio_container *container,
+				 unsigned long arg)
+{
+	struct vfio_iommu_driver *driver;
+	long ret = -ENODEV;
+
+	mutex_lock(&container->group_lock);
+
+	/*
+	 * The container is designed to be an unprivileged interface while
+	 * the group can be assigned to specific users.  Therefore, only by
+	 * adding a group to a container does the user get the privilege of
+	 * enabling the iommu, which may allocate finite resources.  There
+	 * is no unset_iommu, but by removing all the groups from a container,
+	 * the container is deprivileged and returns to an unset state.
+	 */
+	if (list_empty(&container->group_list) || container->iommu_driver) {
+		mutex_unlock(&container->group_lock);
+		return -EINVAL;
+	}
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		void *data;
+
+		if (!try_module_get(driver->ops->owner))
+			continue;
+
+		/*
+		 * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
+		 * so test which iommu driver reported support for this
+		 * extension and call open on them.  We also pass them the
+		 * magic, allowing a single driver to support multiple
+		 * interfaces if they'd like.
+		 */
+		if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
+			module_put(driver->ops->owner);
+			continue;
+		}
+
+		/* module reference holds the driver we're working on */
+		mutex_unlock(&vfio.iommu_drivers_lock);
+
+		data = driver->ops->open(arg);
+		if (IS_ERR(data)) {
+			ret = PTR_ERR(data);
+			goto skip_drivers_unlock;
+		}
+
+		ret = __vfio_container_attach_groups(container, driver, data);
+		if (!ret) {
+			container->iommu_driver = driver;
+			container->iommu_data = data;
+		} else
+			driver->ops->release(data);
+
+		goto skip_drivers_unlock;
+	}
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+skip_drivers_unlock:
+	mutex_unlock(&container->group_lock);
+
+	return ret;
+}
+
+static long vfio_fops_unl_ioctl(struct file *filep,
+				unsigned int cmd, unsigned long arg)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	void *data;
+	long ret = -EINVAL;
+
+	if (!container)
+		return ret;
+
+	driver = container->iommu_driver;
+	data = container->iommu_data;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		ret = VFIO_API_VERSION;
+		break;
+	case VFIO_CHECK_EXTENSION:
+		ret = vfio_ioctl_check_extension(container, arg);
+		break;
+	case VFIO_SET_IOMMU:
+		ret = vfio_ioctl_set_iommu(container, arg);
+		break;
+	default:
+		if (driver) /* passthrough all unrecognized ioctls */
+			ret = driver->ops->ioctl(data, cmd, arg);
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_fops_compat_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container;
+
+	container = kzalloc(sizeof(*container), GFP_KERNEL);
+	if (!container)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&container->group_list);
+	mutex_init(&container->group_lock);
+	kref_init(&container->kref);
+
+	filep->private_data = container;
+
+	return 0;
+}
+
+static int vfio_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_container_put(container);
+
+	return 0;
+}
+
+/*
+ * Once an iommu driver is set, we optionally pass read/write/mmap
+ * on to the driver, allowing management interfaces beyond ioctl.
+ */
+static ssize_t vfio_fops_read(struct file *filep, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->read))
+		return -EINVAL;
+
+	return driver->ops->read(container->iommu_data, buf, count, ppos);
+}
+
+static ssize_t vfio_fops_write(struct file *filep, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->write))
+		return -EINVAL;
+
+	return driver->ops->write(container->iommu_data, buf, count, ppos);
+}
+
+static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->mmap))
+		return -EINVAL;
+
+	return driver->ops->mmap(container->iommu_data, vma);
+}
+
+static const struct file_operations vfio_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vfio_fops_open,
+	.release	= vfio_fops_release,
+	.read		= vfio_fops_read,
+	.write		= vfio_fops_write,
+	.unlocked_ioctl	= vfio_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_fops_mmap,
+};
+
+/**
+ * VFIO Group fd, /dev/vfio/$GROUP
+ */
+static void __vfio_group_unset_container(struct vfio_group *group)
+{
+	struct vfio_container *container = group->container;
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver)
+		driver->ops->detach_group(container->iommu_data,
+					  group->iommu_group);
+
+	group->container = NULL;
+	list_del(&group->container_next);
+
+	/* Detaching the last group deprivileges a container, remove iommu */
+	if (driver && list_empty(&container->group_list)) {
+		driver->ops->release(container->iommu_data);
+		module_put(driver->ops->owner);
+		container->iommu_driver = NULL;
+		container->iommu_data = NULL;
+	}
+
+	mutex_unlock(&container->group_lock);
+
+	vfio_container_put(container);
+}
+
+/*
+ * VFIO_GROUP_UNSET_CONTAINER should fail if there are other users or
+ * if there was no container to unset.  Since the ioctl is called on
+ * the group, we know that still exists, therefore the only valid
+ * transition here is 1->0.
+ */
+static int vfio_group_unset_container(struct vfio_group *group)
+{
+	int users = atomic_cmpxchg(&group->container_users, 1, 0);
+
+	if (!users)
+		return -EINVAL;
+	if (users != 1)
+		return -EBUSY;
+
+	__vfio_group_unset_container(group);
+
+	return 0;
+}
+
+/*
+ * When removing container users, anything that removes the last user
+ * implicitly removes the group from the container.  That is, if the
+ * group file descriptor is closed, as well as any device file descriptors,
+ * the group is free.
+ */
+static void vfio_group_try_dissolve_container(struct vfio_group *group)
+{
+	if (0 == atomic_dec_if_positive(&group->container_users))
+		__vfio_group_unset_container(group);
+}
+
+static int vfio_group_set_container(struct vfio_group *group, int container_fd)
+{
+	struct file *filep;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	int ret = 0;
+
+	if (atomic_read(&group->container_users))
+		return -EINVAL;
+
+	filep = fget(container_fd);
+	if (!filep)
+		return -EBADF;
+
+	/* Sanity check, is this really our fd? */
+	if (filep->f_op != &vfio_fops) {
+		fput(filep);
+		return -EINVAL;
+	}
+
+	container = filep->private_data;
+	WARN_ON(!container); /* fget ensures we don't race vfio_release */
+
+	mutex_lock(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver) {
+		ret = driver->ops->attach_group(container->iommu_data,
+						group->iommu_group);
+		if (ret)
+			goto unlock_out;
+	}
+
+	group->container = container;
+	list_add(&group->container_next, &container->group_list);
+
+	/* Get a reference on the container and mark a user within the group */
+	vfio_container_get(container);
+	atomic_inc(&group->container_users);
+
+unlock_out:
+	mutex_unlock(&container->group_lock);
+	fput(filep);
+	if (ret)
+		vfio_container_put(container);
+
+	return ret;
+}
+
+/*
+ * A vfio group is viable for use by userspace if all devices are either
+ * driver-less or bound to a vfio driver.  We test the latter by the
+ * existence of a struct vfio_device matching the dev.
+ */
+static int vfio_dev_viable(struct device *dev, void *data)
+{
+	struct vfio_group *group = data;
+	struct vfio_device *device;
+
+	if (!dev->driver)
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	vfio_device_put(device);
+
+	if (!device)
+		return -EINVAL;
+
+	return 0;
+}
+
+static bool vfio_group_viable(struct vfio_group *group)
+{
+	return (iommu_group_for_each_dev(group->iommu_group,
+					 group, vfio_dev_viable) == 0);
+}
+
+static const struct file_operations vfio_device_fops;
+
+static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
+{
+	struct vfio_device *device;
+	struct file *filep;
+	int ret = -ENODEV;
+
+	if (0 == atomic_read(&group->container_users) ||
+	    !group->container->iommu_driver || !vfio_group_viable(group))
+		return -EINVAL;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (strcmp(dev_name(device->dev), buf))
+			continue;
+
+		ret = device->ops->open(device->device_data);
+		if (ret)
+			break;
+		/*
+		 * We can't use anon_inode_getfd() because we need to modify
+		 * the f_mode flags directly to allow more than just ioctls
+		 */
+		ret = get_unused_fd();
+		if (ret < 0) {
+			device->ops->release(device->device_data);
+			break;
+		}
+
+		filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
+					   device, O_RDWR);
+		if (IS_ERR(filep)) {
+			put_unused_fd(ret);
+			ret = PTR_ERR(filep);
+			device->ops->release(device->device_data);
+			break;
+		}
+
+		/*
+		 * TODO: add an anon_inode interface to do this.
+		 * Appears to be missing by lack of need rather than
+		 * explicitly prevented.  Now there's need.
+		 */
+		filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+		fd_install(ret, filep);
+
+		vfio_device_get(device);
+		atomic_inc(&group->container_users);
+		break;
+	}
+	mutex_unlock(&group->device_lock);
+
+	return ret;
+}
+
+static long vfio_group_fops_unl_ioctl(struct file *filep,
+				      unsigned int cmd, unsigned long arg)
+{
+	struct vfio_group *group = filep->private_data;
+	long ret = -ENOTTY;
+
+	switch (cmd) {
+	case VFIO_GROUP_GET_STATUS:
+	{
+		struct vfio_group_status status;
+		unsigned long minsz;
+
+		minsz = offsetofend(struct vfio_group_status, flags);
+
+		if (copy_from_user(&status, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (status.argsz < minsz)
+			return -EINVAL;
+
+		status.flags = 0;
+
+		if (vfio_group_viable(group))
+			status.flags |= VFIO_GROUP_FLAGS_VIABLE;
+
+		if (group->container)
+			status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
+
+		ret = copy_to_user((void __user *)arg, &status, minsz);
+
+		break;
+	}
+	case VFIO_GROUP_SET_CONTAINER:
+	{
+		int fd;
+
+		if (get_user(fd, (int __user *)arg))
+			return -EFAULT;
+
+		if (fd < 0)
+			return -EINVAL;
+
+		ret = vfio_group_set_container(group, fd);
+		break;
+	}
+	case VFIO_GROUP_UNSET_CONTAINER:
+		ret = vfio_group_unset_container(group);
+		break;
+	case VFIO_GROUP_GET_DEVICE_FD:
+	{
+		char *buf;
+
+		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
+		if (IS_ERR(buf))
+			return PTR_ERR(buf);
+
+		ret = vfio_group_get_device_fd(group, buf);
+		kfree(buf);
+		break;
+	}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_group_fops_compat_ioctl(struct file *filep,
+					 unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_group_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_group_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group;
+
+	group = vfio_group_get_from_minor(iminor(inode));
+	if (!group)
+		return -ENODEV;
+
+	if (group->container) {
+		vfio_group_put(group);
+		return -EBUSY;
+	}
+
+	filep->private_data = group;
+
+	return 0;
+}
+
+static int vfio_group_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_group_try_dissolve_container(group);
+
+	vfio_group_put(group);
+
+	return 0;
+}
+
+static const struct file_operations vfio_group_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vfio_group_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_group_fops_compat_ioctl,
+#endif
+	.open		= vfio_group_fops_open,
+	.release	= vfio_group_fops_release,
+};
+
+/**
+ * VFIO Device fd
+ */
+static int vfio_device_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device = filep->private_data;
+
+	device->ops->release(device->device_data);
+
+	vfio_group_try_dissolve_container(device->group);
+
+	vfio_device_put(device);
+
+	return 0;
+}
+
+static long vfio_device_fops_unl_ioctl(struct file *filep,
+				       unsigned int cmd, unsigned long arg)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->ioctl))
+		return -EINVAL;
+
+	return device->ops->ioctl(device->device_data, cmd, arg);
+}
+
+static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->read))
+		return -EINVAL;
+
+	return device->ops->read(device->device_data, buf, count, ppos);
+}
+
+static ssize_t vfio_device_fops_write(struct file *filep,
+				      const char __user *buf,
+				      size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->write))
+		return -EINVAL;
+
+	return device->ops->write(device->device_data, buf, count, ppos);
+}
+
+static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->mmap))
+		return -EINVAL;
+
+	return device->ops->mmap(device->device_data, vma);
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_device_fops_compat_ioctl(struct file *filep,
+					  unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_device_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static const struct file_operations vfio_device_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vfio_device_fops_release,
+	.read		= vfio_device_fops_read,
+	.write		= vfio_device_fops_write,
+	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_device_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_device_fops_mmap,
+};
+
+/**
+ * Module/class support
+ */
+static char *vfio_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
+}
+
+static int __init vfio_init(void)
+{
+	int ret;
+
+	idr_init(&vfio.group_idr);
+	mutex_init(&vfio.group_lock);
+	mutex_init(&vfio.iommu_drivers_lock);
+	INIT_LIST_HEAD(&vfio.group_list);
+	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+	init_waitqueue_head(&vfio.release_q);
+
+	vfio.class = class_create(THIS_MODULE, "vfio");
+	if (IS_ERR(vfio.class)) {
+		ret = PTR_ERR(vfio.class);
+		goto err_class;
+	}
+
+	vfio.class->devnode = vfio_devnode;
+
+	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
+	if (ret)
+		goto err_base_chrdev;
+
+	cdev_init(&vfio.cdev, &vfio_fops);
+	ret = cdev_add(&vfio.cdev, vfio.devt, 1);
+	if (ret)
+		goto err_base_cdev;
+
+	vfio.dev = device_create(vfio.class, NULL, vfio.devt, NULL, "vfio");
+	if (IS_ERR(vfio.dev)) {
+		ret = PTR_ERR(vfio.dev);
+		goto err_base_dev;
+	}
+
+	/* /dev/vfio/$GROUP */
+	cdev_init(&vfio.group_cdev, &vfio_group_fops);
+	ret = cdev_add(&vfio.group_cdev,
+		       MKDEV(MAJOR(vfio.devt), 1), MINORMASK - 1);
+	if (ret)
+		goto err_groups_cdev;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	return 0;
+
+err_groups_cdev:
+	device_destroy(vfio.class, vfio.devt);
+err_base_dev:
+	cdev_del(&vfio.cdev);
+err_base_cdev:
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+err_base_chrdev:
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+err_class:
+	return ret;
+}
+
+static void __exit vfio_cleanup(void)
+{
+	WARN_ON(!list_empty(&vfio.group_list));
+
+	idr_destroy(&vfio.group_idr);
+	cdev_del(&vfio.group_cdev);
+	device_destroy(vfio.class, vfio.devt);
+	cdev_del(&vfio.cdev);
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+}
+
+module_init(vfio_init);
+module_exit(vfio_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
new file mode 100644
index 0000000..a264054
--- /dev/null
+++ b/include/linux/vfio.h
@@ -0,0 +1,364 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VFIO_API_VERSION	0
+
+#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
+
+/**
+ * struct vfio_device_ops - VFIO bus driver device callbacks
+ *
+ * @open: Called when userspace creates new file descriptor for device
+ * @release: Called when userspace releases file descriptor for device
+ * @read: Perform read(2) on device file descriptor
+ * @write: Perform write(2) on device file descriptor
+ * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
+ *         operations documented below
+ * @mmap: Perform mmap(2) on a region of the device file descriptor
+ */
+struct vfio_device_ops {
+	char	*name;
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t count, loff_t *size);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+extern int vfio_add_group_dev(struct device *dev,
+			      const struct vfio_device_ops *ops,
+			      void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+/**
+ * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
+ */
+struct vfio_iommu_driver_ops {
+	char		*name;
+	struct module	*owner;
+	void		*(*open)(unsigned long arg);
+	void		(*release)(void *iommu_data);
+	ssize_t		(*read)(void *iommu_data, char __user *buf,
+				size_t count, loff_t *ppos);
+	ssize_t		(*write)(void *iommu_data, const char __user *buf,
+				 size_t count, loff_t *size);
+	long		(*ioctl)(void *iommu_data, unsigned int cmd,
+				 unsigned long arg);
+	int		(*mmap)(void *iommu_data, struct vm_area_struct *vma);
+	int		(*attach_group)(void *iommu_data,
+					struct iommu_group *group);
+	void		(*detach_group)(void *iommu_data,
+					struct iommu_group *group);
+
+};
+
+extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
+
+extern void vfio_unregister_iommu_driver(
+				const struct vfio_iommu_driver_ops *ops);
+
+/**
+ * offsetofend(TYPE, MEMBER)
+ *
+ * @TYPE: The type of the structure
+ * @MEMBER: The member within the structure to get the end offset of
+ *
+ * Simple helper macro for dealing with variable sized structures passed
+ * from user space.  This allows us to easily determine if the provided
+ * structure is sized to include various fields.
+ */
+#define offsetofend(TYPE, MEMBER) ({				\
+	TYPE tmp;						\
+	offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); })		\
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_X86_IOMMU		1
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION		_IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ *						struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET	(1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry.  The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ *						struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+};
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+	__u32	index;		/* Region index */
+	__u32	resv;		/* Reserved for alignment */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks.  Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts.  This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index.  This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront.  In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_EVENTFD		(1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+	__u32	index;		/* IRQ index */
+	__s32	count;		/* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts.  Caller provides
+ * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided.  If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts.  For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device.  Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+	__u32	index;
+	__s32	start;
+	__s32	count;
+	__u8	data[];
+};
+#define VFIO_DEVICE_SET_IRQS		_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
+					 VFIO_IRQ_SET_DATA_BOOL | \
+					 VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
+					 VFIO_IRQ_SET_ACTION_UNMASK | \
+					 VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+#endif /* VFIO_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 07/13] vfio: VFIO core
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

VFIO is a secure user level driver for use with both virtual machines
and user level drivers.  VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access.  It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).

New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface.  We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model.  IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms.  VFIO
users are able to query and initialize the IOMMU model of their
choice.

Please see the follow-on Documentation commit for further description
and usage example.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/ioctl/ioctl-number.txt |    1 
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/vfio/Kconfig                 |    8 
 drivers/vfio/Makefile                |    1 
 drivers/vfio/vfio.c                  | 1399 ++++++++++++++++++++++++++++++++++
 include/linux/vfio.h                 |  364 +++++++++
 8 files changed, 1784 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 include/linux/vfio.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index e34b531..111e30a 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr@solidum.com>
+';'	64-6F	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index de4e280..48e7600 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7227,6 +7227,14 @@ S:	Maintained
 F:	Documentation/filesystems/vfat.txt
 F:	fs/fat/
 
+VFIO DRIVER
+M:	Alex Williamson <alex.williamson@redhat.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	Documentation/vfio.txt
+F:	drivers/vfio/
+F:	include/linux/vfio.h
+
 VIDEOBUF2 FRAMEWORK
 M:	Pawel Osciak <pawel@osciak.com>
 M:	Marek Szyprowski <m.szyprowski@samsung.com>
diff --git a/drivers/Kconfig b/drivers/Kconfig
index d236aef..46eb115 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
 
 source "drivers/uio/Kconfig"
 
+source "drivers/vfio/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virtio/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 95952c8..fe1880a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
+obj-$(CONFIG_VFIO)		+= vfio/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 0000000..9acb1e7
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig VFIO
+	tristate "VFIO Non-Privileged userspace driver framework"
+	depends on IOMMU_API
+	help
+	  VFIO provides a framework for secure userspace device drivers.
+	  See Documentation/vfio.txt for more details.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 0000000..7500a67
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VFIO) += vfio.o
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
new file mode 100644
index 0000000..af0e4f8
--- /dev/null
+++ b/drivers/vfio/vfio.c
@@ -0,0 +1,1399 @@
+/*
+ * VFIO core
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/cdev.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+
+#define DRIVER_VERSION	"0.3"
+#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC	"VFIO - User Level meta-driver"
+
+static struct vfio {
+	struct class			*class;
+	struct list_head		iommu_drivers_list;
+	struct mutex			iommu_drivers_lock;
+	struct list_head		group_list;
+	struct idr			group_idr;
+	struct mutex			group_lock;
+	struct cdev			group_cdev;
+	struct device			*dev;
+	dev_t				devt;
+	struct cdev			cdev;
+	wait_queue_head_t		release_q;
+} vfio;
+
+struct vfio_iommu_driver {
+	const struct vfio_iommu_driver_ops	*ops;
+	struct list_head			vfio_next;
+};
+
+struct vfio_container {
+	struct kref			kref;
+	struct list_head		group_list;
+	struct mutex			group_lock;
+	struct vfio_iommu_driver	*iommu_driver;
+	void				*iommu_data;
+};
+
+struct vfio_group {
+	struct kref			kref;
+	int				minor;
+	atomic_t			container_users;
+	struct iommu_group		*iommu_group;
+	struct vfio_container		*container;
+	struct list_head		device_list;
+	struct mutex			device_lock;
+	struct device			*dev;
+	struct notifier_block		nb;
+	struct list_head		vfio_next;
+	struct list_head		container_next;
+};
+
+struct vfio_device {
+	struct kref			kref;
+	struct device			*dev;
+	const struct vfio_device_ops	*ops;
+	struct vfio_group		*group;
+	struct list_head		group_next;
+	void				*device_data;
+};
+
+/**
+ * IOMMU driver registration
+ */
+int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver, *tmp;
+
+	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
+	if (!driver)
+		return -ENOMEM;
+
+	driver->ops = ops;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+
+	/* Check for duplicates */
+	list_for_each_entry(tmp, &vfio.iommu_drivers_list, vfio_next) {
+		if (tmp->ops == ops) {
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return -EINVAL;
+		}
+	}
+
+	list_add(&driver->vfio_next, &vfio.iommu_drivers_list);
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_register_iommu_driver);
+
+void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		if (driver->ops == ops) {
+			list_del(&driver->vfio_next);
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return;
+		}
+	}
+	mutex_unlock(&vfio.iommu_drivers_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
+
+/**
+ * Group minor allocation/free - both called with vfio.group_lock held
+ */
+static int vfio_alloc_group_minor(struct vfio_group *group)
+{
+	int ret, minor;
+
+again:
+	if (unlikely(idr_pre_get(&vfio.group_idr, GFP_KERNEL) == 0))
+		return -ENOMEM;
+
+	/* index 0 is used by /dev/vfio/vfio */
+	ret = idr_get_new_above(&vfio.group_idr, group, 1, &minor);
+	if (ret == -EAGAIN)
+		goto again;
+	if (ret || minor > MINORMASK) {
+		if (minor > MINORMASK)
+			idr_remove(&vfio.group_idr, minor);
+		return -ENOSPC;
+	}
+
+	return minor;
+}
+
+static void vfio_free_group_minor(int minor)
+{
+	idr_remove(&vfio.group_idr, minor);
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data);
+static void vfio_group_get(struct vfio_group *group);
+
+/**
+ * Container objects - containers are created when /dev/vfio/vfio is
+ * opened, but their lifecycle extends until the last user is done, so
+ * it's freed via kref.  Must support container/group/device being
+ * closed in any order.
+ */
+static void vfio_container_get(struct vfio_container *container)
+{
+	kref_get(&container->kref);
+}
+
+static void vfio_container_release(struct kref *kref)
+{
+	struct vfio_container *container;
+	container = container_of(kref, struct vfio_container, kref);
+
+	kfree(container);
+}
+
+static void vfio_container_put(struct vfio_container *container)
+{
+	kref_put(&container->kref, vfio_container_release);
+}
+
+/**
+ * Group objects - create, release, get, put, search
+ */
+static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group, *tmp;
+	struct device *dev;
+	int ret, minor;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&group->kref);
+	INIT_LIST_HEAD(&group->device_list);
+	mutex_init(&group->device_lock);
+	atomic_set(&group->container_users, 0);
+	group->iommu_group = iommu_group;
+
+	group->nb.notifier_call = vfio_iommu_group_notifier;
+
+	/*
+	 * blocking notifiers acquire a rwsem around registering and hold
+	 * it around callback.  Therefore, need to register outside of
+	 * vfio.group_lock to avoid A-B/B-A contention.  Our callback won't
+	 * do anything unless it can find the group in vfio.group_list, so
+	 * no harm in registering early.
+	 */
+	ret = iommu_group_register_notifier(iommu_group, &group->nb);
+	if (ret) {
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	mutex_lock(&vfio.group_lock);
+
+	minor = vfio_alloc_group_minor(group);
+	if (minor < 0) {
+		mutex_unlock(&vfio.group_lock);
+		kfree(group);
+		return ERR_PTR(minor);
+	}
+
+	/* Did we race creating this group? */
+	list_for_each_entry(tmp, &vfio.group_list, vfio_next) {
+		if (tmp->iommu_group == iommu_group) {
+			vfio_group_get(tmp);
+			vfio_free_group_minor(minor);
+			mutex_unlock(&vfio.group_lock);
+			kfree(group);
+			return tmp;
+		}
+	}
+
+	dev = device_create(vfio.class, NULL, MKDEV(MAJOR(vfio.devt), minor),
+			    group, "%d", iommu_group_id(iommu_group));
+	if (IS_ERR(dev)) {
+		vfio_free_group_minor(minor);
+		mutex_unlock(&vfio.group_lock);
+		kfree(group);
+		return (struct vfio_group *)dev; /* ERR_PTR */
+	}
+
+	group->minor = minor;
+	group->dev = dev;
+
+	list_add(&group->vfio_next, &vfio.group_list);
+
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+static void vfio_group_release(struct kref *kref)
+{
+	struct vfio_group *group = container_of(kref, struct vfio_group, kref);
+
+	WARN_ON(!list_empty(&group->device_list));
+
+	device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group->minor));
+	list_del(&group->vfio_next);
+	vfio_free_group_minor(group->minor);
+
+	mutex_unlock(&vfio.group_lock);
+
+	/*
+	 * Unregister outside of lock.  A spurious callback is harmless now
+	 * that the group is no longer in vfio.group_list.
+	 */
+	iommu_group_unregister_notifier(group->iommu_group, &group->nb);
+
+	kfree(group);
+}
+
+static void vfio_group_put(struct vfio_group *group)
+{
+	mutex_lock(&vfio.group_lock);
+	/*
+	 * Release needs to unlock to unregister the notifier, so only
+	 * unlock if not released.
+	 */
+	if (!kref_put(&group->kref, vfio_group_release))
+		mutex_unlock(&vfio.group_lock);
+}
+
+/* Assume group_lock or group reference is held */
+static void vfio_group_get(struct vfio_group *group)
+{
+	kref_get(&group->kref);
+}
+
+/*
+ * Not really a try as we will sleep for mutex, but we need to make
+ * sure the group pointer is valid under lock and get a reference.
+ */
+static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
+{
+	struct vfio_group *target = group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group == target) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static
+struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group->iommu_group == iommu_group) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static struct vfio_group *vfio_group_get_from_minor(int minor)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	group = idr_find(&vfio.group_idr, minor);
+	if (!group) {
+		mutex_unlock(&vfio.group_lock);
+		return NULL;
+	}
+	vfio_group_get(group);
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+/**
+ * Device objects - create, release, get, put, search
+ */
+static
+struct vfio_device *vfio_group_create_device(struct vfio_group *group,
+					     struct device *dev,
+					     const struct vfio_device_ops *ops,
+					     void *device_data)
+{
+	struct vfio_device *device;
+	int ret;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&device->kref);
+	device->dev = dev;
+	device->group = group;
+	device->ops = ops;
+	device->device_data = device_data;
+
+	ret = dev_set_drvdata(dev, device);
+	if (ret) {
+		kfree(device);
+		return ERR_PTR(ret);
+	}
+
+	/* No need to get group_lock, caller has group reference */
+	vfio_group_get(group);
+
+	mutex_lock(&group->device_lock);
+	list_add(&device->group_next, &group->device_list);
+	mutex_unlock(&group->device_lock);
+
+	return device;
+}
+
+static void vfio_device_release(struct kref *kref)
+{
+	struct vfio_device *device = container_of(kref,
+						  struct vfio_device, kref);
+	struct vfio_group *group = device->group;
+
+	mutex_lock(&group->device_lock);
+	list_del(&device->group_next);
+	mutex_unlock(&group->device_lock);
+
+	dev_set_drvdata(device->dev, NULL);
+
+	kfree(device);
+
+	/* vfio_del_group_dev may be waiting for this device */
+	wake_up(&vfio.release_q);
+}
+
+/* Device reference always implies a group reference */
+static void vfio_device_put(struct vfio_device *device)
+{
+	kref_put(&device->kref, vfio_device_release);
+	vfio_group_put(device->group);
+}
+
+static void vfio_device_get(struct vfio_device *device)
+{
+	vfio_group_get(device->group);
+	kref_get(&device->kref);
+}
+
+static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
+						 struct device *dev)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (device->dev == dev) {
+			vfio_device_get(device);
+			mutex_unlock(&group->device_lock);
+			return device;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+	return NULL;
+}
+
+/**
+ * Async device support
+ */
+static int vfio_group_nb_add_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* Do we already know about it?  We shouldn't */
+	device = vfio_group_get_device(group, dev);
+	if (WARN_ON_ONCE(device)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	/* Nothing to do for idle groups */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	/* TODO Prevent device auto probing */
+	WARN("Device %s added to live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	return 0;
+}
+
+static int vfio_group_nb_del_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/*
+	 * Expect to fall out here.  If a device was in use, it would
+	 * have been bound to a vfio sub-driver, which would have blocked
+	 * in .remove at vfio_del_group_dev.  Sanity check that we no
+	 * longer track the device, so it's safe to remove.
+	 */
+	device = vfio_group_get_device(group, dev);
+	if (likely(!device))
+		return 0;
+
+	WARN("Device %s removed from live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	vfio_device_put(device);
+	return 0;
+}
+
+static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* We don't care what happens when the group isn't in use */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	if (device)
+		vfio_device_put(device);
+
+	return device ? 0 : -EINVAL;
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data)
+{
+	struct vfio_group *group = container_of(nb, struct vfio_group, nb);
+	struct device *dev = data;
+
+	/*
+	 * Need to go through a group_lock lookup to get a reference or
+	 * we risk racing a group being removed.  Leave a WARN_ON for
+	 * debuging, but if the group no longer exists, a spurious notify
+	 * is harmless.
+	 */
+	group = vfio_group_try_get(group);
+	if (WARN_ON(!group))
+		return NOTIFY_OK;
+
+	switch (action) {
+	case IOMMU_GROUP_NOTIFY_ADD_DEVICE:
+		vfio_group_nb_add_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_DEL_DEVICE:
+		vfio_group_nb_del_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_BIND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d binding to driver\n", __func__,
+		      dev_name(dev), iommu_group_id(group->iommu_group));
+		break;
+	case IOMMU_GROUP_NOTIFY_BOUND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d bound to driver %s\n", __func__,
+		       dev_name(dev), iommu_group_id(group->iommu_group),
+		       dev->driver->name);
+		BUG_ON(vfio_group_nb_verify(group, dev));
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d unbinding from driver %s\n",
+		       __func__, dev_name(dev),
+		       iommu_group_id(group->iommu_group), dev->driver->name);
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER:
+		printk(KERN_DEBUG
+		       "%s: Device %s, group %d unbound from driver\n",
+		       __func__, dev_name(dev),
+		       iommu_group_id(group->iommu_group));
+		/*
+		 * XXX An unbound device in a live group is ok, but we'd
+		 * really like to avoid the above BUG_ON by preventing other
+		 * drivers from binding to it.  Once that occurs, we have to
+		 * stop the system to maintain isolation.  At a minimum, we'd
+		 * want a toggle to disable driver auto probe for this device.
+		 */
+		break;
+	}
+
+	vfio_group_put(group);
+	return NOTIFY_OK;
+}
+
+/**
+ * VFIO driver API
+ */
+int vfio_add_group_dev(struct device *dev,
+		       const struct vfio_device_ops *ops, void *device_data)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		group = vfio_create_group(iommu_group);
+		if (IS_ERR(group)) {
+			iommu_group_put(iommu_group);
+			return PTR_ERR(group);
+		}
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (device) {
+		WARN(1, "Device %s already exists on group %d\n",
+		     dev_name(dev), iommu_group_id(iommu_group));
+		vfio_device_put(device);
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return -EBUSY;
+	}
+
+	device = vfio_group_create_device(group, dev, ops, device_data);
+	if (IS_ERR(device)) {
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return PTR_ERR(device);
+	}
+
+	/*
+	 * Added device holds reference to iommu_group and vfio_device
+	 * (which in turn holds reference to vfio_group).  Drop extra
+	 * group reference used while acquiring device.
+	 */
+	vfio_group_put(group);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_add_group_dev);
+
+/* Test whether a struct device is present in our tracking */
+static bool vfio_dev_present(struct device *dev)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return false;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		iommu_group_put(iommu_group);
+		return false;
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (!device) {
+		vfio_group_put(group);
+		iommu_group_put(iommu_group);
+		return false;
+	}
+
+	vfio_device_put(device);
+	vfio_group_put(group);
+	iommu_group_put(iommu_group);
+	return true;
+}
+
+/*
+ * Decrement the device reference count and wait for the device to be
+ * removed.  Open file descriptors for the device... */
+void *vfio_del_group_dev(struct device *dev)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group = group->iommu_group;
+	void *device_data = device->device_data;
+
+	vfio_device_put(device);
+
+	/* TODO send a signal to encourage this to be released */
+	wait_event(vfio.release_q, !vfio_dev_present(dev));
+
+	iommu_group_put(iommu_group);
+
+	return device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_del_group_dev);
+
+/**
+ * VFIO base fd, /dev/vfio/vfio
+ */
+static long vfio_ioctl_check_extension(struct vfio_container *container,
+				       unsigned long arg)
+{
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+	long ret = 0;
+
+	switch (arg) {
+		/* No base extensions yet */
+	default:
+		/*
+		 * If no driver is set, poll all registered drivers for
+		 * extensions and return the first positive result.  If
+		 * a driver is already set, further queries will be passed
+		 * only to that driver.
+		 */
+		if (!driver) {
+			mutex_lock(&vfio.iommu_drivers_lock);
+			list_for_each_entry(driver, &vfio.iommu_drivers_list,
+					    vfio_next) {
+				if (!try_module_get(driver->ops->owner))
+					continue;
+
+				ret = driver->ops->ioctl(NULL,
+							 VFIO_CHECK_EXTENSION,
+							 arg);
+				module_put(driver->ops->owner);
+				if (ret > 0)
+					break;
+			}
+			mutex_unlock(&vfio.iommu_drivers_lock);
+		} else
+			ret = driver->ops->ioctl(container->iommu_data,
+						 VFIO_CHECK_EXTENSION, arg);
+	}
+
+	return ret;
+}
+
+/* hold container->group_lock */
+static int __vfio_container_attach_groups(struct vfio_container *container,
+					  struct vfio_iommu_driver *driver,
+					  void *data)
+{
+	struct vfio_group *group;
+	int ret = -ENODEV;
+
+	list_for_each_entry(group, &container->group_list, container_next) {
+		ret = driver->ops->attach_group(data, group->iommu_group);
+		if (ret)
+			goto unwind;
+	}
+
+	return ret;
+
+unwind:
+	list_for_each_entry_continue_reverse(group, &container->group_list,
+					     container_next) {
+		driver->ops->detach_group(data, group->iommu_group);
+	}
+
+	return ret;
+}
+
+static long vfio_ioctl_set_iommu(struct vfio_container *container,
+				 unsigned long arg)
+{
+	struct vfio_iommu_driver *driver;
+	long ret = -ENODEV;
+
+	mutex_lock(&container->group_lock);
+
+	/*
+	 * The container is designed to be an unprivileged interface while
+	 * the group can be assigned to specific users.  Therefore, only by
+	 * adding a group to a container does the user get the privilege of
+	 * enabling the iommu, which may allocate finite resources.  There
+	 * is no unset_iommu, but by removing all the groups from a container,
+	 * the container is deprivileged and returns to an unset state.
+	 */
+	if (list_empty(&container->group_list) || container->iommu_driver) {
+		mutex_unlock(&container->group_lock);
+		return -EINVAL;
+	}
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		void *data;
+
+		if (!try_module_get(driver->ops->owner))
+			continue;
+
+		/*
+		 * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
+		 * so test which iommu driver reported support for this
+		 * extension and call open on them.  We also pass them the
+		 * magic, allowing a single driver to support multiple
+		 * interfaces if they'd like.
+		 */
+		if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
+			module_put(driver->ops->owner);
+			continue;
+		}
+
+		/* module reference holds the driver we're working on */
+		mutex_unlock(&vfio.iommu_drivers_lock);
+
+		data = driver->ops->open(arg);
+		if (IS_ERR(data)) {
+			ret = PTR_ERR(data);
+			goto skip_drivers_unlock;
+		}
+
+		ret = __vfio_container_attach_groups(container, driver, data);
+		if (!ret) {
+			container->iommu_driver = driver;
+			container->iommu_data = data;
+		} else
+			driver->ops->release(data);
+
+		goto skip_drivers_unlock;
+	}
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+skip_drivers_unlock:
+	mutex_unlock(&container->group_lock);
+
+	return ret;
+}
+
+static long vfio_fops_unl_ioctl(struct file *filep,
+				unsigned int cmd, unsigned long arg)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	void *data;
+	long ret = -EINVAL;
+
+	if (!container)
+		return ret;
+
+	driver = container->iommu_driver;
+	data = container->iommu_data;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		ret = VFIO_API_VERSION;
+		break;
+	case VFIO_CHECK_EXTENSION:
+		ret = vfio_ioctl_check_extension(container, arg);
+		break;
+	case VFIO_SET_IOMMU:
+		ret = vfio_ioctl_set_iommu(container, arg);
+		break;
+	default:
+		if (driver) /* passthrough all unrecognized ioctls */
+			ret = driver->ops->ioctl(data, cmd, arg);
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_fops_compat_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container;
+
+	container = kzalloc(sizeof(*container), GFP_KERNEL);
+	if (!container)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&container->group_list);
+	mutex_init(&container->group_lock);
+	kref_init(&container->kref);
+
+	filep->private_data = container;
+
+	return 0;
+}
+
+static int vfio_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_container_put(container);
+
+	return 0;
+}
+
+/*
+ * Once an iommu driver is set, we optionally pass read/write/mmap
+ * on to the driver, allowing management interfaces beyond ioctl.
+ */
+static ssize_t vfio_fops_read(struct file *filep, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->read))
+		return -EINVAL;
+
+	return driver->ops->read(container->iommu_data, buf, count, ppos);
+}
+
+static ssize_t vfio_fops_write(struct file *filep, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->write))
+		return -EINVAL;
+
+	return driver->ops->write(container->iommu_data, buf, count, ppos);
+}
+
+static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (unlikely(!driver || !driver->ops->mmap))
+		return -EINVAL;
+
+	return driver->ops->mmap(container->iommu_data, vma);
+}
+
+static const struct file_operations vfio_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vfio_fops_open,
+	.release	= vfio_fops_release,
+	.read		= vfio_fops_read,
+	.write		= vfio_fops_write,
+	.unlocked_ioctl	= vfio_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_fops_mmap,
+};
+
+/**
+ * VFIO Group fd, /dev/vfio/$GROUP
+ */
+static void __vfio_group_unset_container(struct vfio_group *group)
+{
+	struct vfio_container *container = group->container;
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver)
+		driver->ops->detach_group(container->iommu_data,
+					  group->iommu_group);
+
+	group->container = NULL;
+	list_del(&group->container_next);
+
+	/* Detaching the last group deprivileges a container, remove iommu */
+	if (driver && list_empty(&container->group_list)) {
+		driver->ops->release(container->iommu_data);
+		module_put(driver->ops->owner);
+		container->iommu_driver = NULL;
+		container->iommu_data = NULL;
+	}
+
+	mutex_unlock(&container->group_lock);
+
+	vfio_container_put(container);
+}
+
+/*
+ * VFIO_GROUP_UNSET_CONTAINER should fail if there are other users or
+ * if there was no container to unset.  Since the ioctl is called on
+ * the group, we know that still exists, therefore the only valid
+ * transition here is 1->0.
+ */
+static int vfio_group_unset_container(struct vfio_group *group)
+{
+	int users = atomic_cmpxchg(&group->container_users, 1, 0);
+
+	if (!users)
+		return -EINVAL;
+	if (users != 1)
+		return -EBUSY;
+
+	__vfio_group_unset_container(group);
+
+	return 0;
+}
+
+/*
+ * When removing container users, anything that removes the last user
+ * implicitly removes the group from the container.  That is, if the
+ * group file descriptor is closed, as well as any device file descriptors,
+ * the group is free.
+ */
+static void vfio_group_try_dissolve_container(struct vfio_group *group)
+{
+	if (0 == atomic_dec_if_positive(&group->container_users))
+		__vfio_group_unset_container(group);
+}
+
+static int vfio_group_set_container(struct vfio_group *group, int container_fd)
+{
+	struct file *filep;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	int ret = 0;
+
+	if (atomic_read(&group->container_users))
+		return -EINVAL;
+
+	filep = fget(container_fd);
+	if (!filep)
+		return -EBADF;
+
+	/* Sanity check, is this really our fd? */
+	if (filep->f_op != &vfio_fops) {
+		fput(filep);
+		return -EINVAL;
+	}
+
+	container = filep->private_data;
+	WARN_ON(!container); /* fget ensures we don't race vfio_release */
+
+	mutex_lock(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver) {
+		ret = driver->ops->attach_group(container->iommu_data,
+						group->iommu_group);
+		if (ret)
+			goto unlock_out;
+	}
+
+	group->container = container;
+	list_add(&group->container_next, &container->group_list);
+
+	/* Get a reference on the container and mark a user within the group */
+	vfio_container_get(container);
+	atomic_inc(&group->container_users);
+
+unlock_out:
+	mutex_unlock(&container->group_lock);
+	fput(filep);
+	if (ret)
+		vfio_container_put(container);
+
+	return ret;
+}
+
+/*
+ * A vfio group is viable for use by userspace if all devices are either
+ * driver-less or bound to a vfio driver.  We test the latter by the
+ * existence of a struct vfio_device matching the dev.
+ */
+static int vfio_dev_viable(struct device *dev, void *data)
+{
+	struct vfio_group *group = data;
+	struct vfio_device *device;
+
+	if (!dev->driver)
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	vfio_device_put(device);
+
+	if (!device)
+		return -EINVAL;
+
+	return 0;
+}
+
+static bool vfio_group_viable(struct vfio_group *group)
+{
+	return (iommu_group_for_each_dev(group->iommu_group,
+					 group, vfio_dev_viable) == 0);
+}
+
+static const struct file_operations vfio_device_fops;
+
+static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
+{
+	struct vfio_device *device;
+	struct file *filep;
+	int ret = -ENODEV;
+
+	if (0 == atomic_read(&group->container_users) ||
+	    !group->container->iommu_driver || !vfio_group_viable(group))
+		return -EINVAL;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (strcmp(dev_name(device->dev), buf))
+			continue;
+
+		ret = device->ops->open(device->device_data);
+		if (ret)
+			break;
+		/*
+		 * We can't use anon_inode_getfd() because we need to modify
+		 * the f_mode flags directly to allow more than just ioctls
+		 */
+		ret = get_unused_fd();
+		if (ret < 0) {
+			device->ops->release(device->device_data);
+			break;
+		}
+
+		filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
+					   device, O_RDWR);
+		if (IS_ERR(filep)) {
+			put_unused_fd(ret);
+			ret = PTR_ERR(filep);
+			device->ops->release(device->device_data);
+			break;
+		}
+
+		/*
+		 * TODO: add an anon_inode interface to do this.
+		 * Appears to be missing by lack of need rather than
+		 * explicitly prevented.  Now there's need.
+		 */
+		filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+		fd_install(ret, filep);
+
+		vfio_device_get(device);
+		atomic_inc(&group->container_users);
+		break;
+	}
+	mutex_unlock(&group->device_lock);
+
+	return ret;
+}
+
+static long vfio_group_fops_unl_ioctl(struct file *filep,
+				      unsigned int cmd, unsigned long arg)
+{
+	struct vfio_group *group = filep->private_data;
+	long ret = -ENOTTY;
+
+	switch (cmd) {
+	case VFIO_GROUP_GET_STATUS:
+	{
+		struct vfio_group_status status;
+		unsigned long minsz;
+
+		minsz = offsetofend(struct vfio_group_status, flags);
+
+		if (copy_from_user(&status, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (status.argsz < minsz)
+			return -EINVAL;
+
+		status.flags = 0;
+
+		if (vfio_group_viable(group))
+			status.flags |= VFIO_GROUP_FLAGS_VIABLE;
+
+		if (group->container)
+			status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
+
+		ret = copy_to_user((void __user *)arg, &status, minsz);
+
+		break;
+	}
+	case VFIO_GROUP_SET_CONTAINER:
+	{
+		int fd;
+
+		if (get_user(fd, (int __user *)arg))
+			return -EFAULT;
+
+		if (fd < 0)
+			return -EINVAL;
+
+		ret = vfio_group_set_container(group, fd);
+		break;
+	}
+	case VFIO_GROUP_UNSET_CONTAINER:
+		ret = vfio_group_unset_container(group);
+		break;
+	case VFIO_GROUP_GET_DEVICE_FD:
+	{
+		char *buf;
+
+		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
+		if (IS_ERR(buf))
+			return PTR_ERR(buf);
+
+		ret = vfio_group_get_device_fd(group, buf);
+		kfree(buf);
+		break;
+	}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_group_fops_compat_ioctl(struct file *filep,
+					 unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_group_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_group_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group;
+
+	group = vfio_group_get_from_minor(iminor(inode));
+	if (!group)
+		return -ENODEV;
+
+	if (group->container) {
+		vfio_group_put(group);
+		return -EBUSY;
+	}
+
+	filep->private_data = group;
+
+	return 0;
+}
+
+static int vfio_group_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_group_try_dissolve_container(group);
+
+	vfio_group_put(group);
+
+	return 0;
+}
+
+static const struct file_operations vfio_group_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vfio_group_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_group_fops_compat_ioctl,
+#endif
+	.open		= vfio_group_fops_open,
+	.release	= vfio_group_fops_release,
+};
+
+/**
+ * VFIO Device fd
+ */
+static int vfio_device_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device = filep->private_data;
+
+	device->ops->release(device->device_data);
+
+	vfio_group_try_dissolve_container(device->group);
+
+	vfio_device_put(device);
+
+	return 0;
+}
+
+static long vfio_device_fops_unl_ioctl(struct file *filep,
+				       unsigned int cmd, unsigned long arg)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->ioctl))
+		return -EINVAL;
+
+	return device->ops->ioctl(device->device_data, cmd, arg);
+}
+
+static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->read))
+		return -EINVAL;
+
+	return device->ops->read(device->device_data, buf, count, ppos);
+}
+
+static ssize_t vfio_device_fops_write(struct file *filep,
+				      const char __user *buf,
+				      size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->write))
+		return -EINVAL;
+
+	return device->ops->write(device->device_data, buf, count, ppos);
+}
+
+static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->mmap))
+		return -EINVAL;
+
+	return device->ops->mmap(device->device_data, vma);
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_device_fops_compat_ioctl(struct file *filep,
+					  unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_device_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static const struct file_operations vfio_device_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vfio_device_fops_release,
+	.read		= vfio_device_fops_read,
+	.write		= vfio_device_fops_write,
+	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_device_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_device_fops_mmap,
+};
+
+/**
+ * Module/class support
+ */
+static char *vfio_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
+}
+
+static int __init vfio_init(void)
+{
+	int ret;
+
+	idr_init(&vfio.group_idr);
+	mutex_init(&vfio.group_lock);
+	mutex_init(&vfio.iommu_drivers_lock);
+	INIT_LIST_HEAD(&vfio.group_list);
+	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+	init_waitqueue_head(&vfio.release_q);
+
+	vfio.class = class_create(THIS_MODULE, "vfio");
+	if (IS_ERR(vfio.class)) {
+		ret = PTR_ERR(vfio.class);
+		goto err_class;
+	}
+
+	vfio.class->devnode = vfio_devnode;
+
+	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
+	if (ret)
+		goto err_base_chrdev;
+
+	cdev_init(&vfio.cdev, &vfio_fops);
+	ret = cdev_add(&vfio.cdev, vfio.devt, 1);
+	if (ret)
+		goto err_base_cdev;
+
+	vfio.dev = device_create(vfio.class, NULL, vfio.devt, NULL, "vfio");
+	if (IS_ERR(vfio.dev)) {
+		ret = PTR_ERR(vfio.dev);
+		goto err_base_dev;
+	}
+
+	/* /dev/vfio/$GROUP */
+	cdev_init(&vfio.group_cdev, &vfio_group_fops);
+	ret = cdev_add(&vfio.group_cdev,
+		       MKDEV(MAJOR(vfio.devt), 1), MINORMASK - 1);
+	if (ret)
+		goto err_groups_cdev;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	return 0;
+
+err_groups_cdev:
+	device_destroy(vfio.class, vfio.devt);
+err_base_dev:
+	cdev_del(&vfio.cdev);
+err_base_cdev:
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+err_base_chrdev:
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+err_class:
+	return ret;
+}
+
+static void __exit vfio_cleanup(void)
+{
+	WARN_ON(!list_empty(&vfio.group_list));
+
+	idr_destroy(&vfio.group_idr);
+	cdev_del(&vfio.group_cdev);
+	device_destroy(vfio.class, vfio.devt);
+	cdev_del(&vfio.cdev);
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+}
+
+module_init(vfio_init);
+module_exit(vfio_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
new file mode 100644
index 0000000..a264054
--- /dev/null
+++ b/include/linux/vfio.h
@@ -0,0 +1,364 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VFIO_API_VERSION	0
+
+#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
+
+/**
+ * struct vfio_device_ops - VFIO bus driver device callbacks
+ *
+ * @open: Called when userspace creates new file descriptor for device
+ * @release: Called when userspace releases file descriptor for device
+ * @read: Perform read(2) on device file descriptor
+ * @write: Perform write(2) on device file descriptor
+ * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
+ *         operations documented below
+ * @mmap: Perform mmap(2) on a region of the device file descriptor
+ */
+struct vfio_device_ops {
+	char	*name;
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t count, loff_t *size);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+extern int vfio_add_group_dev(struct device *dev,
+			      const struct vfio_device_ops *ops,
+			      void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+/**
+ * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
+ */
+struct vfio_iommu_driver_ops {
+	char		*name;
+	struct module	*owner;
+	void		*(*open)(unsigned long arg);
+	void		(*release)(void *iommu_data);
+	ssize_t		(*read)(void *iommu_data, char __user *buf,
+				size_t count, loff_t *ppos);
+	ssize_t		(*write)(void *iommu_data, const char __user *buf,
+				 size_t count, loff_t *size);
+	long		(*ioctl)(void *iommu_data, unsigned int cmd,
+				 unsigned long arg);
+	int		(*mmap)(void *iommu_data, struct vm_area_struct *vma);
+	int		(*attach_group)(void *iommu_data,
+					struct iommu_group *group);
+	void		(*detach_group)(void *iommu_data,
+					struct iommu_group *group);
+
+};
+
+extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
+
+extern void vfio_unregister_iommu_driver(
+				const struct vfio_iommu_driver_ops *ops);
+
+/**
+ * offsetofend(TYPE, MEMBER)
+ *
+ * @TYPE: The type of the structure
+ * @MEMBER: The member within the structure to get the end offset of
+ *
+ * Simple helper macro for dealing with variable sized structures passed
+ * from user space.  This allows us to easily determine if the provided
+ * structure is sized to include various fields.
+ */
+#define offsetofend(TYPE, MEMBER) ({				\
+	TYPE tmp;						\
+	offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); })		\
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_X86_IOMMU		1
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION		_IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ *						struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET	(1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry.  The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ *						struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+};
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+	__u32	index;		/* Region index */
+	__u32	resv;		/* Reserved for alignment */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks.  Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts.  This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index.  This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront.  In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_EVENTFD		(1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+	__u32	index;		/* IRQ index */
+	__s32	count;		/* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts.  Caller provides
+ * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided.  If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts.  For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device.  Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+	__u32	index;
+	__s32	start;
+	__s32	count;
+	__u8	data[];
+};
+#define VFIO_DEVICE_SET_IRQS		_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
+					 VFIO_IRQ_SET_DATA_BOOL | \
+					 VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
+					 VFIO_IRQ_SET_ACTION_UNMASK | \
+					 VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+#endif /* VFIO_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 08/13] vfio: Add documentation
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/vfio.txt |  315 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 315 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..6808ac5
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,315 @@
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment.  In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that?  Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance.  From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace.  Examples include network adapters (often non-TCP/IP based)
+and compute accelerators.  Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Devices are the main target of any I/O driver.  Devices typically
+create a programming interface made up of I/O access, interrupts,
+and DMA.  Without going into the details of each of these, DMA is
+by far the most critical aspect for maintaining a secure environment
+as allowing a device read-write access to system memory imposes the
+greatest risk to the overall system integrity.
+
+To help mitigate this risk, many modern IOMMUs now incorporate
+isolation properties into what was, in many cases, an interface only
+meant for translation (ie. solving the addressing problems of devices
+with limited address spaces).  With this, devices can now be isolated
+from each other and from arbitrary memory access, thus allowing
+things like secure direct assignment of devices into virtual machines.
+
+This isolation is not always at the granularity of a single device
+though.  Even when an IOMMU is capable of this, properties of devices,
+interconnects, and IOMMU topologies can each reduce this isolation.
+For instance, an individual device may be part of a larger multi-
+function enclosure.  While the IOMMU may be able to distinguish
+between devices within the enclosure, the enclosure may not require
+transactions between devices to reach the IOMMU.  Examples of this
+could be anything from a multi-function PCI device with backdoors
+between functions to a non-PCI-ACS (Access Control Services) capable
+bridge allowing redirection without reaching the IOMMU.  Topology
+can also play a factor in terms of hiding devices.  A PCIe-to-PCI
+bridge masks the devices behind it, making transaction appear as if
+from the bridge itself.  Obviously IOMMU design plays a major factor
+as well.
+
+Therefore, while for the most part an IOMMU may have device level
+granularity, any system is susceptible to reduced granularity.  The
+IOMMU API therefore supports a notion of IOMMU groups.  A group is
+a set of devices which is isolatable from all other devices in the
+system.  Groups are therefore the unit of ownership used by VFIO.
+
+While the group is the minimum granularity that must be used to
+ensure secure user access, it's not necessarily the preferred
+granularity.  In IOMMUs which make use of page tables, it may be
+possible to share a set of page tables between different groups,
+reducing the overhead both to the platform (reduced TLB thrashing,
+reduced duplicate page tables), and to the user (programming only
+a single set of translations).  For this reason, VFIO makes use of
+a container class, which may hold one or more groups.  A container
+is created by simply opening the /dev/vfio/vfio character device.
+
+On it's own, the container provides little functionality, with all
+but a couple version and extension query interfaces locked away.
+The user needs to add a group into the container for the next level
+of functionality.  To do this, the user first needs to identify the
+group associated with the desired device.  This can be done using
+the sysfs links described in the example below.  By unbinding the
+device from the host driver and binding it to a VFIO driver, a new
+VFIO group will appear for the group as /dev/vfio/$GROUP, where
+$GROUP is the IOMMU group number of which the device is a member.
+If the IOMMU group contains multiple devices, each will need to
+be bound to a VFIO driver before operations on the VFIO group
+are allowed (it's also sufficient to only unbind the device from
+host drivers if a VFIO driver is unavailable; this will make the
+group available, but not that particular device).  TBD - interface
+for disabling driver probing/locking a device.
+
+Once the group is ready, it may be added to the container by opening
+the VFIO group character device (/dev/vfio/$GROUP) and using the
+VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
+previously opened container file.  If desired and if the IOMMU driver
+supports sharing the IOMMU context between groups, multiple groups may
+be set to the same container.  If a group fails to set to a container
+with existing groups, a new empty container will need to be used
+instead.
+
+With a group (or groups) attached to a container, the remaining
+ioctls become available, enabling access to the VFIO IOMMU interfaces.
+Additionally, it now becomes possible to get file descriptors for each
+device within a group using an ioctl on the VFIO group file descriptor.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+../../../../kernel/iommu_groups/26
+
+This device is therefore in IOMMU group 26.  This device is on the
+pci bus, therefore the user will make use of vfio-pci to manage the
+group:
+
+# modprobe vfio-pci
+
+Binding this device to the vfio-pci driver creates the VFIO group
+character devices for this group:
+
+$ lspci -n -s 0000:06:0d.0
+06:0d.0 0401: 1102:0002 (rev 08)
+# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
+# echo 1102 0002 > /sys/bus/pci/drivers/vfio/new_id
+# echo 0000:06:0d.0 > /sys/bus/pci/drivers/vfio/bind
+
+Now we need to look at what other devices are in the group to free
+it for use by VFIO:
+
+$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
+total 0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
+	../../../../devices/pci0000:00/0000:00:1e.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
+	../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
+	../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This device is behind a PCIe-to-PCI bridge[4], therefore we also
+need to add device 0000:06:0d.1 to the group following the same
+procedure as above.  Device 0000:00:1e.0 is a bridge that does
+not currently have a host driver, therefore it's not required to
+bind this device to the vfio-pci driver (vfio-pci does not currently
+support PCI bridges).
+
+The final step is to provide the user with access to the group if
+unprivileged operation is desired (note that /dev/vfio/vfio provides
+no capabilities on it's own and is therefore expected to be set to
+mode 0666 by the system).
+
+# chown user:user /dev/vfio/26
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+	int container, group, device, i;
+	struct vfio_group_status group_status =
+					{ .argsz = sizeof(group_status) };
+	struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) };
+	struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) };
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	/* Create a new container */
+	container = open("/dev/vfio/vfio, O_RDWR);
+
+	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
+		/* Unknown API version */
+
+	if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_X86_IOMMU))
+		/* Doesn't support the IOMMU driver we want. */
+
+	/* Open the group */
+	group = open("/dev/vfio/26", O_RDWR);
+
+	/* Test the group is viable and available */
+	ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
+		/* Group is not viable (ie, not all devices bound for vfio) */
+
+	/* Add the group to the container */
+	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+	/* Enable the IOMMU model we want */
+	ioctl(container, VFIO_SET_IOMMU, VFIO_X86_IOMMU)
+
+	/* Get addition IOMMU info */
+	ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+	/* Allocate some space and setup a DMA mapping */
+	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	dma_map.size = 1024 * 1024;
+	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	for (i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+		/* Setup mappings... read/write offsets, mmaps
+		 * For PCI devices, config space is a region */
+	}
+
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
+	}
+
+	/* Gratuitous device reset and go... */
+	ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+VFIO bus drivers, such as vfio-pci make use of only a few interfaces
+into VFIO core.  When devices are bound and unbound to the driver,
+the driver should call vfio_add_group_dev() and vfio_del_group_dev()
+respectively:
+
+extern int vfio_add_group_dev(struct iommu_group *iommu_group,
+                              struct device *dev,
+                              const struct vfio_device_ops *ops,
+                              void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+vfio_add_group_dev() indicates to the core to begin tracking the
+specified iommu_group and register the specified dev as owned by
+a VFIO bus driver.  The driver provides an ops structure for callbacks
+similar to a file operations structure:
+
+struct vfio_device_ops {
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t size, loff_t *ppos);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+Each function is passed the device_data that was originally registered
+in the vfio_add_group_dev() call above.  This allows the bus driver
+an easy place to store it's opaque, private data.  The open/release
+callbacks are issued when a new file descriptor is created for a
+device (via VFIO_GROUP_GET_DEVICE_FD).  The ioctl interface provides
+a direct pass through for VFIO_DEVICE_* ioctls.  The read/write/mmap
+interfaces implement the device region access defined by the device's
+own VFIO_DEVICE_GET_REGION_INFO ioctl.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
+initial implementation by Tom Lyon while as Cisco.  We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved".  It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers.  To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
+still provide isolation.  For PCI, SR-IOV Virtual Functions are the
+best indicator of "well behaved", as these are designed for
+virtualization usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO.  It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge, so transactions
+from either function of the device are indistinguishable to the iommu:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+                        \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 08/13] vfio: Add documentation
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 Documentation/vfio.txt |  315 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 315 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..6808ac5
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,315 @@
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment.  In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that?  Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance.  From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace.  Examples include network adapters (often non-TCP/IP based)
+and compute accelerators.  Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Devices are the main target of any I/O driver.  Devices typically
+create a programming interface made up of I/O access, interrupts,
+and DMA.  Without going into the details of each of these, DMA is
+by far the most critical aspect for maintaining a secure environment
+as allowing a device read-write access to system memory imposes the
+greatest risk to the overall system integrity.
+
+To help mitigate this risk, many modern IOMMUs now incorporate
+isolation properties into what was, in many cases, an interface only
+meant for translation (ie. solving the addressing problems of devices
+with limited address spaces).  With this, devices can now be isolated
+from each other and from arbitrary memory access, thus allowing
+things like secure direct assignment of devices into virtual machines.
+
+This isolation is not always at the granularity of a single device
+though.  Even when an IOMMU is capable of this, properties of devices,
+interconnects, and IOMMU topologies can each reduce this isolation.
+For instance, an individual device may be part of a larger multi-
+function enclosure.  While the IOMMU may be able to distinguish
+between devices within the enclosure, the enclosure may not require
+transactions between devices to reach the IOMMU.  Examples of this
+could be anything from a multi-function PCI device with backdoors
+between functions to a non-PCI-ACS (Access Control Services) capable
+bridge allowing redirection without reaching the IOMMU.  Topology
+can also play a factor in terms of hiding devices.  A PCIe-to-PCI
+bridge masks the devices behind it, making transaction appear as if
+from the bridge itself.  Obviously IOMMU design plays a major factor
+as well.
+
+Therefore, while for the most part an IOMMU may have device level
+granularity, any system is susceptible to reduced granularity.  The
+IOMMU API therefore supports a notion of IOMMU groups.  A group is
+a set of devices which is isolatable from all other devices in the
+system.  Groups are therefore the unit of ownership used by VFIO.
+
+While the group is the minimum granularity that must be used to
+ensure secure user access, it's not necessarily the preferred
+granularity.  In IOMMUs which make use of page tables, it may be
+possible to share a set of page tables between different groups,
+reducing the overhead both to the platform (reduced TLB thrashing,
+reduced duplicate page tables), and to the user (programming only
+a single set of translations).  For this reason, VFIO makes use of
+a container class, which may hold one or more groups.  A container
+is created by simply opening the /dev/vfio/vfio character device.
+
+On it's own, the container provides little functionality, with all
+but a couple version and extension query interfaces locked away.
+The user needs to add a group into the container for the next level
+of functionality.  To do this, the user first needs to identify the
+group associated with the desired device.  This can be done using
+the sysfs links described in the example below.  By unbinding the
+device from the host driver and binding it to a VFIO driver, a new
+VFIO group will appear for the group as /dev/vfio/$GROUP, where
+$GROUP is the IOMMU group number of which the device is a member.
+If the IOMMU group contains multiple devices, each will need to
+be bound to a VFIO driver before operations on the VFIO group
+are allowed (it's also sufficient to only unbind the device from
+host drivers if a VFIO driver is unavailable; this will make the
+group available, but not that particular device).  TBD - interface
+for disabling driver probing/locking a device.
+
+Once the group is ready, it may be added to the container by opening
+the VFIO group character device (/dev/vfio/$GROUP) and using the
+VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
+previously opened container file.  If desired and if the IOMMU driver
+supports sharing the IOMMU context between groups, multiple groups may
+be set to the same container.  If a group fails to set to a container
+with existing groups, a new empty container will need to be used
+instead.
+
+With a group (or groups) attached to a container, the remaining
+ioctls become available, enabling access to the VFIO IOMMU interfaces.
+Additionally, it now becomes possible to get file descriptors for each
+device within a group using an ioctl on the VFIO group file descriptor.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+../../../../kernel/iommu_groups/26
+
+This device is therefore in IOMMU group 26.  This device is on the
+pci bus, therefore the user will make use of vfio-pci to manage the
+group:
+
+# modprobe vfio-pci
+
+Binding this device to the vfio-pci driver creates the VFIO group
+character devices for this group:
+
+$ lspci -n -s 0000:06:0d.0
+06:0d.0 0401: 1102:0002 (rev 08)
+# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
+# echo 1102 0002 > /sys/bus/pci/drivers/vfio/new_id
+# echo 0000:06:0d.0 > /sys/bus/pci/drivers/vfio/bind
+
+Now we need to look at what other devices are in the group to free
+it for use by VFIO:
+
+$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
+total 0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
+	../../../../devices/pci0000:00/0000:00:1e.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
+	../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
+	../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This device is behind a PCIe-to-PCI bridge[4], therefore we also
+need to add device 0000:06:0d.1 to the group following the same
+procedure as above.  Device 0000:00:1e.0 is a bridge that does
+not currently have a host driver, therefore it's not required to
+bind this device to the vfio-pci driver (vfio-pci does not currently
+support PCI bridges).
+
+The final step is to provide the user with access to the group if
+unprivileged operation is desired (note that /dev/vfio/vfio provides
+no capabilities on it's own and is therefore expected to be set to
+mode 0666 by the system).
+
+# chown user:user /dev/vfio/26
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+	int container, group, device, i;
+	struct vfio_group_status group_status =
+					{ .argsz = sizeof(group_status) };
+	struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) };
+	struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) };
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	/* Create a new container */
+	container = open("/dev/vfio/vfio, O_RDWR);
+
+	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
+		/* Unknown API version */
+
+	if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_X86_IOMMU))
+		/* Doesn't support the IOMMU driver we want. */
+
+	/* Open the group */
+	group = open("/dev/vfio/26", O_RDWR);
+
+	/* Test the group is viable and available */
+	ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
+		/* Group is not viable (ie, not all devices bound for vfio) */
+
+	/* Add the group to the container */
+	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+	/* Enable the IOMMU model we want */
+	ioctl(container, VFIO_SET_IOMMU, VFIO_X86_IOMMU)
+
+	/* Get addition IOMMU info */
+	ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+	/* Allocate some space and setup a DMA mapping */
+	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	dma_map.size = 1024 * 1024;
+	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	for (i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+		/* Setup mappings... read/write offsets, mmaps
+		 * For PCI devices, config space is a region */
+	}
+
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
+	}
+
+	/* Gratuitous device reset and go... */
+	ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+VFIO bus drivers, such as vfio-pci make use of only a few interfaces
+into VFIO core.  When devices are bound and unbound to the driver,
+the driver should call vfio_add_group_dev() and vfio_del_group_dev()
+respectively:
+
+extern int vfio_add_group_dev(struct iommu_group *iommu_group,
+                              struct device *dev,
+                              const struct vfio_device_ops *ops,
+                              void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+vfio_add_group_dev() indicates to the core to begin tracking the
+specified iommu_group and register the specified dev as owned by
+a VFIO bus driver.  The driver provides an ops structure for callbacks
+similar to a file operations structure:
+
+struct vfio_device_ops {
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t size, loff_t *ppos);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+Each function is passed the device_data that was originally registered
+in the vfio_add_group_dev() call above.  This allows the bus driver
+an easy place to store it's opaque, private data.  The open/release
+callbacks are issued when a new file descriptor is created for a
+device (via VFIO_GROUP_GET_DEVICE_FD).  The ioctl interface provides
+a direct pass through for VFIO_DEVICE_* ioctls.  The read/write/mmap
+interfaces implement the device region access defined by the device's
+own VFIO_DEVICE_GET_REGION_INFO ioctl.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
+initial implementation by Tom Lyon while as Cisco.  We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved".  It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers.  To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
+still provide isolation.  For PCI, SR-IOV Virtual Functions are the
+best indicator of "well behaved", as these are designed for
+virtualization usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO.  It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge, so transactions
+from either function of the device are indistinguishable to the iommu:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+                        \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 08/13] vfio: Add documentation
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/vfio.txt |  315 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 315 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..6808ac5
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,315 @@
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment.  In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that?  Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance.  From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace.  Examples include network adapters (often non-TCP/IP based)
+and compute accelerators.  Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Devices are the main target of any I/O driver.  Devices typically
+create a programming interface made up of I/O access, interrupts,
+and DMA.  Without going into the details of each of these, DMA is
+by far the most critical aspect for maintaining a secure environment
+as allowing a device read-write access to system memory imposes the
+greatest risk to the overall system integrity.
+
+To help mitigate this risk, many modern IOMMUs now incorporate
+isolation properties into what was, in many cases, an interface only
+meant for translation (ie. solving the addressing problems of devices
+with limited address spaces).  With this, devices can now be isolated
+from each other and from arbitrary memory access, thus allowing
+things like secure direct assignment of devices into virtual machines.
+
+This isolation is not always at the granularity of a single device
+though.  Even when an IOMMU is capable of this, properties of devices,
+interconnects, and IOMMU topologies can each reduce this isolation.
+For instance, an individual device may be part of a larger multi-
+function enclosure.  While the IOMMU may be able to distinguish
+between devices within the enclosure, the enclosure may not require
+transactions between devices to reach the IOMMU.  Examples of this
+could be anything from a multi-function PCI device with backdoors
+between functions to a non-PCI-ACS (Access Control Services) capable
+bridge allowing redirection without reaching the IOMMU.  Topology
+can also play a factor in terms of hiding devices.  A PCIe-to-PCI
+bridge masks the devices behind it, making transaction appear as if
+from the bridge itself.  Obviously IOMMU design plays a major factor
+as well.
+
+Therefore, while for the most part an IOMMU may have device level
+granularity, any system is susceptible to reduced granularity.  The
+IOMMU API therefore supports a notion of IOMMU groups.  A group is
+a set of devices which is isolatable from all other devices in the
+system.  Groups are therefore the unit of ownership used by VFIO.
+
+While the group is the minimum granularity that must be used to
+ensure secure user access, it's not necessarily the preferred
+granularity.  In IOMMUs which make use of page tables, it may be
+possible to share a set of page tables between different groups,
+reducing the overhead both to the platform (reduced TLB thrashing,
+reduced duplicate page tables), and to the user (programming only
+a single set of translations).  For this reason, VFIO makes use of
+a container class, which may hold one or more groups.  A container
+is created by simply opening the /dev/vfio/vfio character device.
+
+On it's own, the container provides little functionality, with all
+but a couple version and extension query interfaces locked away.
+The user needs to add a group into the container for the next level
+of functionality.  To do this, the user first needs to identify the
+group associated with the desired device.  This can be done using
+the sysfs links described in the example below.  By unbinding the
+device from the host driver and binding it to a VFIO driver, a new
+VFIO group will appear for the group as /dev/vfio/$GROUP, where
+$GROUP is the IOMMU group number of which the device is a member.
+If the IOMMU group contains multiple devices, each will need to
+be bound to a VFIO driver before operations on the VFIO group
+are allowed (it's also sufficient to only unbind the device from
+host drivers if a VFIO driver is unavailable; this will make the
+group available, but not that particular device).  TBD - interface
+for disabling driver probing/locking a device.
+
+Once the group is ready, it may be added to the container by opening
+the VFIO group character device (/dev/vfio/$GROUP) and using the
+VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
+previously opened container file.  If desired and if the IOMMU driver
+supports sharing the IOMMU context between groups, multiple groups may
+be set to the same container.  If a group fails to set to a container
+with existing groups, a new empty container will need to be used
+instead.
+
+With a group (or groups) attached to a container, the remaining
+ioctls become available, enabling access to the VFIO IOMMU interfaces.
+Additionally, it now becomes possible to get file descriptors for each
+device within a group using an ioctl on the VFIO group file descriptor.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+../../../../kernel/iommu_groups/26
+
+This device is therefore in IOMMU group 26.  This device is on the
+pci bus, therefore the user will make use of vfio-pci to manage the
+group:
+
+# modprobe vfio-pci
+
+Binding this device to the vfio-pci driver creates the VFIO group
+character devices for this group:
+
+$ lspci -n -s 0000:06:0d.0
+06:0d.0 0401: 1102:0002 (rev 08)
+# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
+# echo 1102 0002 > /sys/bus/pci/drivers/vfio/new_id
+# echo 0000:06:0d.0 > /sys/bus/pci/drivers/vfio/bind
+
+Now we need to look at what other devices are in the group to free
+it for use by VFIO:
+
+$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
+total 0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
+	../../../../devices/pci0000:00/0000:00:1e.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
+	../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
+	../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This device is behind a PCIe-to-PCI bridge[4], therefore we also
+need to add device 0000:06:0d.1 to the group following the same
+procedure as above.  Device 0000:00:1e.0 is a bridge that does
+not currently have a host driver, therefore it's not required to
+bind this device to the vfio-pci driver (vfio-pci does not currently
+support PCI bridges).
+
+The final step is to provide the user with access to the group if
+unprivileged operation is desired (note that /dev/vfio/vfio provides
+no capabilities on it's own and is therefore expected to be set to
+mode 0666 by the system).
+
+# chown user:user /dev/vfio/26
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+	int container, group, device, i;
+	struct vfio_group_status group_status =
+					{ .argsz = sizeof(group_status) };
+	struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) };
+	struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) };
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	/* Create a new container */
+	container = open("/dev/vfio/vfio, O_RDWR);
+
+	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
+		/* Unknown API version */
+
+	if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_X86_IOMMU))
+		/* Doesn't support the IOMMU driver we want. */
+
+	/* Open the group */
+	group = open("/dev/vfio/26", O_RDWR);
+
+	/* Test the group is viable and available */
+	ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
+		/* Group is not viable (ie, not all devices bound for vfio) */
+
+	/* Add the group to the container */
+	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+	/* Enable the IOMMU model we want */
+	ioctl(container, VFIO_SET_IOMMU, VFIO_X86_IOMMU)
+
+	/* Get addition IOMMU info */
+	ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+	/* Allocate some space and setup a DMA mapping */
+	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	dma_map.size = 1024 * 1024;
+	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	for (i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+		/* Setup mappings... read/write offsets, mmaps
+		 * For PCI devices, config space is a region */
+	}
+
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
+	}
+
+	/* Gratuitous device reset and go... */
+	ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+VFIO bus drivers, such as vfio-pci make use of only a few interfaces
+into VFIO core.  When devices are bound and unbound to the driver,
+the driver should call vfio_add_group_dev() and vfio_del_group_dev()
+respectively:
+
+extern int vfio_add_group_dev(struct iommu_group *iommu_group,
+                              struct device *dev,
+                              const struct vfio_device_ops *ops,
+                              void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+vfio_add_group_dev() indicates to the core to begin tracking the
+specified iommu_group and register the specified dev as owned by
+a VFIO bus driver.  The driver provides an ops structure for callbacks
+similar to a file operations structure:
+
+struct vfio_device_ops {
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t size, loff_t *ppos);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+Each function is passed the device_data that was originally registered
+in the vfio_add_group_dev() call above.  This allows the bus driver
+an easy place to store it's opaque, private data.  The open/release
+callbacks are issued when a new file descriptor is created for a
+device (via VFIO_GROUP_GET_DEVICE_FD).  The ioctl interface provides
+a direct pass through for VFIO_DEVICE_* ioctls.  The read/write/mmap
+interfaces implement the device region access defined by the device's
+own VFIO_DEVICE_GET_REGION_INFO ioctl.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
+initial implementation by Tom Lyon while as Cisco.  We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved".  It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers.  To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
+still provide isolation.  For PCI, SR-IOV Virtual Functions are the
+best indicator of "well behaved", as these are designed for
+virtualization usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO.  It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge, so transactions
+from either function of the device are indistinguishable to the iommu:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+                        \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 09/13] vfio: x86 IOMMU implementation
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it.  This driver support a very simple
usage model using the existing IOMMU API.  The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/ioctl/ioctl-number.txt |    2 
 drivers/vfio/Kconfig                 |    6 
 drivers/vfio/Makefile                |    2 
 drivers/vfio/vfio.c                  |    7 
 drivers/vfio/vfio_iommu_x86.c        |  743 ++++++++++++++++++++++++++++++++++
 include/linux/vfio.h                 |   52 ++
 6 files changed, 811 insertions(+), 1 deletions(-)
 create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr@solidum.com>
-';'	64-6F	linux/vfio.h
+';'	64-72	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+	tristate
+	depends on VFIO && X86
+	default n
+
 menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
+	select VFIO_IOMMU_X86 if X86
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7500a67..1f1abee 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1 +1,3 @@
 obj-$(CONFIG_VFIO) += vfio.o
+obj-$(CONFIG_VFIO_IOMMU_X86) += vfio_iommu_x86.o
+obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index af0e4f8..410c4b0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1362,6 +1362,13 @@ static int __init vfio_init(void)
 
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 
+	/*
+	 * Attempt to load known iommu-drivers.  This gives us a working
+	 * environment without the user needing to explicitly load iommu
+	 * drivers.
+	 */
+	request_module_nowait("vfio_iommu_x86");
+
 	return 0;
 
 err_groups_cdev:
diff --git a/drivers/vfio/vfio_iommu_x86.c b/drivers/vfio/vfio_iommu_x86.c
new file mode 100644
index 0000000..a52391d
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_x86.c
@@ -0,0 +1,743 @@
+/*
+ * VFIO: IOMMU DMA mapping support for x86 (Intel VT-d & AMD-Vi)
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/pci.h>		/* pci_bus_type */
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "x86 IOMMU driver for VFIO"
+
+static bool allow_unsafe_interrupts;
+module_param_named(allow_unsafe_interrupts,
+		   allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(allow_unsafe_interrupts,
+		 "Enable VFIO IOMMU support for on platforms without interrupt remapping support.");
+
+struct vfio_iommu {
+	struct iommu_domain	*domain;
+	struct mutex		lock;
+	struct list_head	dma_list;
+	struct list_head	group_list;
+	bool			cache;
+};
+
+struct vfio_dma {
+	struct list_head	next;
+	dma_addr_t		iova;		/* Device address */
+	unsigned long		vaddr;		/* Process virtual addr */
+	long			npage;		/* Number of pages */
+	int			prot;		/* IOMMU_READ/WRITE */
+};
+
+struct vfio_group {
+	struct iommu_group	*iommu_group;
+	struct list_head	next;
+};
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
+
+struct vwork {
+	struct mm_struct	*mm;
+	long			npage;
+	struct work_struct	work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void vfio_lock_acct_bg(struct work_struct *work)
+{
+	struct vwork *vwork = container_of(work, struct vwork, work);
+	struct mm_struct *mm;
+
+	mm = vwork->mm;
+	down_write(&mm->mmap_sem);
+	mm->locked_vm += vwork->npage;
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	kfree(vwork);
+}
+
+static void vfio_lock_acct(long npage)
+{
+	struct vwork *vwork;
+	struct mm_struct *mm;
+
+	if (!current->mm)
+		return; /* process exited */
+
+	if (down_write_trylock(&current->mm->mmap_sem)) {
+		current->mm->locked_vm += npage;
+		up_write(&current->mm->mmap_sem);
+		return;
+	}
+
+	/*
+	 * Couldn't get mmap_sem lock, so must setup to update
+	 * mm->locked_vm later. If locked_vm were atomic, we
+	 * wouldn't need this silliness
+	 */
+	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+	if (!vwork)
+		return;
+	mm = get_task_mm(current);
+	if (!mm) {
+		kfree(vwork);
+		return;
+	}
+	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
+	vwork->mm = mm;
+	vwork->npage = npage;
+	schedule_work(&vwork->work);
+}
+
+/*
+ * Some mappings aren't backed by a struct page, for example an mmap'd
+ * MMIO range for our own or another device.  These use a different
+ * pfn conversion and shouldn't be tracked as locked pages.
+ */
+static bool is_invalid_reserved_pfn(unsigned long pfn)
+{
+	if (pfn_valid(pfn)) {
+		bool reserved;
+		struct page *tail = pfn_to_page(pfn);
+		struct page *head = compound_trans_head(tail);
+		reserved = !!(PageReserved(head));
+		if (head != tail) {
+			/*
+			 * "head" is not a dangling pointer
+			 * (compound_trans_head takes care of that)
+			 * but the hugepage may have been split
+			 * from under us (and we may not hold a
+			 * reference count on the head page so it can
+			 * be reused before we run PageReferenced), so
+			 * we've to check PageTail before returning
+			 * what we just read.
+			 */
+			smp_rmb();
+			if (PageTail(tail))
+				return reserved;
+		}
+		return PageReserved(tail);
+	}
+
+	return true;
+}
+
+static int put_pfn(unsigned long pfn, int prot)
+{
+	if (!is_invalid_reserved_pfn(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		if (prot & IOMMU_WRITE)
+			SetPageDirty(page);
+		put_page(page);
+		return 1;
+	}
+	return 0;
+}
+
+/* Unmap DMA region */
+static long __vfio_dma_do_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			     long npage, int prot)
+{
+	long i, unlocked = 0;
+
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
+		unsigned long pfn;
+
+		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
+		if (pfn) {
+			iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+			unlocked += put_pfn(pfn, prot);
+		}
+	}
+	return unlocked;
+}
+
+static void vfio_dma_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			   long npage, int prot)
+{
+	long unlocked;
+
+	unlocked = __vfio_dma_do_unmap(iommu, iova, npage, prot);
+	vfio_lock_acct(-unlocked);
+}
+
+static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+{
+	struct page *page[1];
+	struct vm_area_struct *vma;
+	int ret = -EFAULT;
+
+	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+		*pfn = page_to_pfn(page[0]);
+		return 0;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+
+	if (vma && vma->vm_flags & VM_PFNMAP) {
+		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		if (is_invalid_reserved_pfn(*pfn))
+			ret = 0;
+	}
+
+	up_read(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+/* Map DMA region */
+static int __vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
+			  unsigned long vaddr, long npage, int prot)
+{
+	dma_addr_t start = iova;
+	long i, locked = 0;
+	int ret;
+
+	/* Verify that pages are not already mapped */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
+		if (iommu_iova_to_phys(iommu->domain, iova))
+			return -EBUSY;
+
+	iova = start;
+
+	if (iommu->cache)
+		prot |= IOMMU_CACHE;
+
+	/*
+	 * XXX We break mappings into pages and use get_user_pages_fast to
+	 * pin the pages in memory.  It's been suggested that mlock might
+	 * provide a more efficient mechanism, but nothing prevents the
+	 * user from munlocking the pages, which could then allow the user
+	 * access to random host memory.  We also have no guarantee from the
+	 * IOMMU API that the iommu driver can unmap sub-pages of previous
+	 * mappings.  This means we might lose an entire range if a single
+	 * page within it is unmapped.  Single page mappings are inefficient,
+	 * but provide the most flexibility for now.
+	 */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
+		unsigned long pfn = 0;
+
+		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		if (ret) {
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+
+		/*
+		 * Only add actual locked pages to accounting
+		 * XXX We're effectively marking a page locked for every
+		 * IOVA page even though it's possible the user could be
+		 * backing multiple IOVAs with the same vaddr.  This over-
+		 * penalizes the user process, but we currently have no
+		 * easy way to do this properly.
+		 */
+		if (!is_invalid_reserved_pfn(pfn))
+			locked++;
+
+		ret = iommu_map(iommu->domain, iova,
+				(phys_addr_t)pfn << PAGE_SHIFT,
+				PAGE_SIZE, prot);
+		if (ret) {
+			/* Back out mappings on error */
+			put_pfn(pfn, prot);
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+	}
+	vfio_lock_acct(locked);
+	return 0;
+}
+
+static inline bool ranges_overlap(dma_addr_t start1, size_t size1,
+				  dma_addr_t start2, size_t size2)
+{
+	if (start1 < start2)
+		return (start2 - start1 < size1);
+	else if (start2 < start1)
+		return (start1 - start2 < size2);
+	return (size1 > 0 && size2 > 0);
+}
+
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+						dma_addr_t start, size_t size)
+{
+	struct vfio_dma *dma;
+
+	list_for_each_entry(dma, &iommu->dma_list, next) {
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   start, size))
+			return dma;
+	}
+	return NULL;
+}
+
+static long vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+				    size_t size, struct vfio_dma *dma)
+{
+	struct vfio_dma *split;
+	long npage_lo, npage_hi;
+
+	/* Existing dma region is completely covered, unmap all */
+	if (start <= dma->iova &&
+	    start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->next);
+		npage_lo = dma->npage;
+		kfree(dma);
+		return npage_lo;
+	}
+
+	/* Overlap low address of existing range */
+	if (start <= dma->iova) {
+		size_t overlap;
+
+		overlap = start + size - dma->iova;
+		npage_lo = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, dma->iova, npage_lo, dma->prot);
+		dma->iova += overlap;
+		dma->vaddr += overlap;
+		dma->npage -= npage_lo;
+		return npage_lo;
+	}
+
+	/* Overlap high address of existing range */
+	if (start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		size_t overlap;
+
+		overlap = dma->iova + NPAGE_TO_SIZE(dma->npage) - start;
+		npage_hi = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, start, npage_hi, dma->prot);
+		dma->npage -= npage_hi;
+		return npage_hi;
+	}
+
+	/* Split existing */
+	npage_lo = (start - dma->iova) >> PAGE_SHIFT;
+	npage_hi = dma->npage - (size >> PAGE_SHIFT) - npage_lo;
+
+	split = kzalloc(sizeof *split, GFP_KERNEL);
+	if (!split)
+		return -ENOMEM;
+
+	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, dma->prot);
+
+	dma->npage = npage_lo;
+
+	split->npage = npage_hi;
+	split->iova = start + size;
+	split->vaddr = dma->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
+	split->prot = dma->prot;
+	list_add(&split->next, &iommu->dma_list);
+	return size >> PAGE_SHIFT;
+}
+
+static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
+			     struct vfio_iommu_x86_dma_unmap *unmap)
+{
+	long ret = 0, npage = unmap->size >> PAGE_SHIFT;
+	struct vfio_dma *dma, *tmp;
+	uint64_t mask;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	if (unmap->iova & mask)
+		return -EINVAL;
+	if (unmap->size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry_safe(dma, tmp, &iommu->dma_list, next) {
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   unmap->iova, unmap->size)) {
+			ret = vfio_remove_dma_overlap(iommu, unmap->iova,
+						      unmap->size, dma);
+			if (ret > 0)
+				npage -= ret;
+			if (ret < 0 || npage == 0)
+				break;
+		}
+	}
+	mutex_unlock(&iommu->lock);
+	return ret > 0 ? 0 : (int)ret;
+}
+
+static int vfio_dma_do_map(struct vfio_iommu *iommu,
+			   struct vfio_iommu_x86_dma_map *map)
+{
+	struct vfio_dma *dma, *pdma = NULL;
+	dma_addr_t iova = map->iova;
+	unsigned long locked, lock_limit, vaddr = map->vaddr;
+	size_t size = map->size;
+	int ret = 0, prot = 0;
+	uint64_t mask;
+	long npage;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	/* READ/WRITE from device perspective */
+	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+		prot |= IOMMU_WRITE;
+	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
+		prot |= IOMMU_READ;
+
+	if (!prot)
+		return -EINVAL; /* No READ/WRITE? */
+
+	if (vaddr & mask)
+		return -EINVAL;
+	if (iova & mask)
+		return -EINVAL;
+	if (size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	/* Don't allow IOVA wrap */
+	if (iova + size && iova + size < iova)
+		return -EINVAL;
+
+	/* Don't allow virtual address wrap */
+	if (vaddr + size && vaddr + size < vaddr)
+		return -EINVAL;
+
+	npage = size >> PAGE_SHIFT;
+	if (!npage)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret = -EBUSY;
+		goto out_lock;
+	}
+
+	/* account for locked pages */
+	locked = current->mm->locked_vm + npage;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+			__func__, rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+		goto out_lock;
+	}
+
+	ret = __vfio_dma_map(iommu, iova, vaddr, npage, prot);
+	if (ret)
+		goto out_lock;
+
+	/* Check if we abut a region below - nothing below 0 */
+	if (iova) {
+		dma = vfio_find_dma(iommu, iova - 1, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr + NPAGE_TO_SIZE(dma->npage) == vaddr) {
+
+			dma->npage += npage;
+			iova = dma->iova;
+			vaddr = dma->vaddr;
+			npage = dma->npage;
+			size = NPAGE_TO_SIZE(npage);
+
+			pdma = dma;
+		}
+	}
+
+	/* Check if we abut a region above - nothing above ~0 + 1 */
+	if (iova + size) {
+		dma = vfio_find_dma(iommu, iova + size, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr == vaddr + size) {
+
+			dma->npage += npage;
+			dma->iova = iova;
+			dma->vaddr = vaddr;
+
+			/*
+			 * If merged above and below, remove previously
+			 * merged entry.  New entry covers it.
+			 */
+			if (pdma) {
+				list_del(&pdma->next);
+				kfree(pdma);
+			}
+			pdma = dma;
+		}
+	}
+
+	/* Isolated, new region */
+	if (!pdma) {
+		dma = kzalloc(sizeof *dma, GFP_KERNEL);
+		if (!dma) {
+			ret = -ENOMEM;
+			vfio_dma_unmap(iommu, iova, npage, prot);
+			goto out_lock;
+		}
+
+		dma->npage = npage;
+		dma->iova = iova;
+		dma->vaddr = vaddr;
+		dma->prot = prot;
+		list_add(&dma->next, &iommu->dma_list);
+	}
+
+out_lock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_iommu_x86_attach_group(void *iommu_data,
+				       struct iommu_group *iommu_group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group, *tmp;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(tmp, &iommu->group_list, next) {
+		if (tmp->iommu_group == iommu_group) {
+			mutex_unlock(&iommu->lock);
+			kfree(group);
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * TODO: Domain have capabilities that might change as we add
+	 * groups (see iommu->cache, currently never set).  Check for
+	 * them and potentially disallow groups to be attached when it
+	 * would change capabilities (ugh).
+	 */
+	ret = iommu_attach_group(iommu->domain, iommu_group);
+	if (ret) {
+		mutex_unlock(&iommu->lock);
+		kfree(group);
+		return ret;
+	}
+
+	group->iommu_group = iommu_group;
+	list_add(&group->next, &iommu->group_list);
+
+	mutex_unlock(&iommu->lock);
+
+	return 0;
+}
+
+static void vfio_iommu_x86_detach_group(void *iommu_data,
+					struct iommu_group *iommu_group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(group, &iommu->group_list, next) {
+		if (group->iommu_group == iommu_group) {
+			iommu_detach_group(iommu->domain, iommu_group);
+			list_del(&group->next);
+			kfree(group);
+			break;
+		}
+	}
+
+	mutex_unlock(&iommu->lock);
+}
+
+static void *vfio_iommu_x86_open(unsigned long arg)
+{
+	struct vfio_iommu *iommu;
+
+	if (arg != VFIO_X86_IOMMU)
+		return ERR_PTR(-EINVAL);
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&iommu->group_list);
+	INIT_LIST_HEAD(&iommu->dma_list);
+	mutex_init(&iommu->lock);
+
+	/*
+	 * Wish we didn't have to know about bus_type here.
+	 */
+	iommu->domain = iommu_domain_alloc(&pci_bus_type);
+	if (!iommu->domain) {
+		kfree(iommu);
+		return ERR_PTR(-EIO);
+	}
+
+	/*
+	 * Wish we could specify required capabilities rather than create
+	 * a domain, see what comes out and hope it doesn't change along
+	 * the way.  Fortunately we know interrupt remapping is global for
+	 * our iommus.
+	 */
+	if (!allow_unsafe_interrupts &&
+	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
+		printk(KERN_WARNING
+		       "%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		iommu_domain_free(iommu->domain);
+		kfree(iommu);
+		return ERR_PTR(-EPERM);
+	}
+
+	return iommu;
+}
+
+static void vfio_iommu_x86_release(void *iommu_data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group, *group_tmp;
+	struct vfio_dma *dma, *dma_tmp;
+
+	list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
+		iommu_detach_group(iommu->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	list_for_each_entry_safe(dma, dma_tmp, &iommu->dma_list, next) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->next);
+		kfree(dma);
+	}
+
+	iommu_domain_free(iommu->domain);
+	iommu->domain = NULL;
+	kfree(iommu);
+}
+
+static long vfio_iommu_x86_ioctl(void *iommu_data,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_CHECK_EXTENSION) {
+		switch (arg) {
+		case VFIO_X86_IOMMU:
+			return 1;
+		default:
+			return 0;
+		}
+	} else if (cmd == VFIO_IOMMU_GET_INFO) {
+		struct vfio_iommu_x86_info info;
+
+		minsz = offsetofend(struct vfio_iommu_x86_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
+		struct vfio_iommu_x86_dma_map map;
+		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE;
+
+		minsz = offsetofend(struct vfio_iommu_x86_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz || map.flags & ~mask)
+			return -EINVAL;
+
+		return vfio_dma_do_map(iommu, &map);
+
+	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
+		struct vfio_iommu_x86_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_x86_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz || unmap.flags)
+			return -EINVAL;
+
+		return vfio_dma_do_unmap(iommu, &unmap);
+	}
+
+	return -ENOTTY;
+}
+
+const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_x86 = {
+	.name		= "vfio-iommu-x86",
+	.owner		= THIS_MODULE,
+	.open		= vfio_iommu_x86_open,
+	.release	= vfio_iommu_x86_release,
+	.ioctl		= vfio_iommu_x86_ioctl,
+	.attach_group	= vfio_iommu_x86_attach_group,
+	.detach_group	= vfio_iommu_x86_detach_group,
+};
+
+static int __init vfio_iommu_x86_init(void)
+{
+	if (!iommu_present(&pci_bus_type))
+		return -ENODEV;
+
+	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+static void __exit vfio_iommu_x86_cleanup(void)
+{
+	vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+module_init(vfio_iommu_x86_init);
+module_exit(vfio_iommu_x86_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a264054..1c7119c 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -361,4 +361,56 @@ struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/* -------- API for x86 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_x86_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ */
+struct vfio_iommu_x86_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+	__u64	vaddr;				/* Process virtual address */
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 14, struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.
+ */
+struct vfio_iommu_x86_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
 #endif /* VFIO_H */


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 09/13] vfio: x86 IOMMU implementation
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it.  This driver support a very simple
usage model using the existing IOMMU API.  The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 Documentation/ioctl/ioctl-number.txt |    2 
 drivers/vfio/Kconfig                 |    6 
 drivers/vfio/Makefile                |    2 
 drivers/vfio/vfio.c                  |    7 
 drivers/vfio/vfio_iommu_x86.c        |  743 ++++++++++++++++++++++++++++++++++
 include/linux/vfio.h                 |   52 ++
 6 files changed, 811 insertions(+), 1 deletions(-)
 create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr-b3vwm0MSY3FBDgjK7y7TUQ@public.gmane.org>
-';'	64-6F	linux/vfio.h
+';'	64-72	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+	tristate
+	depends on VFIO && X86
+	default n
+
 menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
+	select VFIO_IOMMU_X86 if X86
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7500a67..1f1abee 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1 +1,3 @@
 obj-$(CONFIG_VFIO) += vfio.o
+obj-$(CONFIG_VFIO_IOMMU_X86) += vfio_iommu_x86.o
+obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index af0e4f8..410c4b0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1362,6 +1362,13 @@ static int __init vfio_init(void)
 
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 
+	/*
+	 * Attempt to load known iommu-drivers.  This gives us a working
+	 * environment without the user needing to explicitly load iommu
+	 * drivers.
+	 */
+	request_module_nowait("vfio_iommu_x86");
+
 	return 0;
 
 err_groups_cdev:
diff --git a/drivers/vfio/vfio_iommu_x86.c b/drivers/vfio/vfio_iommu_x86.c
new file mode 100644
index 0000000..a52391d
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_x86.c
@@ -0,0 +1,743 @@
+/*
+ * VFIO: IOMMU DMA mapping support for x86 (Intel VT-d & AMD-Vi)
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/pci.h>		/* pci_bus_type */
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
+#define DRIVER_DESC     "x86 IOMMU driver for VFIO"
+
+static bool allow_unsafe_interrupts;
+module_param_named(allow_unsafe_interrupts,
+		   allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(allow_unsafe_interrupts,
+		 "Enable VFIO IOMMU support for on platforms without interrupt remapping support.");
+
+struct vfio_iommu {
+	struct iommu_domain	*domain;
+	struct mutex		lock;
+	struct list_head	dma_list;
+	struct list_head	group_list;
+	bool			cache;
+};
+
+struct vfio_dma {
+	struct list_head	next;
+	dma_addr_t		iova;		/* Device address */
+	unsigned long		vaddr;		/* Process virtual addr */
+	long			npage;		/* Number of pages */
+	int			prot;		/* IOMMU_READ/WRITE */
+};
+
+struct vfio_group {
+	struct iommu_group	*iommu_group;
+	struct list_head	next;
+};
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
+
+struct vwork {
+	struct mm_struct	*mm;
+	long			npage;
+	struct work_struct	work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void vfio_lock_acct_bg(struct work_struct *work)
+{
+	struct vwork *vwork = container_of(work, struct vwork, work);
+	struct mm_struct *mm;
+
+	mm = vwork->mm;
+	down_write(&mm->mmap_sem);
+	mm->locked_vm += vwork->npage;
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	kfree(vwork);
+}
+
+static void vfio_lock_acct(long npage)
+{
+	struct vwork *vwork;
+	struct mm_struct *mm;
+
+	if (!current->mm)
+		return; /* process exited */
+
+	if (down_write_trylock(&current->mm->mmap_sem)) {
+		current->mm->locked_vm += npage;
+		up_write(&current->mm->mmap_sem);
+		return;
+	}
+
+	/*
+	 * Couldn't get mmap_sem lock, so must setup to update
+	 * mm->locked_vm later. If locked_vm were atomic, we
+	 * wouldn't need this silliness
+	 */
+	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+	if (!vwork)
+		return;
+	mm = get_task_mm(current);
+	if (!mm) {
+		kfree(vwork);
+		return;
+	}
+	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
+	vwork->mm = mm;
+	vwork->npage = npage;
+	schedule_work(&vwork->work);
+}
+
+/*
+ * Some mappings aren't backed by a struct page, for example an mmap'd
+ * MMIO range for our own or another device.  These use a different
+ * pfn conversion and shouldn't be tracked as locked pages.
+ */
+static bool is_invalid_reserved_pfn(unsigned long pfn)
+{
+	if (pfn_valid(pfn)) {
+		bool reserved;
+		struct page *tail = pfn_to_page(pfn);
+		struct page *head = compound_trans_head(tail);
+		reserved = !!(PageReserved(head));
+		if (head != tail) {
+			/*
+			 * "head" is not a dangling pointer
+			 * (compound_trans_head takes care of that)
+			 * but the hugepage may have been split
+			 * from under us (and we may not hold a
+			 * reference count on the head page so it can
+			 * be reused before we run PageReferenced), so
+			 * we've to check PageTail before returning
+			 * what we just read.
+			 */
+			smp_rmb();
+			if (PageTail(tail))
+				return reserved;
+		}
+		return PageReserved(tail);
+	}
+
+	return true;
+}
+
+static int put_pfn(unsigned long pfn, int prot)
+{
+	if (!is_invalid_reserved_pfn(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		if (prot & IOMMU_WRITE)
+			SetPageDirty(page);
+		put_page(page);
+		return 1;
+	}
+	return 0;
+}
+
+/* Unmap DMA region */
+static long __vfio_dma_do_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			     long npage, int prot)
+{
+	long i, unlocked = 0;
+
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
+		unsigned long pfn;
+
+		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
+		if (pfn) {
+			iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+			unlocked += put_pfn(pfn, prot);
+		}
+	}
+	return unlocked;
+}
+
+static void vfio_dma_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			   long npage, int prot)
+{
+	long unlocked;
+
+	unlocked = __vfio_dma_do_unmap(iommu, iova, npage, prot);
+	vfio_lock_acct(-unlocked);
+}
+
+static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+{
+	struct page *page[1];
+	struct vm_area_struct *vma;
+	int ret = -EFAULT;
+
+	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+		*pfn = page_to_pfn(page[0]);
+		return 0;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+
+	if (vma && vma->vm_flags & VM_PFNMAP) {
+		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		if (is_invalid_reserved_pfn(*pfn))
+			ret = 0;
+	}
+
+	up_read(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+/* Map DMA region */
+static int __vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
+			  unsigned long vaddr, long npage, int prot)
+{
+	dma_addr_t start = iova;
+	long i, locked = 0;
+	int ret;
+
+	/* Verify that pages are not already mapped */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
+		if (iommu_iova_to_phys(iommu->domain, iova))
+			return -EBUSY;
+
+	iova = start;
+
+	if (iommu->cache)
+		prot |= IOMMU_CACHE;
+
+	/*
+	 * XXX We break mappings into pages and use get_user_pages_fast to
+	 * pin the pages in memory.  It's been suggested that mlock might
+	 * provide a more efficient mechanism, but nothing prevents the
+	 * user from munlocking the pages, which could then allow the user
+	 * access to random host memory.  We also have no guarantee from the
+	 * IOMMU API that the iommu driver can unmap sub-pages of previous
+	 * mappings.  This means we might lose an entire range if a single
+	 * page within it is unmapped.  Single page mappings are inefficient,
+	 * but provide the most flexibility for now.
+	 */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
+		unsigned long pfn = 0;
+
+		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		if (ret) {
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+
+		/*
+		 * Only add actual locked pages to accounting
+		 * XXX We're effectively marking a page locked for every
+		 * IOVA page even though it's possible the user could be
+		 * backing multiple IOVAs with the same vaddr.  This over-
+		 * penalizes the user process, but we currently have no
+		 * easy way to do this properly.
+		 */
+		if (!is_invalid_reserved_pfn(pfn))
+			locked++;
+
+		ret = iommu_map(iommu->domain, iova,
+				(phys_addr_t)pfn << PAGE_SHIFT,
+				PAGE_SIZE, prot);
+		if (ret) {
+			/* Back out mappings on error */
+			put_pfn(pfn, prot);
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+	}
+	vfio_lock_acct(locked);
+	return 0;
+}
+
+static inline bool ranges_overlap(dma_addr_t start1, size_t size1,
+				  dma_addr_t start2, size_t size2)
+{
+	if (start1 < start2)
+		return (start2 - start1 < size1);
+	else if (start2 < start1)
+		return (start1 - start2 < size2);
+	return (size1 > 0 && size2 > 0);
+}
+
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+						dma_addr_t start, size_t size)
+{
+	struct vfio_dma *dma;
+
+	list_for_each_entry(dma, &iommu->dma_list, next) {
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   start, size))
+			return dma;
+	}
+	return NULL;
+}
+
+static long vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+				    size_t size, struct vfio_dma *dma)
+{
+	struct vfio_dma *split;
+	long npage_lo, npage_hi;
+
+	/* Existing dma region is completely covered, unmap all */
+	if (start <= dma->iova &&
+	    start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->next);
+		npage_lo = dma->npage;
+		kfree(dma);
+		return npage_lo;
+	}
+
+	/* Overlap low address of existing range */
+	if (start <= dma->iova) {
+		size_t overlap;
+
+		overlap = start + size - dma->iova;
+		npage_lo = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, dma->iova, npage_lo, dma->prot);
+		dma->iova += overlap;
+		dma->vaddr += overlap;
+		dma->npage -= npage_lo;
+		return npage_lo;
+	}
+
+	/* Overlap high address of existing range */
+	if (start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		size_t overlap;
+
+		overlap = dma->iova + NPAGE_TO_SIZE(dma->npage) - start;
+		npage_hi = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, start, npage_hi, dma->prot);
+		dma->npage -= npage_hi;
+		return npage_hi;
+	}
+
+	/* Split existing */
+	npage_lo = (start - dma->iova) >> PAGE_SHIFT;
+	npage_hi = dma->npage - (size >> PAGE_SHIFT) - npage_lo;
+
+	split = kzalloc(sizeof *split, GFP_KERNEL);
+	if (!split)
+		return -ENOMEM;
+
+	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, dma->prot);
+
+	dma->npage = npage_lo;
+
+	split->npage = npage_hi;
+	split->iova = start + size;
+	split->vaddr = dma->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
+	split->prot = dma->prot;
+	list_add(&split->next, &iommu->dma_list);
+	return size >> PAGE_SHIFT;
+}
+
+static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
+			     struct vfio_iommu_x86_dma_unmap *unmap)
+{
+	long ret = 0, npage = unmap->size >> PAGE_SHIFT;
+	struct vfio_dma *dma, *tmp;
+	uint64_t mask;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	if (unmap->iova & mask)
+		return -EINVAL;
+	if (unmap->size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry_safe(dma, tmp, &iommu->dma_list, next) {
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   unmap->iova, unmap->size)) {
+			ret = vfio_remove_dma_overlap(iommu, unmap->iova,
+						      unmap->size, dma);
+			if (ret > 0)
+				npage -= ret;
+			if (ret < 0 || npage == 0)
+				break;
+		}
+	}
+	mutex_unlock(&iommu->lock);
+	return ret > 0 ? 0 : (int)ret;
+}
+
+static int vfio_dma_do_map(struct vfio_iommu *iommu,
+			   struct vfio_iommu_x86_dma_map *map)
+{
+	struct vfio_dma *dma, *pdma = NULL;
+	dma_addr_t iova = map->iova;
+	unsigned long locked, lock_limit, vaddr = map->vaddr;
+	size_t size = map->size;
+	int ret = 0, prot = 0;
+	uint64_t mask;
+	long npage;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	/* READ/WRITE from device perspective */
+	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+		prot |= IOMMU_WRITE;
+	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
+		prot |= IOMMU_READ;
+
+	if (!prot)
+		return -EINVAL; /* No READ/WRITE? */
+
+	if (vaddr & mask)
+		return -EINVAL;
+	if (iova & mask)
+		return -EINVAL;
+	if (size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	/* Don't allow IOVA wrap */
+	if (iova + size && iova + size < iova)
+		return -EINVAL;
+
+	/* Don't allow virtual address wrap */
+	if (vaddr + size && vaddr + size < vaddr)
+		return -EINVAL;
+
+	npage = size >> PAGE_SHIFT;
+	if (!npage)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret = -EBUSY;
+		goto out_lock;
+	}
+
+	/* account for locked pages */
+	locked = current->mm->locked_vm + npage;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+			__func__, rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+		goto out_lock;
+	}
+
+	ret = __vfio_dma_map(iommu, iova, vaddr, npage, prot);
+	if (ret)
+		goto out_lock;
+
+	/* Check if we abut a region below - nothing below 0 */
+	if (iova) {
+		dma = vfio_find_dma(iommu, iova - 1, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr + NPAGE_TO_SIZE(dma->npage) == vaddr) {
+
+			dma->npage += npage;
+			iova = dma->iova;
+			vaddr = dma->vaddr;
+			npage = dma->npage;
+			size = NPAGE_TO_SIZE(npage);
+
+			pdma = dma;
+		}
+	}
+
+	/* Check if we abut a region above - nothing above ~0 + 1 */
+	if (iova + size) {
+		dma = vfio_find_dma(iommu, iova + size, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr == vaddr + size) {
+
+			dma->npage += npage;
+			dma->iova = iova;
+			dma->vaddr = vaddr;
+
+			/*
+			 * If merged above and below, remove previously
+			 * merged entry.  New entry covers it.
+			 */
+			if (pdma) {
+				list_del(&pdma->next);
+				kfree(pdma);
+			}
+			pdma = dma;
+		}
+	}
+
+	/* Isolated, new region */
+	if (!pdma) {
+		dma = kzalloc(sizeof *dma, GFP_KERNEL);
+		if (!dma) {
+			ret = -ENOMEM;
+			vfio_dma_unmap(iommu, iova, npage, prot);
+			goto out_lock;
+		}
+
+		dma->npage = npage;
+		dma->iova = iova;
+		dma->vaddr = vaddr;
+		dma->prot = prot;
+		list_add(&dma->next, &iommu->dma_list);
+	}
+
+out_lock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_iommu_x86_attach_group(void *iommu_data,
+				       struct iommu_group *iommu_group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group, *tmp;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(tmp, &iommu->group_list, next) {
+		if (tmp->iommu_group == iommu_group) {
+			mutex_unlock(&iommu->lock);
+			kfree(group);
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * TODO: Domain have capabilities that might change as we add
+	 * groups (see iommu->cache, currently never set).  Check for
+	 * them and potentially disallow groups to be attached when it
+	 * would change capabilities (ugh).
+	 */
+	ret = iommu_attach_group(iommu->domain, iommu_group);
+	if (ret) {
+		mutex_unlock(&iommu->lock);
+		kfree(group);
+		return ret;
+	}
+
+	group->iommu_group = iommu_group;
+	list_add(&group->next, &iommu->group_list);
+
+	mutex_unlock(&iommu->lock);
+
+	return 0;
+}
+
+static void vfio_iommu_x86_detach_group(void *iommu_data,
+					struct iommu_group *iommu_group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(group, &iommu->group_list, next) {
+		if (group->iommu_group == iommu_group) {
+			iommu_detach_group(iommu->domain, iommu_group);
+			list_del(&group->next);
+			kfree(group);
+			break;
+		}
+	}
+
+	mutex_unlock(&iommu->lock);
+}
+
+static void *vfio_iommu_x86_open(unsigned long arg)
+{
+	struct vfio_iommu *iommu;
+
+	if (arg != VFIO_X86_IOMMU)
+		return ERR_PTR(-EINVAL);
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&iommu->group_list);
+	INIT_LIST_HEAD(&iommu->dma_list);
+	mutex_init(&iommu->lock);
+
+	/*
+	 * Wish we didn't have to know about bus_type here.
+	 */
+	iommu->domain = iommu_domain_alloc(&pci_bus_type);
+	if (!iommu->domain) {
+		kfree(iommu);
+		return ERR_PTR(-EIO);
+	}
+
+	/*
+	 * Wish we could specify required capabilities rather than create
+	 * a domain, see what comes out and hope it doesn't change along
+	 * the way.  Fortunately we know interrupt remapping is global for
+	 * our iommus.
+	 */
+	if (!allow_unsafe_interrupts &&
+	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
+		printk(KERN_WARNING
+		       "%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		iommu_domain_free(iommu->domain);
+		kfree(iommu);
+		return ERR_PTR(-EPERM);
+	}
+
+	return iommu;
+}
+
+static void vfio_iommu_x86_release(void *iommu_data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group, *group_tmp;
+	struct vfio_dma *dma, *dma_tmp;
+
+	list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
+		iommu_detach_group(iommu->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	list_for_each_entry_safe(dma, dma_tmp, &iommu->dma_list, next) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->next);
+		kfree(dma);
+	}
+
+	iommu_domain_free(iommu->domain);
+	iommu->domain = NULL;
+	kfree(iommu);
+}
+
+static long vfio_iommu_x86_ioctl(void *iommu_data,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_CHECK_EXTENSION) {
+		switch (arg) {
+		case VFIO_X86_IOMMU:
+			return 1;
+		default:
+			return 0;
+		}
+	} else if (cmd == VFIO_IOMMU_GET_INFO) {
+		struct vfio_iommu_x86_info info;
+
+		minsz = offsetofend(struct vfio_iommu_x86_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
+		struct vfio_iommu_x86_dma_map map;
+		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE;
+
+		minsz = offsetofend(struct vfio_iommu_x86_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz || map.flags & ~mask)
+			return -EINVAL;
+
+		return vfio_dma_do_map(iommu, &map);
+
+	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
+		struct vfio_iommu_x86_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_x86_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz || unmap.flags)
+			return -EINVAL;
+
+		return vfio_dma_do_unmap(iommu, &unmap);
+	}
+
+	return -ENOTTY;
+}
+
+const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_x86 = {
+	.name		= "vfio-iommu-x86",
+	.owner		= THIS_MODULE,
+	.open		= vfio_iommu_x86_open,
+	.release	= vfio_iommu_x86_release,
+	.ioctl		= vfio_iommu_x86_ioctl,
+	.attach_group	= vfio_iommu_x86_attach_group,
+	.detach_group	= vfio_iommu_x86_detach_group,
+};
+
+static int __init vfio_iommu_x86_init(void)
+{
+	if (!iommu_present(&pci_bus_type))
+		return -ENODEV;
+
+	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+static void __exit vfio_iommu_x86_cleanup(void)
+{
+	vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+module_init(vfio_iommu_x86_init);
+module_exit(vfio_iommu_x86_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a264054..1c7119c 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -361,4 +361,56 @@ struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/* -------- API for x86 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_x86_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ */
+struct vfio_iommu_x86_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+	__u64	vaddr;				/* Process virtual address */
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 14, struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.
+ */
+struct vfio_iommu_x86_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
 #endif /* VFIO_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 09/13] vfio: x86 IOMMU implementation
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it.  This driver support a very simple
usage model using the existing IOMMU API.  The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/ioctl/ioctl-number.txt |    2 
 drivers/vfio/Kconfig                 |    6 
 drivers/vfio/Makefile                |    2 
 drivers/vfio/vfio.c                  |    7 
 drivers/vfio/vfio_iommu_x86.c        |  743 ++++++++++++++++++++++++++++++++++
 include/linux/vfio.h                 |   52 ++
 6 files changed, 811 insertions(+), 1 deletions(-)
 create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr@solidum.com>
-';'	64-6F	linux/vfio.h
+';'	64-72	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+	tristate
+	depends on VFIO && X86
+	default n
+
 menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
+	select VFIO_IOMMU_X86 if X86
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7500a67..1f1abee 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1 +1,3 @@
 obj-$(CONFIG_VFIO) += vfio.o
+obj-$(CONFIG_VFIO_IOMMU_X86) += vfio_iommu_x86.o
+obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index af0e4f8..410c4b0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1362,6 +1362,13 @@ static int __init vfio_init(void)
 
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 
+	/*
+	 * Attempt to load known iommu-drivers.  This gives us a working
+	 * environment without the user needing to explicitly load iommu
+	 * drivers.
+	 */
+	request_module_nowait("vfio_iommu_x86");
+
 	return 0;
 
 err_groups_cdev:
diff --git a/drivers/vfio/vfio_iommu_x86.c b/drivers/vfio/vfio_iommu_x86.c
new file mode 100644
index 0000000..a52391d
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_x86.c
@@ -0,0 +1,743 @@
+/*
+ * VFIO: IOMMU DMA mapping support for x86 (Intel VT-d & AMD-Vi)
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/pci.h>		/* pci_bus_type */
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "x86 IOMMU driver for VFIO"
+
+static bool allow_unsafe_interrupts;
+module_param_named(allow_unsafe_interrupts,
+		   allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(allow_unsafe_interrupts,
+		 "Enable VFIO IOMMU support for on platforms without interrupt remapping support.");
+
+struct vfio_iommu {
+	struct iommu_domain	*domain;
+	struct mutex		lock;
+	struct list_head	dma_list;
+	struct list_head	group_list;
+	bool			cache;
+};
+
+struct vfio_dma {
+	struct list_head	next;
+	dma_addr_t		iova;		/* Device address */
+	unsigned long		vaddr;		/* Process virtual addr */
+	long			npage;		/* Number of pages */
+	int			prot;		/* IOMMU_READ/WRITE */
+};
+
+struct vfio_group {
+	struct iommu_group	*iommu_group;
+	struct list_head	next;
+};
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
+
+struct vwork {
+	struct mm_struct	*mm;
+	long			npage;
+	struct work_struct	work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void vfio_lock_acct_bg(struct work_struct *work)
+{
+	struct vwork *vwork = container_of(work, struct vwork, work);
+	struct mm_struct *mm;
+
+	mm = vwork->mm;
+	down_write(&mm->mmap_sem);
+	mm->locked_vm += vwork->npage;
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	kfree(vwork);
+}
+
+static void vfio_lock_acct(long npage)
+{
+	struct vwork *vwork;
+	struct mm_struct *mm;
+
+	if (!current->mm)
+		return; /* process exited */
+
+	if (down_write_trylock(&current->mm->mmap_sem)) {
+		current->mm->locked_vm += npage;
+		up_write(&current->mm->mmap_sem);
+		return;
+	}
+
+	/*
+	 * Couldn't get mmap_sem lock, so must setup to update
+	 * mm->locked_vm later. If locked_vm were atomic, we
+	 * wouldn't need this silliness
+	 */
+	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+	if (!vwork)
+		return;
+	mm = get_task_mm(current);
+	if (!mm) {
+		kfree(vwork);
+		return;
+	}
+	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
+	vwork->mm = mm;
+	vwork->npage = npage;
+	schedule_work(&vwork->work);
+}
+
+/*
+ * Some mappings aren't backed by a struct page, for example an mmap'd
+ * MMIO range for our own or another device.  These use a different
+ * pfn conversion and shouldn't be tracked as locked pages.
+ */
+static bool is_invalid_reserved_pfn(unsigned long pfn)
+{
+	if (pfn_valid(pfn)) {
+		bool reserved;
+		struct page *tail = pfn_to_page(pfn);
+		struct page *head = compound_trans_head(tail);
+		reserved = !!(PageReserved(head));
+		if (head != tail) {
+			/*
+			 * "head" is not a dangling pointer
+			 * (compound_trans_head takes care of that)
+			 * but the hugepage may have been split
+			 * from under us (and we may not hold a
+			 * reference count on the head page so it can
+			 * be reused before we run PageReferenced), so
+			 * we've to check PageTail before returning
+			 * what we just read.
+			 */
+			smp_rmb();
+			if (PageTail(tail))
+				return reserved;
+		}
+		return PageReserved(tail);
+	}
+
+	return true;
+}
+
+static int put_pfn(unsigned long pfn, int prot)
+{
+	if (!is_invalid_reserved_pfn(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		if (prot & IOMMU_WRITE)
+			SetPageDirty(page);
+		put_page(page);
+		return 1;
+	}
+	return 0;
+}
+
+/* Unmap DMA region */
+static long __vfio_dma_do_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			     long npage, int prot)
+{
+	long i, unlocked = 0;
+
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
+		unsigned long pfn;
+
+		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
+		if (pfn) {
+			iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+			unlocked += put_pfn(pfn, prot);
+		}
+	}
+	return unlocked;
+}
+
+static void vfio_dma_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			   long npage, int prot)
+{
+	long unlocked;
+
+	unlocked = __vfio_dma_do_unmap(iommu, iova, npage, prot);
+	vfio_lock_acct(-unlocked);
+}
+
+static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+{
+	struct page *page[1];
+	struct vm_area_struct *vma;
+	int ret = -EFAULT;
+
+	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+		*pfn = page_to_pfn(page[0]);
+		return 0;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+
+	if (vma && vma->vm_flags & VM_PFNMAP) {
+		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		if (is_invalid_reserved_pfn(*pfn))
+			ret = 0;
+	}
+
+	up_read(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+/* Map DMA region */
+static int __vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
+			  unsigned long vaddr, long npage, int prot)
+{
+	dma_addr_t start = iova;
+	long i, locked = 0;
+	int ret;
+
+	/* Verify that pages are not already mapped */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
+		if (iommu_iova_to_phys(iommu->domain, iova))
+			return -EBUSY;
+
+	iova = start;
+
+	if (iommu->cache)
+		prot |= IOMMU_CACHE;
+
+	/*
+	 * XXX We break mappings into pages and use get_user_pages_fast to
+	 * pin the pages in memory.  It's been suggested that mlock might
+	 * provide a more efficient mechanism, but nothing prevents the
+	 * user from munlocking the pages, which could then allow the user
+	 * access to random host memory.  We also have no guarantee from the
+	 * IOMMU API that the iommu driver can unmap sub-pages of previous
+	 * mappings.  This means we might lose an entire range if a single
+	 * page within it is unmapped.  Single page mappings are inefficient,
+	 * but provide the most flexibility for now.
+	 */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
+		unsigned long pfn = 0;
+
+		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		if (ret) {
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+
+		/*
+		 * Only add actual locked pages to accounting
+		 * XXX We're effectively marking a page locked for every
+		 * IOVA page even though it's possible the user could be
+		 * backing multiple IOVAs with the same vaddr.  This over-
+		 * penalizes the user process, but we currently have no
+		 * easy way to do this properly.
+		 */
+		if (!is_invalid_reserved_pfn(pfn))
+			locked++;
+
+		ret = iommu_map(iommu->domain, iova,
+				(phys_addr_t)pfn << PAGE_SHIFT,
+				PAGE_SIZE, prot);
+		if (ret) {
+			/* Back out mappings on error */
+			put_pfn(pfn, prot);
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+	}
+	vfio_lock_acct(locked);
+	return 0;
+}
+
+static inline bool ranges_overlap(dma_addr_t start1, size_t size1,
+				  dma_addr_t start2, size_t size2)
+{
+	if (start1 < start2)
+		return (start2 - start1 < size1);
+	else if (start2 < start1)
+		return (start1 - start2 < size2);
+	return (size1 > 0 && size2 > 0);
+}
+
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+						dma_addr_t start, size_t size)
+{
+	struct vfio_dma *dma;
+
+	list_for_each_entry(dma, &iommu->dma_list, next) {
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   start, size))
+			return dma;
+	}
+	return NULL;
+}
+
+static long vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+				    size_t size, struct vfio_dma *dma)
+{
+	struct vfio_dma *split;
+	long npage_lo, npage_hi;
+
+	/* Existing dma region is completely covered, unmap all */
+	if (start <= dma->iova &&
+	    start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->next);
+		npage_lo = dma->npage;
+		kfree(dma);
+		return npage_lo;
+	}
+
+	/* Overlap low address of existing range */
+	if (start <= dma->iova) {
+		size_t overlap;
+
+		overlap = start + size - dma->iova;
+		npage_lo = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, dma->iova, npage_lo, dma->prot);
+		dma->iova += overlap;
+		dma->vaddr += overlap;
+		dma->npage -= npage_lo;
+		return npage_lo;
+	}
+
+	/* Overlap high address of existing range */
+	if (start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		size_t overlap;
+
+		overlap = dma->iova + NPAGE_TO_SIZE(dma->npage) - start;
+		npage_hi = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, start, npage_hi, dma->prot);
+		dma->npage -= npage_hi;
+		return npage_hi;
+	}
+
+	/* Split existing */
+	npage_lo = (start - dma->iova) >> PAGE_SHIFT;
+	npage_hi = dma->npage - (size >> PAGE_SHIFT) - npage_lo;
+
+	split = kzalloc(sizeof *split, GFP_KERNEL);
+	if (!split)
+		return -ENOMEM;
+
+	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, dma->prot);
+
+	dma->npage = npage_lo;
+
+	split->npage = npage_hi;
+	split->iova = start + size;
+	split->vaddr = dma->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
+	split->prot = dma->prot;
+	list_add(&split->next, &iommu->dma_list);
+	return size >> PAGE_SHIFT;
+}
+
+static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
+			     struct vfio_iommu_x86_dma_unmap *unmap)
+{
+	long ret = 0, npage = unmap->size >> PAGE_SHIFT;
+	struct vfio_dma *dma, *tmp;
+	uint64_t mask;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	if (unmap->iova & mask)
+		return -EINVAL;
+	if (unmap->size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry_safe(dma, tmp, &iommu->dma_list, next) {
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   unmap->iova, unmap->size)) {
+			ret = vfio_remove_dma_overlap(iommu, unmap->iova,
+						      unmap->size, dma);
+			if (ret > 0)
+				npage -= ret;
+			if (ret < 0 || npage == 0)
+				break;
+		}
+	}
+	mutex_unlock(&iommu->lock);
+	return ret > 0 ? 0 : (int)ret;
+}
+
+static int vfio_dma_do_map(struct vfio_iommu *iommu,
+			   struct vfio_iommu_x86_dma_map *map)
+{
+	struct vfio_dma *dma, *pdma = NULL;
+	dma_addr_t iova = map->iova;
+	unsigned long locked, lock_limit, vaddr = map->vaddr;
+	size_t size = map->size;
+	int ret = 0, prot = 0;
+	uint64_t mask;
+	long npage;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	/* READ/WRITE from device perspective */
+	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+		prot |= IOMMU_WRITE;
+	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
+		prot |= IOMMU_READ;
+
+	if (!prot)
+		return -EINVAL; /* No READ/WRITE? */
+
+	if (vaddr & mask)
+		return -EINVAL;
+	if (iova & mask)
+		return -EINVAL;
+	if (size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	/* Don't allow IOVA wrap */
+	if (iova + size && iova + size < iova)
+		return -EINVAL;
+
+	/* Don't allow virtual address wrap */
+	if (vaddr + size && vaddr + size < vaddr)
+		return -EINVAL;
+
+	npage = size >> PAGE_SHIFT;
+	if (!npage)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret = -EBUSY;
+		goto out_lock;
+	}
+
+	/* account for locked pages */
+	locked = current->mm->locked_vm + npage;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+			__func__, rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+		goto out_lock;
+	}
+
+	ret = __vfio_dma_map(iommu, iova, vaddr, npage, prot);
+	if (ret)
+		goto out_lock;
+
+	/* Check if we abut a region below - nothing below 0 */
+	if (iova) {
+		dma = vfio_find_dma(iommu, iova - 1, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr + NPAGE_TO_SIZE(dma->npage) == vaddr) {
+
+			dma->npage += npage;
+			iova = dma->iova;
+			vaddr = dma->vaddr;
+			npage = dma->npage;
+			size = NPAGE_TO_SIZE(npage);
+
+			pdma = dma;
+		}
+	}
+
+	/* Check if we abut a region above - nothing above ~0 + 1 */
+	if (iova + size) {
+		dma = vfio_find_dma(iommu, iova + size, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr == vaddr + size) {
+
+			dma->npage += npage;
+			dma->iova = iova;
+			dma->vaddr = vaddr;
+
+			/*
+			 * If merged above and below, remove previously
+			 * merged entry.  New entry covers it.
+			 */
+			if (pdma) {
+				list_del(&pdma->next);
+				kfree(pdma);
+			}
+			pdma = dma;
+		}
+	}
+
+	/* Isolated, new region */
+	if (!pdma) {
+		dma = kzalloc(sizeof *dma, GFP_KERNEL);
+		if (!dma) {
+			ret = -ENOMEM;
+			vfio_dma_unmap(iommu, iova, npage, prot);
+			goto out_lock;
+		}
+
+		dma->npage = npage;
+		dma->iova = iova;
+		dma->vaddr = vaddr;
+		dma->prot = prot;
+		list_add(&dma->next, &iommu->dma_list);
+	}
+
+out_lock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_iommu_x86_attach_group(void *iommu_data,
+				       struct iommu_group *iommu_group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group, *tmp;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(tmp, &iommu->group_list, next) {
+		if (tmp->iommu_group == iommu_group) {
+			mutex_unlock(&iommu->lock);
+			kfree(group);
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * TODO: Domain have capabilities that might change as we add
+	 * groups (see iommu->cache, currently never set).  Check for
+	 * them and potentially disallow groups to be attached when it
+	 * would change capabilities (ugh).
+	 */
+	ret = iommu_attach_group(iommu->domain, iommu_group);
+	if (ret) {
+		mutex_unlock(&iommu->lock);
+		kfree(group);
+		return ret;
+	}
+
+	group->iommu_group = iommu_group;
+	list_add(&group->next, &iommu->group_list);
+
+	mutex_unlock(&iommu->lock);
+
+	return 0;
+}
+
+static void vfio_iommu_x86_detach_group(void *iommu_data,
+					struct iommu_group *iommu_group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(group, &iommu->group_list, next) {
+		if (group->iommu_group == iommu_group) {
+			iommu_detach_group(iommu->domain, iommu_group);
+			list_del(&group->next);
+			kfree(group);
+			break;
+		}
+	}
+
+	mutex_unlock(&iommu->lock);
+}
+
+static void *vfio_iommu_x86_open(unsigned long arg)
+{
+	struct vfio_iommu *iommu;
+
+	if (arg != VFIO_X86_IOMMU)
+		return ERR_PTR(-EINVAL);
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&iommu->group_list);
+	INIT_LIST_HEAD(&iommu->dma_list);
+	mutex_init(&iommu->lock);
+
+	/*
+	 * Wish we didn't have to know about bus_type here.
+	 */
+	iommu->domain = iommu_domain_alloc(&pci_bus_type);
+	if (!iommu->domain) {
+		kfree(iommu);
+		return ERR_PTR(-EIO);
+	}
+
+	/*
+	 * Wish we could specify required capabilities rather than create
+	 * a domain, see what comes out and hope it doesn't change along
+	 * the way.  Fortunately we know interrupt remapping is global for
+	 * our iommus.
+	 */
+	if (!allow_unsafe_interrupts &&
+	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
+		printk(KERN_WARNING
+		       "%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		iommu_domain_free(iommu->domain);
+		kfree(iommu);
+		return ERR_PTR(-EPERM);
+	}
+
+	return iommu;
+}
+
+static void vfio_iommu_x86_release(void *iommu_data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *group, *group_tmp;
+	struct vfio_dma *dma, *dma_tmp;
+
+	list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
+		iommu_detach_group(iommu->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	list_for_each_entry_safe(dma, dma_tmp, &iommu->dma_list, next) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->next);
+		kfree(dma);
+	}
+
+	iommu_domain_free(iommu->domain);
+	iommu->domain = NULL;
+	kfree(iommu);
+}
+
+static long vfio_iommu_x86_ioctl(void *iommu_data,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_CHECK_EXTENSION) {
+		switch (arg) {
+		case VFIO_X86_IOMMU:
+			return 1;
+		default:
+			return 0;
+		}
+	} else if (cmd == VFIO_IOMMU_GET_INFO) {
+		struct vfio_iommu_x86_info info;
+
+		minsz = offsetofend(struct vfio_iommu_x86_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
+		struct vfio_iommu_x86_dma_map map;
+		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE;
+
+		minsz = offsetofend(struct vfio_iommu_x86_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz || map.flags & ~mask)
+			return -EINVAL;
+
+		return vfio_dma_do_map(iommu, &map);
+
+	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
+		struct vfio_iommu_x86_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_x86_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz || unmap.flags)
+			return -EINVAL;
+
+		return vfio_dma_do_unmap(iommu, &unmap);
+	}
+
+	return -ENOTTY;
+}
+
+const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_x86 = {
+	.name		= "vfio-iommu-x86",
+	.owner		= THIS_MODULE,
+	.open		= vfio_iommu_x86_open,
+	.release	= vfio_iommu_x86_release,
+	.ioctl		= vfio_iommu_x86_ioctl,
+	.attach_group	= vfio_iommu_x86_attach_group,
+	.detach_group	= vfio_iommu_x86_detach_group,
+};
+
+static int __init vfio_iommu_x86_init(void)
+{
+	if (!iommu_present(&pci_bus_type))
+		return -ENODEV;
+
+	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+static void __exit vfio_iommu_x86_cleanup(void)
+{
+	vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+module_init(vfio_iommu_x86_init);
+module_exit(vfio_iommu_x86_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a264054..1c7119c 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -361,4 +361,56 @@ struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/* -------- API for x86 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_x86_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ */
+struct vfio_iommu_x86_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+	__u64	vaddr;				/* Process virtual address */
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 14, struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.
+ */
+struct vfio_iommu_x86_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
 #endif /* VFIO_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 10/13] pci: export pci_user functions for use by other drivers
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

VFIO PCI support will make use of these for user initiated
PCI config accesses.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/access.c |    6 ++++--
 drivers/pci/pci.h    |    7 -------
 include/linux/pci.h  |    8 ++++++++
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 2a58164..ba91a7e 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -162,7 +162,8 @@ int pci_user_read_config_##size						\
 	if (ret > 0)							\
 		ret = -EINVAL;						\
 	return ret;							\
-}
+}									\
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
 
 /* Returns 0 on success, negative values indicate error. */
 #define PCI_USER_WRITE_CONFIG(size,type)				\
@@ -181,7 +182,8 @@ int pci_user_write_config_##size					\
 	if (ret > 0)							\
 		ret = -EINVAL;						\
 	return ret;							\
-}
+}									\
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);
 
 PCI_USER_READ_CONFIG(byte, u8)
 PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e8f2f8f..68e3475 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -86,13 +86,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
 	return !!(pci_dev->subordinate);
 }
 
-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
-
 struct pci_vpd_ops {
 	ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
 	ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index dc25da3..b437225 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
 	return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
 
+/* user-space driven config access */
+extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
+extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
+
 int __must_check pci_enable_device(struct pci_dev *dev);
 int __must_check pci_enable_device_io(struct pci_dev *dev);
 int __must_check pci_enable_device_mem(struct pci_dev *dev);


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 10/13] pci: export pci_user functions for use by other drivers
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

VFIO PCI support will make use of these for user initiated
PCI config accesses.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/pci/access.c |    6 ++++--
 drivers/pci/pci.h    |    7 -------
 include/linux/pci.h  |    8 ++++++++
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 2a58164..ba91a7e 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -162,7 +162,8 @@ int pci_user_read_config_##size						\
 	if (ret > 0)							\
 		ret = -EINVAL;						\
 	return ret;							\
-}
+}									\
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
 
 /* Returns 0 on success, negative values indicate error. */
 #define PCI_USER_WRITE_CONFIG(size,type)				\
@@ -181,7 +182,8 @@ int pci_user_write_config_##size					\
 	if (ret > 0)							\
 		ret = -EINVAL;						\
 	return ret;							\
-}
+}									\
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);
 
 PCI_USER_READ_CONFIG(byte, u8)
 PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e8f2f8f..68e3475 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -86,13 +86,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
 	return !!(pci_dev->subordinate);
 }
 
-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
-
 struct pci_vpd_ops {
 	ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
 	ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index dc25da3..b437225 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
 	return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
 
+/* user-space driven config access */
+extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
+extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
+
 int __must_check pci_enable_device(struct pci_dev *dev);
 int __must_check pci_enable_device_io(struct pci_dev *dev);
 int __must_check pci_enable_device_mem(struct pci_dev *dev);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 10/13] pci: export pci_user functions for use by other drivers
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

VFIO PCI support will make use of these for user initiated
PCI config accesses.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/access.c |    6 ++++--
 drivers/pci/pci.h    |    7 -------
 include/linux/pci.h  |    8 ++++++++
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 2a58164..ba91a7e 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -162,7 +162,8 @@ int pci_user_read_config_##size						\
 	if (ret > 0)							\
 		ret = -EINVAL;						\
 	return ret;							\
-}
+}									\
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
 
 /* Returns 0 on success, negative values indicate error. */
 #define PCI_USER_WRITE_CONFIG(size,type)				\
@@ -181,7 +182,8 @@ int pci_user_write_config_##size					\
 	if (ret > 0)							\
 		ret = -EINVAL;						\
 	return ret;							\
-}
+}									\
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);
 
 PCI_USER_READ_CONFIG(byte, u8)
 PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e8f2f8f..68e3475 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -86,13 +86,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
 	return !!(pci_dev->subordinate);
 }
 
-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
-
 struct pci_vpd_ops {
 	ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
 	ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index dc25da3..b437225 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
 	return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
 
+/* user-space driven config access */
+extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
+extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
+
 int __must_check pci_enable_device(struct pci_dev *dev);
 int __must_check pci_enable_device_io(struct pci_dev *dev);
 int __must_check pci_enable_device_mem(struct pci_dev *dev);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 11/13] pci: Create common pcibios_err_to_errno
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

For returning errors out to non-PCI code.  Re-name xen's version.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/xen/xen-pciback/conf_space.c |    6 +++---
 include/linux/pci.h                  |   26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
index 30d7be0..46ae0f9 100644
--- a/drivers/xen/xen-pciback/conf_space.c
+++ b/drivers/xen/xen-pciback/conf_space.c
@@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
 	return val;
 }
 
-static int pcibios_err_to_errno(int err)
+static int xen_pcibios_err_to_errno(int err)
 {
 	switch (err) {
 	case PCIBIOS_SUCCESSFUL:
@@ -202,7 +202,7 @@ out:
 		       pci_name(dev), size, offset, value);
 
 	*ret_val = value;
-	return pcibios_err_to_errno(err);
+	return xen_pcibios_err_to_errno(err);
 }
 
 int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
@@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
 		}
 	}
 
-	return pcibios_err_to_errno(err);
+	return xen_pcibios_err_to_errno(err);
 }
 
 void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b437225..20a8f2e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
 #define PCIBIOS_SET_FAILED		0x88
 #define PCIBIOS_BUFFER_TOO_SMALL	0x89
 
+/*
+ * Translate above to generic errno for passing back through non-pci.
+ */
+static inline int pcibios_err_to_errno(int err)
+{
+	if (err <= PCIBIOS_SUCCESSFUL)
+		return err; /* Assume already errno */
+
+	switch (err) {
+	case PCIBIOS_FUNC_NOT_SUPPORTED:
+		return -ENOENT;
+	case PCIBIOS_BAD_VENDOR_ID:
+		return -EINVAL;
+	case PCIBIOS_DEVICE_NOT_FOUND:
+		return -ENODEV;
+	case PCIBIOS_BAD_REGISTER_NUMBER:
+		return -EFAULT;
+	case PCIBIOS_SET_FAILED:
+		return -EIO;
+	case PCIBIOS_BUFFER_TOO_SMALL:
+		return -ENOSPC;
+	}
+
+	return -ENOTTY;
+}
+
 /* Low-level architecture-dependent routines */
 
 struct pci_ops {


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 11/13] pci: Create common pcibios_err_to_errno
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

For returning errors out to non-PCI code.  Re-name xen's version.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/xen/xen-pciback/conf_space.c |    6 +++---
 include/linux/pci.h                  |   26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
index 30d7be0..46ae0f9 100644
--- a/drivers/xen/xen-pciback/conf_space.c
+++ b/drivers/xen/xen-pciback/conf_space.c
@@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
 	return val;
 }
 
-static int pcibios_err_to_errno(int err)
+static int xen_pcibios_err_to_errno(int err)
 {
 	switch (err) {
 	case PCIBIOS_SUCCESSFUL:
@@ -202,7 +202,7 @@ out:
 		       pci_name(dev), size, offset, value);
 
 	*ret_val = value;
-	return pcibios_err_to_errno(err);
+	return xen_pcibios_err_to_errno(err);
 }
 
 int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
@@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
 		}
 	}
 
-	return pcibios_err_to_errno(err);
+	return xen_pcibios_err_to_errno(err);
 }
 
 void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b437225..20a8f2e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
 #define PCIBIOS_SET_FAILED		0x88
 #define PCIBIOS_BUFFER_TOO_SMALL	0x89
 
+/*
+ * Translate above to generic errno for passing back through non-pci.
+ */
+static inline int pcibios_err_to_errno(int err)
+{
+	if (err <= PCIBIOS_SUCCESSFUL)
+		return err; /* Assume already errno */
+
+	switch (err) {
+	case PCIBIOS_FUNC_NOT_SUPPORTED:
+		return -ENOENT;
+	case PCIBIOS_BAD_VENDOR_ID:
+		return -EINVAL;
+	case PCIBIOS_DEVICE_NOT_FOUND:
+		return -ENODEV;
+	case PCIBIOS_BAD_REGISTER_NUMBER:
+		return -EFAULT;
+	case PCIBIOS_SET_FAILED:
+		return -EIO;
+	case PCIBIOS_BUFFER_TOO_SMALL:
+		return -ENOSPC;
+	}
+
+	return -ENOTTY;
+}
+
 /* Low-level architecture-dependent routines */
 
 struct pci_ops {

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 11/13] pci: Create common pcibios_err_to_errno
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

For returning errors out to non-PCI code.  Re-name xen's version.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/xen/xen-pciback/conf_space.c |    6 +++---
 include/linux/pci.h                  |   26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
index 30d7be0..46ae0f9 100644
--- a/drivers/xen/xen-pciback/conf_space.c
+++ b/drivers/xen/xen-pciback/conf_space.c
@@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
 	return val;
 }
 
-static int pcibios_err_to_errno(int err)
+static int xen_pcibios_err_to_errno(int err)
 {
 	switch (err) {
 	case PCIBIOS_SUCCESSFUL:
@@ -202,7 +202,7 @@ out:
 		       pci_name(dev), size, offset, value);
 
 	*ret_val = value;
-	return pcibios_err_to_errno(err);
+	return xen_pcibios_err_to_errno(err);
 }
 
 int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
@@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
 		}
 	}
 
-	return pcibios_err_to_errno(err);
+	return xen_pcibios_err_to_errno(err);
 }
 
 void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b437225..20a8f2e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
 #define PCIBIOS_SET_FAILED		0x88
 #define PCIBIOS_BUFFER_TOO_SMALL	0x89
 
+/*
+ * Translate above to generic errno for passing back through non-pci.
+ */
+static inline int pcibios_err_to_errno(int err)
+{
+	if (err <= PCIBIOS_SUCCESSFUL)
+		return err; /* Assume already errno */
+
+	switch (err) {
+	case PCIBIOS_FUNC_NOT_SUPPORTED:
+		return -ENOENT;
+	case PCIBIOS_BAD_VENDOR_ID:
+		return -EINVAL;
+	case PCIBIOS_DEVICE_NOT_FOUND:
+		return -ENODEV;
+	case PCIBIOS_BAD_REGISTER_NUMBER:
+		return -EFAULT;
+	case PCIBIOS_SET_FAILED:
+		return -EIO;
+	case PCIBIOS_BUFFER_TOO_SMALL:
+		return -ENOSPC;
+	}
+
+	return -ENOTTY;
+}
+
 /* Low-level architecture-dependent routines */
 
 struct pci_ops {

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 12/13] pci: Misc pci_reg additions
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

Fill in many missing definitions and add sizeof fields for many
sections allowing for more extensive config parsing.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 include/linux/pci_regs.h |  112 +++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 4b608f5..379be84 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
  */
+#define PCI_STD_HEADER_SIZEOF	64
 #define PCI_VENDOR_ID		0x00	/* 16 bits */
 #define PCI_DEVICE_ID		0x02	/* 16 bits */
 #define PCI_COMMAND		0x04	/* 16 bits */
@@ -209,9 +210,12 @@
 #define  PCI_CAP_ID_SHPC 	0x0C	/* PCI Standard Hot-Plug Controller */
 #define  PCI_CAP_ID_SSVID	0x0D	/* Bridge subsystem vendor/device ID */
 #define  PCI_CAP_ID_AGP3	0x0E	/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV	0x0F	/* Secure Device */
 #define  PCI_CAP_ID_EXP 	0x10	/* PCI Express */
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
+#define  PCI_CAP_ID_SATA	0x12	/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAX		PCI_CAP_ID_AF
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -276,6 +280,7 @@
 #define  PCI_VPD_ADDR_MASK	0x7fff	/* Address mask */
 #define  PCI_VPD_ADDR_F		0x8000	/* Write 0, 1 indicates completion */
 #define PCI_VPD_DATA		4	/* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF	8
 
 /* Slot Identification */
 
@@ -297,8 +302,10 @@
 #define PCI_MSI_ADDRESS_HI	8	/* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
 #define PCI_MSI_DATA_32		8	/* 16 bits of data for 32-bit devices */
 #define PCI_MSI_MASK_32		12	/* Mask bits register for 32-bit devices */
+#define PCI_MSI_PENDING_32	16	/* Pending intrs for 32-bit devices */
 #define PCI_MSI_DATA_64		12	/* 16 bits of data for 64-bit devices */
 #define PCI_MSI_MASK_64		16	/* Mask bits register for 64-bit devices */
+#define PCI_MSI_PENDING_64	20	/* Pending intrs for 64-bit devices */
 
 /* MSI-X registers */
 #define PCI_MSIX_FLAGS		2
@@ -308,6 +315,7 @@
 #define PCI_MSIX_TABLE		4
 #define PCI_MSIX_PBA		8
 #define  PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
+#define PCI_CAP_MSIX_SIZEOF	12	/* size of MSIX registers */
 
 /* MSI-X entry's format */
 #define PCI_MSIX_ENTRY_SIZE		16
@@ -338,6 +346,7 @@
 #define  PCI_AF_CTRL_FLR	0x01
 #define PCI_AF_STATUS		5
 #define  PCI_AF_STATUS_TP	0x01
+#define PCI_CAP_AF_SIZEOF	6	/* size of AF registers */
 
 /* PCI-X registers */
 
@@ -374,6 +383,9 @@
 #define  PCI_X_STATUS_SPL_ERR	0x20000000	/* Rcvd Split Completion Error Msg */
 #define  PCI_X_STATUS_266MHZ	0x40000000	/* 266 MHz capable */
 #define  PCI_X_STATUS_533MHZ	0x80000000	/* 533 MHz capable */
+#define PCI_X_ECC_CSR		8	/* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0	8	/* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V12	24	/* size for Version 1 & 2 */
 
 /* PCI Bridge Subsystem ID registers */
 
@@ -462,6 +474,7 @@
 #define  PCI_EXP_LNKSTA_DLLLA	0x2000	/* Data Link Layer Link Active */
 #define  PCI_EXP_LNKSTA_LBMS	0x4000	/* Link Bandwidth Management Status */
 #define  PCI_EXP_LNKSTA_LABS	0x8000	/* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1	20	/* v1 endpoints end here */
 #define PCI_EXP_SLTCAP		20	/* Slot Capabilities */
 #define  PCI_EXP_SLTCAP_ABP	0x00000001 /* Attention Button Present */
 #define  PCI_EXP_SLTCAP_PCP	0x00000002 /* Power Controller Present */
@@ -521,6 +534,7 @@
 #define  PCI_EXP_OBFF_MSGA_EN	0x2000	/* OBFF enable with Message type A */
 #define  PCI_EXP_OBFF_MSGB_EN	0x4000	/* OBFF enable with Message type B */
 #define  PCI_EXP_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	44	/* v2 endpoints end here */
 #define PCI_EXP_LNKCTL2		48	/* Link Control 2 */
 #define PCI_EXP_SLTCTL2		56	/* Slot Control 2 */
 
@@ -529,23 +543,43 @@
 #define PCI_EXT_CAP_VER(header)		((header >> 16) & 0xf)
 #define PCI_EXT_CAP_NEXT(header)	((header >> 20) & 0xffc)
 
-#define PCI_EXT_CAP_ID_ERR	1
-#define PCI_EXT_CAP_ID_VC	2
-#define PCI_EXT_CAP_ID_DSN	3
-#define PCI_EXT_CAP_ID_PWR	4
-#define PCI_EXT_CAP_ID_VNDR	11
-#define PCI_EXT_CAP_ID_ACS	13
-#define PCI_EXT_CAP_ID_ARI	14
-#define PCI_EXT_CAP_ID_ATS	15
-#define PCI_EXT_CAP_ID_SRIOV	16
-#define PCI_EXT_CAP_ID_PRI	19
-#define PCI_EXT_CAP_ID_LTR	24
-#define PCI_EXT_CAP_ID_PASID	27
+#define PCI_EXT_CAP_ID_ERR	0x01	/* Advanced Error Reporting */
+#define PCI_EXT_CAP_ID_VC	0x02	/* Virtual Channel Capability */
+#define PCI_EXT_CAP_ID_DSN	0x03	/* Device Serial Number */
+#define PCI_EXT_CAP_ID_PWR	0x04	/* Power Budgeting */
+#define PCI_EXT_CAP_ID_RCLD	0x05	/* Root Complex Link Declaration */
+#define PCI_EXT_CAP_ID_RCILC	0x06	/* Root Complex Internal Link Control */
+#define PCI_EXT_CAP_ID_RCEC	0x07	/* Root Complex Event Collector */
+#define PCI_EXT_CAP_ID_MFVC	0x08	/* Multi-Function VC Capability */
+#define PCI_EXT_CAP_ID_VC9	0x09	/* same as _VC */
+#define PCI_EXT_CAP_ID_RCRB	0x0A	/* Root Complex RB? */
+#define PCI_EXT_CAP_ID_VNDR	0x0B	/* Vendor Specific */
+#define PCI_EXT_CAP_ID_CAC	0x0C	/* Config Access - obsolete */
+#define PCI_EXT_CAP_ID_ACS	0x0D	/* Access Control Services */
+#define PCI_EXT_CAP_ID_ARI	0x0E	/* Alternate Routing ID */
+#define PCI_EXT_CAP_ID_ATS	0x0F	/* Address Translation Services */
+#define PCI_EXT_CAP_ID_SRIOV	0x10	/* Single Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MRIOV	0x11	/* Multi Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MCAST	0x12	/* Multicast */
+#define PCI_EXT_CAP_ID_PRI	0x13	/* Page Request Interface */
+#define PCI_EXT_CAP_ID_AMD_XXX	0x14	/* reserved for AMD */
+#define PCI_EXT_CAP_ID_REBAR	0x15	/* resizable BAR */
+#define PCI_EXT_CAP_ID_DPA	0x16	/* dynamic power alloc */
+#define PCI_EXT_CAP_ID_TPH	0x17	/* TPH request */
+#define PCI_EXT_CAP_ID_LTR	0x18	/* latency tolerance reporting */
+#define PCI_EXT_CAP_ID_SECPCI	0x19	/* Secondary PCIe */
+#define PCI_EXT_CAP_ID_PMUX	0x1A	/* Protocol Multiplexing */
+#define PCI_EXT_CAP_ID_PASID	0x1B	/* Process Address Space ID */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PASID
+
+#define PCI_EXT_CAP_DSN_SIZEOF	12
+#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable Error Status */
 #define  PCI_ERR_UNC_TRAIN	0x00000001	/* Training */
 #define  PCI_ERR_UNC_DLP	0x00000010	/* Data Link Protocol */
+#define  PCI_ERR_UNC_SURPDN	0x00000020	/* Surprise Down */
 #define  PCI_ERR_UNC_POISON_TLP	0x00001000	/* Poisoned TLP */
 #define  PCI_ERR_UNC_FCP	0x00002000	/* Flow Control Protocol */
 #define  PCI_ERR_UNC_COMP_TIME	0x00004000	/* Completion Timeout */
@@ -555,6 +589,11 @@
 #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
 #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
 #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
+#define  PCI_ERR_UNC_ACSV	0x00200000	/* ACS Violation */
+#define  PCI_ERR_UNC_INTN	0x00400000	/* internal error */
+#define  PCI_ERR_UNC_MCBTLP	0x00800000	/* MC blocked TLP */
+#define  PCI_ERR_UNC_ATOMEG	0x01000000	/* Atomic egress blocked */
+#define  PCI_ERR_UNC_TLPPRE	0x02000000	/* TLP prefix blocked */
 #define PCI_ERR_UNCOR_MASK	8	/* Uncorrectable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_UNCOR_SEVER	12	/* Uncorrectable Error Severity */
@@ -565,6 +604,9 @@
 #define  PCI_ERR_COR_BAD_DLLP	0x00000080	/* Bad DLLP Status */
 #define  PCI_ERR_COR_REP_ROLL	0x00000100	/* REPLAY_NUM Rollover */
 #define  PCI_ERR_COR_REP_TIMER	0x00001000	/* Replay Timer Timeout */
+#define  PCI_ERR_COR_ADV_NFAT	0x00002000	/* Advisory Non-Fatal */
+#define  PCI_ERR_COR_INTERNAL	0x00004000	/* Corrected Internal */
+#define  PCI_ERR_COR_LOG_OVER	0x00008000	/* Header Log Overflow */
 #define PCI_ERR_COR_MASK	20	/* Correctable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_CAP		24	/* Advanced Error Capabilities */
@@ -596,12 +638,18 @@
 
 /* Virtual Channel */
 #define PCI_VC_PORT_REG1	4
+#define  PCI_VC_REG1_EVCC	0x7	/* extended vc count */
 #define PCI_VC_PORT_REG2	8
+#define  PCI_VC_REG2_32_PHASE	0x2
+#define  PCI_VC_REG2_64_PHASE	0x4
+#define  PCI_VC_REG2_128_PHASE	0x8
 #define PCI_VC_PORT_CTRL	12
 #define PCI_VC_PORT_STATUS	14
 #define PCI_VC_RES_CAP		16
 #define PCI_VC_RES_CTRL		20
 #define PCI_VC_RES_STATUS	26
+#define PCI_CAP_VC_BASE_SIZEOF		0x10
+#define PCI_CAP_VC_PER_VC_SIZEOF	0x0C
 
 /* Power Budgeting */
 #define PCI_PWR_DSR		4	/* Data Select Register */
@@ -614,6 +662,7 @@
 #define  PCI_PWR_DATA_RAIL(x)	(((x) >> 18) & 7)   /* Power Rail */
 #define PCI_PWR_CAP		12	/* Capability */
 #define  PCI_PWR_CAP_BUDGET(x)	((x) & 1)	/* Included in system budget */
+#define PCI_EXT_CAP_PWR_SIZEOF	16
 
 /*
  * Hypertransport sub capability types
@@ -646,6 +695,8 @@
 #define HT_CAPTYPE_ERROR_RETRY	0xC0	/* Retry on error configuration */
 #define HT_CAPTYPE_GEN3		0xD0	/* Generation 3 hypertransport configuration */
 #define HT_CAPTYPE_PM		0xE0	/* Hypertransport powermanagement configuration */
+#define HT_CAP_SIZEOF_LONG	28	/* slave & primary */
+#define HT_CAP_SIZEOF_SHORT	24	/* host & secondary */
 
 /* Alternative Routing-ID Interpretation */
 #define PCI_ARI_CAP		0x04	/* ARI Capability Register */
@@ -656,6 +707,7 @@
 #define  PCI_ARI_CTRL_MFVC	0x0001	/* MFVC Function Groups Enable */
 #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function Group */
+#define PCI_EXT_CAP_ARI_SIZEOF	8
 
 /* Address Translation Service */
 #define PCI_ATS_CAP		0x04	/* ATS Capability Register */
@@ -665,6 +717,7 @@
 #define  PCI_ATS_CTRL_ENABLE	0x8000	/* ATS Enable */
 #define  PCI_ATS_CTRL_STU(x)	((x) & 0x1f)	/* Smallest Translation Unit */
 #define  PCI_ATS_MIN_STU	12	/* shift of minimum STU block */
+#define PCI_EXT_CAP_ATS_SIZEOF	8
 
 /* Page Request Interface */
 #define PCI_PRI_CTRL		0x04	/* PRI control register */
@@ -676,6 +729,7 @@
 #define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped */
 #define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs supported */
 #define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs allowed */
+#define PCI_EXT_CAP_PRI_SIZEOF	16
 
 /* PASID capability */
 #define PCI_PASID_CAP		0x04    /* PASID feature register */
@@ -685,6 +739,7 @@
 #define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
 #define  PCI_PASID_CTRL_EXEC	0x02	/* Exec permissions Enable */
 #define  PCI_PASID_CTRL_PRIV	0x04	/* Priviledge Mode Enable */
+#define PCI_EXT_CAP_PASID_SIZEOF	8
 
 /* Single Root I/O Virtualization */
 #define PCI_SRIOV_CAP		0x04	/* SR-IOV Capabilities */
@@ -716,12 +771,14 @@
 #define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn */
 #define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut */
 #define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available */
+#define PCI_EXT_CAP_SRIOV_SIZEOF 64
 
 #define PCI_LTR_MAX_SNOOP_LAT	0x4
 #define PCI_LTR_MAX_NOSNOOP_LAT	0x6
 #define  PCI_LTR_VALUE_MASK	0x000003ff
 #define  PCI_LTR_SCALE_MASK	0x00001c00
 #define  PCI_LTR_SCALE_SHIFT	10
+#define PCI_EXT_CAP_LTR_SIZEOF	8
 
 /* Access Control Service */
 #define PCI_ACS_CAP		0x04	/* ACS Capability Register */
@@ -732,7 +789,38 @@
 #define  PCI_ACS_UF		0x10	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x20	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x40	/* Direct Translated P2P */
+#define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+#define PCI_VSEC_HDR		4	/* extended cap - vendor specific */
+#define  PCI_VSEC_HDR_LEN_SHIFT	20	/* shift for length field */
+
+/* sata capability */
+#define PCI_SATA_REGS		4	/* SATA REGs specifier */
+#define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
+#define  PCI_SATA_REGS_INLINE	0xF	/* REGS in config space */
+#define PCI_SATA_SIZEOF_SHORT	8
+#define PCI_SATA_SIZEOF_LONG	16
+
+/* resizable BARs */
+#define PCI_REBAR_CTRL		8	/* control register */
+#define  PCI_REBAR_CTRL_NBAR_MASK	(7 << 5)	/* mask for # bars */
+#define  PCI_REBAR_CTRL_NBAR_SHIFT	5	/* shift for # bars */
+
+/* dynamic power allocation */
+#define PCI_DPA_CAP		4	/* capability register */
+#define  PCI_DPA_CAP_SUBSTATE_MASK	0x1F	/* # substates - 1 */
+#define PCI_DPA_BASE_SIZEOF	16	/* size with 0 substates */
+
+/* TPH Requester */
+#define PCI_TPH_CAP		4	/* capability register */
+#define  PCI_TPH_CAP_LOC_MASK	0x600	/* location mask */
+#define   PCI_TPH_LOC_NONE	0x000	/* no location */
+#define   PCI_TPH_LOC_CAP	0x200	/* in capability */
+#define   PCI_TPH_LOC_MSIX	0x400	/* in MSI-X */
+#define PCI_TPH_CAP_ST_MASK	0x07FF0000	/* st table mask */
+#define PCI_TPH_CAP_ST_SHIFT	16	/* st table shift */
+#define PCI_TPH_BASE_SIZEOF	12	/* size with no st table */
+
 #endif /* LINUX_PCI_REGS_H */


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 12/13] pci: Misc pci_reg additions
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

Fill in many missing definitions and add sizeof fields for many
sections allowing for more extensive config parsing.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 include/linux/pci_regs.h |  112 +++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 4b608f5..379be84 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
  */
+#define PCI_STD_HEADER_SIZEOF	64
 #define PCI_VENDOR_ID		0x00	/* 16 bits */
 #define PCI_DEVICE_ID		0x02	/* 16 bits */
 #define PCI_COMMAND		0x04	/* 16 bits */
@@ -209,9 +210,12 @@
 #define  PCI_CAP_ID_SHPC 	0x0C	/* PCI Standard Hot-Plug Controller */
 #define  PCI_CAP_ID_SSVID	0x0D	/* Bridge subsystem vendor/device ID */
 #define  PCI_CAP_ID_AGP3	0x0E	/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV	0x0F	/* Secure Device */
 #define  PCI_CAP_ID_EXP 	0x10	/* PCI Express */
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
+#define  PCI_CAP_ID_SATA	0x12	/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAX		PCI_CAP_ID_AF
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -276,6 +280,7 @@
 #define  PCI_VPD_ADDR_MASK	0x7fff	/* Address mask */
 #define  PCI_VPD_ADDR_F		0x8000	/* Write 0, 1 indicates completion */
 #define PCI_VPD_DATA		4	/* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF	8
 
 /* Slot Identification */
 
@@ -297,8 +302,10 @@
 #define PCI_MSI_ADDRESS_HI	8	/* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
 #define PCI_MSI_DATA_32		8	/* 16 bits of data for 32-bit devices */
 #define PCI_MSI_MASK_32		12	/* Mask bits register for 32-bit devices */
+#define PCI_MSI_PENDING_32	16	/* Pending intrs for 32-bit devices */
 #define PCI_MSI_DATA_64		12	/* 16 bits of data for 64-bit devices */
 #define PCI_MSI_MASK_64		16	/* Mask bits register for 64-bit devices */
+#define PCI_MSI_PENDING_64	20	/* Pending intrs for 64-bit devices */
 
 /* MSI-X registers */
 #define PCI_MSIX_FLAGS		2
@@ -308,6 +315,7 @@
 #define PCI_MSIX_TABLE		4
 #define PCI_MSIX_PBA		8
 #define  PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
+#define PCI_CAP_MSIX_SIZEOF	12	/* size of MSIX registers */
 
 /* MSI-X entry's format */
 #define PCI_MSIX_ENTRY_SIZE		16
@@ -338,6 +346,7 @@
 #define  PCI_AF_CTRL_FLR	0x01
 #define PCI_AF_STATUS		5
 #define  PCI_AF_STATUS_TP	0x01
+#define PCI_CAP_AF_SIZEOF	6	/* size of AF registers */
 
 /* PCI-X registers */
 
@@ -374,6 +383,9 @@
 #define  PCI_X_STATUS_SPL_ERR	0x20000000	/* Rcvd Split Completion Error Msg */
 #define  PCI_X_STATUS_266MHZ	0x40000000	/* 266 MHz capable */
 #define  PCI_X_STATUS_533MHZ	0x80000000	/* 533 MHz capable */
+#define PCI_X_ECC_CSR		8	/* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0	8	/* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V12	24	/* size for Version 1 & 2 */
 
 /* PCI Bridge Subsystem ID registers */
 
@@ -462,6 +474,7 @@
 #define  PCI_EXP_LNKSTA_DLLLA	0x2000	/* Data Link Layer Link Active */
 #define  PCI_EXP_LNKSTA_LBMS	0x4000	/* Link Bandwidth Management Status */
 #define  PCI_EXP_LNKSTA_LABS	0x8000	/* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1	20	/* v1 endpoints end here */
 #define PCI_EXP_SLTCAP		20	/* Slot Capabilities */
 #define  PCI_EXP_SLTCAP_ABP	0x00000001 /* Attention Button Present */
 #define  PCI_EXP_SLTCAP_PCP	0x00000002 /* Power Controller Present */
@@ -521,6 +534,7 @@
 #define  PCI_EXP_OBFF_MSGA_EN	0x2000	/* OBFF enable with Message type A */
 #define  PCI_EXP_OBFF_MSGB_EN	0x4000	/* OBFF enable with Message type B */
 #define  PCI_EXP_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	44	/* v2 endpoints end here */
 #define PCI_EXP_LNKCTL2		48	/* Link Control 2 */
 #define PCI_EXP_SLTCTL2		56	/* Slot Control 2 */
 
@@ -529,23 +543,43 @@
 #define PCI_EXT_CAP_VER(header)		((header >> 16) & 0xf)
 #define PCI_EXT_CAP_NEXT(header)	((header >> 20) & 0xffc)
 
-#define PCI_EXT_CAP_ID_ERR	1
-#define PCI_EXT_CAP_ID_VC	2
-#define PCI_EXT_CAP_ID_DSN	3
-#define PCI_EXT_CAP_ID_PWR	4
-#define PCI_EXT_CAP_ID_VNDR	11
-#define PCI_EXT_CAP_ID_ACS	13
-#define PCI_EXT_CAP_ID_ARI	14
-#define PCI_EXT_CAP_ID_ATS	15
-#define PCI_EXT_CAP_ID_SRIOV	16
-#define PCI_EXT_CAP_ID_PRI	19
-#define PCI_EXT_CAP_ID_LTR	24
-#define PCI_EXT_CAP_ID_PASID	27
+#define PCI_EXT_CAP_ID_ERR	0x01	/* Advanced Error Reporting */
+#define PCI_EXT_CAP_ID_VC	0x02	/* Virtual Channel Capability */
+#define PCI_EXT_CAP_ID_DSN	0x03	/* Device Serial Number */
+#define PCI_EXT_CAP_ID_PWR	0x04	/* Power Budgeting */
+#define PCI_EXT_CAP_ID_RCLD	0x05	/* Root Complex Link Declaration */
+#define PCI_EXT_CAP_ID_RCILC	0x06	/* Root Complex Internal Link Control */
+#define PCI_EXT_CAP_ID_RCEC	0x07	/* Root Complex Event Collector */
+#define PCI_EXT_CAP_ID_MFVC	0x08	/* Multi-Function VC Capability */
+#define PCI_EXT_CAP_ID_VC9	0x09	/* same as _VC */
+#define PCI_EXT_CAP_ID_RCRB	0x0A	/* Root Complex RB? */
+#define PCI_EXT_CAP_ID_VNDR	0x0B	/* Vendor Specific */
+#define PCI_EXT_CAP_ID_CAC	0x0C	/* Config Access - obsolete */
+#define PCI_EXT_CAP_ID_ACS	0x0D	/* Access Control Services */
+#define PCI_EXT_CAP_ID_ARI	0x0E	/* Alternate Routing ID */
+#define PCI_EXT_CAP_ID_ATS	0x0F	/* Address Translation Services */
+#define PCI_EXT_CAP_ID_SRIOV	0x10	/* Single Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MRIOV	0x11	/* Multi Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MCAST	0x12	/* Multicast */
+#define PCI_EXT_CAP_ID_PRI	0x13	/* Page Request Interface */
+#define PCI_EXT_CAP_ID_AMD_XXX	0x14	/* reserved for AMD */
+#define PCI_EXT_CAP_ID_REBAR	0x15	/* resizable BAR */
+#define PCI_EXT_CAP_ID_DPA	0x16	/* dynamic power alloc */
+#define PCI_EXT_CAP_ID_TPH	0x17	/* TPH request */
+#define PCI_EXT_CAP_ID_LTR	0x18	/* latency tolerance reporting */
+#define PCI_EXT_CAP_ID_SECPCI	0x19	/* Secondary PCIe */
+#define PCI_EXT_CAP_ID_PMUX	0x1A	/* Protocol Multiplexing */
+#define PCI_EXT_CAP_ID_PASID	0x1B	/* Process Address Space ID */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PASID
+
+#define PCI_EXT_CAP_DSN_SIZEOF	12
+#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable Error Status */
 #define  PCI_ERR_UNC_TRAIN	0x00000001	/* Training */
 #define  PCI_ERR_UNC_DLP	0x00000010	/* Data Link Protocol */
+#define  PCI_ERR_UNC_SURPDN	0x00000020	/* Surprise Down */
 #define  PCI_ERR_UNC_POISON_TLP	0x00001000	/* Poisoned TLP */
 #define  PCI_ERR_UNC_FCP	0x00002000	/* Flow Control Protocol */
 #define  PCI_ERR_UNC_COMP_TIME	0x00004000	/* Completion Timeout */
@@ -555,6 +589,11 @@
 #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
 #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
 #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
+#define  PCI_ERR_UNC_ACSV	0x00200000	/* ACS Violation */
+#define  PCI_ERR_UNC_INTN	0x00400000	/* internal error */
+#define  PCI_ERR_UNC_MCBTLP	0x00800000	/* MC blocked TLP */
+#define  PCI_ERR_UNC_ATOMEG	0x01000000	/* Atomic egress blocked */
+#define  PCI_ERR_UNC_TLPPRE	0x02000000	/* TLP prefix blocked */
 #define PCI_ERR_UNCOR_MASK	8	/* Uncorrectable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_UNCOR_SEVER	12	/* Uncorrectable Error Severity */
@@ -565,6 +604,9 @@
 #define  PCI_ERR_COR_BAD_DLLP	0x00000080	/* Bad DLLP Status */
 #define  PCI_ERR_COR_REP_ROLL	0x00000100	/* REPLAY_NUM Rollover */
 #define  PCI_ERR_COR_REP_TIMER	0x00001000	/* Replay Timer Timeout */
+#define  PCI_ERR_COR_ADV_NFAT	0x00002000	/* Advisory Non-Fatal */
+#define  PCI_ERR_COR_INTERNAL	0x00004000	/* Corrected Internal */
+#define  PCI_ERR_COR_LOG_OVER	0x00008000	/* Header Log Overflow */
 #define PCI_ERR_COR_MASK	20	/* Correctable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_CAP		24	/* Advanced Error Capabilities */
@@ -596,12 +638,18 @@
 
 /* Virtual Channel */
 #define PCI_VC_PORT_REG1	4
+#define  PCI_VC_REG1_EVCC	0x7	/* extended vc count */
 #define PCI_VC_PORT_REG2	8
+#define  PCI_VC_REG2_32_PHASE	0x2
+#define  PCI_VC_REG2_64_PHASE	0x4
+#define  PCI_VC_REG2_128_PHASE	0x8
 #define PCI_VC_PORT_CTRL	12
 #define PCI_VC_PORT_STATUS	14
 #define PCI_VC_RES_CAP		16
 #define PCI_VC_RES_CTRL		20
 #define PCI_VC_RES_STATUS	26
+#define PCI_CAP_VC_BASE_SIZEOF		0x10
+#define PCI_CAP_VC_PER_VC_SIZEOF	0x0C
 
 /* Power Budgeting */
 #define PCI_PWR_DSR		4	/* Data Select Register */
@@ -614,6 +662,7 @@
 #define  PCI_PWR_DATA_RAIL(x)	(((x) >> 18) & 7)   /* Power Rail */
 #define PCI_PWR_CAP		12	/* Capability */
 #define  PCI_PWR_CAP_BUDGET(x)	((x) & 1)	/* Included in system budget */
+#define PCI_EXT_CAP_PWR_SIZEOF	16
 
 /*
  * Hypertransport sub capability types
@@ -646,6 +695,8 @@
 #define HT_CAPTYPE_ERROR_RETRY	0xC0	/* Retry on error configuration */
 #define HT_CAPTYPE_GEN3		0xD0	/* Generation 3 hypertransport configuration */
 #define HT_CAPTYPE_PM		0xE0	/* Hypertransport powermanagement configuration */
+#define HT_CAP_SIZEOF_LONG	28	/* slave & primary */
+#define HT_CAP_SIZEOF_SHORT	24	/* host & secondary */
 
 /* Alternative Routing-ID Interpretation */
 #define PCI_ARI_CAP		0x04	/* ARI Capability Register */
@@ -656,6 +707,7 @@
 #define  PCI_ARI_CTRL_MFVC	0x0001	/* MFVC Function Groups Enable */
 #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function Group */
+#define PCI_EXT_CAP_ARI_SIZEOF	8
 
 /* Address Translation Service */
 #define PCI_ATS_CAP		0x04	/* ATS Capability Register */
@@ -665,6 +717,7 @@
 #define  PCI_ATS_CTRL_ENABLE	0x8000	/* ATS Enable */
 #define  PCI_ATS_CTRL_STU(x)	((x) & 0x1f)	/* Smallest Translation Unit */
 #define  PCI_ATS_MIN_STU	12	/* shift of minimum STU block */
+#define PCI_EXT_CAP_ATS_SIZEOF	8
 
 /* Page Request Interface */
 #define PCI_PRI_CTRL		0x04	/* PRI control register */
@@ -676,6 +729,7 @@
 #define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped */
 #define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs supported */
 #define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs allowed */
+#define PCI_EXT_CAP_PRI_SIZEOF	16
 
 /* PASID capability */
 #define PCI_PASID_CAP		0x04    /* PASID feature register */
@@ -685,6 +739,7 @@
 #define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
 #define  PCI_PASID_CTRL_EXEC	0x02	/* Exec permissions Enable */
 #define  PCI_PASID_CTRL_PRIV	0x04	/* Priviledge Mode Enable */
+#define PCI_EXT_CAP_PASID_SIZEOF	8
 
 /* Single Root I/O Virtualization */
 #define PCI_SRIOV_CAP		0x04	/* SR-IOV Capabilities */
@@ -716,12 +771,14 @@
 #define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn */
 #define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut */
 #define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available */
+#define PCI_EXT_CAP_SRIOV_SIZEOF 64
 
 #define PCI_LTR_MAX_SNOOP_LAT	0x4
 #define PCI_LTR_MAX_NOSNOOP_LAT	0x6
 #define  PCI_LTR_VALUE_MASK	0x000003ff
 #define  PCI_LTR_SCALE_MASK	0x00001c00
 #define  PCI_LTR_SCALE_SHIFT	10
+#define PCI_EXT_CAP_LTR_SIZEOF	8
 
 /* Access Control Service */
 #define PCI_ACS_CAP		0x04	/* ACS Capability Register */
@@ -732,7 +789,38 @@
 #define  PCI_ACS_UF		0x10	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x20	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x40	/* Direct Translated P2P */
+#define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+#define PCI_VSEC_HDR		4	/* extended cap - vendor specific */
+#define  PCI_VSEC_HDR_LEN_SHIFT	20	/* shift for length field */
+
+/* sata capability */
+#define PCI_SATA_REGS		4	/* SATA REGs specifier */
+#define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
+#define  PCI_SATA_REGS_INLINE	0xF	/* REGS in config space */
+#define PCI_SATA_SIZEOF_SHORT	8
+#define PCI_SATA_SIZEOF_LONG	16
+
+/* resizable BARs */
+#define PCI_REBAR_CTRL		8	/* control register */
+#define  PCI_REBAR_CTRL_NBAR_MASK	(7 << 5)	/* mask for # bars */
+#define  PCI_REBAR_CTRL_NBAR_SHIFT	5	/* shift for # bars */
+
+/* dynamic power allocation */
+#define PCI_DPA_CAP		4	/* capability register */
+#define  PCI_DPA_CAP_SUBSTATE_MASK	0x1F	/* # substates - 1 */
+#define PCI_DPA_BASE_SIZEOF	16	/* size with 0 substates */
+
+/* TPH Requester */
+#define PCI_TPH_CAP		4	/* capability register */
+#define  PCI_TPH_CAP_LOC_MASK	0x600	/* location mask */
+#define   PCI_TPH_LOC_NONE	0x000	/* no location */
+#define   PCI_TPH_LOC_CAP	0x200	/* in capability */
+#define   PCI_TPH_LOC_MSIX	0x400	/* in MSI-X */
+#define PCI_TPH_CAP_ST_MASK	0x07FF0000	/* st table mask */
+#define PCI_TPH_CAP_ST_SHIFT	16	/* st table shift */
+#define PCI_TPH_BASE_SIZEOF	12	/* size with no st table */
+
 #endif /* LINUX_PCI_REGS_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 12/13] pci: Misc pci_reg additions
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Fill in many missing definitions and add sizeof fields for many
sections allowing for more extensive config parsing.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 include/linux/pci_regs.h |  112 +++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 4b608f5..379be84 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
  */
+#define PCI_STD_HEADER_SIZEOF	64
 #define PCI_VENDOR_ID		0x00	/* 16 bits */
 #define PCI_DEVICE_ID		0x02	/* 16 bits */
 #define PCI_COMMAND		0x04	/* 16 bits */
@@ -209,9 +210,12 @@
 #define  PCI_CAP_ID_SHPC 	0x0C	/* PCI Standard Hot-Plug Controller */
 #define  PCI_CAP_ID_SSVID	0x0D	/* Bridge subsystem vendor/device ID */
 #define  PCI_CAP_ID_AGP3	0x0E	/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV	0x0F	/* Secure Device */
 #define  PCI_CAP_ID_EXP 	0x10	/* PCI Express */
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
+#define  PCI_CAP_ID_SATA	0x12	/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAX		PCI_CAP_ID_AF
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -276,6 +280,7 @@
 #define  PCI_VPD_ADDR_MASK	0x7fff	/* Address mask */
 #define  PCI_VPD_ADDR_F		0x8000	/* Write 0, 1 indicates completion */
 #define PCI_VPD_DATA		4	/* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF	8
 
 /* Slot Identification */
 
@@ -297,8 +302,10 @@
 #define PCI_MSI_ADDRESS_HI	8	/* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
 #define PCI_MSI_DATA_32		8	/* 16 bits of data for 32-bit devices */
 #define PCI_MSI_MASK_32		12	/* Mask bits register for 32-bit devices */
+#define PCI_MSI_PENDING_32	16	/* Pending intrs for 32-bit devices */
 #define PCI_MSI_DATA_64		12	/* 16 bits of data for 64-bit devices */
 #define PCI_MSI_MASK_64		16	/* Mask bits register for 64-bit devices */
+#define PCI_MSI_PENDING_64	20	/* Pending intrs for 64-bit devices */
 
 /* MSI-X registers */
 #define PCI_MSIX_FLAGS		2
@@ -308,6 +315,7 @@
 #define PCI_MSIX_TABLE		4
 #define PCI_MSIX_PBA		8
 #define  PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
+#define PCI_CAP_MSIX_SIZEOF	12	/* size of MSIX registers */
 
 /* MSI-X entry's format */
 #define PCI_MSIX_ENTRY_SIZE		16
@@ -338,6 +346,7 @@
 #define  PCI_AF_CTRL_FLR	0x01
 #define PCI_AF_STATUS		5
 #define  PCI_AF_STATUS_TP	0x01
+#define PCI_CAP_AF_SIZEOF	6	/* size of AF registers */
 
 /* PCI-X registers */
 
@@ -374,6 +383,9 @@
 #define  PCI_X_STATUS_SPL_ERR	0x20000000	/* Rcvd Split Completion Error Msg */
 #define  PCI_X_STATUS_266MHZ	0x40000000	/* 266 MHz capable */
 #define  PCI_X_STATUS_533MHZ	0x80000000	/* 533 MHz capable */
+#define PCI_X_ECC_CSR		8	/* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0	8	/* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V12	24	/* size for Version 1 & 2 */
 
 /* PCI Bridge Subsystem ID registers */
 
@@ -462,6 +474,7 @@
 #define  PCI_EXP_LNKSTA_DLLLA	0x2000	/* Data Link Layer Link Active */
 #define  PCI_EXP_LNKSTA_LBMS	0x4000	/* Link Bandwidth Management Status */
 #define  PCI_EXP_LNKSTA_LABS	0x8000	/* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1	20	/* v1 endpoints end here */
 #define PCI_EXP_SLTCAP		20	/* Slot Capabilities */
 #define  PCI_EXP_SLTCAP_ABP	0x00000001 /* Attention Button Present */
 #define  PCI_EXP_SLTCAP_PCP	0x00000002 /* Power Controller Present */
@@ -521,6 +534,7 @@
 #define  PCI_EXP_OBFF_MSGA_EN	0x2000	/* OBFF enable with Message type A */
 #define  PCI_EXP_OBFF_MSGB_EN	0x4000	/* OBFF enable with Message type B */
 #define  PCI_EXP_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	44	/* v2 endpoints end here */
 #define PCI_EXP_LNKCTL2		48	/* Link Control 2 */
 #define PCI_EXP_SLTCTL2		56	/* Slot Control 2 */
 
@@ -529,23 +543,43 @@
 #define PCI_EXT_CAP_VER(header)		((header >> 16) & 0xf)
 #define PCI_EXT_CAP_NEXT(header)	((header >> 20) & 0xffc)
 
-#define PCI_EXT_CAP_ID_ERR	1
-#define PCI_EXT_CAP_ID_VC	2
-#define PCI_EXT_CAP_ID_DSN	3
-#define PCI_EXT_CAP_ID_PWR	4
-#define PCI_EXT_CAP_ID_VNDR	11
-#define PCI_EXT_CAP_ID_ACS	13
-#define PCI_EXT_CAP_ID_ARI	14
-#define PCI_EXT_CAP_ID_ATS	15
-#define PCI_EXT_CAP_ID_SRIOV	16
-#define PCI_EXT_CAP_ID_PRI	19
-#define PCI_EXT_CAP_ID_LTR	24
-#define PCI_EXT_CAP_ID_PASID	27
+#define PCI_EXT_CAP_ID_ERR	0x01	/* Advanced Error Reporting */
+#define PCI_EXT_CAP_ID_VC	0x02	/* Virtual Channel Capability */
+#define PCI_EXT_CAP_ID_DSN	0x03	/* Device Serial Number */
+#define PCI_EXT_CAP_ID_PWR	0x04	/* Power Budgeting */
+#define PCI_EXT_CAP_ID_RCLD	0x05	/* Root Complex Link Declaration */
+#define PCI_EXT_CAP_ID_RCILC	0x06	/* Root Complex Internal Link Control */
+#define PCI_EXT_CAP_ID_RCEC	0x07	/* Root Complex Event Collector */
+#define PCI_EXT_CAP_ID_MFVC	0x08	/* Multi-Function VC Capability */
+#define PCI_EXT_CAP_ID_VC9	0x09	/* same as _VC */
+#define PCI_EXT_CAP_ID_RCRB	0x0A	/* Root Complex RB? */
+#define PCI_EXT_CAP_ID_VNDR	0x0B	/* Vendor Specific */
+#define PCI_EXT_CAP_ID_CAC	0x0C	/* Config Access - obsolete */
+#define PCI_EXT_CAP_ID_ACS	0x0D	/* Access Control Services */
+#define PCI_EXT_CAP_ID_ARI	0x0E	/* Alternate Routing ID */
+#define PCI_EXT_CAP_ID_ATS	0x0F	/* Address Translation Services */
+#define PCI_EXT_CAP_ID_SRIOV	0x10	/* Single Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MRIOV	0x11	/* Multi Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MCAST	0x12	/* Multicast */
+#define PCI_EXT_CAP_ID_PRI	0x13	/* Page Request Interface */
+#define PCI_EXT_CAP_ID_AMD_XXX	0x14	/* reserved for AMD */
+#define PCI_EXT_CAP_ID_REBAR	0x15	/* resizable BAR */
+#define PCI_EXT_CAP_ID_DPA	0x16	/* dynamic power alloc */
+#define PCI_EXT_CAP_ID_TPH	0x17	/* TPH request */
+#define PCI_EXT_CAP_ID_LTR	0x18	/* latency tolerance reporting */
+#define PCI_EXT_CAP_ID_SECPCI	0x19	/* Secondary PCIe */
+#define PCI_EXT_CAP_ID_PMUX	0x1A	/* Protocol Multiplexing */
+#define PCI_EXT_CAP_ID_PASID	0x1B	/* Process Address Space ID */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PASID
+
+#define PCI_EXT_CAP_DSN_SIZEOF	12
+#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable Error Status */
 #define  PCI_ERR_UNC_TRAIN	0x00000001	/* Training */
 #define  PCI_ERR_UNC_DLP	0x00000010	/* Data Link Protocol */
+#define  PCI_ERR_UNC_SURPDN	0x00000020	/* Surprise Down */
 #define  PCI_ERR_UNC_POISON_TLP	0x00001000	/* Poisoned TLP */
 #define  PCI_ERR_UNC_FCP	0x00002000	/* Flow Control Protocol */
 #define  PCI_ERR_UNC_COMP_TIME	0x00004000	/* Completion Timeout */
@@ -555,6 +589,11 @@
 #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
 #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
 #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
+#define  PCI_ERR_UNC_ACSV	0x00200000	/* ACS Violation */
+#define  PCI_ERR_UNC_INTN	0x00400000	/* internal error */
+#define  PCI_ERR_UNC_MCBTLP	0x00800000	/* MC blocked TLP */
+#define  PCI_ERR_UNC_ATOMEG	0x01000000	/* Atomic egress blocked */
+#define  PCI_ERR_UNC_TLPPRE	0x02000000	/* TLP prefix blocked */
 #define PCI_ERR_UNCOR_MASK	8	/* Uncorrectable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_UNCOR_SEVER	12	/* Uncorrectable Error Severity */
@@ -565,6 +604,9 @@
 #define  PCI_ERR_COR_BAD_DLLP	0x00000080	/* Bad DLLP Status */
 #define  PCI_ERR_COR_REP_ROLL	0x00000100	/* REPLAY_NUM Rollover */
 #define  PCI_ERR_COR_REP_TIMER	0x00001000	/* Replay Timer Timeout */
+#define  PCI_ERR_COR_ADV_NFAT	0x00002000	/* Advisory Non-Fatal */
+#define  PCI_ERR_COR_INTERNAL	0x00004000	/* Corrected Internal */
+#define  PCI_ERR_COR_LOG_OVER	0x00008000	/* Header Log Overflow */
 #define PCI_ERR_COR_MASK	20	/* Correctable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_CAP		24	/* Advanced Error Capabilities */
@@ -596,12 +638,18 @@
 
 /* Virtual Channel */
 #define PCI_VC_PORT_REG1	4
+#define  PCI_VC_REG1_EVCC	0x7	/* extended vc count */
 #define PCI_VC_PORT_REG2	8
+#define  PCI_VC_REG2_32_PHASE	0x2
+#define  PCI_VC_REG2_64_PHASE	0x4
+#define  PCI_VC_REG2_128_PHASE	0x8
 #define PCI_VC_PORT_CTRL	12
 #define PCI_VC_PORT_STATUS	14
 #define PCI_VC_RES_CAP		16
 #define PCI_VC_RES_CTRL		20
 #define PCI_VC_RES_STATUS	26
+#define PCI_CAP_VC_BASE_SIZEOF		0x10
+#define PCI_CAP_VC_PER_VC_SIZEOF	0x0C
 
 /* Power Budgeting */
 #define PCI_PWR_DSR		4	/* Data Select Register */
@@ -614,6 +662,7 @@
 #define  PCI_PWR_DATA_RAIL(x)	(((x) >> 18) & 7)   /* Power Rail */
 #define PCI_PWR_CAP		12	/* Capability */
 #define  PCI_PWR_CAP_BUDGET(x)	((x) & 1)	/* Included in system budget */
+#define PCI_EXT_CAP_PWR_SIZEOF	16
 
 /*
  * Hypertransport sub capability types
@@ -646,6 +695,8 @@
 #define HT_CAPTYPE_ERROR_RETRY	0xC0	/* Retry on error configuration */
 #define HT_CAPTYPE_GEN3		0xD0	/* Generation 3 hypertransport configuration */
 #define HT_CAPTYPE_PM		0xE0	/* Hypertransport powermanagement configuration */
+#define HT_CAP_SIZEOF_LONG	28	/* slave & primary */
+#define HT_CAP_SIZEOF_SHORT	24	/* host & secondary */
 
 /* Alternative Routing-ID Interpretation */
 #define PCI_ARI_CAP		0x04	/* ARI Capability Register */
@@ -656,6 +707,7 @@
 #define  PCI_ARI_CTRL_MFVC	0x0001	/* MFVC Function Groups Enable */
 #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function Group */
+#define PCI_EXT_CAP_ARI_SIZEOF	8
 
 /* Address Translation Service */
 #define PCI_ATS_CAP		0x04	/* ATS Capability Register */
@@ -665,6 +717,7 @@
 #define  PCI_ATS_CTRL_ENABLE	0x8000	/* ATS Enable */
 #define  PCI_ATS_CTRL_STU(x)	((x) & 0x1f)	/* Smallest Translation Unit */
 #define  PCI_ATS_MIN_STU	12	/* shift of minimum STU block */
+#define PCI_EXT_CAP_ATS_SIZEOF	8
 
 /* Page Request Interface */
 #define PCI_PRI_CTRL		0x04	/* PRI control register */
@@ -676,6 +729,7 @@
 #define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped */
 #define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs supported */
 #define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs allowed */
+#define PCI_EXT_CAP_PRI_SIZEOF	16
 
 /* PASID capability */
 #define PCI_PASID_CAP		0x04    /* PASID feature register */
@@ -685,6 +739,7 @@
 #define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
 #define  PCI_PASID_CTRL_EXEC	0x02	/* Exec permissions Enable */
 #define  PCI_PASID_CTRL_PRIV	0x04	/* Priviledge Mode Enable */
+#define PCI_EXT_CAP_PASID_SIZEOF	8
 
 /* Single Root I/O Virtualization */
 #define PCI_SRIOV_CAP		0x04	/* SR-IOV Capabilities */
@@ -716,12 +771,14 @@
 #define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn */
 #define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut */
 #define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available */
+#define PCI_EXT_CAP_SRIOV_SIZEOF 64
 
 #define PCI_LTR_MAX_SNOOP_LAT	0x4
 #define PCI_LTR_MAX_NOSNOOP_LAT	0x6
 #define  PCI_LTR_VALUE_MASK	0x000003ff
 #define  PCI_LTR_SCALE_MASK	0x00001c00
 #define  PCI_LTR_SCALE_SHIFT	10
+#define PCI_EXT_CAP_LTR_SIZEOF	8
 
 /* Access Control Service */
 #define PCI_ACS_CAP		0x04	/* ACS Capability Register */
@@ -732,7 +789,38 @@
 #define  PCI_ACS_UF		0x10	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x20	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x40	/* Direct Translated P2P */
+#define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+#define PCI_VSEC_HDR		4	/* extended cap - vendor specific */
+#define  PCI_VSEC_HDR_LEN_SHIFT	20	/* shift for length field */
+
+/* sata capability */
+#define PCI_SATA_REGS		4	/* SATA REGs specifier */
+#define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
+#define  PCI_SATA_REGS_INLINE	0xF	/* REGS in config space */
+#define PCI_SATA_SIZEOF_SHORT	8
+#define PCI_SATA_SIZEOF_LONG	16
+
+/* resizable BARs */
+#define PCI_REBAR_CTRL		8	/* control register */
+#define  PCI_REBAR_CTRL_NBAR_MASK	(7 << 5)	/* mask for # bars */
+#define  PCI_REBAR_CTRL_NBAR_SHIFT	5	/* shift for # bars */
+
+/* dynamic power allocation */
+#define PCI_DPA_CAP		4	/* capability register */
+#define  PCI_DPA_CAP_SUBSTATE_MASK	0x1F	/* # substates - 1 */
+#define PCI_DPA_BASE_SIZEOF	16	/* size with 0 substates */
+
+/* TPH Requester */
+#define PCI_TPH_CAP		4	/* capability register */
+#define  PCI_TPH_CAP_LOC_MASK	0x600	/* location mask */
+#define   PCI_TPH_LOC_NONE	0x000	/* no location */
+#define   PCI_TPH_LOC_CAP	0x200	/* in capability */
+#define   PCI_TPH_LOC_MSIX	0x400	/* in MSI-X */
+#define PCI_TPH_CAP_ST_MASK	0x07FF0000	/* st table mask */
+#define PCI_TPH_CAP_ST_SHIFT	16	/* st table shift */
+#define PCI_TPH_BASE_SIZEOF	12	/* size with no st table */
+
 #endif /* LINUX_PCI_REGS_H */

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 13/13] vfio: Add PCI device driver
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: chrisw, agraf, benve, aafabbri, B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci, linux-kernel, gregkh,
	bhelgaas, alex.williamson

Add PCI device support for VFIO.  PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device.  PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers.  I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions.  Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/vfio/Kconfig                |    2 
 drivers/vfio/pci/Kconfig            |    8 
 drivers/vfio/pci/Makefile           |    4 
 drivers/vfio/pci/vfio_pci.c         |  557 +++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  | 1527 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_intrs.c   |  724 +++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c    |  267 ++++++
 include/linux/vfio.h                |   26 +
 9 files changed, 3206 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index bd88a30..77b754c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,3 +12,5 @@ menuconfig VFIO
 	  See Documentation/vfio.txt for more details.
 
 	  If you don't know what to do here, say N.
+
+source "drivers/vfio/pci/Kconfig"
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
new file mode 100644
index 0000000..cc7db62
--- /dev/null
+++ b/drivers/vfio/pci/Kconfig
@@ -0,0 +1,8 @@
+config VFIO_PCI
+	tristate "VFIO support for PCI devices"
+	depends on VFIO && PCI
+	help
+	  Support for the PCI VFIO bus driver.  This is required to make
+	  use of PCI drivers using the VFIO framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
new file mode 100644
index 0000000..1310792
--- /dev/null
+++ b/drivers/vfio/pci/Makefile
@@ -0,0 +1,4 @@
+
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+
+obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 0000000..b2f1f3a
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,557 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1.9"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	u16 cmd;
+	u8 msix_pos;
+
+	vdev->reset_works = (pci_reset_function(pdev) == 0);
+	pci_save_state(pdev);
+	vdev->pci_saved_state = pci_store_saved_state(pdev);
+	if (!vdev->pci_saved_state)
+		printk(KERN_DEBUG "%s: Couldn't store %s saved state\n",
+		       __func__, dev_name(&pdev->dev));
+
+	ret = vfio_config_init(vdev);
+	if (ret)
+		goto out;
+
+	vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+		cmd &= ~PCI_COMMAND_INTX_DISABLE;
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+	}
+
+	msix_pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+	if (msix_pos) {
+		u16 flags;
+		u32 table;
+
+		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+		vdev->msix_bar = table & PCI_MSIX_FLAGS_BIRMASK;
+		vdev->msix_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
+		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+	} else
+		vdev->msix_bar = 0xFF;
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		goto out;
+
+	return ret;
+
+out:
+	kfree(vdev->pci_saved_state);
+	vdev->pci_saved_state = NULL;
+	vfio_config_free(vdev);
+	return ret;
+}
+
+static void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+	int bar;
+
+	pci_disable_device(vdev->pdev);
+
+	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+				VFIO_IRQ_SET_ACTION_TRIGGER,
+				vdev->irq_type, 0, 0, NULL);
+
+	vdev->virq_disabled = false;
+
+	vfio_config_free(vdev);
+
+	if (pci_reset_function(vdev->pdev) == 0) {
+		if (pci_load_and_free_saved_state(vdev->pdev,
+						  &vdev->pci_saved_state) == 0)
+			pci_restore_state(vdev->pdev);
+		else
+			printk(KERN_INFO "%s: Couldn't reload %s saved state\n",
+			       __func__, dev_name(&vdev->pdev->dev));
+	}
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		if (!vdev->barmap[bar])
+			continue;
+		pci_iounmap(vdev->pdev, vdev->barmap[bar]);
+		pci_release_selected_regions(vdev->pdev, 1 << bar);
+		vdev->barmap[bar] = NULL;
+	}
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	if (atomic_dec_and_test(&vdev->refcnt))
+		vfio_pci_disable(vdev);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (atomic_inc_return(&vdev->refcnt) == 1) {
+		int ret = vfio_pci_enable(vdev);
+		if (ret) {
+			module_put(THIS_MODULE);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+		if (pin)
+			return 1;
+
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSI);
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSI_FLAGS, &flags);
+
+			return 1 << (flags & PCI_MSI_FLAGS_QMASK);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSIX);
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSIX_FLAGS, &flags);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	}
+
+	return 0;
+}
+
+static long vfio_pci_ioctl(void *device_data,
+			   unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_device *vdev = device_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (vdev->reset_works)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_REGIONS)
+			return -EINVAL;
+
+		info.flags = 0;
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+
+		if (info.index == VFIO_PCI_CONFIG_REGION_INDEX) {
+			info.size = pdev->cfg_size;
+		} else if (pci_resource_start(pdev, info.index)) {
+			unsigned long flags;
+
+			flags = pci_resource_flags(pdev, info.index);
+
+			info.flags |= VFIO_REGION_INFO_FLAG_READ;
+
+			/* Report the actual ROM size instead of the BAR size,
+			 * this gives the user an easy way to determine whether
+			 * there's anything here w/o trying to read it. */
+			if (info.index == VFIO_PCI_ROM_REGION_INDEX) {
+				void __iomem *io;
+				size_t size;
+
+				io = pci_map_rom(pdev, &size);
+				info.size = io ? size : 0;
+				pci_unmap_rom(pdev, io);
+			} else if (flags & IORESOURCE_MEM) {
+				info.size = pci_resource_len(pdev, info.index);
+				info.flags |= (VFIO_REGION_INFO_FLAG_WRITE |
+					       VFIO_REGION_INFO_FLAG_MMAP);
+			} else {
+				info.size = pci_resource_len(pdev, info.index);
+				info.flags |= VFIO_REGION_INFO_FLAG_WRITE;
+			}
+		} else
+			info.size = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+		info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+				       VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.count > vfio_pci_get_irq_count(vdev, hdr.index))
+				return -EINVAL;
+
+			data = kmalloc(hdr.count * size, GFP_KERNEL);
+			if (!data)
+				return -ENOMEM;
+
+			if (copy_from_user(data, (void __user *)(arg + minsz),
+					   hdr.count * size)) {
+				kfree(data);
+				return -EFAULT;
+			}
+		}
+
+		mutex_lock(&vdev->igate);
+
+		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+					      hdr.start, hdr.count, data);
+
+		mutex_unlock(&vdev->igate);
+		kfree(data);
+
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_RESET)
+		return vdev->reset_works ?
+			pci_reset_function(vdev->pdev) : -EINVAL;
+
+	return -ENOTTY;
+}
+
+static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return vfio_pci_config_readwrite(vdev, buf, count, ppos, false);
+	else if (index == VFIO_PCI_ROM_REGION_INDEX)
+		return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+		return vfio_pci_io_readwrite(vdev, buf, count, ppos, false);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM)
+		return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+
+	return -EINVAL;
+}
+
+static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return vfio_pci_config_readwrite(vdev, (char __user *)buf,
+						 count, ppos, true);
+	else if (index == VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+		return vfio_pci_io_readwrite(vdev, (char __user *)buf,
+					     count, ppos, true);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM) {
+		return vfio_pci_mem_readwrite(vdev, (char __user *)buf,
+					      count, ppos, true);
+	}
+
+	return -EINVAL;
+}
+
+static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index;
+	u64 phys_len, req_len, pgoff, req_start, phys;
+	int ret;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	phys_len = pci_resource_len(pdev, index);
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
+		return -EINVAL;
+
+	if (index == vdev->msix_bar) {
+		/*
+		 * Disallow mmaps overlapping the MSI-X table; users don't
+		 * get to touch this directly.  We could find somewhere
+		 * else to map the overlap, but page granularity is only
+		 * a recommendation, not a requirement, so the user needs
+		 * to know which bits are real.  Requiring them to mmap
+		 * around the table makes that clear.
+		 */
+
+		/* If neither entirely above nor below, then it overlaps */
+		if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
+		      req_start + req_len <= vdev->msix_offset))
+			return -EINVAL;
+	}
+
+	/*
+	 * Even though we don't make use of the barmap for the mmap,
+	 * we need to request the region and the barmap tracks that.
+	 */
+	if (!vdev->barmap[index]) {
+		ret = pci_request_selected_regions(pdev,
+						   1 << index, "vfio-pci");
+		if (ret)
+			return ret;
+
+		vdev->barmap[index] = pci_iomap(pdev, index, 0);
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_flags |= (VM_IO | VM_RESERVED);
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	phys = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, phys,
+			       req_len, vma->vm_page_prot);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_ioctl,
+	.read		= vfio_pci_read,
+	.write		= vfio_pci_write,
+	.mmap		= vfio_pci_mmap,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	u8 type;
+	struct vfio_pci_device *vdev;
+	struct iommu_group *group;
+	int ret;
+
+	pci_read_config_byte(pdev, PCI_HEADER_TYPE, &type);
+	if ((type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		iommu_group_put(group);
+		return -ENOMEM;
+	}
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	atomic_set(&vdev->refcnt, 0);
+
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (ret) {
+		iommu_group_put(group);
+		kfree(vdev);
+	}
+
+	return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = vfio_del_group_dev(&pdev->dev);
+	if (!vdev)
+		return;
+
+	iommu_group_put(pdev->dev.iommu_group);
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_pci_driver = {
+	.name		= "vfio-pci",
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_pci_probe,
+	.remove		= vfio_pci_remove,
+};
+
+void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+	vfio_pci_virqfd_exit();
+	vfio_pci_uninit_perm_bits();
+}
+
+int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Start the virqfd cleanup handler */
+	ret = vfio_pci_virqfd_init();
+	if (ret)
+		goto out_virqfd;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	return 0;
+
+out_virqfd:
+	vfio_pci_virqfd_exit();
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
new file mode 100644
index 0000000..a909433
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -0,0 +1,1527 @@
+/*
+ * VFIO PCI config space virtualization
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+/*
+ * This code handles reading and writing of PCI configuration registers.
+ * This is hairy because we want to allow a lot of flexibility to the
+ * user driver, but cannot trust it with all of the config fields.
+ * Tables determine which fields can be read and written, as well as
+ * which fields are 'virtualized' - special actions and translations to
+ * make it appear to the user that he has control, when in fact things
+ * must be negotiated with the underlying OS.
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define PCI_CFG_SPACE_SIZE	256
+
+/* Useful "pseudo" capabilities */
+#define PCI_CAP_ID_BASIC	0
+#define PCI_CAP_ID_INVALID	0xFF
+
+#define is_bar(offset)	\
+	((offset >= PCI_BASE_ADDRESS_0 && offset < PCI_BASE_ADDRESS_5 + 4) || \
+	 (offset >= PCI_ROM_ADDRESS && offset < PCI_ROM_ADDRESS + 4))
+
+/*
+ * Lengths of PCI Config Capabilities
+ *   0: Removed from the user visible capability list
+ *   FF: Variable length
+ */
+static u8 pci_cap_length[] = {
+	[PCI_CAP_ID_BASIC]	= PCI_STD_HEADER_SIZEOF, /* pci config header */
+	[PCI_CAP_ID_PM]		= PCI_PM_SIZEOF,
+	[PCI_CAP_ID_AGP]	= PCI_AGP_SIZEOF,
+	[PCI_CAP_ID_VPD]	= PCI_CAP_VPD_SIZEOF,
+	[PCI_CAP_ID_SLOTID]	= 0,		/* bridge - don't care */
+	[PCI_CAP_ID_MSI]	= 0xFF,		/* 10, 14, 20, or 24 */
+	[PCI_CAP_ID_CHSWP]	= 0,		/* cpci - not yet */
+	[PCI_CAP_ID_PCIX]	= 0xFF,		/* 8 or 24 */
+	[PCI_CAP_ID_HT]		= 0xFF,		/* hypertransport */
+	[PCI_CAP_ID_VNDR]	= 0xFF,		/* variable */
+	[PCI_CAP_ID_DBG]	= 0,		/* debug - don't care */
+	[PCI_CAP_ID_CCRC]	= 0,		/* cpci - not yet */
+	[PCI_CAP_ID_SHPC]	= 0,		/* hotswap - not yet */
+	[PCI_CAP_ID_SSVID]	= 0,		/* bridge - don't care */
+	[PCI_CAP_ID_AGP3]	= 0,		/* AGP8x - not yet */
+	[PCI_CAP_ID_SECDEV]	= 0,		/* secure device not yet */
+	[PCI_CAP_ID_EXP]	= 0xFF,		/* 20 or 44 */
+	[PCI_CAP_ID_MSIX]	= PCI_CAP_MSIX_SIZEOF,
+	[PCI_CAP_ID_SATA]	= 0xFF,
+	[PCI_CAP_ID_AF]		= PCI_CAP_AF_SIZEOF,
+};
+
+/*
+ * Lengths of PCIe/PCI-X Extended Config Capabilities
+ *   0: Removed or masked from the user visible capabilty list
+ *   FF: Variable length
+ */
+static u16 pci_ext_cap_length[] = {
+	[PCI_EXT_CAP_ID_ERR]	=	PCI_ERR_ROOT_COMMAND,
+	[PCI_EXT_CAP_ID_VC]	=	0xFF,
+	[PCI_EXT_CAP_ID_DSN]	=	PCI_EXT_CAP_DSN_SIZEOF,
+	[PCI_EXT_CAP_ID_PWR]	=	PCI_EXT_CAP_PWR_SIZEOF,
+	[PCI_EXT_CAP_ID_RCLD]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_RCILC]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_RCEC]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_MFVC]	=	0xFF,
+	[PCI_EXT_CAP_ID_VC9]	=	0xFF,	/* same as CAP_ID_VC */
+	[PCI_EXT_CAP_ID_RCRB]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_VNDR]	=	0xFF,
+	[PCI_EXT_CAP_ID_CAC]	=	0,	/* obsolete */
+	[PCI_EXT_CAP_ID_ACS]	=	0xFF,
+	[PCI_EXT_CAP_ID_ARI]	=	PCI_EXT_CAP_ARI_SIZEOF,
+	[PCI_EXT_CAP_ID_ATS]	=	PCI_EXT_CAP_ATS_SIZEOF,
+	[PCI_EXT_CAP_ID_SRIOV]	=	PCI_EXT_CAP_SRIOV_SIZEOF,
+	[PCI_EXT_CAP_ID_MRIOV]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_MCAST]	=	PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF,
+	[PCI_EXT_CAP_ID_PRI]	=	PCI_EXT_CAP_PRI_SIZEOF,
+	[PCI_EXT_CAP_ID_AMD_XXX] =	0,	/* not yet */
+	[PCI_EXT_CAP_ID_REBAR]	=	0xFF,
+	[PCI_EXT_CAP_ID_DPA]	=	0xFF,
+	[PCI_EXT_CAP_ID_TPH]	=	0xFF,
+	[PCI_EXT_CAP_ID_LTR]	=	PCI_EXT_CAP_LTR_SIZEOF,
+	[PCI_EXT_CAP_ID_SECPCI]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PMUX]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PASID]	=	0,	/* not yet */
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists, but what is read depends on
+ * whether the field is 'virtualized', or just pass thru to the
+ * hardware.  Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+	u8	*virt;		/* read/write virtual data, not hw */
+	u8	*write;		/* writeable bits */
+	int	(*readfn)(struct vfio_pci_device *vdev, int pos, int count,
+			  struct perm_bits *perm, int offset, u32 *val);
+	int	(*writefn)(struct vfio_pci_device *vdev, int pos, int count,
+			   struct perm_bits *perm, int offset, u32 val);
+};
+
+#define	NO_VIRT		0
+#define	ALL_VIRT	0xFFFFFFFFU
+#define	NO_WRITE	0
+#define	ALL_WRITE	0xFFFFFFFFU
+
+static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+				 u32 *val, int count)
+{
+	int ret = -EINVAL;
+
+	switch (count) {
+	case 1:
+		ret = pci_user_read_config_byte(pdev, offset, (u8 *)val);
+		break;
+	case 2:
+		ret = pci_user_read_config_word(pdev, offset, (u16 *)val);
+		*val = cpu_to_le16(*val);
+		break;
+	case 4:
+		ret = pci_user_read_config_dword(pdev, offset, val);
+		*val = cpu_to_le32(*val);
+		break;
+	}
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+				  u32 val, int count)
+{
+	int ret = -EINVAL;
+
+	switch (count) {
+	case 1:
+		ret = pci_user_write_config_byte(pdev, offset, val);
+		break;
+	case 2:
+		ret = pci_user_write_config_word(pdev, offset, val);
+		break;
+	case 4:
+		ret = pci_user_write_config_dword(pdev, offset, val);
+		break;
+	}
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
+				    int count, struct perm_bits *perm,
+				    int offset, u32 *val)
+{
+	u32 virt = 0;
+
+	memcpy(val, vdev->vconfig + pos, count);
+
+	memcpy(&virt, perm->virt + offset, count);
+
+	/* Any non-virtualized bits? */
+	if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
+		struct pci_dev *pdev = vdev->pdev;
+		u32 phys_val = 0;
+		int ret;
+
+		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		if (ret)
+			return ret;
+
+		*val = (phys_val & ~virt) | (*val & virt);
+	}
+
+	return count;
+}
+
+static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
+				     int count, struct perm_bits *perm,
+				     int offset, u32 val)
+{
+	u32 virt = 0, write = 0;
+
+	memcpy(&write, perm->write + offset, count);
+
+	if (!write)
+		return count; /* drop, no writable bits */
+
+	memcpy(&virt, perm->virt + offset, count);
+
+	/* Virtualized and writable bits go to vconfig */
+	if (write & virt) {
+		u32 virt_val = 0;
+
+		memcpy(&virt_val, vdev->vconfig + pos, count);
+
+		virt_val &= ~(write & virt);
+		virt_val |= (val & (write & virt));
+
+		memcpy(vdev->vconfig + pos, &virt_val, count);
+	}
+
+	/* Non-virtualzed and writable bits go to hardware */
+	if (write & ~virt) {
+		struct pci_dev *pdev = vdev->pdev;
+		u32 phys_val = 0;
+		int ret;
+
+		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		if (ret)
+			return ret;
+
+		phys_val &= ~(write & ~virt);
+		phys_val |= (val & (write & ~virt));
+
+		ret = vfio_user_config_write(pdev, pos, phys_val, count);
+		if (ret)
+			return ret;
+	}
+
+	return count;
+}
+
+/* Allow direct read from hardware, except for capability next pointer */
+static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, u32 *val)
+{
+	int ret;
+
+	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	if (pos >= PCI_CFG_SPACE_SIZE) { /* Extended cap header mangling */
+		if (offset < 4)
+			memcpy(val, vdev->vconfig + pos, count);
+	} else if (pos >= PCI_STD_HEADER_SIZEOF) { /* Std cap mangling */
+		if (offset == PCI_CAP_LIST_ID && count > 1)
+			memcpy(val, vdev->vconfig + pos,
+			       min(PCI_CAP_FLAGS, count));
+		else if (offset == PCI_CAP_LIST_NEXT)
+			memcpy(val, vdev->vconfig + pos, 1);
+	}
+
+	return count;
+}
+
+static int vfio_direct_config_write(struct vfio_pci_device *vdev, int pos,
+				    int count, struct perm_bits *perm,
+				    int offset, u32 val)
+{
+	int ret;
+
+	ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+/* Default all regions to read-only, no-virtualization */
+static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
+	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
+	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+
+static void free_perm_bits(struct perm_bits *perm)
+{
+	kfree(perm->virt);
+	kfree(perm->write);
+	perm->virt = NULL;
+	perm->write = NULL;
+}
+
+static int alloc_perm_bits(struct perm_bits *perm, int size)
+{
+	/*
+	 * Round up all permission bits to the next dword, this lets us
+	 * ignore whether a read/write exceeds the defined capability
+	 * structure.  We can do this because:
+	 *  - Standard config space is already dword aligned
+	 *  - Capabilities are all dword alinged (bits 0:1 of next reserved)
+	 *  - Express capabilities defined as dword aligned
+	 */
+	size = round_up(size, 4);
+
+	/*
+	 * Zero state is
+	 * - All Readable, None Writeable, None Virtualized
+	 */
+	perm->virt = kzalloc(size, GFP_KERNEL);
+	perm->write = kzalloc(size, GFP_KERNEL);
+	if (!perm->virt || !perm->write) {
+		free_perm_bits(perm);
+		return -ENOMEM;
+	}
+
+	perm->readfn = vfio_default_config_read;
+	perm->writefn = vfio_default_config_write;
+
+	return 0;
+}
+
+/*
+ * Helper functions for filling in permission tables
+ */
+static inline void p_setb(struct perm_bits *p, int off, u8 virt, u8 write)
+{
+	p->virt[off] = virt;
+	p->write[off] = write;
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setw(struct perm_bits *p, int off, u16 virt, u16 write)
+{
+	*(u16 *)(&p->virt[off]) = cpu_to_le16(virt);
+	*(u16 *)(&p->write[off]) = cpu_to_le16(write);
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
+{
+	*(u32 *)(&p->virt[off]) = cpu_to_le32(virt);
+	*(u32 *)(&p->write[off]) = cpu_to_le32(write);
+}
+
+/*
+ * Restore the *real* BARs after we detect a FLR or backdoor reset.
+ * (backdoor = some device specific technique that we didn't catch)
+ */
+static void vfio_bar_restore(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 *rbar = vdev->rbar;
+	int i;
+
+	if (pdev->is_virtfn)
+		return;
+
+	printk(KERN_INFO "%s: %s reset recovery - restoring bars\n",
+	       __func__, dev_name(&pdev->dev));
+
+	for (i = PCI_BASE_ADDRESS_0; i <= PCI_BASE_ADDRESS_5; i += 4, rbar++)
+		pci_user_write_config_dword(pdev, i, *rbar);
+
+	pci_user_write_config_dword(pdev, PCI_ROM_ADDRESS, *rbar);
+}
+
+static u32 vfio_generate_bar_flags(struct pci_dev *pdev, int bar)
+{
+	unsigned long flags = pci_resource_flags(pdev, bar);
+	u32 val;
+
+	if (flags & IORESOURCE_IO)
+		return cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
+
+	val = PCI_BASE_ADDRESS_SPACE_MEMORY;
+
+	if (flags & IORESOURCE_PREFETCH)
+		val |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+	if (flags & IORESOURCE_MEM_64)
+		val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+
+	return cpu_to_le32(val);
+}
+
+/*
+ * Pretend we're hardware and tweak the values of the *virtual* PCI BARs
+ * to reflect the hardware capabilities.  This implements BAR sizing.
+ */
+static void vfio_bar_fixup(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+	u32 *bar;
+	u64 mask;
+
+	bar = (u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_0];
+
+	for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++, bar++) {
+		if (!pci_resource_start(pdev, i)) {
+			*bar = 0; /* Unmapped by host = unimplemented to user */
+			continue;
+		}
+
+		mask = ~(pci_resource_len(pdev, i) - 1);
+
+		*bar &= cpu_to_le32((u32)mask);
+		*bar |= vfio_generate_bar_flags(pdev, i);
+
+		if (*bar & cpu_to_le32(IORESOURCE_MEM_64)) {
+			bar++;
+			*bar &= cpu_to_le32((u32)(mask >> 32));
+			i++;
+		}
+	}
+
+	bar = (u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];
+
+	/*
+	 * NB. we expose the actual BAR size here, regardless of whether
+	 * we can read it.  When we report the REGION_INFO for the ROM
+	 * we report what PCI tells us is the actual ROM size.
+	 */
+	if (pci_resource_start(pdev, PCI_ROM_RESOURCE)) {
+		mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
+		mask |= PCI_ROM_ADDRESS_ENABLE;
+		*bar &= cpu_to_le32((u32)mask);
+	} else
+		*bar = 0;
+
+	vdev->bardirty = false;
+}
+
+static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
+				  int count, struct perm_bits *perm,
+				  int offset, u32 *val)
+{
+	if (is_bar(offset)) /* pos == offset for basic config */
+		vfio_bar_fixup(vdev);
+
+	count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
+
+	/* Mask in virtual memory enable for SR-IOV devices */
+	if (offset == PCI_COMMAND && vdev->pdev->is_virtfn) {
+		u16 cmd = *(u16 *)&vdev->vconfig[PCI_COMMAND];
+		*val |= cmd & cpu_to_le16(PCI_COMMAND_MEMORY);
+	}
+
+	return count;
+}
+
+static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, u32 val)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 phys_cmd, *virt_cmd, new_cmd = 0;
+	int ret;
+
+	virt_cmd = (u16 *)&vdev->vconfig[PCI_COMMAND];
+
+	if (offset == PCI_COMMAND) {
+		bool phys_mem, virt_mem, new_mem, phys_io, virt_io, new_io;
+
+		ret = pci_user_read_config_word(pdev, PCI_COMMAND, &phys_cmd);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		switch (count) {
+		case 1:
+			new_cmd = val;
+			break;
+		case 2:
+			new_cmd = le16_to_cpu(val);
+			break;
+		case 4:
+			new_cmd = (u16)le32_to_cpu(val);
+			break;
+		}
+
+		phys_mem = !!(phys_cmd & PCI_COMMAND_MEMORY);
+		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
+		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
+
+		phys_io = !!(phys_cmd & PCI_COMMAND_IO);
+		virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
+		new_io = !!(new_cmd & PCI_COMMAND_IO);
+
+		/*
+		 * If the user is writing mem/io enable (new_mem/io) and we
+		 * think it's already enabled (virt_mem/io), but the hardware
+		 * shows it disabled (phys_mem/io, then the device has
+		 * undergone some kind of backdoor reset and needs to be
+		 * restored before we allow it to enable the bars.
+		 * SR-IOV devices will trigger this, but we catch them later
+		 */
+		if ((new_mem && virt_mem && !phys_mem) ||
+		    (new_io && virt_io && !phys_io))
+			vfio_bar_restore(vdev);
+	}
+
+	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (count < 0)
+		return count;
+
+	/*
+	 * Save current memory/io enable bits in vconfig to allow for
+	 * the test above next time.
+	 */
+	if (offset == PCI_COMMAND) {
+		u16 mask = PCI_COMMAND_MEMORY | PCI_COMMAND_IO;
+
+		*virt_cmd &= cpu_to_le16(~mask);
+		*virt_cmd |= new_cmd & cpu_to_le16(mask);
+	}
+
+	/* Emulate INTx disable */
+	if (offset >= PCI_COMMAND && offset <= PCI_COMMAND + 1) {
+		bool virt_intx_disable;
+
+		virt_intx_disable = !!(le16_to_cpu(*virt_cmd) &
+				       PCI_COMMAND_INTX_DISABLE);
+
+		if (virt_intx_disable && !vdev->virq_disabled) {
+			vdev->virq_disabled = true;
+			vfio_pci_intx_mask(vdev);
+		} else if (!virt_intx_disable && vdev->virq_disabled) {
+			vdev->virq_disabled = false;
+			vfio_pci_intx_unmask(vdev);
+		}
+	}
+
+	if (is_bar(offset))
+		vdev->bardirty = true;
+
+	return count;
+}
+
+/* Permissions for the Basic PCI Header */
+static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, PCI_STD_HEADER_SIZEOF))
+		return -ENOMEM;
+
+	perm->readfn = vfio_basic_config_read;
+	perm->writefn = vfio_basic_config_write;
+
+	/* Virtualized for SR-IOV functions, which just have FFFF */
+	p_setw(perm, PCI_VENDOR_ID, (u16)ALL_VIRT, NO_WRITE);
+	p_setw(perm, PCI_DEVICE_ID, (u16)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Virtualize INTx disable, we use it internally for interrupt
+	 * control and can emulate it for non-PCI 2.3 devices.
+	 */
+	p_setw(perm, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE, (u16)ALL_WRITE);
+
+	/* Virtualize capability list, we might want to skip/disable */
+	p_setw(perm, PCI_STATUS, PCI_STATUS_CAP_LIST, NO_WRITE);
+
+	/* No harm to write */
+	p_setb(perm, PCI_CACHE_LINE_SIZE, NO_VIRT, (u8)ALL_WRITE);
+	p_setb(perm, PCI_LATENCY_TIMER, NO_VIRT, (u8)ALL_WRITE);
+	p_setb(perm, PCI_BIST, NO_VIRT, (u8)ALL_WRITE);
+
+	/* Virtualize all bars, can't touch the real ones */
+	p_setd(perm, PCI_BASE_ADDRESS_0, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_1, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_2, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_3, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_4, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_5, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_ROM_ADDRESS, ALL_VIRT, ALL_WRITE);
+
+	/* Allow us to adjust capability chain */
+	p_setb(perm, PCI_CAPABILITY_LIST, (u8)ALL_VIRT, NO_WRITE);
+
+	/* Sometimes used by sw, just virtualize */
+	p_setb(perm, PCI_INTERRUPT_LINE, (u8)ALL_VIRT, (u8)ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for the Power Management capability */
+static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM]))
+		return -ENOMEM;
+
+	/*
+	 * We always virtualize the next field so we can remove
+	 * capabilities from the chain if we want to.
+	 */
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Power management is defined *per function*,
+	 * so we let the user write this
+	 */
+	p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for PCI-X capability */
+static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
+{
+	/* Alloc 24, but only 8 are used in v0 */
+	if (alloc_perm_bits(perm, PCI_CAP_PCIX_SIZEOF_V12))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	p_setw(perm, PCI_X_CMD, NO_VIRT, (u16)ALL_WRITE);
+	p_setd(perm, PCI_X_ECC_CSR, NO_VIRT, ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for PCI Express capability */
+static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
+{
+	/* Alloc larger of two possible sizes */
+	if (alloc_perm_bits(perm, PCI_CAP_EXP_ENDPOINT_SIZEOF_V2))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Allow writes to device control fields (includes FLR!)
+	 * but not to devctl_phantom which could confuse IOMMU
+	 * or to the ARI bit in devctl2 which is set at probe time
+	 */
+	p_setw(perm, PCI_EXP_DEVCTL, NO_VIRT, ~PCI_EXP_DEVCTL_PHANTOM);
+	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
+	return 0;
+}
+
+/* Permissions for Advanced Function capability */
+static int __init init_pci_cap_af_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_AF]))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+	p_setb(perm, PCI_AF_CTRL, NO_VIRT, PCI_AF_CTRL_FLR);
+	return 0;
+}
+
+/* Permissions for Advanced Error Reporting extended capability */
+static int __init init_pci_ext_cap_err_perm(struct perm_bits *perm)
+{
+	u32 mask;
+
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_ERR]))
+		return -ENOMEM;
+
+	/*
+	 * Virtualize the first dword of all express capabilities
+	 * because it includes the next pointer.  This lets us later
+	 * remove capabilities from the chain if we need to.
+	 */
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+	/* Writable bits mask */
+	mask =	PCI_ERR_UNC_TRAIN |		/* Training */
+		PCI_ERR_UNC_DLP |		/* Data Link Protocol */
+		PCI_ERR_UNC_SURPDN |		/* Surprise Down */
+		PCI_ERR_UNC_POISON_TLP |	/* Poisoned TLP */
+		PCI_ERR_UNC_FCP |		/* Flow Control Protocol */
+		PCI_ERR_UNC_COMP_TIME |		/* Completion Timeout */
+		PCI_ERR_UNC_COMP_ABORT |	/* Completer Abort */
+		PCI_ERR_UNC_UNX_COMP |		/* Unexpected Completion */
+		PCI_ERR_UNC_RX_OVER |		/* Receiver Overflow */
+		PCI_ERR_UNC_MALF_TLP |		/* Malformed TLP */
+		PCI_ERR_UNC_ECRC |		/* ECRC Error Status */
+		PCI_ERR_UNC_UNSUP |		/* Unsupported Request */
+		PCI_ERR_UNC_ACSV |		/* ACS Violation */
+		PCI_ERR_UNC_INTN |		/* internal error */
+		PCI_ERR_UNC_MCBTLP |		/* MC blocked TLP */
+		PCI_ERR_UNC_ATOMEG |		/* Atomic egress blocked */
+		PCI_ERR_UNC_TLPPRE;		/* TLP prefix blocked */
+	p_setd(perm, PCI_ERR_UNCOR_STATUS, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_UNCOR_MASK, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_UNCOR_SEVER, NO_VIRT, mask);
+
+	mask =	PCI_ERR_COR_RCVR |		/* Receiver Error Status */
+		PCI_ERR_COR_BAD_TLP |		/* Bad TLP Status */
+		PCI_ERR_COR_BAD_DLLP |		/* Bad DLLP Status */
+		PCI_ERR_COR_REP_ROLL |		/* REPLAY_NUM Rollover */
+		PCI_ERR_COR_REP_TIMER |		/* Replay Timer Timeout */
+		PCI_ERR_COR_ADV_NFAT |		/* Advisory Non-Fatal */
+		PCI_ERR_COR_INTERNAL |		/* Corrected Internal */
+		PCI_ERR_COR_LOG_OVER;		/* Header Log Overflow */
+	p_setd(perm, PCI_ERR_COR_STATUS, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_COR_MASK, NO_VIRT, mask);
+
+	mask =	PCI_ERR_CAP_ECRC_GENE |		/* ECRC Generation Enable */
+		PCI_ERR_CAP_ECRC_CHKE;		/* ECRC Check Enable */
+	p_setd(perm, PCI_ERR_CAP, NO_VIRT, mask);
+	return 0;
+}
+
+/* Permissions for Power Budgeting extended capability */
+static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_PWR]))
+		return -ENOMEM;
+
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+	/* Writing the data selector is OK, the info is still read-only */
+	p_setb(perm, PCI_PWR_DATA, NO_VIRT, (u8)ALL_WRITE);
+	return 0;
+}
+
+/*
+ * Initialize the shared permission tables
+ */
+void vfio_pci_uninit_perm_bits(void)
+{
+	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
+
+	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_PCIX]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_EXP]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_AF]);
+
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+}
+
+int __init vfio_pci_init_perm_bits(void)
+{
+	int ret;
+
+	/* Basic config space */
+	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
+
+	/* Capabilities */
+	ret |= init_pci_cap_pm_perm(&cap_perms[PCI_CAP_ID_PM]);
+	cap_perms[PCI_CAP_ID_VPD].writefn = vfio_direct_config_write;
+	ret |= init_pci_cap_pcix_perm(&cap_perms[PCI_CAP_ID_PCIX]);
+	cap_perms[PCI_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+	ret |= init_pci_cap_exp_perm(&cap_perms[PCI_CAP_ID_EXP]);
+	ret |= init_pci_cap_af_perm(&cap_perms[PCI_CAP_ID_AF]);
+
+	/* Extended capabilities */
+	ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+	ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+
+	if (ret)
+		vfio_pci_uninit_perm_bits();
+
+	return ret;
+}
+
+static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos)
+{
+	u8 cap;
+	int base = (pos >= PCI_CFG_SPACE_SIZE) ? PCI_CFG_SPACE_SIZE :
+						 PCI_STD_HEADER_SIZEOF;
+	base /= 4;
+	pos /= 4;
+
+	cap = vdev->pci_config_map[pos];
+
+	if (cap == PCI_CAP_ID_BASIC)
+		return 0;
+
+	/* XXX Can we have to abutting capabilities of the same type? */
+	while (pos - 1 >= base && vdev->pci_config_map[pos - 1] == cap)
+		pos--;
+
+	return pos * 4;
+}
+
+static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
+				int count, struct perm_bits *perm,
+				int offset, u32 *val)
+{
+	/* Update max available queue size from msi_qmax */
+	if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+		u16 *flags;
+		int start;
+
+		start = vfio_find_cap_start(vdev, pos);
+
+		flags = (u16 *)&vdev->vconfig[start];
+
+		*flags &= cpu_to_le16(~PCI_MSI_FLAGS_QMASK);
+		*flags |= cpu_to_le16(vdev->msi_qmax << 1);
+	}
+
+	return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_msi_config_write(struct vfio_pci_device *vdev, int pos,
+				 int count, struct perm_bits *perm,
+				 int offset, u32 val)
+{
+	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (count < 0)
+		return count;
+
+	/* Fixup and write configured queue size and enable to hardware */
+	if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+		u16 *pflags, flags;
+		int start, ret;
+
+		start = vfio_find_cap_start(vdev, pos);
+
+		pflags = (u16 *)&vdev->vconfig[start + PCI_MSI_FLAGS];
+
+		flags = le16_to_cpu(*pflags);
+
+		/* MSI is enabled via ioctl */
+		if  (!is_msi(vdev))
+			flags &= ~PCI_MSI_FLAGS_ENABLE;
+
+		/* Check queue size */
+		if ((flags & PCI_MSI_FLAGS_QSIZE) >> 4 > vdev->msi_qmax) {
+			flags &= ~PCI_MSI_FLAGS_QSIZE;
+			flags |= vdev->msi_qmax << 4;
+		}
+
+		/* Write back to virt and to hardware */
+		*pflags = cpu_to_le16(flags);
+		ret = pci_user_write_config_word(vdev->pdev,
+						 start + PCI_MSI_FLAGS,
+						 flags);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+	}
+
+	return count;
+}
+
+/*
+ * MSI determination is per-device, so this routine gets used beyond
+ * initialization time. Don't add __init
+ */
+static int init_pci_cap_msi_perm(struct perm_bits *perm, int len, u16 flags)
+{
+	if (alloc_perm_bits(perm, len))
+		return -ENOMEM;
+
+	perm->readfn = vfio_msi_config_read;
+	perm->writefn = vfio_msi_config_write;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * The upper byte of the control register is reserved,
+	 * just setup the lower byte.
+	 */
+	p_setb(perm, PCI_MSI_FLAGS, (u8)ALL_VIRT, (u8)ALL_WRITE);
+	p_setd(perm, PCI_MSI_ADDRESS_LO, ALL_VIRT, ALL_WRITE);
+	if (flags & PCI_MSI_FLAGS_64BIT) {
+		p_setd(perm, PCI_MSI_ADDRESS_HI, ALL_VIRT, ALL_WRITE);
+		p_setw(perm, PCI_MSI_DATA_64, (u16)ALL_VIRT, (u16)ALL_WRITE);
+		if (flags & PCI_MSI_FLAGS_MASKBIT) {
+			p_setd(perm, PCI_MSI_MASK_64, NO_VIRT, ALL_WRITE);
+			p_setd(perm, PCI_MSI_PENDING_64, NO_VIRT, ALL_WRITE);
+		}
+	} else {
+		p_setw(perm, PCI_MSI_DATA_32, (u16)ALL_VIRT, (u16)ALL_WRITE);
+		if (flags & PCI_MSI_FLAGS_MASKBIT) {
+			p_setd(perm, PCI_MSI_MASK_32, NO_VIRT, ALL_WRITE);
+			p_setd(perm, PCI_MSI_PENDING_32, NO_VIRT, ALL_WRITE);
+		}
+	}
+	return 0;
+}
+
+/* Determine MSI CAP field length; initialize msi_perms on 1st call per vdev */
+static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int len, ret;
+	u16 flags;
+
+	ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	len = 10; /* Minimum size */
+	if (flags & PCI_MSI_FLAGS_64BIT)
+		len += 4;
+	if (flags & PCI_MSI_FLAGS_MASKBIT)
+		len += 10;
+
+	if (vdev->msi_perm)
+		return len;
+
+	vdev->msi_perm = kmalloc(sizeof(struct perm_bits), GFP_KERNEL);
+	if (!vdev->msi_perm)
+		return -ENOMEM;
+
+	ret = init_pci_cap_msi_perm(vdev->msi_perm, len, flags);
+	if (ret)
+		return ret;
+
+	return len;
+}
+
+/* Determine extended capability length for VC (2 & 9) and MFVC */
+static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 tmp;
+	int ret, evcc, phases, vc_arb;
+	int len = PCI_CAP_VC_BASE_SIZEOF;
+
+	ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG1, &tmp);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	evcc = tmp & PCI_VC_REG1_EVCC; /* extended vc count */
+	ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG2, &tmp);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	if (tmp & PCI_VC_REG2_128_PHASE)
+		phases = 128;
+	else if (tmp & PCI_VC_REG2_64_PHASE)
+		phases = 64;
+	else if (tmp & PCI_VC_REG2_32_PHASE)
+		phases = 32;
+	else
+		phases = 0;
+
+	vc_arb = phases * 4;
+
+	/*
+	 * Port arbitration tables are root & switch only;
+	 * function arbitration tables are function 0 only.
+	 * In either case, we'll never let user write them so
+	 * we don't care how big they are
+	 */
+	len += (1 + evcc) * PCI_CAP_VC_PER_VC_SIZEOF;
+	if (vc_arb) {
+		len = round_up(len, 16);
+		len += vc_arb / 8;
+	}
+	return len;
+}
+
+static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 word;
+	u8 byte;
+	int ret;
+
+	switch (cap) {
+	case PCI_CAP_ID_MSI:
+		return vfio_msi_cap_len(vdev, pos);
+	case PCI_CAP_ID_PCIX:
+		ret = pci_read_config_word(pdev, pos + PCI_X_CMD, &word);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if (PCI_X_CMD_VERSION(word)) {
+			vdev->extended_caps = true;
+			return PCI_CAP_PCIX_SIZEOF_V12;
+		} else
+			return PCI_CAP_PCIX_SIZEOF_V0;
+	case PCI_CAP_ID_VNDR:
+		/* length follows next field */
+		ret = pci_read_config_byte(pdev, pos + PCI_CAP_FLAGS, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return byte;
+	case PCI_CAP_ID_EXP:
+		/* length based on version */
+		ret = pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &word);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if ((word & PCI_EXP_FLAGS_VERS) == 1)
+			return PCI_CAP_EXP_ENDPOINT_SIZEOF_V1;
+		else {
+			vdev->extended_caps = true;
+			return PCI_CAP_EXP_ENDPOINT_SIZEOF_V2;
+		}
+	case PCI_CAP_ID_HT:
+		ret = pci_read_config_byte(pdev, pos + 3, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return (byte & HT_3BIT_CAP_MASK) ?
+			HT_CAP_SIZEOF_SHORT : HT_CAP_SIZEOF_LONG;
+	case PCI_CAP_ID_SATA:
+		ret = pci_read_config_byte(pdev, pos + PCI_SATA_REGS, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_SATA_REGS_MASK;
+		if (byte == PCI_SATA_REGS_INLINE)
+			return PCI_SATA_SIZEOF_LONG;
+		else
+			return PCI_SATA_SIZEOF_SHORT;
+	default:
+		printk(KERN_WARNING
+		       "%s: %s unknown length for pci cap 0x%x@0x%x\n",
+		       dev_name(&pdev->dev), __func__, cap, pos);
+	}
+
+	return 0;
+}
+
+static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 byte;
+	u32 dword;
+	int ret;
+
+	switch (ecap) {
+	case PCI_EXT_CAP_ID_VNDR:
+		ret = pci_read_config_dword(pdev, epos + PCI_VSEC_HDR, &dword);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return dword >> PCI_VSEC_HDR_LEN_SHIFT;
+	case PCI_EXT_CAP_ID_VC:
+	case PCI_EXT_CAP_ID_VC9:
+	case PCI_EXT_CAP_ID_MFVC:
+		return vfio_vc_cap_len(vdev, epos);
+	case PCI_EXT_CAP_ID_ACS:
+		ret = pci_read_config_byte(pdev, epos + PCI_ACS_CAP, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if (byte & PCI_ACS_EC) {
+			int bits;
+
+			ret = pci_read_config_byte(pdev,
+						   epos + PCI_ACS_EGRESS_BITS,
+						   &byte);
+			if (ret)
+				return pcibios_err_to_errno(ret);
+
+			bits = byte ? round_up(byte, 32) : 256;
+			return 8 + (bits / 8);
+		}
+		return 8;
+
+	case PCI_EXT_CAP_ID_REBAR:
+		ret = pci_read_config_byte(pdev, epos + PCI_REBAR_CTRL, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_REBAR_CTRL_NBAR_MASK;
+		byte >>= PCI_REBAR_CTRL_NBAR_SHIFT;
+
+		return 4 + (byte * 8);
+	case PCI_EXT_CAP_ID_DPA:
+		ret = pci_read_config_byte(pdev, epos + PCI_DPA_CAP, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_DPA_CAP_SUBSTATE_MASK;
+		byte = round_up(byte + 1, 4);
+		return PCI_DPA_BASE_SIZEOF + byte;
+	case PCI_EXT_CAP_ID_TPH:
+		ret = pci_read_config_dword(pdev, epos + PCI_TPH_CAP, &dword);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if ((dword & PCI_TPH_CAP_LOC_MASK) == PCI_TPH_LOC_CAP) {
+			int sts;
+
+			sts = byte & PCI_TPH_CAP_ST_MASK;
+			sts >>= PCI_TPH_CAP_ST_SHIFT;
+			return PCI_TPH_BASE_SIZEOF + round_up(sts * 2, 4);
+		}
+		return PCI_TPH_BASE_SIZEOF;
+	default:
+		printk(KERN_WARNING
+		       "%s: %s unknown length for pci ecap 0x%x@0x%x\n",
+		       dev_name(&pdev->dev), __func__, ecap, epos);
+	}
+
+	return 0;
+}
+
+static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
+				   int offset, int size)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret = 0;
+
+	/*
+	 * We try to read physical config space in the largest chunks
+	 * we can, assuming that all of the fields support dword access.
+	 * pci_save_state() makes this same assumption and seems to do ok.
+	 */
+	while (size) {
+		int filled;
+
+		if (size >= 4 && !(offset % 4)) {
+			u32 *dword = (u32 *)&vdev->vconfig[offset];
+			ret = pci_read_config_dword(pdev, offset, dword);
+			if (ret)
+				return ret;
+			*dword = cpu_to_le32(*dword);
+			filled = 4;
+		} else if (size >= 2 && !(offset % 2)) {
+			u16 *word = (u16 *)&vdev->vconfig[offset];
+			ret = pci_read_config_word(pdev, offset, word);
+			if (ret)
+				return ret;
+			*word = cpu_to_le16(*word);
+			filled = 2;
+		} else {
+			u8 *byte = &vdev->vconfig[offset];
+			ret = pci_read_config_byte(pdev, offset, byte);
+			if (ret)
+				return ret;
+			filled = 1;
+		}
+
+		offset += filled;
+		size -= filled;
+	}
+
+	return ret;
+}
+
+static int vfio_cap_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u16 status;
+	u8 pos, *prev, cap;
+	int loops, ret, caps = 0;
+
+	/* Any capabilities? */
+	ret = pci_read_config_word(pdev, PCI_STATUS, &status);
+	if (ret)
+		return ret;
+
+	if (!(status & PCI_STATUS_CAP_LIST))
+		return 0; /* Done */
+
+	ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+	if (ret)
+		return ret;
+
+	/* Mark the previous position in case we want to skip a capability */
+	prev = &vdev->vconfig[PCI_CAPABILITY_LIST];
+
+	/* We can bound our loop, capabilities are dword aligned */
+	loops = (PCI_CFG_SPACE_SIZE - PCI_STD_HEADER_SIZEOF) / PCI_CAP_SIZEOF;
+	while (pos && loops--) {
+		u8 next;
+		int i, len = 0;
+
+		ret = pci_read_config_byte(pdev, pos, &cap);
+		if (ret)
+			return ret;
+
+		ret = pci_read_config_byte(pdev,
+					   pos + PCI_CAP_LIST_NEXT, &next);
+		if (ret)
+			return ret;
+
+		if (cap <= PCI_CAP_ID_MAX) {
+			len = pci_cap_length[cap];
+			if (len == 0xFF) { /* Variable length */
+				len = vfio_cap_len(vdev, cap, pos);
+				if (len < 0)
+					return len;
+			}
+		}
+
+		if (!len) {
+			printk(KERN_INFO "%s: %s hiding cap 0x%x\n",
+			       __func__, dev_name(&pdev->dev), cap);
+			*prev = next;
+			pos = next;
+			continue;
+		}
+
+		/* Sanity check, do we overlap other capabilities? */
+		for (i = 0; i < len; i += 4) {
+			if (likely(map[(pos + i) / 4] == PCI_CAP_ID_INVALID))
+				continue;
+
+			printk(KERN_WARNING
+			       "%s: %s pci config conflict @0x%x, was cap 0x%x now cap 0x%x\n",
+			       __func__, dev_name(&pdev->dev), pos + i,
+			       map[pos + i], cap);
+		}
+
+		memset(map + (pos / 4), cap, len / 4);
+		ret = vfio_fill_vconfig_bytes(vdev, pos, len);
+		if (ret)
+			return ret;
+
+		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
+		pos = next;
+		caps++;
+	}
+
+	/* If we didn't fill any capabilities, clear the status flag */
+	if (!caps) {
+		u16 *vstatus = (u16 *)&vdev->vconfig[PCI_STATUS];
+		*vstatus &= ~cpu_to_le32(PCI_STATUS_CAP_LIST);
+	}
+
+	return 0;
+}
+
+static int vfio_ecap_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u16 epos;
+	u32 *prev = NULL;
+	int loops, ret, ecaps = 0;
+
+	if (!vdev->extended_caps)
+		return 0;
+
+	epos = PCI_CFG_SPACE_SIZE;
+
+	loops = (pdev->cfg_size - PCI_CFG_SPACE_SIZE) / PCI_CAP_SIZEOF;
+
+	while (loops-- && epos >= PCI_CFG_SPACE_SIZE) {
+		u32 header;
+		u16 ecap;
+		int i, len = 0;
+		bool hidden = false;
+
+		ret = pci_read_config_dword(pdev, epos, &header);
+		if (ret)
+			return ret;
+
+		ecap = PCI_EXT_CAP_ID(header);
+
+		if (ecap <= PCI_EXT_CAP_ID_MAX) {
+			len = pci_ext_cap_length[ecap];
+			if (len == 0xFF) {
+				len = vfio_ext_cap_len(vdev, ecap, epos);
+				if (len < 0)
+					return ret;
+			}
+		}
+
+		if (!len) {
+			printk(KERN_INFO "%s: %s hiding ecap 0x%x@0x%x\n",
+			       __func__, dev_name(&pdev->dev), ecap, epos);
+
+			/* If not the first in the chain, we can skip over it */
+			if (prev) {
+				epos = PCI_EXT_CAP_NEXT(header);
+				*prev &= cpu_to_le32(~((u32)0xffc << 20));
+				*prev |= cpu_to_le32((u32)epos << 20);
+				continue;
+			}
+
+			/*
+			 * Otherwise, fill in a placeholder, the direct
+			 * readfn will virtualize this automatically
+			 */
+			len = PCI_CAP_SIZEOF;
+			hidden = true;
+		}
+
+		for (i = 0; i < len; i += 4) {
+			if (likely(map[(epos + i) / 4] == PCI_CAP_ID_INVALID))
+				continue;
+
+			printk(KERN_WARNING
+			       "%s: %s pci config conflict @0x%x, was ecap 0x%x now ecap 0x%x\n",
+			       __func__, dev_name(&pdev->dev),
+			       epos + i, map[epos + i], ecap);
+		}
+
+		/*
+		 * Even though ecap is 2 bytes, we're currently a long way
+		 * from exceeding 1 byte capabilities.  If we ever make it
+		 * up to 0xFF we'll need to up this to a two-byte, byte map.
+		 */
+		BUILD_BUG_ON(PCI_EXT_CAP_ID_MAX >= PCI_CAP_ID_INVALID);
+
+		memset(map + (epos / 4), ecap, len / 4);
+		ret = vfio_fill_vconfig_bytes(vdev, epos, len);
+		if (ret)
+			return ret;
+
+		/*
+		 * If we're just using this capability to anchor the list,
+		 * hide the real ID.  Only count real ecaps.  XXX PCI spec
+		 * indicates to use cap id = 0, version = 0, next = 0 if
+		 * ecaps are absent, hope users check all the way to next.
+		 */
+		if (hidden)
+			*(u32 *)&vdev->vconfig[epos] &=
+				cpu_to_le32(((u32)0xffc << 20));
+		else
+			ecaps++;
+
+		prev = (u32 *)&vdev->vconfig[epos];
+		epos = PCI_EXT_CAP_NEXT(header);
+	}
+
+	if (!ecaps)
+		*(u32 *)&vdev->vconfig[PCI_CFG_SPACE_SIZE] = 0;
+
+	return 0;
+}
+
+/*
+ * For each device we allocate a pci_config_map that indicates the
+ * capability occupying each dword and thus the struct perm_bits we
+ * use for read and write.  We also allocate a virtualized config
+ * space which tracks reads and writes to bits that we emulate for
+ * the user.  Initial values filled from device.
+ *
+ * Using shared stuct perm_bits between all vfio-pci devices saves
+ * us from allocating cfg_size buffers for virt and write for every
+ * device.  We could remove vconfig and allocate individual buffers
+ * for each area requring emulated bits, but the array of pointers
+ * would be comparable in size (at least for standard config space).
+ */
+int vfio_config_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map, *vconfig;
+	int ret;
+
+	/*
+	 * Config space, caps and ecaps are all dword aligned, so we can
+	 * use one byte per dword to record the type.
+	 */
+	map = kmalloc(pdev->cfg_size / 4, GFP_KERNEL);
+	if (!map)
+		return -ENOMEM;
+
+	vconfig = kmalloc(pdev->cfg_size, GFP_KERNEL);
+	if (!vconfig) {
+		kfree(map);
+		return -ENOMEM;
+	}
+
+	vdev->pci_config_map = map;
+	vdev->vconfig = vconfig;
+
+	memset(map, PCI_CAP_ID_BASIC, PCI_STD_HEADER_SIZEOF / 4);
+	memset(map + (PCI_STD_HEADER_SIZEOF / 4), PCI_CAP_ID_INVALID,
+	       (pdev->cfg_size - PCI_STD_HEADER_SIZEOF) / 4);
+
+	ret = vfio_fill_vconfig_bytes(vdev, 0, PCI_STD_HEADER_SIZEOF);
+	if (ret)
+		goto out;
+
+	vdev->bardirty = true;
+
+	/*
+	 * XXX can we just pci_load_saved_state/pci_restore_state?
+	 * may need to rebuild vconfig after that
+	 */
+
+	/* For restore after reset */
+	vdev->rbar[0] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_0];
+	vdev->rbar[1] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_1];
+	vdev->rbar[2] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_2];
+	vdev->rbar[3] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_3];
+	vdev->rbar[4] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_4];
+	vdev->rbar[5] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_5];
+	vdev->rbar[6] = *(u32 *)&vconfig[PCI_ROM_ADDRESS];
+
+	if (pdev->is_virtfn) {
+		*(u16 *)&vconfig[PCI_VENDOR_ID] = cpu_to_le16(pdev->vendor);
+		*(u16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
+	}
+
+	ret = vfio_cap_init(vdev);
+	if (ret)
+		goto out;
+
+	ret = vfio_ecap_init(vdev);
+	if (ret)
+		goto out;
+
+	return 0;
+
+out:
+	kfree(map);
+	vdev->pci_config_map = NULL;
+	kfree(vconfig);
+	vdev->vconfig = NULL;
+	return pcibios_err_to_errno(ret);
+}
+
+void vfio_config_free(struct vfio_pci_device *vdev)
+{
+	kfree(vdev->vconfig);
+	vdev->vconfig = NULL;
+	kfree(vdev->pci_config_map);
+	vdev->pci_config_map = NULL;
+	kfree(vdev->msi_perm);
+	vdev->msi_perm = NULL;
+}
+
+ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
+			  size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct perm_bits *perm;
+	u32 val = 0;
+	int cap_start = 0, offset;
+	u8 cap_id;
+
+
+	if (*ppos < 0 || *ppos + count > pdev->cfg_size)
+		return -EFAULT;
+
+	cap_id = vdev->pci_config_map[*ppos / 4];
+
+	if (cap_id == PCI_CAP_ID_INVALID) {
+		if (iswrite)
+			return count; /* drop */
+
+		/*
+		 * Per PCI spec 3.0, section 6.1, reads from reserved and
+		 * unimplemented registers return 0
+		 */
+		if (copy_to_user(buf, &val, count))
+			return -EFAULT;
+
+		return count;
+	}
+
+	/*
+	 * All capabilities are minimum 4 bytes and aligned on dword
+	 * boundaries.  Since we don't support unaligned accesses, we're
+	 * only ever accessing a single capability.
+	 */
+	if (*ppos >= PCI_CFG_SPACE_SIZE) {
+		WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
+
+		perm = &ecap_perms[cap_id];
+		cap_start = vfio_find_cap_start(vdev, *ppos);
+
+	} else {
+		WARN_ON(cap_id > PCI_CAP_ID_MAX);
+
+		perm = &cap_perms[cap_id];
+
+		if (cap_id == PCI_CAP_ID_MSI)
+			perm = vdev->msi_perm;
+
+		if (cap_id > PCI_CAP_ID_BASIC)
+			cap_start = vfio_find_cap_start(vdev, *ppos);
+	}
+
+	WARN_ON(!cap_start && cap_id != PCI_CAP_ID_BASIC);
+	WARN_ON(cap_start > *ppos);
+
+	offset = *ppos - cap_start;
+
+	if (iswrite) {
+		if (perm->writefn) {
+			if (copy_from_user(&val, buf, count))
+				return -EFAULT;
+
+			count = perm->writefn(vdev, *ppos, count,
+					      perm, offset, val);
+		}
+	} else {
+		if (perm->readfn) {
+			count = perm->readfn(vdev, *ppos, count,
+					     perm, offset, &val);
+			if (count < 0)
+				return count;
+		}
+
+		if (copy_to_user(buf, &val, count))
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+				  char __user *buf, size_t count,
+				  loff_t *ppos, bool iswrite)
+{
+	size_t done = 0;
+	int ret = 0;
+	loff_t pos = *ppos;
+
+	pos &= VFIO_PCI_OFFSET_MASK;
+
+	/*
+	 * We want to both keep the access size the caller users as well as
+	 * support reading large chunks of config space in a single call.
+	 * PCI doesn't support unaligned accesses, so we can safely break
+	 * those apart.
+	 */
+	while (count) {
+		if (count >= 4 && !(pos % 4))
+			ret = vfio_config_do_rw(vdev, buf, 4, &pos, iswrite);
+		else if (count >= 2 && !(pos % 2))
+			ret = vfio_config_do_rw(vdev, buf, 2, &pos, iswrite);
+		else
+			ret = vfio_config_do_rw(vdev, buf, 1, &pos, iswrite);
+
+		if (ret < 0)
+			return ret;
+
+		count -= ret;
+		done += ret;
+		buf += ret;
+		pos += ret;
+	}
+
+	*ppos += done;
+
+	return done;
+}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
new file mode 100644
index 0000000..2996f37
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -0,0 +1,724 @@
+/*
+ * VFIO PCI interrupt handling
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+#include "vfio_pci_private.h"
+
+/*
+ * IRQfd - generic
+ */
+struct virqfd {
+	struct vfio_pci_device	*vdev;
+	void			*data;
+	struct eventfd_ctx	*eventfd;
+	poll_table		pt;
+	wait_queue_t		wait;
+	struct work_struct	inject;
+	struct work_struct	shutdown;
+	struct virqfd		**pvirqfd;
+};
+
+static struct workqueue_struct *vfio_irqfd_cleanup_wq;
+
+int __init vfio_pci_virqfd_init(void)
+{
+	vfio_irqfd_cleanup_wq =
+		create_singlethread_workqueue("vfio-irqfd-cleanup");
+	if (!vfio_irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void vfio_pci_virqfd_exit(void)
+{
+	destroy_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+static void virqfd_deactivate(struct virqfd *virqfd)
+{
+	queue_work(vfio_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct virqfd *virqfd = container_of(wait, struct virqfd, wait);
+	unsigned long flags = (unsigned long)key;
+
+	if (flags & POLLIN)
+		/* An event has been signaled, inject an interrupt */
+		schedule_work(&virqfd->inject);
+
+	if (flags & POLLHUP)
+		/* The eventfd is closing, detach from VFIO */
+		virqfd_deactivate(virqfd);
+
+	return 0;
+}
+
+static void virqfd_ptable_queue_proc(struct file *file,
+				     wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
+	add_wait_queue(wqh, &virqfd->wait);
+}
+
+static void virqfd_shutdown(struct work_struct *work)
+{
+	u64 cnt;
+	struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
+	struct virqfd **pvirqfd = virqfd->pvirqfd;
+
+	eventfd_ctx_remove_wait_queue(virqfd->eventfd, &virqfd->wait, &cnt);
+	flush_work(&virqfd->inject);
+	eventfd_ctx_put(virqfd->eventfd);
+
+	kfree(virqfd);
+	*pvirqfd = NULL;
+}
+
+static int virqfd_enable(struct vfio_pci_device *vdev,
+			 void (*inject)(struct work_struct *work),
+			 void *data, struct virqfd **pvirqfd, int fd)
+{
+	struct file *file = NULL;
+	struct eventfd_ctx *ctx = NULL;
+	struct virqfd *virqfd;
+	int ret = 0;
+	unsigned int events;
+
+	if (*pvirqfd)
+		return -EBUSY;
+
+	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+	if (!*pvirqfd)
+		return -ENOMEM;
+
+	virqfd->vdev = vdev;
+	virqfd->data = data;
+	virqfd->pvirqfd = pvirqfd;
+	*pvirqfd = virqfd;
+
+	INIT_WORK(&virqfd->inject, inject);
+	INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
+
+	file = eventfd_fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto fail;
+	}
+
+	ctx = eventfd_ctx_fileget(file);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
+		goto fail;
+	}
+
+	virqfd->eventfd = ctx;
+
+	/*
+	 * Install our own custom wake-up handling so we are notified via
+	 * a callback whenever someone signals the underlying eventfd.
+	 */
+	init_waitqueue_func_entry(&virqfd->wait, virqfd_wakeup);
+	init_poll_funcptr(&virqfd->pt, virqfd_ptable_queue_proc);
+
+	events = file->f_op->poll(file, &virqfd->pt);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered and trigger it as if we didn't miss it.
+	 */
+	if (events & POLLIN)
+		schedule_work(&virqfd->inject);
+
+	/*
+	 * Do not drop the file until the irqfd is fully initialized,
+	 * otherwise we might race against the POLLHUP.
+	 */
+	fput(file);
+
+	return 0;
+
+fail:
+	if (ctx && !IS_ERR(ctx))
+		eventfd_ctx_put(ctx);
+
+	if (!IS_ERR(file))
+		fput(file);
+
+	kfree(virqfd);
+	*pvirqfd = NULL;
+
+	return ret;
+}
+
+static void virqfd_disable(struct virqfd *virqfd)
+{
+	if (!virqfd)
+		return;
+
+	virqfd_deactivate(virqfd);
+
+	/* Block until we know all outstanding shutdown jobs have completed. */
+	flush_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+/*
+ * INTx
+ */
+static inline void vfio_send_intx_eventfd(struct vfio_pci_device *vdev)
+{
+	if (likely(is_intx(vdev) && !vdev->virq_disabled))
+		eventfd_signal(vdev->ctx[0].trigger, 1);
+}
+
+void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+
+	spin_lock_irq(&vdev->irqlock);
+
+	/*
+	 * Masking can come from interrupt, ioctl, or config space
+	 * via INTx disable.  The latter means this can get called
+	 * even when not using intx delivery.  In this case, just
+	 * try to have the physical bit follow the virtual bit.
+	 */
+	if (unlikely(!is_intx(vdev))) {
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 0);
+	} else if (!vdev->ctx[0].masked) {
+		/*
+		 * Can't use check_and_mask here because we always want to
+		 * mask, not just when something is pending.
+		 */
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 0);
+		else
+			disable_irq_nosync(pdev->irq);
+
+		vdev->ctx[0].masked = true;
+	}
+
+	spin_unlock_irq(&vdev->irqlock);
+}
+
+void vfio_pci_intx_unmask(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	bool signal = false;
+
+	spin_lock_irq(&vdev->irqlock);
+
+	/*
+	 * Unmasking comes from ioctl or config, so again, have the
+	 * physical bit follow the virtual even when not using INTx.
+	 */
+	if (unlikely(!is_intx(vdev))) {
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 1);
+	} else if (vdev->ctx[0].masked && !vdev->virq_disabled) {
+		/*
+		 * A pending interrupt here would immediately trigger,
+		 * but we can avoid that overhead by just re-sending
+		 * the interrupt to the user.
+		 */
+		if (vdev->pci_2_3) {
+			if (!pci_check_and_unmask_intx(pdev))
+				signal = true;
+		} else
+			enable_irq(pdev->irq);
+
+		vdev->ctx[0].masked = signal;
+	}
+
+	spin_unlock_irq(&vdev->irqlock);
+
+	if (signal)
+		vfio_send_intx_eventfd(vdev);
+}
+
+static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
+{
+	struct vfio_pci_device *vdev = dev_id;
+	struct pci_dev *pdev = vdev->pdev;
+	irqreturn_t ret = IRQ_NONE;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vdev->irqlock, flags);
+
+	/* Non-PCI 2.3 device don't use this hard handler */
+	if (pci_check_and_mask_intx(pdev)) {
+		ret = IRQ_WAKE_THREAD;
+		vdev->ctx[0].masked = true;
+	}
+
+	spin_unlock_irqrestore(&vdev->irqlock, flags);
+
+	return ret;
+}
+
+static irqreturn_t vfio_intx_thread(int irq, void *dev_id)
+{
+	struct vfio_pci_device *vdev = dev_id;
+	int ret = IRQ_HANDLED;
+
+	if (unlikely(!vdev->pci_2_3)) {
+		spin_lock_irq(&vdev->irqlock);
+		if (!vdev->ctx[0].masked) {
+			disable_irq_nosync(vdev->pdev->irq);
+			vdev->ctx[0].masked = true;
+		} else
+			ret = IRQ_NONE;
+		spin_unlock_irq(&vdev->irqlock);
+	}
+
+	if (ret == IRQ_HANDLED)
+		vfio_send_intx_eventfd(vdev);
+
+	return ret;
+}
+
+static int vfio_intx_enable(struct vfio_pci_device *vdev)
+{
+	if (!is_irq_none(vdev))
+		return -EINVAL;
+
+	if (!vdev->pdev->irq)
+		return -ENODEV;
+
+	vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	vdev->num_ctx = 1;
+	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
+
+	return 0;
+}
+
+static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	irq_handler_t handler = vfio_intx_handler;
+	unsigned long irqflags = IRQF_SHARED;
+	int ret;
+
+	if (vdev->ctx[0].trigger) {
+		free_irq(pdev->irq, vdev);
+		kfree(vdev->ctx[0].name);
+		eventfd_ctx_put(vdev->ctx[0].trigger);
+		vdev->ctx[0].trigger = NULL;
+	}
+
+	if (fd >= 0) {
+		vdev->ctx[0].name = kasprintf(GFP_KERNEL, "vfio-intx(%s)",
+					    pci_name(pdev));
+		if (!vdev->ctx[0].name)
+			return -ENOMEM;
+
+		vdev->ctx[0].trigger = eventfd_ctx_fdget(fd);
+		if (!vdev->ctx[0].trigger) {
+			kfree(vdev->ctx[0].name);
+			return -EINVAL;
+		}
+
+		if (!vdev->pci_2_3) {
+			handler = NULL;
+			irqflags = IRQF_ONESHOT;
+		}
+
+		ret = request_threaded_irq(pdev->irq, handler, vfio_intx_thread,
+					   irqflags, vdev->ctx[0].name, vdev);
+		if (ret) {
+			eventfd_ctx_put(vdev->ctx[0].trigger);
+			kfree(vdev->ctx[0].name);
+			return ret;
+		}
+
+		/*
+		 * INTx disable will stick across the new irq setup,
+		 * disable_irq won't.
+		 */
+		if (!vdev->pci_2_3)
+			if (vdev->ctx[0].masked || vdev->virq_disabled)
+				disable_irq_nosync(pdev->irq);
+	}
+	return 0;
+}
+
+static void vfio_intx_disable(struct vfio_pci_device *vdev)
+{
+	vfio_intx_set_signal(vdev, -1);
+	virqfd_disable(vdev->ctx[0].unmask);
+	virqfd_disable(vdev->ctx[0].mask);
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	vdev->num_ctx = 0;
+	kfree(vdev->ctx);
+}
+
+static void vfio_intx_unmask_inject(struct work_struct *work)
+{
+	struct virqfd *virqfd = container_of(work, struct virqfd, inject);
+	vfio_pci_intx_unmask(virqfd->vdev);
+}
+
+/*
+ * MSI/MSI-X
+ */
+static irqreturn_t vfio_msihandler(int irq, void *arg)
+{
+	struct eventfd_ctx *trigger = arg;
+
+	eventfd_signal(trigger, 1);
+	return IRQ_HANDLED;
+}
+
+static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+
+	if (!is_irq_none(vdev))
+		return -EINVAL;
+
+	vdev->ctx = kzalloc(nvec * sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	if (msix) {
+		int i;
+
+		vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+				     GFP_KERNEL);
+		if (!vdev->msix) {
+			kfree(vdev->ctx);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < nvec; i++)
+			vdev->msix[i].entry = i;
+
+		ret = pci_enable_msix(pdev, vdev->msix, nvec);
+		if (ret) {
+			kfree(vdev->msix);
+			kfree(vdev->ctx);
+			return ret;
+		}
+	} else {
+		ret = pci_enable_msi_block(pdev, nvec);
+		if (ret) {
+			kfree(vdev->ctx);
+			return ret;
+		}
+	}
+
+	vdev->num_ctx = nvec;
+	vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX :
+				VFIO_PCI_MSI_IRQ_INDEX;
+
+	if (!msix) {
+		/*
+		 * Compute the virtual hardware field for max msi vectors -
+		 * it is the log base 2 of the number of vectors.
+		 */
+		vdev->msi_qmax = fls(nvec * 2 - 1) - 1;
+	}
+
+	return 0;
+}
+
+static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
+				      int vector, int fd, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
+	char *name = msix ? "vfio-msix" : "vfio-msi";
+
+	if (vector >= vdev->num_ctx)
+		return -EINVAL;
+
+	if (vdev->ctx[vector].trigger) {
+		free_irq(irq, vdev->ctx[vector].trigger);
+		kfree(vdev->ctx[vector].name);
+		eventfd_ctx_put(vdev->ctx[vector].trigger);
+		vdev->ctx[vector].trigger = NULL;
+	}
+
+	if (fd >= 0) {
+		struct eventfd_ctx *trigger;
+		int ret;
+
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "%s[%d](%s)",
+						   name, vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name)
+			return -ENOMEM;
+
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger)) {
+			kfree(vdev->ctx[vector].name);
+			return PTR_ERR(trigger);
+		}
+
+		ret = request_threaded_irq(irq, NULL, vfio_msihandler, 0,
+					   vdev->ctx[vector].name, trigger);
+		if (ret) {
+			eventfd_ctx_put(trigger);
+			kfree(vdev->ctx[vector].name);
+			return ret;
+		}
+
+		vdev->ctx[vector].trigger = trigger;
+	}
+
+	return 0;
+}
+
+static int vfio_msi_set_block(struct vfio_pci_device *vdev, int start,
+			      int count, int32_t *fds, bool msix)
+{
+	int i, j, ret = 0;
+
+	if (start + count > vdev->num_ctx)
+		return -EINVAL;
+
+	for (i = 0, j = start; i < count && !ret; i++, j++) {
+		int fd = fds ? fds[i] : -1;
+		ret = vfio_msi_set_vector_signal(vdev, j, fd, msix);
+	}
+
+	if (ret) {
+		for (--j; j >= start; j--)
+			vfio_msi_set_vector_signal(vdev, j, -1, msix);
+	}
+
+	return ret;
+}
+
+static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+
+	vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);
+
+	for (i = 0; i < vdev->num_ctx; i++) {
+		virqfd_disable(vdev->ctx[i].unmask);
+		virqfd_disable(vdev->ctx[i].mask);
+	}
+
+	if (msix) {
+		pci_disable_msix(vdev->pdev);
+		kfree(vdev->msix);
+	} else
+		pci_disable_msi(pdev);
+
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	vdev->num_ctx = 0;
+	kfree(vdev->ctx);
+}
+
+/*
+ * IOCTL support
+ */
+static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
+				    int index, int start, int count,
+				    uint32_t flags, void *data)
+{
+	if (!is_intx(vdev) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_pci_intx_unmask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t unmask = *(uint8_t *)data;
+		if (unmask)
+			vfio_pci_intx_unmask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t fd = *(int32_t *)data;
+		if (fd >= 0)
+			return virqfd_enable(vdev, vfio_intx_unmask_inject,
+					     NULL, &vdev->ctx[0].unmask, fd);
+
+		virqfd_disable(vdev->ctx[0].unmask);
+	}
+
+	return 0;
+}
+
+static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
+				  int index, int start, int count,
+				  uint32_t flags, void *data)
+{
+	if (!is_intx(vdev) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_pci_intx_mask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t mask = *(uint8_t *)data;
+		if (mask)
+			vfio_pci_intx_mask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		return -ENOTTY; /* XXX implement me */
+	}
+
+	return 0;
+}
+
+static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
+				     int index, int start, int count,
+				     uint32_t flags, void *data)
+{
+	if (is_intx(vdev) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		vfio_intx_disable(vdev);
+		return 0;
+	}
+
+	if (!(is_intx(vdev) || is_irq_none(vdev)) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t fd = *(int32_t *)data;
+		int ret;
+
+		if (is_intx(vdev))
+			return vfio_intx_set_signal(vdev, fd);
+
+		ret = vfio_intx_enable(vdev);
+		if (ret)
+			return ret;
+
+		ret = vfio_intx_set_signal(vdev, fd);
+		if (ret)
+			vfio_intx_disable(vdev);
+
+		return ret;
+	}
+
+	if (!is_intx(vdev))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_send_intx_eventfd(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t trigger = *(uint8_t *)data;
+		if (trigger)
+			vfio_send_intx_eventfd(vdev);
+	}
+	return 0;
+}
+
+static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
+				    int index, int start, int count,
+				    uint32_t flags, void *data)
+{
+	int i;
+	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
+
+	if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		vfio_msi_disable(vdev, msix);
+		return 0;
+	}
+
+	if (!(irq_is(vdev, index) || is_irq_none(vdev)))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t *fds = data;
+		int ret;
+
+		if (vdev->irq_type == index)
+			return vfio_msi_set_block(vdev, start, count,
+						  fds, msix);
+
+		ret = vfio_msi_enable(vdev, start + count, msix);
+		if (ret)
+			return ret;
+
+		ret = vfio_msi_set_block(vdev, start, count, fds, msix);
+		if (ret)
+			vfio_msi_disable(vdev, msix);
+
+		return ret;
+	}
+
+	if (!irq_is(vdev, index) || start + count > vdev->num_ctx)
+		return -EINVAL;
+
+	for (i = start; i < start + count; i++) {
+		if (!vdev->ctx[i].trigger)
+			continue;
+		if (flags & VFIO_IRQ_SET_DATA_NONE) {
+			eventfd_signal(vdev->ctx[i].trigger, 1);
+		} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+			uint8_t *bools = data;
+			if (bools[i - start])
+				eventfd_signal(vdev->ctx[i].trigger, 1);
+		}
+	}
+	return 0;
+}
+
+int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
+			    int index, int start, int count, void *data)
+{
+	int (*func)(struct vfio_pci_device *vdev, int index, int start,
+		    int count, uint32_t flags, void *data) = NULL;
+
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+			func = vfio_pci_set_intx_mask;
+			break;
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			func = vfio_pci_set_intx_unmask;
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_intx_trigger;
+			break;
+		}
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			/* XXX Need masking support exported */
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_msi_trigger;
+			break;
+		}
+		break;
+	}
+
+	if (!func)
+		return -ENOTTY;
+
+	return func(vdev, index, start, count, flags, data);
+}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
new file mode 100644
index 0000000..a4a3678
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -0,0 +1,91 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/mutex.h>
+#include <linux/pci.h>
+
+#ifndef VFIO_PCI_PRIVATE_H
+#define VFIO_PCI_PRIVATE_H
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_pci_irq_ctx {
+	struct eventfd_ctx	*trigger;
+	struct virqfd		*unmask;
+	struct virqfd		*mask;
+	char			*name;
+	bool			masked;
+};
+
+struct vfio_pci_device {
+	struct pci_dev		*pdev;
+	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
+	u8			*pci_config_map;
+	u8			*vconfig;
+	struct perm_bits	*msi_perm;
+	spinlock_t		irqlock;
+	struct mutex		igate;
+	struct msix_entry	*msix;
+	struct vfio_pci_irq_ctx	*ctx;
+	int			num_ctx;
+	int			irq_type;
+	u8			msi_qmax;
+	u8			msix_bar;
+	u16			msix_size;
+	u32			msix_offset;
+	u32			rbar[7];
+	bool			pci_2_3;
+	bool			virq_disabled;
+	bool			reset_works;
+	bool			extended_caps;
+	bool			bardirty;
+	struct pci_saved_state	*pci_saved_state;
+	atomic_t		refcnt;
+};
+
+#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
+#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
+#define irq_is(vdev, type) (vdev->irq_type == type)
+
+extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
+extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
+
+extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
+				   uint32_t flags, int index, int start,
+				   int count, void *data);
+
+extern ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+					 char __user *buf, size_t count,
+					 loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev,
+				      char __user *buf, size_t count,
+				      loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev,
+				     char __user *buf, size_t count,
+				     loff_t *ppos, bool iswrite);
+
+extern int vfio_pci_init_perm_bits(void);
+extern void vfio_pci_uninit_perm_bits(void);
+
+extern int vfio_pci_virqfd_init(void);
+extern void vfio_pci_virqfd_exit(void);
+
+extern int vfio_config_init(struct vfio_pci_device *vdev);
+extern void vfio_config_free(struct vfio_pci_device *vdev);
+#endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
new file mode 100644
index 0000000..44c3ba2
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -0,0 +1,267 @@
+/*
+ * VFIO PCI I/O Port & MMIO access
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include "vfio_pci_private.h"
+
+/* I/O Port BAR access */
+ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+			      size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	void __iomem *io;
+	size_t done = 0;
+
+	if (!pci_resource_start(pdev, bar))
+		return -EINVAL;
+
+	if (pos + count > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!vdev->barmap[bar]) {
+		int ret;
+
+		ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+		if (ret)
+			return ret;
+
+		vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+		if (!vdev->barmap[bar]) {
+			pci_release_selected_regions(pdev, 1 << bar);
+			return -EINVAL;
+		}
+	}
+
+	io = vdev->barmap[bar];
+
+	while (count) {
+		int filled;
+
+		if (count >= 3 && !(pos % 4)) {
+			u32 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 4))
+					return -EFAULT;
+
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+
+				if (copy_to_user(buf, &val, 4))
+					return -EFAULT;
+			}
+
+			filled = 4;
+
+		} else if ((pos % 2) == 0 && count >= 2) {
+			u16 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 2))
+					return -EFAULT;
+
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+
+				if (copy_to_user(buf, &val, 2))
+					return -EFAULT;
+			}
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 1))
+					return -EFAULT;
+
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+
+				if (copy_to_user(buf, &val, 1))
+					return -EFAULT;
+			}
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		buf += filled;
+		pos += filled;
+	}
+
+	*ppos += done;
+
+	return done;
+}
+
+/*
+ * MMIO BAR access
+ * We handle two excluded ranges here as well, if the user tries to read
+ * the ROM beyond what PCI tells us is available or the MSI-X table region,
+ * we return 0xFF and writes are dropped.
+ */
+ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+			       size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	void __iomem *io;
+	resource_size_t end;
+	size_t done = 0;
+	size_t x_start = 0, x_end = 0; /* excluded range */
+
+	if (!pci_resource_start(pdev, bar))
+		return -EINVAL;
+
+	end = pci_resource_len(pdev, bar);
+
+	if (pos > end)
+		return -EINVAL;
+
+	if (pos == end)
+		return 0;
+
+	if (pos + count > end)
+		count = end - pos;
+
+	if (bar == PCI_ROM_RESOURCE) {
+		io = pci_map_rom(pdev, &x_start);
+		x_end = end;
+	} else {
+		if (!vdev->barmap[bar]) {
+			int ret;
+
+			ret = pci_request_selected_regions(pdev, 1 << bar,
+							   "vfio");
+			if (ret)
+				return ret;
+
+			vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+			if (!vdev->barmap[bar]) {
+				pci_release_selected_regions(pdev, 1 << bar);
+				return -EINVAL;
+			}
+		}
+
+		io = vdev->barmap[bar];
+
+		if (bar == vdev->msix_bar) {
+			x_start = vdev->msix_offset;
+			x_end = vdev->msix_offset + vdev->msix_size;
+		}
+	}
+
+	if (!io)
+		return -EINVAL;
+
+	while (count) {
+		size_t fillable, filled;
+
+		if (pos < x_start)
+			fillable = x_start - pos;
+		else if (pos >= x_end)
+			fillable = end - pos;
+		else
+			fillable = 0;
+
+		if (fillable >= 4 && !(pos % 4)) {
+			u32 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 4))
+					goto out;
+
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+				if (copy_to_user(buf, &val, 4))
+					goto out;
+			}
+
+			filled = 4;
+		} else if (fillable >= 2 && !(pos % 2)) {
+			u16 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 2))
+					goto out;
+
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+				if (copy_to_user(buf, &val, 2))
+					goto out;
+			}
+
+			filled = 2;
+		} else if (fillable) {
+			u8 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 1))
+					goto out;
+
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+
+				if (copy_to_user(buf, &val, 1))
+					goto out;
+			}
+
+			filled = 1;
+		} else {
+			/* Drop writes, fill reads with FF */
+			if (!iswrite) {
+				char val = 0xFF;
+				size_t i;
+
+				for (i = 0; i < x_end - pos; i++) {
+					if (put_user(val, buf + i))
+						goto out;
+				}
+			}
+
+			filled = x_end - pos;
+		}
+
+		count -= filled;
+		done += filled;
+		buf += filled;
+		pos += filled;
+	}
+
+	*ppos += done;
+
+out:
+	if (bar == PCI_ROM_RESOURCE)
+		pci_unmap_rom(pdev, io);
+
+	return count ? -EFAULT : done;
+}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 1c7119c..d668283 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -220,6 +220,7 @@ struct vfio_device_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
@@ -361,6 +362,31 @@ struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping.  Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_PCI_BAR0_REGION_INDEX,
+	VFIO_PCI_BAR1_REGION_INDEX,
+	VFIO_PCI_BAR2_REGION_INDEX,
+	VFIO_PCI_BAR3_REGION_INDEX,
+	VFIO_PCI_BAR4_REGION_INDEX,
+	VFIO_PCI_BAR5_REGION_INDEX,
+	VFIO_PCI_ROM_REGION_INDEX,
+	VFIO_PCI_CONFIG_REGION_INDEX,
+	VFIO_PCI_NUM_REGIONS
+};
+
+enum {
+	VFIO_PCI_INTX_IRQ_INDEX,
+	VFIO_PCI_MSI_IRQ_INDEX,
+	VFIO_PCI_MSIX_IRQ_INDEX,
+	VFIO_PCI_NUM_IRQS
+};
+
 /* -------- API for x86 VFIO IOMMU -------- */
 
 /**


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 13/13] vfio: Add PCI device driver
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	joerg.roedel-5C7GfCeVMHo, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

Add PCI device support for VFIO.  PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device.  PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers.  I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions.  Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/vfio/Kconfig                |    2 
 drivers/vfio/pci/Kconfig            |    8 
 drivers/vfio/pci/Makefile           |    4 
 drivers/vfio/pci/vfio_pci.c         |  557 +++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  | 1527 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_intrs.c   |  724 +++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c    |  267 ++++++
 include/linux/vfio.h                |   26 +
 9 files changed, 3206 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index bd88a30..77b754c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,3 +12,5 @@ menuconfig VFIO
 	  See Documentation/vfio.txt for more details.
 
 	  If you don't know what to do here, say N.
+
+source "drivers/vfio/pci/Kconfig"
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
new file mode 100644
index 0000000..cc7db62
--- /dev/null
+++ b/drivers/vfio/pci/Kconfig
@@ -0,0 +1,8 @@
+config VFIO_PCI
+	tristate "VFIO support for PCI devices"
+	depends on VFIO && PCI
+	help
+	  Support for the PCI VFIO bus driver.  This is required to make
+	  use of PCI drivers using the VFIO framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
new file mode 100644
index 0000000..1310792
--- /dev/null
+++ b/drivers/vfio/pci/Makefile
@@ -0,0 +1,4 @@
+
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+
+obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 0000000..b2f1f3a
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,557 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1.9"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	u16 cmd;
+	u8 msix_pos;
+
+	vdev->reset_works = (pci_reset_function(pdev) == 0);
+	pci_save_state(pdev);
+	vdev->pci_saved_state = pci_store_saved_state(pdev);
+	if (!vdev->pci_saved_state)
+		printk(KERN_DEBUG "%s: Couldn't store %s saved state\n",
+		       __func__, dev_name(&pdev->dev));
+
+	ret = vfio_config_init(vdev);
+	if (ret)
+		goto out;
+
+	vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+		cmd &= ~PCI_COMMAND_INTX_DISABLE;
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+	}
+
+	msix_pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+	if (msix_pos) {
+		u16 flags;
+		u32 table;
+
+		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+		vdev->msix_bar = table & PCI_MSIX_FLAGS_BIRMASK;
+		vdev->msix_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
+		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+	} else
+		vdev->msix_bar = 0xFF;
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		goto out;
+
+	return ret;
+
+out:
+	kfree(vdev->pci_saved_state);
+	vdev->pci_saved_state = NULL;
+	vfio_config_free(vdev);
+	return ret;
+}
+
+static void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+	int bar;
+
+	pci_disable_device(vdev->pdev);
+
+	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+				VFIO_IRQ_SET_ACTION_TRIGGER,
+				vdev->irq_type, 0, 0, NULL);
+
+	vdev->virq_disabled = false;
+
+	vfio_config_free(vdev);
+
+	if (pci_reset_function(vdev->pdev) == 0) {
+		if (pci_load_and_free_saved_state(vdev->pdev,
+						  &vdev->pci_saved_state) == 0)
+			pci_restore_state(vdev->pdev);
+		else
+			printk(KERN_INFO "%s: Couldn't reload %s saved state\n",
+			       __func__, dev_name(&vdev->pdev->dev));
+	}
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		if (!vdev->barmap[bar])
+			continue;
+		pci_iounmap(vdev->pdev, vdev->barmap[bar]);
+		pci_release_selected_regions(vdev->pdev, 1 << bar);
+		vdev->barmap[bar] = NULL;
+	}
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	if (atomic_dec_and_test(&vdev->refcnt))
+		vfio_pci_disable(vdev);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (atomic_inc_return(&vdev->refcnt) == 1) {
+		int ret = vfio_pci_enable(vdev);
+		if (ret) {
+			module_put(THIS_MODULE);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+		if (pin)
+			return 1;
+
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSI);
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSI_FLAGS, &flags);
+
+			return 1 << (flags & PCI_MSI_FLAGS_QMASK);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSIX);
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSIX_FLAGS, &flags);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	}
+
+	return 0;
+}
+
+static long vfio_pci_ioctl(void *device_data,
+			   unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_device *vdev = device_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (vdev->reset_works)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_REGIONS)
+			return -EINVAL;
+
+		info.flags = 0;
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+
+		if (info.index == VFIO_PCI_CONFIG_REGION_INDEX) {
+			info.size = pdev->cfg_size;
+		} else if (pci_resource_start(pdev, info.index)) {
+			unsigned long flags;
+
+			flags = pci_resource_flags(pdev, info.index);
+
+			info.flags |= VFIO_REGION_INFO_FLAG_READ;
+
+			/* Report the actual ROM size instead of the BAR size,
+			 * this gives the user an easy way to determine whether
+			 * there's anything here w/o trying to read it. */
+			if (info.index == VFIO_PCI_ROM_REGION_INDEX) {
+				void __iomem *io;
+				size_t size;
+
+				io = pci_map_rom(pdev, &size);
+				info.size = io ? size : 0;
+				pci_unmap_rom(pdev, io);
+			} else if (flags & IORESOURCE_MEM) {
+				info.size = pci_resource_len(pdev, info.index);
+				info.flags |= (VFIO_REGION_INFO_FLAG_WRITE |
+					       VFIO_REGION_INFO_FLAG_MMAP);
+			} else {
+				info.size = pci_resource_len(pdev, info.index);
+				info.flags |= VFIO_REGION_INFO_FLAG_WRITE;
+			}
+		} else
+			info.size = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+		info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+				       VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.count > vfio_pci_get_irq_count(vdev, hdr.index))
+				return -EINVAL;
+
+			data = kmalloc(hdr.count * size, GFP_KERNEL);
+			if (!data)
+				return -ENOMEM;
+
+			if (copy_from_user(data, (void __user *)(arg + minsz),
+					   hdr.count * size)) {
+				kfree(data);
+				return -EFAULT;
+			}
+		}
+
+		mutex_lock(&vdev->igate);
+
+		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+					      hdr.start, hdr.count, data);
+
+		mutex_unlock(&vdev->igate);
+		kfree(data);
+
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_RESET)
+		return vdev->reset_works ?
+			pci_reset_function(vdev->pdev) : -EINVAL;
+
+	return -ENOTTY;
+}
+
+static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return vfio_pci_config_readwrite(vdev, buf, count, ppos, false);
+	else if (index == VFIO_PCI_ROM_REGION_INDEX)
+		return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+		return vfio_pci_io_readwrite(vdev, buf, count, ppos, false);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM)
+		return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+
+	return -EINVAL;
+}
+
+static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return vfio_pci_config_readwrite(vdev, (char __user *)buf,
+						 count, ppos, true);
+	else if (index == VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+		return vfio_pci_io_readwrite(vdev, (char __user *)buf,
+					     count, ppos, true);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM) {
+		return vfio_pci_mem_readwrite(vdev, (char __user *)buf,
+					      count, ppos, true);
+	}
+
+	return -EINVAL;
+}
+
+static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index;
+	u64 phys_len, req_len, pgoff, req_start, phys;
+	int ret;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	phys_len = pci_resource_len(pdev, index);
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
+		return -EINVAL;
+
+	if (index == vdev->msix_bar) {
+		/*
+		 * Disallow mmaps overlapping the MSI-X table; users don't
+		 * get to touch this directly.  We could find somewhere
+		 * else to map the overlap, but page granularity is only
+		 * a recommendation, not a requirement, so the user needs
+		 * to know which bits are real.  Requiring them to mmap
+		 * around the table makes that clear.
+		 */
+
+		/* If neither entirely above nor below, then it overlaps */
+		if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
+		      req_start + req_len <= vdev->msix_offset))
+			return -EINVAL;
+	}
+
+	/*
+	 * Even though we don't make use of the barmap for the mmap,
+	 * we need to request the region and the barmap tracks that.
+	 */
+	if (!vdev->barmap[index]) {
+		ret = pci_request_selected_regions(pdev,
+						   1 << index, "vfio-pci");
+		if (ret)
+			return ret;
+
+		vdev->barmap[index] = pci_iomap(pdev, index, 0);
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_flags |= (VM_IO | VM_RESERVED);
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	phys = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, phys,
+			       req_len, vma->vm_page_prot);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_ioctl,
+	.read		= vfio_pci_read,
+	.write		= vfio_pci_write,
+	.mmap		= vfio_pci_mmap,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	u8 type;
+	struct vfio_pci_device *vdev;
+	struct iommu_group *group;
+	int ret;
+
+	pci_read_config_byte(pdev, PCI_HEADER_TYPE, &type);
+	if ((type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		iommu_group_put(group);
+		return -ENOMEM;
+	}
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	atomic_set(&vdev->refcnt, 0);
+
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (ret) {
+		iommu_group_put(group);
+		kfree(vdev);
+	}
+
+	return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = vfio_del_group_dev(&pdev->dev);
+	if (!vdev)
+		return;
+
+	iommu_group_put(pdev->dev.iommu_group);
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_pci_driver = {
+	.name		= "vfio-pci",
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_pci_probe,
+	.remove		= vfio_pci_remove,
+};
+
+void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+	vfio_pci_virqfd_exit();
+	vfio_pci_uninit_perm_bits();
+}
+
+int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Start the virqfd cleanup handler */
+	ret = vfio_pci_virqfd_init();
+	if (ret)
+		goto out_virqfd;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	return 0;
+
+out_virqfd:
+	vfio_pci_virqfd_exit();
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
new file mode 100644
index 0000000..a909433
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -0,0 +1,1527 @@
+/*
+ * VFIO PCI config space virtualization
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+/*
+ * This code handles reading and writing of PCI configuration registers.
+ * This is hairy because we want to allow a lot of flexibility to the
+ * user driver, but cannot trust it with all of the config fields.
+ * Tables determine which fields can be read and written, as well as
+ * which fields are 'virtualized' - special actions and translations to
+ * make it appear to the user that he has control, when in fact things
+ * must be negotiated with the underlying OS.
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define PCI_CFG_SPACE_SIZE	256
+
+/* Useful "pseudo" capabilities */
+#define PCI_CAP_ID_BASIC	0
+#define PCI_CAP_ID_INVALID	0xFF
+
+#define is_bar(offset)	\
+	((offset >= PCI_BASE_ADDRESS_0 && offset < PCI_BASE_ADDRESS_5 + 4) || \
+	 (offset >= PCI_ROM_ADDRESS && offset < PCI_ROM_ADDRESS + 4))
+
+/*
+ * Lengths of PCI Config Capabilities
+ *   0: Removed from the user visible capability list
+ *   FF: Variable length
+ */
+static u8 pci_cap_length[] = {
+	[PCI_CAP_ID_BASIC]	= PCI_STD_HEADER_SIZEOF, /* pci config header */
+	[PCI_CAP_ID_PM]		= PCI_PM_SIZEOF,
+	[PCI_CAP_ID_AGP]	= PCI_AGP_SIZEOF,
+	[PCI_CAP_ID_VPD]	= PCI_CAP_VPD_SIZEOF,
+	[PCI_CAP_ID_SLOTID]	= 0,		/* bridge - don't care */
+	[PCI_CAP_ID_MSI]	= 0xFF,		/* 10, 14, 20, or 24 */
+	[PCI_CAP_ID_CHSWP]	= 0,		/* cpci - not yet */
+	[PCI_CAP_ID_PCIX]	= 0xFF,		/* 8 or 24 */
+	[PCI_CAP_ID_HT]		= 0xFF,		/* hypertransport */
+	[PCI_CAP_ID_VNDR]	= 0xFF,		/* variable */
+	[PCI_CAP_ID_DBG]	= 0,		/* debug - don't care */
+	[PCI_CAP_ID_CCRC]	= 0,		/* cpci - not yet */
+	[PCI_CAP_ID_SHPC]	= 0,		/* hotswap - not yet */
+	[PCI_CAP_ID_SSVID]	= 0,		/* bridge - don't care */
+	[PCI_CAP_ID_AGP3]	= 0,		/* AGP8x - not yet */
+	[PCI_CAP_ID_SECDEV]	= 0,		/* secure device not yet */
+	[PCI_CAP_ID_EXP]	= 0xFF,		/* 20 or 44 */
+	[PCI_CAP_ID_MSIX]	= PCI_CAP_MSIX_SIZEOF,
+	[PCI_CAP_ID_SATA]	= 0xFF,
+	[PCI_CAP_ID_AF]		= PCI_CAP_AF_SIZEOF,
+};
+
+/*
+ * Lengths of PCIe/PCI-X Extended Config Capabilities
+ *   0: Removed or masked from the user visible capabilty list
+ *   FF: Variable length
+ */
+static u16 pci_ext_cap_length[] = {
+	[PCI_EXT_CAP_ID_ERR]	=	PCI_ERR_ROOT_COMMAND,
+	[PCI_EXT_CAP_ID_VC]	=	0xFF,
+	[PCI_EXT_CAP_ID_DSN]	=	PCI_EXT_CAP_DSN_SIZEOF,
+	[PCI_EXT_CAP_ID_PWR]	=	PCI_EXT_CAP_PWR_SIZEOF,
+	[PCI_EXT_CAP_ID_RCLD]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_RCILC]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_RCEC]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_MFVC]	=	0xFF,
+	[PCI_EXT_CAP_ID_VC9]	=	0xFF,	/* same as CAP_ID_VC */
+	[PCI_EXT_CAP_ID_RCRB]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_VNDR]	=	0xFF,
+	[PCI_EXT_CAP_ID_CAC]	=	0,	/* obsolete */
+	[PCI_EXT_CAP_ID_ACS]	=	0xFF,
+	[PCI_EXT_CAP_ID_ARI]	=	PCI_EXT_CAP_ARI_SIZEOF,
+	[PCI_EXT_CAP_ID_ATS]	=	PCI_EXT_CAP_ATS_SIZEOF,
+	[PCI_EXT_CAP_ID_SRIOV]	=	PCI_EXT_CAP_SRIOV_SIZEOF,
+	[PCI_EXT_CAP_ID_MRIOV]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_MCAST]	=	PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF,
+	[PCI_EXT_CAP_ID_PRI]	=	PCI_EXT_CAP_PRI_SIZEOF,
+	[PCI_EXT_CAP_ID_AMD_XXX] =	0,	/* not yet */
+	[PCI_EXT_CAP_ID_REBAR]	=	0xFF,
+	[PCI_EXT_CAP_ID_DPA]	=	0xFF,
+	[PCI_EXT_CAP_ID_TPH]	=	0xFF,
+	[PCI_EXT_CAP_ID_LTR]	=	PCI_EXT_CAP_LTR_SIZEOF,
+	[PCI_EXT_CAP_ID_SECPCI]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PMUX]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PASID]	=	0,	/* not yet */
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists, but what is read depends on
+ * whether the field is 'virtualized', or just pass thru to the
+ * hardware.  Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+	u8	*virt;		/* read/write virtual data, not hw */
+	u8	*write;		/* writeable bits */
+	int	(*readfn)(struct vfio_pci_device *vdev, int pos, int count,
+			  struct perm_bits *perm, int offset, u32 *val);
+	int	(*writefn)(struct vfio_pci_device *vdev, int pos, int count,
+			   struct perm_bits *perm, int offset, u32 val);
+};
+
+#define	NO_VIRT		0
+#define	ALL_VIRT	0xFFFFFFFFU
+#define	NO_WRITE	0
+#define	ALL_WRITE	0xFFFFFFFFU
+
+static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+				 u32 *val, int count)
+{
+	int ret = -EINVAL;
+
+	switch (count) {
+	case 1:
+		ret = pci_user_read_config_byte(pdev, offset, (u8 *)val);
+		break;
+	case 2:
+		ret = pci_user_read_config_word(pdev, offset, (u16 *)val);
+		*val = cpu_to_le16(*val);
+		break;
+	case 4:
+		ret = pci_user_read_config_dword(pdev, offset, val);
+		*val = cpu_to_le32(*val);
+		break;
+	}
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+				  u32 val, int count)
+{
+	int ret = -EINVAL;
+
+	switch (count) {
+	case 1:
+		ret = pci_user_write_config_byte(pdev, offset, val);
+		break;
+	case 2:
+		ret = pci_user_write_config_word(pdev, offset, val);
+		break;
+	case 4:
+		ret = pci_user_write_config_dword(pdev, offset, val);
+		break;
+	}
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
+				    int count, struct perm_bits *perm,
+				    int offset, u32 *val)
+{
+	u32 virt = 0;
+
+	memcpy(val, vdev->vconfig + pos, count);
+
+	memcpy(&virt, perm->virt + offset, count);
+
+	/* Any non-virtualized bits? */
+	if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
+		struct pci_dev *pdev = vdev->pdev;
+		u32 phys_val = 0;
+		int ret;
+
+		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		if (ret)
+			return ret;
+
+		*val = (phys_val & ~virt) | (*val & virt);
+	}
+
+	return count;
+}
+
+static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
+				     int count, struct perm_bits *perm,
+				     int offset, u32 val)
+{
+	u32 virt = 0, write = 0;
+
+	memcpy(&write, perm->write + offset, count);
+
+	if (!write)
+		return count; /* drop, no writable bits */
+
+	memcpy(&virt, perm->virt + offset, count);
+
+	/* Virtualized and writable bits go to vconfig */
+	if (write & virt) {
+		u32 virt_val = 0;
+
+		memcpy(&virt_val, vdev->vconfig + pos, count);
+
+		virt_val &= ~(write & virt);
+		virt_val |= (val & (write & virt));
+
+		memcpy(vdev->vconfig + pos, &virt_val, count);
+	}
+
+	/* Non-virtualzed and writable bits go to hardware */
+	if (write & ~virt) {
+		struct pci_dev *pdev = vdev->pdev;
+		u32 phys_val = 0;
+		int ret;
+
+		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		if (ret)
+			return ret;
+
+		phys_val &= ~(write & ~virt);
+		phys_val |= (val & (write & ~virt));
+
+		ret = vfio_user_config_write(pdev, pos, phys_val, count);
+		if (ret)
+			return ret;
+	}
+
+	return count;
+}
+
+/* Allow direct read from hardware, except for capability next pointer */
+static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, u32 *val)
+{
+	int ret;
+
+	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	if (pos >= PCI_CFG_SPACE_SIZE) { /* Extended cap header mangling */
+		if (offset < 4)
+			memcpy(val, vdev->vconfig + pos, count);
+	} else if (pos >= PCI_STD_HEADER_SIZEOF) { /* Std cap mangling */
+		if (offset == PCI_CAP_LIST_ID && count > 1)
+			memcpy(val, vdev->vconfig + pos,
+			       min(PCI_CAP_FLAGS, count));
+		else if (offset == PCI_CAP_LIST_NEXT)
+			memcpy(val, vdev->vconfig + pos, 1);
+	}
+
+	return count;
+}
+
+static int vfio_direct_config_write(struct vfio_pci_device *vdev, int pos,
+				    int count, struct perm_bits *perm,
+				    int offset, u32 val)
+{
+	int ret;
+
+	ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+/* Default all regions to read-only, no-virtualization */
+static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
+	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
+	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+
+static void free_perm_bits(struct perm_bits *perm)
+{
+	kfree(perm->virt);
+	kfree(perm->write);
+	perm->virt = NULL;
+	perm->write = NULL;
+}
+
+static int alloc_perm_bits(struct perm_bits *perm, int size)
+{
+	/*
+	 * Round up all permission bits to the next dword, this lets us
+	 * ignore whether a read/write exceeds the defined capability
+	 * structure.  We can do this because:
+	 *  - Standard config space is already dword aligned
+	 *  - Capabilities are all dword alinged (bits 0:1 of next reserved)
+	 *  - Express capabilities defined as dword aligned
+	 */
+	size = round_up(size, 4);
+
+	/*
+	 * Zero state is
+	 * - All Readable, None Writeable, None Virtualized
+	 */
+	perm->virt = kzalloc(size, GFP_KERNEL);
+	perm->write = kzalloc(size, GFP_KERNEL);
+	if (!perm->virt || !perm->write) {
+		free_perm_bits(perm);
+		return -ENOMEM;
+	}
+
+	perm->readfn = vfio_default_config_read;
+	perm->writefn = vfio_default_config_write;
+
+	return 0;
+}
+
+/*
+ * Helper functions for filling in permission tables
+ */
+static inline void p_setb(struct perm_bits *p, int off, u8 virt, u8 write)
+{
+	p->virt[off] = virt;
+	p->write[off] = write;
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setw(struct perm_bits *p, int off, u16 virt, u16 write)
+{
+	*(u16 *)(&p->virt[off]) = cpu_to_le16(virt);
+	*(u16 *)(&p->write[off]) = cpu_to_le16(write);
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
+{
+	*(u32 *)(&p->virt[off]) = cpu_to_le32(virt);
+	*(u32 *)(&p->write[off]) = cpu_to_le32(write);
+}
+
+/*
+ * Restore the *real* BARs after we detect a FLR or backdoor reset.
+ * (backdoor = some device specific technique that we didn't catch)
+ */
+static void vfio_bar_restore(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 *rbar = vdev->rbar;
+	int i;
+
+	if (pdev->is_virtfn)
+		return;
+
+	printk(KERN_INFO "%s: %s reset recovery - restoring bars\n",
+	       __func__, dev_name(&pdev->dev));
+
+	for (i = PCI_BASE_ADDRESS_0; i <= PCI_BASE_ADDRESS_5; i += 4, rbar++)
+		pci_user_write_config_dword(pdev, i, *rbar);
+
+	pci_user_write_config_dword(pdev, PCI_ROM_ADDRESS, *rbar);
+}
+
+static u32 vfio_generate_bar_flags(struct pci_dev *pdev, int bar)
+{
+	unsigned long flags = pci_resource_flags(pdev, bar);
+	u32 val;
+
+	if (flags & IORESOURCE_IO)
+		return cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
+
+	val = PCI_BASE_ADDRESS_SPACE_MEMORY;
+
+	if (flags & IORESOURCE_PREFETCH)
+		val |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+	if (flags & IORESOURCE_MEM_64)
+		val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+
+	return cpu_to_le32(val);
+}
+
+/*
+ * Pretend we're hardware and tweak the values of the *virtual* PCI BARs
+ * to reflect the hardware capabilities.  This implements BAR sizing.
+ */
+static void vfio_bar_fixup(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+	u32 *bar;
+	u64 mask;
+
+	bar = (u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_0];
+
+	for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++, bar++) {
+		if (!pci_resource_start(pdev, i)) {
+			*bar = 0; /* Unmapped by host = unimplemented to user */
+			continue;
+		}
+
+		mask = ~(pci_resource_len(pdev, i) - 1);
+
+		*bar &= cpu_to_le32((u32)mask);
+		*bar |= vfio_generate_bar_flags(pdev, i);
+
+		if (*bar & cpu_to_le32(IORESOURCE_MEM_64)) {
+			bar++;
+			*bar &= cpu_to_le32((u32)(mask >> 32));
+			i++;
+		}
+	}
+
+	bar = (u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];
+
+	/*
+	 * NB. we expose the actual BAR size here, regardless of whether
+	 * we can read it.  When we report the REGION_INFO for the ROM
+	 * we report what PCI tells us is the actual ROM size.
+	 */
+	if (pci_resource_start(pdev, PCI_ROM_RESOURCE)) {
+		mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
+		mask |= PCI_ROM_ADDRESS_ENABLE;
+		*bar &= cpu_to_le32((u32)mask);
+	} else
+		*bar = 0;
+
+	vdev->bardirty = false;
+}
+
+static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
+				  int count, struct perm_bits *perm,
+				  int offset, u32 *val)
+{
+	if (is_bar(offset)) /* pos == offset for basic config */
+		vfio_bar_fixup(vdev);
+
+	count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
+
+	/* Mask in virtual memory enable for SR-IOV devices */
+	if (offset == PCI_COMMAND && vdev->pdev->is_virtfn) {
+		u16 cmd = *(u16 *)&vdev->vconfig[PCI_COMMAND];
+		*val |= cmd & cpu_to_le16(PCI_COMMAND_MEMORY);
+	}
+
+	return count;
+}
+
+static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, u32 val)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 phys_cmd, *virt_cmd, new_cmd = 0;
+	int ret;
+
+	virt_cmd = (u16 *)&vdev->vconfig[PCI_COMMAND];
+
+	if (offset == PCI_COMMAND) {
+		bool phys_mem, virt_mem, new_mem, phys_io, virt_io, new_io;
+
+		ret = pci_user_read_config_word(pdev, PCI_COMMAND, &phys_cmd);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		switch (count) {
+		case 1:
+			new_cmd = val;
+			break;
+		case 2:
+			new_cmd = le16_to_cpu(val);
+			break;
+		case 4:
+			new_cmd = (u16)le32_to_cpu(val);
+			break;
+		}
+
+		phys_mem = !!(phys_cmd & PCI_COMMAND_MEMORY);
+		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
+		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
+
+		phys_io = !!(phys_cmd & PCI_COMMAND_IO);
+		virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
+		new_io = !!(new_cmd & PCI_COMMAND_IO);
+
+		/*
+		 * If the user is writing mem/io enable (new_mem/io) and we
+		 * think it's already enabled (virt_mem/io), but the hardware
+		 * shows it disabled (phys_mem/io, then the device has
+		 * undergone some kind of backdoor reset and needs to be
+		 * restored before we allow it to enable the bars.
+		 * SR-IOV devices will trigger this, but we catch them later
+		 */
+		if ((new_mem && virt_mem && !phys_mem) ||
+		    (new_io && virt_io && !phys_io))
+			vfio_bar_restore(vdev);
+	}
+
+	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (count < 0)
+		return count;
+
+	/*
+	 * Save current memory/io enable bits in vconfig to allow for
+	 * the test above next time.
+	 */
+	if (offset == PCI_COMMAND) {
+		u16 mask = PCI_COMMAND_MEMORY | PCI_COMMAND_IO;
+
+		*virt_cmd &= cpu_to_le16(~mask);
+		*virt_cmd |= new_cmd & cpu_to_le16(mask);
+	}
+
+	/* Emulate INTx disable */
+	if (offset >= PCI_COMMAND && offset <= PCI_COMMAND + 1) {
+		bool virt_intx_disable;
+
+		virt_intx_disable = !!(le16_to_cpu(*virt_cmd) &
+				       PCI_COMMAND_INTX_DISABLE);
+
+		if (virt_intx_disable && !vdev->virq_disabled) {
+			vdev->virq_disabled = true;
+			vfio_pci_intx_mask(vdev);
+		} else if (!virt_intx_disable && vdev->virq_disabled) {
+			vdev->virq_disabled = false;
+			vfio_pci_intx_unmask(vdev);
+		}
+	}
+
+	if (is_bar(offset))
+		vdev->bardirty = true;
+
+	return count;
+}
+
+/* Permissions for the Basic PCI Header */
+static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, PCI_STD_HEADER_SIZEOF))
+		return -ENOMEM;
+
+	perm->readfn = vfio_basic_config_read;
+	perm->writefn = vfio_basic_config_write;
+
+	/* Virtualized for SR-IOV functions, which just have FFFF */
+	p_setw(perm, PCI_VENDOR_ID, (u16)ALL_VIRT, NO_WRITE);
+	p_setw(perm, PCI_DEVICE_ID, (u16)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Virtualize INTx disable, we use it internally for interrupt
+	 * control and can emulate it for non-PCI 2.3 devices.
+	 */
+	p_setw(perm, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE, (u16)ALL_WRITE);
+
+	/* Virtualize capability list, we might want to skip/disable */
+	p_setw(perm, PCI_STATUS, PCI_STATUS_CAP_LIST, NO_WRITE);
+
+	/* No harm to write */
+	p_setb(perm, PCI_CACHE_LINE_SIZE, NO_VIRT, (u8)ALL_WRITE);
+	p_setb(perm, PCI_LATENCY_TIMER, NO_VIRT, (u8)ALL_WRITE);
+	p_setb(perm, PCI_BIST, NO_VIRT, (u8)ALL_WRITE);
+
+	/* Virtualize all bars, can't touch the real ones */
+	p_setd(perm, PCI_BASE_ADDRESS_0, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_1, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_2, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_3, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_4, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_5, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_ROM_ADDRESS, ALL_VIRT, ALL_WRITE);
+
+	/* Allow us to adjust capability chain */
+	p_setb(perm, PCI_CAPABILITY_LIST, (u8)ALL_VIRT, NO_WRITE);
+
+	/* Sometimes used by sw, just virtualize */
+	p_setb(perm, PCI_INTERRUPT_LINE, (u8)ALL_VIRT, (u8)ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for the Power Management capability */
+static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM]))
+		return -ENOMEM;
+
+	/*
+	 * We always virtualize the next field so we can remove
+	 * capabilities from the chain if we want to.
+	 */
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Power management is defined *per function*,
+	 * so we let the user write this
+	 */
+	p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for PCI-X capability */
+static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
+{
+	/* Alloc 24, but only 8 are used in v0 */
+	if (alloc_perm_bits(perm, PCI_CAP_PCIX_SIZEOF_V12))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	p_setw(perm, PCI_X_CMD, NO_VIRT, (u16)ALL_WRITE);
+	p_setd(perm, PCI_X_ECC_CSR, NO_VIRT, ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for PCI Express capability */
+static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
+{
+	/* Alloc larger of two possible sizes */
+	if (alloc_perm_bits(perm, PCI_CAP_EXP_ENDPOINT_SIZEOF_V2))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Allow writes to device control fields (includes FLR!)
+	 * but not to devctl_phantom which could confuse IOMMU
+	 * or to the ARI bit in devctl2 which is set at probe time
+	 */
+	p_setw(perm, PCI_EXP_DEVCTL, NO_VIRT, ~PCI_EXP_DEVCTL_PHANTOM);
+	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
+	return 0;
+}
+
+/* Permissions for Advanced Function capability */
+static int __init init_pci_cap_af_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_AF]))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+	p_setb(perm, PCI_AF_CTRL, NO_VIRT, PCI_AF_CTRL_FLR);
+	return 0;
+}
+
+/* Permissions for Advanced Error Reporting extended capability */
+static int __init init_pci_ext_cap_err_perm(struct perm_bits *perm)
+{
+	u32 mask;
+
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_ERR]))
+		return -ENOMEM;
+
+	/*
+	 * Virtualize the first dword of all express capabilities
+	 * because it includes the next pointer.  This lets us later
+	 * remove capabilities from the chain if we need to.
+	 */
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+	/* Writable bits mask */
+	mask =	PCI_ERR_UNC_TRAIN |		/* Training */
+		PCI_ERR_UNC_DLP |		/* Data Link Protocol */
+		PCI_ERR_UNC_SURPDN |		/* Surprise Down */
+		PCI_ERR_UNC_POISON_TLP |	/* Poisoned TLP */
+		PCI_ERR_UNC_FCP |		/* Flow Control Protocol */
+		PCI_ERR_UNC_COMP_TIME |		/* Completion Timeout */
+		PCI_ERR_UNC_COMP_ABORT |	/* Completer Abort */
+		PCI_ERR_UNC_UNX_COMP |		/* Unexpected Completion */
+		PCI_ERR_UNC_RX_OVER |		/* Receiver Overflow */
+		PCI_ERR_UNC_MALF_TLP |		/* Malformed TLP */
+		PCI_ERR_UNC_ECRC |		/* ECRC Error Status */
+		PCI_ERR_UNC_UNSUP |		/* Unsupported Request */
+		PCI_ERR_UNC_ACSV |		/* ACS Violation */
+		PCI_ERR_UNC_INTN |		/* internal error */
+		PCI_ERR_UNC_MCBTLP |		/* MC blocked TLP */
+		PCI_ERR_UNC_ATOMEG |		/* Atomic egress blocked */
+		PCI_ERR_UNC_TLPPRE;		/* TLP prefix blocked */
+	p_setd(perm, PCI_ERR_UNCOR_STATUS, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_UNCOR_MASK, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_UNCOR_SEVER, NO_VIRT, mask);
+
+	mask =	PCI_ERR_COR_RCVR |		/* Receiver Error Status */
+		PCI_ERR_COR_BAD_TLP |		/* Bad TLP Status */
+		PCI_ERR_COR_BAD_DLLP |		/* Bad DLLP Status */
+		PCI_ERR_COR_REP_ROLL |		/* REPLAY_NUM Rollover */
+		PCI_ERR_COR_REP_TIMER |		/* Replay Timer Timeout */
+		PCI_ERR_COR_ADV_NFAT |		/* Advisory Non-Fatal */
+		PCI_ERR_COR_INTERNAL |		/* Corrected Internal */
+		PCI_ERR_COR_LOG_OVER;		/* Header Log Overflow */
+	p_setd(perm, PCI_ERR_COR_STATUS, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_COR_MASK, NO_VIRT, mask);
+
+	mask =	PCI_ERR_CAP_ECRC_GENE |		/* ECRC Generation Enable */
+		PCI_ERR_CAP_ECRC_CHKE;		/* ECRC Check Enable */
+	p_setd(perm, PCI_ERR_CAP, NO_VIRT, mask);
+	return 0;
+}
+
+/* Permissions for Power Budgeting extended capability */
+static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_PWR]))
+		return -ENOMEM;
+
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+	/* Writing the data selector is OK, the info is still read-only */
+	p_setb(perm, PCI_PWR_DATA, NO_VIRT, (u8)ALL_WRITE);
+	return 0;
+}
+
+/*
+ * Initialize the shared permission tables
+ */
+void vfio_pci_uninit_perm_bits(void)
+{
+	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
+
+	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_PCIX]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_EXP]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_AF]);
+
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+}
+
+int __init vfio_pci_init_perm_bits(void)
+{
+	int ret;
+
+	/* Basic config space */
+	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
+
+	/* Capabilities */
+	ret |= init_pci_cap_pm_perm(&cap_perms[PCI_CAP_ID_PM]);
+	cap_perms[PCI_CAP_ID_VPD].writefn = vfio_direct_config_write;
+	ret |= init_pci_cap_pcix_perm(&cap_perms[PCI_CAP_ID_PCIX]);
+	cap_perms[PCI_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+	ret |= init_pci_cap_exp_perm(&cap_perms[PCI_CAP_ID_EXP]);
+	ret |= init_pci_cap_af_perm(&cap_perms[PCI_CAP_ID_AF]);
+
+	/* Extended capabilities */
+	ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+	ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+
+	if (ret)
+		vfio_pci_uninit_perm_bits();
+
+	return ret;
+}
+
+static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos)
+{
+	u8 cap;
+	int base = (pos >= PCI_CFG_SPACE_SIZE) ? PCI_CFG_SPACE_SIZE :
+						 PCI_STD_HEADER_SIZEOF;
+	base /= 4;
+	pos /= 4;
+
+	cap = vdev->pci_config_map[pos];
+
+	if (cap == PCI_CAP_ID_BASIC)
+		return 0;
+
+	/* XXX Can we have to abutting capabilities of the same type? */
+	while (pos - 1 >= base && vdev->pci_config_map[pos - 1] == cap)
+		pos--;
+
+	return pos * 4;
+}
+
+static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
+				int count, struct perm_bits *perm,
+				int offset, u32 *val)
+{
+	/* Update max available queue size from msi_qmax */
+	if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+		u16 *flags;
+		int start;
+
+		start = vfio_find_cap_start(vdev, pos);
+
+		flags = (u16 *)&vdev->vconfig[start];
+
+		*flags &= cpu_to_le16(~PCI_MSI_FLAGS_QMASK);
+		*flags |= cpu_to_le16(vdev->msi_qmax << 1);
+	}
+
+	return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_msi_config_write(struct vfio_pci_device *vdev, int pos,
+				 int count, struct perm_bits *perm,
+				 int offset, u32 val)
+{
+	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (count < 0)
+		return count;
+
+	/* Fixup and write configured queue size and enable to hardware */
+	if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+		u16 *pflags, flags;
+		int start, ret;
+
+		start = vfio_find_cap_start(vdev, pos);
+
+		pflags = (u16 *)&vdev->vconfig[start + PCI_MSI_FLAGS];
+
+		flags = le16_to_cpu(*pflags);
+
+		/* MSI is enabled via ioctl */
+		if  (!is_msi(vdev))
+			flags &= ~PCI_MSI_FLAGS_ENABLE;
+
+		/* Check queue size */
+		if ((flags & PCI_MSI_FLAGS_QSIZE) >> 4 > vdev->msi_qmax) {
+			flags &= ~PCI_MSI_FLAGS_QSIZE;
+			flags |= vdev->msi_qmax << 4;
+		}
+
+		/* Write back to virt and to hardware */
+		*pflags = cpu_to_le16(flags);
+		ret = pci_user_write_config_word(vdev->pdev,
+						 start + PCI_MSI_FLAGS,
+						 flags);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+	}
+
+	return count;
+}
+
+/*
+ * MSI determination is per-device, so this routine gets used beyond
+ * initialization time. Don't add __init
+ */
+static int init_pci_cap_msi_perm(struct perm_bits *perm, int len, u16 flags)
+{
+	if (alloc_perm_bits(perm, len))
+		return -ENOMEM;
+
+	perm->readfn = vfio_msi_config_read;
+	perm->writefn = vfio_msi_config_write;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * The upper byte of the control register is reserved,
+	 * just setup the lower byte.
+	 */
+	p_setb(perm, PCI_MSI_FLAGS, (u8)ALL_VIRT, (u8)ALL_WRITE);
+	p_setd(perm, PCI_MSI_ADDRESS_LO, ALL_VIRT, ALL_WRITE);
+	if (flags & PCI_MSI_FLAGS_64BIT) {
+		p_setd(perm, PCI_MSI_ADDRESS_HI, ALL_VIRT, ALL_WRITE);
+		p_setw(perm, PCI_MSI_DATA_64, (u16)ALL_VIRT, (u16)ALL_WRITE);
+		if (flags & PCI_MSI_FLAGS_MASKBIT) {
+			p_setd(perm, PCI_MSI_MASK_64, NO_VIRT, ALL_WRITE);
+			p_setd(perm, PCI_MSI_PENDING_64, NO_VIRT, ALL_WRITE);
+		}
+	} else {
+		p_setw(perm, PCI_MSI_DATA_32, (u16)ALL_VIRT, (u16)ALL_WRITE);
+		if (flags & PCI_MSI_FLAGS_MASKBIT) {
+			p_setd(perm, PCI_MSI_MASK_32, NO_VIRT, ALL_WRITE);
+			p_setd(perm, PCI_MSI_PENDING_32, NO_VIRT, ALL_WRITE);
+		}
+	}
+	return 0;
+}
+
+/* Determine MSI CAP field length; initialize msi_perms on 1st call per vdev */
+static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int len, ret;
+	u16 flags;
+
+	ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	len = 10; /* Minimum size */
+	if (flags & PCI_MSI_FLAGS_64BIT)
+		len += 4;
+	if (flags & PCI_MSI_FLAGS_MASKBIT)
+		len += 10;
+
+	if (vdev->msi_perm)
+		return len;
+
+	vdev->msi_perm = kmalloc(sizeof(struct perm_bits), GFP_KERNEL);
+	if (!vdev->msi_perm)
+		return -ENOMEM;
+
+	ret = init_pci_cap_msi_perm(vdev->msi_perm, len, flags);
+	if (ret)
+		return ret;
+
+	return len;
+}
+
+/* Determine extended capability length for VC (2 & 9) and MFVC */
+static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 tmp;
+	int ret, evcc, phases, vc_arb;
+	int len = PCI_CAP_VC_BASE_SIZEOF;
+
+	ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG1, &tmp);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	evcc = tmp & PCI_VC_REG1_EVCC; /* extended vc count */
+	ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG2, &tmp);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	if (tmp & PCI_VC_REG2_128_PHASE)
+		phases = 128;
+	else if (tmp & PCI_VC_REG2_64_PHASE)
+		phases = 64;
+	else if (tmp & PCI_VC_REG2_32_PHASE)
+		phases = 32;
+	else
+		phases = 0;
+
+	vc_arb = phases * 4;
+
+	/*
+	 * Port arbitration tables are root & switch only;
+	 * function arbitration tables are function 0 only.
+	 * In either case, we'll never let user write them so
+	 * we don't care how big they are
+	 */
+	len += (1 + evcc) * PCI_CAP_VC_PER_VC_SIZEOF;
+	if (vc_arb) {
+		len = round_up(len, 16);
+		len += vc_arb / 8;
+	}
+	return len;
+}
+
+static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 word;
+	u8 byte;
+	int ret;
+
+	switch (cap) {
+	case PCI_CAP_ID_MSI:
+		return vfio_msi_cap_len(vdev, pos);
+	case PCI_CAP_ID_PCIX:
+		ret = pci_read_config_word(pdev, pos + PCI_X_CMD, &word);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if (PCI_X_CMD_VERSION(word)) {
+			vdev->extended_caps = true;
+			return PCI_CAP_PCIX_SIZEOF_V12;
+		} else
+			return PCI_CAP_PCIX_SIZEOF_V0;
+	case PCI_CAP_ID_VNDR:
+		/* length follows next field */
+		ret = pci_read_config_byte(pdev, pos + PCI_CAP_FLAGS, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return byte;
+	case PCI_CAP_ID_EXP:
+		/* length based on version */
+		ret = pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &word);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if ((word & PCI_EXP_FLAGS_VERS) == 1)
+			return PCI_CAP_EXP_ENDPOINT_SIZEOF_V1;
+		else {
+			vdev->extended_caps = true;
+			return PCI_CAP_EXP_ENDPOINT_SIZEOF_V2;
+		}
+	case PCI_CAP_ID_HT:
+		ret = pci_read_config_byte(pdev, pos + 3, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return (byte & HT_3BIT_CAP_MASK) ?
+			HT_CAP_SIZEOF_SHORT : HT_CAP_SIZEOF_LONG;
+	case PCI_CAP_ID_SATA:
+		ret = pci_read_config_byte(pdev, pos + PCI_SATA_REGS, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_SATA_REGS_MASK;
+		if (byte == PCI_SATA_REGS_INLINE)
+			return PCI_SATA_SIZEOF_LONG;
+		else
+			return PCI_SATA_SIZEOF_SHORT;
+	default:
+		printk(KERN_WARNING
+		       "%s: %s unknown length for pci cap 0x%x@0x%x\n",
+		       dev_name(&pdev->dev), __func__, cap, pos);
+	}
+
+	return 0;
+}
+
+static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 byte;
+	u32 dword;
+	int ret;
+
+	switch (ecap) {
+	case PCI_EXT_CAP_ID_VNDR:
+		ret = pci_read_config_dword(pdev, epos + PCI_VSEC_HDR, &dword);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return dword >> PCI_VSEC_HDR_LEN_SHIFT;
+	case PCI_EXT_CAP_ID_VC:
+	case PCI_EXT_CAP_ID_VC9:
+	case PCI_EXT_CAP_ID_MFVC:
+		return vfio_vc_cap_len(vdev, epos);
+	case PCI_EXT_CAP_ID_ACS:
+		ret = pci_read_config_byte(pdev, epos + PCI_ACS_CAP, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if (byte & PCI_ACS_EC) {
+			int bits;
+
+			ret = pci_read_config_byte(pdev,
+						   epos + PCI_ACS_EGRESS_BITS,
+						   &byte);
+			if (ret)
+				return pcibios_err_to_errno(ret);
+
+			bits = byte ? round_up(byte, 32) : 256;
+			return 8 + (bits / 8);
+		}
+		return 8;
+
+	case PCI_EXT_CAP_ID_REBAR:
+		ret = pci_read_config_byte(pdev, epos + PCI_REBAR_CTRL, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_REBAR_CTRL_NBAR_MASK;
+		byte >>= PCI_REBAR_CTRL_NBAR_SHIFT;
+
+		return 4 + (byte * 8);
+	case PCI_EXT_CAP_ID_DPA:
+		ret = pci_read_config_byte(pdev, epos + PCI_DPA_CAP, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_DPA_CAP_SUBSTATE_MASK;
+		byte = round_up(byte + 1, 4);
+		return PCI_DPA_BASE_SIZEOF + byte;
+	case PCI_EXT_CAP_ID_TPH:
+		ret = pci_read_config_dword(pdev, epos + PCI_TPH_CAP, &dword);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if ((dword & PCI_TPH_CAP_LOC_MASK) == PCI_TPH_LOC_CAP) {
+			int sts;
+
+			sts = byte & PCI_TPH_CAP_ST_MASK;
+			sts >>= PCI_TPH_CAP_ST_SHIFT;
+			return PCI_TPH_BASE_SIZEOF + round_up(sts * 2, 4);
+		}
+		return PCI_TPH_BASE_SIZEOF;
+	default:
+		printk(KERN_WARNING
+		       "%s: %s unknown length for pci ecap 0x%x@0x%x\n",
+		       dev_name(&pdev->dev), __func__, ecap, epos);
+	}
+
+	return 0;
+}
+
+static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
+				   int offset, int size)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret = 0;
+
+	/*
+	 * We try to read physical config space in the largest chunks
+	 * we can, assuming that all of the fields support dword access.
+	 * pci_save_state() makes this same assumption and seems to do ok.
+	 */
+	while (size) {
+		int filled;
+
+		if (size >= 4 && !(offset % 4)) {
+			u32 *dword = (u32 *)&vdev->vconfig[offset];
+			ret = pci_read_config_dword(pdev, offset, dword);
+			if (ret)
+				return ret;
+			*dword = cpu_to_le32(*dword);
+			filled = 4;
+		} else if (size >= 2 && !(offset % 2)) {
+			u16 *word = (u16 *)&vdev->vconfig[offset];
+			ret = pci_read_config_word(pdev, offset, word);
+			if (ret)
+				return ret;
+			*word = cpu_to_le16(*word);
+			filled = 2;
+		} else {
+			u8 *byte = &vdev->vconfig[offset];
+			ret = pci_read_config_byte(pdev, offset, byte);
+			if (ret)
+				return ret;
+			filled = 1;
+		}
+
+		offset += filled;
+		size -= filled;
+	}
+
+	return ret;
+}
+
+static int vfio_cap_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u16 status;
+	u8 pos, *prev, cap;
+	int loops, ret, caps = 0;
+
+	/* Any capabilities? */
+	ret = pci_read_config_word(pdev, PCI_STATUS, &status);
+	if (ret)
+		return ret;
+
+	if (!(status & PCI_STATUS_CAP_LIST))
+		return 0; /* Done */
+
+	ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+	if (ret)
+		return ret;
+
+	/* Mark the previous position in case we want to skip a capability */
+	prev = &vdev->vconfig[PCI_CAPABILITY_LIST];
+
+	/* We can bound our loop, capabilities are dword aligned */
+	loops = (PCI_CFG_SPACE_SIZE - PCI_STD_HEADER_SIZEOF) / PCI_CAP_SIZEOF;
+	while (pos && loops--) {
+		u8 next;
+		int i, len = 0;
+
+		ret = pci_read_config_byte(pdev, pos, &cap);
+		if (ret)
+			return ret;
+
+		ret = pci_read_config_byte(pdev,
+					   pos + PCI_CAP_LIST_NEXT, &next);
+		if (ret)
+			return ret;
+
+		if (cap <= PCI_CAP_ID_MAX) {
+			len = pci_cap_length[cap];
+			if (len == 0xFF) { /* Variable length */
+				len = vfio_cap_len(vdev, cap, pos);
+				if (len < 0)
+					return len;
+			}
+		}
+
+		if (!len) {
+			printk(KERN_INFO "%s: %s hiding cap 0x%x\n",
+			       __func__, dev_name(&pdev->dev), cap);
+			*prev = next;
+			pos = next;
+			continue;
+		}
+
+		/* Sanity check, do we overlap other capabilities? */
+		for (i = 0; i < len; i += 4) {
+			if (likely(map[(pos + i) / 4] == PCI_CAP_ID_INVALID))
+				continue;
+
+			printk(KERN_WARNING
+			       "%s: %s pci config conflict @0x%x, was cap 0x%x now cap 0x%x\n",
+			       __func__, dev_name(&pdev->dev), pos + i,
+			       map[pos + i], cap);
+		}
+
+		memset(map + (pos / 4), cap, len / 4);
+		ret = vfio_fill_vconfig_bytes(vdev, pos, len);
+		if (ret)
+			return ret;
+
+		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
+		pos = next;
+		caps++;
+	}
+
+	/* If we didn't fill any capabilities, clear the status flag */
+	if (!caps) {
+		u16 *vstatus = (u16 *)&vdev->vconfig[PCI_STATUS];
+		*vstatus &= ~cpu_to_le32(PCI_STATUS_CAP_LIST);
+	}
+
+	return 0;
+}
+
+static int vfio_ecap_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u16 epos;
+	u32 *prev = NULL;
+	int loops, ret, ecaps = 0;
+
+	if (!vdev->extended_caps)
+		return 0;
+
+	epos = PCI_CFG_SPACE_SIZE;
+
+	loops = (pdev->cfg_size - PCI_CFG_SPACE_SIZE) / PCI_CAP_SIZEOF;
+
+	while (loops-- && epos >= PCI_CFG_SPACE_SIZE) {
+		u32 header;
+		u16 ecap;
+		int i, len = 0;
+		bool hidden = false;
+
+		ret = pci_read_config_dword(pdev, epos, &header);
+		if (ret)
+			return ret;
+
+		ecap = PCI_EXT_CAP_ID(header);
+
+		if (ecap <= PCI_EXT_CAP_ID_MAX) {
+			len = pci_ext_cap_length[ecap];
+			if (len == 0xFF) {
+				len = vfio_ext_cap_len(vdev, ecap, epos);
+				if (len < 0)
+					return ret;
+			}
+		}
+
+		if (!len) {
+			printk(KERN_INFO "%s: %s hiding ecap 0x%x@0x%x\n",
+			       __func__, dev_name(&pdev->dev), ecap, epos);
+
+			/* If not the first in the chain, we can skip over it */
+			if (prev) {
+				epos = PCI_EXT_CAP_NEXT(header);
+				*prev &= cpu_to_le32(~((u32)0xffc << 20));
+				*prev |= cpu_to_le32((u32)epos << 20);
+				continue;
+			}
+
+			/*
+			 * Otherwise, fill in a placeholder, the direct
+			 * readfn will virtualize this automatically
+			 */
+			len = PCI_CAP_SIZEOF;
+			hidden = true;
+		}
+
+		for (i = 0; i < len; i += 4) {
+			if (likely(map[(epos + i) / 4] == PCI_CAP_ID_INVALID))
+				continue;
+
+			printk(KERN_WARNING
+			       "%s: %s pci config conflict @0x%x, was ecap 0x%x now ecap 0x%x\n",
+			       __func__, dev_name(&pdev->dev),
+			       epos + i, map[epos + i], ecap);
+		}
+
+		/*
+		 * Even though ecap is 2 bytes, we're currently a long way
+		 * from exceeding 1 byte capabilities.  If we ever make it
+		 * up to 0xFF we'll need to up this to a two-byte, byte map.
+		 */
+		BUILD_BUG_ON(PCI_EXT_CAP_ID_MAX >= PCI_CAP_ID_INVALID);
+
+		memset(map + (epos / 4), ecap, len / 4);
+		ret = vfio_fill_vconfig_bytes(vdev, epos, len);
+		if (ret)
+			return ret;
+
+		/*
+		 * If we're just using this capability to anchor the list,
+		 * hide the real ID.  Only count real ecaps.  XXX PCI spec
+		 * indicates to use cap id = 0, version = 0, next = 0 if
+		 * ecaps are absent, hope users check all the way to next.
+		 */
+		if (hidden)
+			*(u32 *)&vdev->vconfig[epos] &=
+				cpu_to_le32(((u32)0xffc << 20));
+		else
+			ecaps++;
+
+		prev = (u32 *)&vdev->vconfig[epos];
+		epos = PCI_EXT_CAP_NEXT(header);
+	}
+
+	if (!ecaps)
+		*(u32 *)&vdev->vconfig[PCI_CFG_SPACE_SIZE] = 0;
+
+	return 0;
+}
+
+/*
+ * For each device we allocate a pci_config_map that indicates the
+ * capability occupying each dword and thus the struct perm_bits we
+ * use for read and write.  We also allocate a virtualized config
+ * space which tracks reads and writes to bits that we emulate for
+ * the user.  Initial values filled from device.
+ *
+ * Using shared stuct perm_bits between all vfio-pci devices saves
+ * us from allocating cfg_size buffers for virt and write for every
+ * device.  We could remove vconfig and allocate individual buffers
+ * for each area requring emulated bits, but the array of pointers
+ * would be comparable in size (at least for standard config space).
+ */
+int vfio_config_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map, *vconfig;
+	int ret;
+
+	/*
+	 * Config space, caps and ecaps are all dword aligned, so we can
+	 * use one byte per dword to record the type.
+	 */
+	map = kmalloc(pdev->cfg_size / 4, GFP_KERNEL);
+	if (!map)
+		return -ENOMEM;
+
+	vconfig = kmalloc(pdev->cfg_size, GFP_KERNEL);
+	if (!vconfig) {
+		kfree(map);
+		return -ENOMEM;
+	}
+
+	vdev->pci_config_map = map;
+	vdev->vconfig = vconfig;
+
+	memset(map, PCI_CAP_ID_BASIC, PCI_STD_HEADER_SIZEOF / 4);
+	memset(map + (PCI_STD_HEADER_SIZEOF / 4), PCI_CAP_ID_INVALID,
+	       (pdev->cfg_size - PCI_STD_HEADER_SIZEOF) / 4);
+
+	ret = vfio_fill_vconfig_bytes(vdev, 0, PCI_STD_HEADER_SIZEOF);
+	if (ret)
+		goto out;
+
+	vdev->bardirty = true;
+
+	/*
+	 * XXX can we just pci_load_saved_state/pci_restore_state?
+	 * may need to rebuild vconfig after that
+	 */
+
+	/* For restore after reset */
+	vdev->rbar[0] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_0];
+	vdev->rbar[1] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_1];
+	vdev->rbar[2] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_2];
+	vdev->rbar[3] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_3];
+	vdev->rbar[4] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_4];
+	vdev->rbar[5] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_5];
+	vdev->rbar[6] = *(u32 *)&vconfig[PCI_ROM_ADDRESS];
+
+	if (pdev->is_virtfn) {
+		*(u16 *)&vconfig[PCI_VENDOR_ID] = cpu_to_le16(pdev->vendor);
+		*(u16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
+	}
+
+	ret = vfio_cap_init(vdev);
+	if (ret)
+		goto out;
+
+	ret = vfio_ecap_init(vdev);
+	if (ret)
+		goto out;
+
+	return 0;
+
+out:
+	kfree(map);
+	vdev->pci_config_map = NULL;
+	kfree(vconfig);
+	vdev->vconfig = NULL;
+	return pcibios_err_to_errno(ret);
+}
+
+void vfio_config_free(struct vfio_pci_device *vdev)
+{
+	kfree(vdev->vconfig);
+	vdev->vconfig = NULL;
+	kfree(vdev->pci_config_map);
+	vdev->pci_config_map = NULL;
+	kfree(vdev->msi_perm);
+	vdev->msi_perm = NULL;
+}
+
+ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
+			  size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct perm_bits *perm;
+	u32 val = 0;
+	int cap_start = 0, offset;
+	u8 cap_id;
+
+
+	if (*ppos < 0 || *ppos + count > pdev->cfg_size)
+		return -EFAULT;
+
+	cap_id = vdev->pci_config_map[*ppos / 4];
+
+	if (cap_id == PCI_CAP_ID_INVALID) {
+		if (iswrite)
+			return count; /* drop */
+
+		/*
+		 * Per PCI spec 3.0, section 6.1, reads from reserved and
+		 * unimplemented registers return 0
+		 */
+		if (copy_to_user(buf, &val, count))
+			return -EFAULT;
+
+		return count;
+	}
+
+	/*
+	 * All capabilities are minimum 4 bytes and aligned on dword
+	 * boundaries.  Since we don't support unaligned accesses, we're
+	 * only ever accessing a single capability.
+	 */
+	if (*ppos >= PCI_CFG_SPACE_SIZE) {
+		WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
+
+		perm = &ecap_perms[cap_id];
+		cap_start = vfio_find_cap_start(vdev, *ppos);
+
+	} else {
+		WARN_ON(cap_id > PCI_CAP_ID_MAX);
+
+		perm = &cap_perms[cap_id];
+
+		if (cap_id == PCI_CAP_ID_MSI)
+			perm = vdev->msi_perm;
+
+		if (cap_id > PCI_CAP_ID_BASIC)
+			cap_start = vfio_find_cap_start(vdev, *ppos);
+	}
+
+	WARN_ON(!cap_start && cap_id != PCI_CAP_ID_BASIC);
+	WARN_ON(cap_start > *ppos);
+
+	offset = *ppos - cap_start;
+
+	if (iswrite) {
+		if (perm->writefn) {
+			if (copy_from_user(&val, buf, count))
+				return -EFAULT;
+
+			count = perm->writefn(vdev, *ppos, count,
+					      perm, offset, val);
+		}
+	} else {
+		if (perm->readfn) {
+			count = perm->readfn(vdev, *ppos, count,
+					     perm, offset, &val);
+			if (count < 0)
+				return count;
+		}
+
+		if (copy_to_user(buf, &val, count))
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+				  char __user *buf, size_t count,
+				  loff_t *ppos, bool iswrite)
+{
+	size_t done = 0;
+	int ret = 0;
+	loff_t pos = *ppos;
+
+	pos &= VFIO_PCI_OFFSET_MASK;
+
+	/*
+	 * We want to both keep the access size the caller users as well as
+	 * support reading large chunks of config space in a single call.
+	 * PCI doesn't support unaligned accesses, so we can safely break
+	 * those apart.
+	 */
+	while (count) {
+		if (count >= 4 && !(pos % 4))
+			ret = vfio_config_do_rw(vdev, buf, 4, &pos, iswrite);
+		else if (count >= 2 && !(pos % 2))
+			ret = vfio_config_do_rw(vdev, buf, 2, &pos, iswrite);
+		else
+			ret = vfio_config_do_rw(vdev, buf, 1, &pos, iswrite);
+
+		if (ret < 0)
+			return ret;
+
+		count -= ret;
+		done += ret;
+		buf += ret;
+		pos += ret;
+	}
+
+	*ppos += done;
+
+	return done;
+}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
new file mode 100644
index 0000000..2996f37
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -0,0 +1,724 @@
+/*
+ * VFIO PCI interrupt handling
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+#include "vfio_pci_private.h"
+
+/*
+ * IRQfd - generic
+ */
+struct virqfd {
+	struct vfio_pci_device	*vdev;
+	void			*data;
+	struct eventfd_ctx	*eventfd;
+	poll_table		pt;
+	wait_queue_t		wait;
+	struct work_struct	inject;
+	struct work_struct	shutdown;
+	struct virqfd		**pvirqfd;
+};
+
+static struct workqueue_struct *vfio_irqfd_cleanup_wq;
+
+int __init vfio_pci_virqfd_init(void)
+{
+	vfio_irqfd_cleanup_wq =
+		create_singlethread_workqueue("vfio-irqfd-cleanup");
+	if (!vfio_irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void vfio_pci_virqfd_exit(void)
+{
+	destroy_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+static void virqfd_deactivate(struct virqfd *virqfd)
+{
+	queue_work(vfio_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct virqfd *virqfd = container_of(wait, struct virqfd, wait);
+	unsigned long flags = (unsigned long)key;
+
+	if (flags & POLLIN)
+		/* An event has been signaled, inject an interrupt */
+		schedule_work(&virqfd->inject);
+
+	if (flags & POLLHUP)
+		/* The eventfd is closing, detach from VFIO */
+		virqfd_deactivate(virqfd);
+
+	return 0;
+}
+
+static void virqfd_ptable_queue_proc(struct file *file,
+				     wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
+	add_wait_queue(wqh, &virqfd->wait);
+}
+
+static void virqfd_shutdown(struct work_struct *work)
+{
+	u64 cnt;
+	struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
+	struct virqfd **pvirqfd = virqfd->pvirqfd;
+
+	eventfd_ctx_remove_wait_queue(virqfd->eventfd, &virqfd->wait, &cnt);
+	flush_work(&virqfd->inject);
+	eventfd_ctx_put(virqfd->eventfd);
+
+	kfree(virqfd);
+	*pvirqfd = NULL;
+}
+
+static int virqfd_enable(struct vfio_pci_device *vdev,
+			 void (*inject)(struct work_struct *work),
+			 void *data, struct virqfd **pvirqfd, int fd)
+{
+	struct file *file = NULL;
+	struct eventfd_ctx *ctx = NULL;
+	struct virqfd *virqfd;
+	int ret = 0;
+	unsigned int events;
+
+	if (*pvirqfd)
+		return -EBUSY;
+
+	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+	if (!*pvirqfd)
+		return -ENOMEM;
+
+	virqfd->vdev = vdev;
+	virqfd->data = data;
+	virqfd->pvirqfd = pvirqfd;
+	*pvirqfd = virqfd;
+
+	INIT_WORK(&virqfd->inject, inject);
+	INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
+
+	file = eventfd_fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto fail;
+	}
+
+	ctx = eventfd_ctx_fileget(file);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
+		goto fail;
+	}
+
+	virqfd->eventfd = ctx;
+
+	/*
+	 * Install our own custom wake-up handling so we are notified via
+	 * a callback whenever someone signals the underlying eventfd.
+	 */
+	init_waitqueue_func_entry(&virqfd->wait, virqfd_wakeup);
+	init_poll_funcptr(&virqfd->pt, virqfd_ptable_queue_proc);
+
+	events = file->f_op->poll(file, &virqfd->pt);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered and trigger it as if we didn't miss it.
+	 */
+	if (events & POLLIN)
+		schedule_work(&virqfd->inject);
+
+	/*
+	 * Do not drop the file until the irqfd is fully initialized,
+	 * otherwise we might race against the POLLHUP.
+	 */
+	fput(file);
+
+	return 0;
+
+fail:
+	if (ctx && !IS_ERR(ctx))
+		eventfd_ctx_put(ctx);
+
+	if (!IS_ERR(file))
+		fput(file);
+
+	kfree(virqfd);
+	*pvirqfd = NULL;
+
+	return ret;
+}
+
+static void virqfd_disable(struct virqfd *virqfd)
+{
+	if (!virqfd)
+		return;
+
+	virqfd_deactivate(virqfd);
+
+	/* Block until we know all outstanding shutdown jobs have completed. */
+	flush_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+/*
+ * INTx
+ */
+static inline void vfio_send_intx_eventfd(struct vfio_pci_device *vdev)
+{
+	if (likely(is_intx(vdev) && !vdev->virq_disabled))
+		eventfd_signal(vdev->ctx[0].trigger, 1);
+}
+
+void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+
+	spin_lock_irq(&vdev->irqlock);
+
+	/*
+	 * Masking can come from interrupt, ioctl, or config space
+	 * via INTx disable.  The latter means this can get called
+	 * even when not using intx delivery.  In this case, just
+	 * try to have the physical bit follow the virtual bit.
+	 */
+	if (unlikely(!is_intx(vdev))) {
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 0);
+	} else if (!vdev->ctx[0].masked) {
+		/*
+		 * Can't use check_and_mask here because we always want to
+		 * mask, not just when something is pending.
+		 */
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 0);
+		else
+			disable_irq_nosync(pdev->irq);
+
+		vdev->ctx[0].masked = true;
+	}
+
+	spin_unlock_irq(&vdev->irqlock);
+}
+
+void vfio_pci_intx_unmask(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	bool signal = false;
+
+	spin_lock_irq(&vdev->irqlock);
+
+	/*
+	 * Unmasking comes from ioctl or config, so again, have the
+	 * physical bit follow the virtual even when not using INTx.
+	 */
+	if (unlikely(!is_intx(vdev))) {
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 1);
+	} else if (vdev->ctx[0].masked && !vdev->virq_disabled) {
+		/*
+		 * A pending interrupt here would immediately trigger,
+		 * but we can avoid that overhead by just re-sending
+		 * the interrupt to the user.
+		 */
+		if (vdev->pci_2_3) {
+			if (!pci_check_and_unmask_intx(pdev))
+				signal = true;
+		} else
+			enable_irq(pdev->irq);
+
+		vdev->ctx[0].masked = signal;
+	}
+
+	spin_unlock_irq(&vdev->irqlock);
+
+	if (signal)
+		vfio_send_intx_eventfd(vdev);
+}
+
+static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
+{
+	struct vfio_pci_device *vdev = dev_id;
+	struct pci_dev *pdev = vdev->pdev;
+	irqreturn_t ret = IRQ_NONE;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vdev->irqlock, flags);
+
+	/* Non-PCI 2.3 device don't use this hard handler */
+	if (pci_check_and_mask_intx(pdev)) {
+		ret = IRQ_WAKE_THREAD;
+		vdev->ctx[0].masked = true;
+	}
+
+	spin_unlock_irqrestore(&vdev->irqlock, flags);
+
+	return ret;
+}
+
+static irqreturn_t vfio_intx_thread(int irq, void *dev_id)
+{
+	struct vfio_pci_device *vdev = dev_id;
+	int ret = IRQ_HANDLED;
+
+	if (unlikely(!vdev->pci_2_3)) {
+		spin_lock_irq(&vdev->irqlock);
+		if (!vdev->ctx[0].masked) {
+			disable_irq_nosync(vdev->pdev->irq);
+			vdev->ctx[0].masked = true;
+		} else
+			ret = IRQ_NONE;
+		spin_unlock_irq(&vdev->irqlock);
+	}
+
+	if (ret == IRQ_HANDLED)
+		vfio_send_intx_eventfd(vdev);
+
+	return ret;
+}
+
+static int vfio_intx_enable(struct vfio_pci_device *vdev)
+{
+	if (!is_irq_none(vdev))
+		return -EINVAL;
+
+	if (!vdev->pdev->irq)
+		return -ENODEV;
+
+	vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	vdev->num_ctx = 1;
+	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
+
+	return 0;
+}
+
+static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	irq_handler_t handler = vfio_intx_handler;
+	unsigned long irqflags = IRQF_SHARED;
+	int ret;
+
+	if (vdev->ctx[0].trigger) {
+		free_irq(pdev->irq, vdev);
+		kfree(vdev->ctx[0].name);
+		eventfd_ctx_put(vdev->ctx[0].trigger);
+		vdev->ctx[0].trigger = NULL;
+	}
+
+	if (fd >= 0) {
+		vdev->ctx[0].name = kasprintf(GFP_KERNEL, "vfio-intx(%s)",
+					    pci_name(pdev));
+		if (!vdev->ctx[0].name)
+			return -ENOMEM;
+
+		vdev->ctx[0].trigger = eventfd_ctx_fdget(fd);
+		if (!vdev->ctx[0].trigger) {
+			kfree(vdev->ctx[0].name);
+			return -EINVAL;
+		}
+
+		if (!vdev->pci_2_3) {
+			handler = NULL;
+			irqflags = IRQF_ONESHOT;
+		}
+
+		ret = request_threaded_irq(pdev->irq, handler, vfio_intx_thread,
+					   irqflags, vdev->ctx[0].name, vdev);
+		if (ret) {
+			eventfd_ctx_put(vdev->ctx[0].trigger);
+			kfree(vdev->ctx[0].name);
+			return ret;
+		}
+
+		/*
+		 * INTx disable will stick across the new irq setup,
+		 * disable_irq won't.
+		 */
+		if (!vdev->pci_2_3)
+			if (vdev->ctx[0].masked || vdev->virq_disabled)
+				disable_irq_nosync(pdev->irq);
+	}
+	return 0;
+}
+
+static void vfio_intx_disable(struct vfio_pci_device *vdev)
+{
+	vfio_intx_set_signal(vdev, -1);
+	virqfd_disable(vdev->ctx[0].unmask);
+	virqfd_disable(vdev->ctx[0].mask);
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	vdev->num_ctx = 0;
+	kfree(vdev->ctx);
+}
+
+static void vfio_intx_unmask_inject(struct work_struct *work)
+{
+	struct virqfd *virqfd = container_of(work, struct virqfd, inject);
+	vfio_pci_intx_unmask(virqfd->vdev);
+}
+
+/*
+ * MSI/MSI-X
+ */
+static irqreturn_t vfio_msihandler(int irq, void *arg)
+{
+	struct eventfd_ctx *trigger = arg;
+
+	eventfd_signal(trigger, 1);
+	return IRQ_HANDLED;
+}
+
+static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+
+	if (!is_irq_none(vdev))
+		return -EINVAL;
+
+	vdev->ctx = kzalloc(nvec * sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	if (msix) {
+		int i;
+
+		vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+				     GFP_KERNEL);
+		if (!vdev->msix) {
+			kfree(vdev->ctx);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < nvec; i++)
+			vdev->msix[i].entry = i;
+
+		ret = pci_enable_msix(pdev, vdev->msix, nvec);
+		if (ret) {
+			kfree(vdev->msix);
+			kfree(vdev->ctx);
+			return ret;
+		}
+	} else {
+		ret = pci_enable_msi_block(pdev, nvec);
+		if (ret) {
+			kfree(vdev->ctx);
+			return ret;
+		}
+	}
+
+	vdev->num_ctx = nvec;
+	vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX :
+				VFIO_PCI_MSI_IRQ_INDEX;
+
+	if (!msix) {
+		/*
+		 * Compute the virtual hardware field for max msi vectors -
+		 * it is the log base 2 of the number of vectors.
+		 */
+		vdev->msi_qmax = fls(nvec * 2 - 1) - 1;
+	}
+
+	return 0;
+}
+
+static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
+				      int vector, int fd, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
+	char *name = msix ? "vfio-msix" : "vfio-msi";
+
+	if (vector >= vdev->num_ctx)
+		return -EINVAL;
+
+	if (vdev->ctx[vector].trigger) {
+		free_irq(irq, vdev->ctx[vector].trigger);
+		kfree(vdev->ctx[vector].name);
+		eventfd_ctx_put(vdev->ctx[vector].trigger);
+		vdev->ctx[vector].trigger = NULL;
+	}
+
+	if (fd >= 0) {
+		struct eventfd_ctx *trigger;
+		int ret;
+
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "%s[%d](%s)",
+						   name, vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name)
+			return -ENOMEM;
+
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger)) {
+			kfree(vdev->ctx[vector].name);
+			return PTR_ERR(trigger);
+		}
+
+		ret = request_threaded_irq(irq, NULL, vfio_msihandler, 0,
+					   vdev->ctx[vector].name, trigger);
+		if (ret) {
+			eventfd_ctx_put(trigger);
+			kfree(vdev->ctx[vector].name);
+			return ret;
+		}
+
+		vdev->ctx[vector].trigger = trigger;
+	}
+
+	return 0;
+}
+
+static int vfio_msi_set_block(struct vfio_pci_device *vdev, int start,
+			      int count, int32_t *fds, bool msix)
+{
+	int i, j, ret = 0;
+
+	if (start + count > vdev->num_ctx)
+		return -EINVAL;
+
+	for (i = 0, j = start; i < count && !ret; i++, j++) {
+		int fd = fds ? fds[i] : -1;
+		ret = vfio_msi_set_vector_signal(vdev, j, fd, msix);
+	}
+
+	if (ret) {
+		for (--j; j >= start; j--)
+			vfio_msi_set_vector_signal(vdev, j, -1, msix);
+	}
+
+	return ret;
+}
+
+static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+
+	vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);
+
+	for (i = 0; i < vdev->num_ctx; i++) {
+		virqfd_disable(vdev->ctx[i].unmask);
+		virqfd_disable(vdev->ctx[i].mask);
+	}
+
+	if (msix) {
+		pci_disable_msix(vdev->pdev);
+		kfree(vdev->msix);
+	} else
+		pci_disable_msi(pdev);
+
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	vdev->num_ctx = 0;
+	kfree(vdev->ctx);
+}
+
+/*
+ * IOCTL support
+ */
+static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
+				    int index, int start, int count,
+				    uint32_t flags, void *data)
+{
+	if (!is_intx(vdev) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_pci_intx_unmask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t unmask = *(uint8_t *)data;
+		if (unmask)
+			vfio_pci_intx_unmask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t fd = *(int32_t *)data;
+		if (fd >= 0)
+			return virqfd_enable(vdev, vfio_intx_unmask_inject,
+					     NULL, &vdev->ctx[0].unmask, fd);
+
+		virqfd_disable(vdev->ctx[0].unmask);
+	}
+
+	return 0;
+}
+
+static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
+				  int index, int start, int count,
+				  uint32_t flags, void *data)
+{
+	if (!is_intx(vdev) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_pci_intx_mask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t mask = *(uint8_t *)data;
+		if (mask)
+			vfio_pci_intx_mask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		return -ENOTTY; /* XXX implement me */
+	}
+
+	return 0;
+}
+
+static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
+				     int index, int start, int count,
+				     uint32_t flags, void *data)
+{
+	if (is_intx(vdev) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		vfio_intx_disable(vdev);
+		return 0;
+	}
+
+	if (!(is_intx(vdev) || is_irq_none(vdev)) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t fd = *(int32_t *)data;
+		int ret;
+
+		if (is_intx(vdev))
+			return vfio_intx_set_signal(vdev, fd);
+
+		ret = vfio_intx_enable(vdev);
+		if (ret)
+			return ret;
+
+		ret = vfio_intx_set_signal(vdev, fd);
+		if (ret)
+			vfio_intx_disable(vdev);
+
+		return ret;
+	}
+
+	if (!is_intx(vdev))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_send_intx_eventfd(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t trigger = *(uint8_t *)data;
+		if (trigger)
+			vfio_send_intx_eventfd(vdev);
+	}
+	return 0;
+}
+
+static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
+				    int index, int start, int count,
+				    uint32_t flags, void *data)
+{
+	int i;
+	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
+
+	if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		vfio_msi_disable(vdev, msix);
+		return 0;
+	}
+
+	if (!(irq_is(vdev, index) || is_irq_none(vdev)))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t *fds = data;
+		int ret;
+
+		if (vdev->irq_type == index)
+			return vfio_msi_set_block(vdev, start, count,
+						  fds, msix);
+
+		ret = vfio_msi_enable(vdev, start + count, msix);
+		if (ret)
+			return ret;
+
+		ret = vfio_msi_set_block(vdev, start, count, fds, msix);
+		if (ret)
+			vfio_msi_disable(vdev, msix);
+
+		return ret;
+	}
+
+	if (!irq_is(vdev, index) || start + count > vdev->num_ctx)
+		return -EINVAL;
+
+	for (i = start; i < start + count; i++) {
+		if (!vdev->ctx[i].trigger)
+			continue;
+		if (flags & VFIO_IRQ_SET_DATA_NONE) {
+			eventfd_signal(vdev->ctx[i].trigger, 1);
+		} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+			uint8_t *bools = data;
+			if (bools[i - start])
+				eventfd_signal(vdev->ctx[i].trigger, 1);
+		}
+	}
+	return 0;
+}
+
+int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
+			    int index, int start, int count, void *data)
+{
+	int (*func)(struct vfio_pci_device *vdev, int index, int start,
+		    int count, uint32_t flags, void *data) = NULL;
+
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+			func = vfio_pci_set_intx_mask;
+			break;
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			func = vfio_pci_set_intx_unmask;
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_intx_trigger;
+			break;
+		}
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			/* XXX Need masking support exported */
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_msi_trigger;
+			break;
+		}
+		break;
+	}
+
+	if (!func)
+		return -ENOTTY;
+
+	return func(vdev, index, start, count, flags, data);
+}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
new file mode 100644
index 0000000..a4a3678
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -0,0 +1,91 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+#include <linux/mutex.h>
+#include <linux/pci.h>
+
+#ifndef VFIO_PCI_PRIVATE_H
+#define VFIO_PCI_PRIVATE_H
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_pci_irq_ctx {
+	struct eventfd_ctx	*trigger;
+	struct virqfd		*unmask;
+	struct virqfd		*mask;
+	char			*name;
+	bool			masked;
+};
+
+struct vfio_pci_device {
+	struct pci_dev		*pdev;
+	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
+	u8			*pci_config_map;
+	u8			*vconfig;
+	struct perm_bits	*msi_perm;
+	spinlock_t		irqlock;
+	struct mutex		igate;
+	struct msix_entry	*msix;
+	struct vfio_pci_irq_ctx	*ctx;
+	int			num_ctx;
+	int			irq_type;
+	u8			msi_qmax;
+	u8			msix_bar;
+	u16			msix_size;
+	u32			msix_offset;
+	u32			rbar[7];
+	bool			pci_2_3;
+	bool			virq_disabled;
+	bool			reset_works;
+	bool			extended_caps;
+	bool			bardirty;
+	struct pci_saved_state	*pci_saved_state;
+	atomic_t		refcnt;
+};
+
+#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
+#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
+#define irq_is(vdev, type) (vdev->irq_type == type)
+
+extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
+extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
+
+extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
+				   uint32_t flags, int index, int start,
+				   int count, void *data);
+
+extern ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+					 char __user *buf, size_t count,
+					 loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev,
+				      char __user *buf, size_t count,
+				      loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev,
+				     char __user *buf, size_t count,
+				     loff_t *ppos, bool iswrite);
+
+extern int vfio_pci_init_perm_bits(void);
+extern void vfio_pci_uninit_perm_bits(void);
+
+extern int vfio_pci_virqfd_init(void);
+extern void vfio_pci_virqfd_exit(void);
+
+extern int vfio_config_init(struct vfio_pci_device *vdev);
+extern void vfio_config_free(struct vfio_pci_device *vdev);
+#endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
new file mode 100644
index 0000000..44c3ba2
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -0,0 +1,267 @@
+/*
+ * VFIO PCI I/O Port & MMIO access
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include "vfio_pci_private.h"
+
+/* I/O Port BAR access */
+ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+			      size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	void __iomem *io;
+	size_t done = 0;
+
+	if (!pci_resource_start(pdev, bar))
+		return -EINVAL;
+
+	if (pos + count > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!vdev->barmap[bar]) {
+		int ret;
+
+		ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+		if (ret)
+			return ret;
+
+		vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+		if (!vdev->barmap[bar]) {
+			pci_release_selected_regions(pdev, 1 << bar);
+			return -EINVAL;
+		}
+	}
+
+	io = vdev->barmap[bar];
+
+	while (count) {
+		int filled;
+
+		if (count >= 3 && !(pos % 4)) {
+			u32 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 4))
+					return -EFAULT;
+
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+
+				if (copy_to_user(buf, &val, 4))
+					return -EFAULT;
+			}
+
+			filled = 4;
+
+		} else if ((pos % 2) == 0 && count >= 2) {
+			u16 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 2))
+					return -EFAULT;
+
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+
+				if (copy_to_user(buf, &val, 2))
+					return -EFAULT;
+			}
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 1))
+					return -EFAULT;
+
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+
+				if (copy_to_user(buf, &val, 1))
+					return -EFAULT;
+			}
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		buf += filled;
+		pos += filled;
+	}
+
+	*ppos += done;
+
+	return done;
+}
+
+/*
+ * MMIO BAR access
+ * We handle two excluded ranges here as well, if the user tries to read
+ * the ROM beyond what PCI tells us is available or the MSI-X table region,
+ * we return 0xFF and writes are dropped.
+ */
+ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+			       size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	void __iomem *io;
+	resource_size_t end;
+	size_t done = 0;
+	size_t x_start = 0, x_end = 0; /* excluded range */
+
+	if (!pci_resource_start(pdev, bar))
+		return -EINVAL;
+
+	end = pci_resource_len(pdev, bar);
+
+	if (pos > end)
+		return -EINVAL;
+
+	if (pos == end)
+		return 0;
+
+	if (pos + count > end)
+		count = end - pos;
+
+	if (bar == PCI_ROM_RESOURCE) {
+		io = pci_map_rom(pdev, &x_start);
+		x_end = end;
+	} else {
+		if (!vdev->barmap[bar]) {
+			int ret;
+
+			ret = pci_request_selected_regions(pdev, 1 << bar,
+							   "vfio");
+			if (ret)
+				return ret;
+
+			vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+			if (!vdev->barmap[bar]) {
+				pci_release_selected_regions(pdev, 1 << bar);
+				return -EINVAL;
+			}
+		}
+
+		io = vdev->barmap[bar];
+
+		if (bar == vdev->msix_bar) {
+			x_start = vdev->msix_offset;
+			x_end = vdev->msix_offset + vdev->msix_size;
+		}
+	}
+
+	if (!io)
+		return -EINVAL;
+
+	while (count) {
+		size_t fillable, filled;
+
+		if (pos < x_start)
+			fillable = x_start - pos;
+		else if (pos >= x_end)
+			fillable = end - pos;
+		else
+			fillable = 0;
+
+		if (fillable >= 4 && !(pos % 4)) {
+			u32 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 4))
+					goto out;
+
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+				if (copy_to_user(buf, &val, 4))
+					goto out;
+			}
+
+			filled = 4;
+		} else if (fillable >= 2 && !(pos % 2)) {
+			u16 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 2))
+					goto out;
+
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+				if (copy_to_user(buf, &val, 2))
+					goto out;
+			}
+
+			filled = 2;
+		} else if (fillable) {
+			u8 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 1))
+					goto out;
+
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+
+				if (copy_to_user(buf, &val, 1))
+					goto out;
+			}
+
+			filled = 1;
+		} else {
+			/* Drop writes, fill reads with FF */
+			if (!iswrite) {
+				char val = 0xFF;
+				size_t i;
+
+				for (i = 0; i < x_end - pos; i++) {
+					if (put_user(val, buf + i))
+						goto out;
+				}
+			}
+
+			filled = x_end - pos;
+		}
+
+		count -= filled;
+		done += filled;
+		buf += filled;
+		pos += filled;
+	}
+
+	*ppos += done;
+
+out:
+	if (bar == PCI_ROM_RESOURCE)
+		pci_unmap_rom(pdev, io);
+
+	return count ? -EFAULT : done;
+}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 1c7119c..d668283 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -220,6 +220,7 @@ struct vfio_device_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
@@ -361,6 +362,31 @@ struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping.  Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_PCI_BAR0_REGION_INDEX,
+	VFIO_PCI_BAR1_REGION_INDEX,
+	VFIO_PCI_BAR2_REGION_INDEX,
+	VFIO_PCI_BAR3_REGION_INDEX,
+	VFIO_PCI_BAR4_REGION_INDEX,
+	VFIO_PCI_BAR5_REGION_INDEX,
+	VFIO_PCI_ROM_REGION_INDEX,
+	VFIO_PCI_CONFIG_REGION_INDEX,
+	VFIO_PCI_NUM_REGIONS
+};
+
+enum {
+	VFIO_PCI_INTX_IRQ_INDEX,
+	VFIO_PCI_MSI_IRQ_INDEX,
+	VFIO_PCI_MSIX_IRQ_INDEX,
+	VFIO_PCI_NUM_IRQS
+};
+
 /* -------- API for x86 VFIO IOMMU -------- */
 
 /**

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Qemu-devel] [PATCH 13/13] vfio: Add PCI device driver
@ 2012-05-11 22:56   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 22:56 UTC (permalink / raw)
  To: benh, aik, david, joerg.roedel, dwmw2
  Cc: aafabbri, alex.williamson, kvm, B07421, linux-pci, konrad.wilk,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Add PCI device support for VFIO.  PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device.  PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers.  I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions.  Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/vfio/Kconfig                |    2 
 drivers/vfio/pci/Kconfig            |    8 
 drivers/vfio/pci/Makefile           |    4 
 drivers/vfio/pci/vfio_pci.c         |  557 +++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  | 1527 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_intrs.c   |  724 +++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c    |  267 ++++++
 include/linux/vfio.h                |   26 +
 9 files changed, 3206 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index bd88a30..77b754c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,3 +12,5 @@ menuconfig VFIO
 	  See Documentation/vfio.txt for more details.
 
 	  If you don't know what to do here, say N.
+
+source "drivers/vfio/pci/Kconfig"
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
new file mode 100644
index 0000000..cc7db62
--- /dev/null
+++ b/drivers/vfio/pci/Kconfig
@@ -0,0 +1,8 @@
+config VFIO_PCI
+	tristate "VFIO support for PCI devices"
+	depends on VFIO && PCI
+	help
+	  Support for the PCI VFIO bus driver.  This is required to make
+	  use of PCI drivers using the VFIO framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
new file mode 100644
index 0000000..1310792
--- /dev/null
+++ b/drivers/vfio/pci/Makefile
@@ -0,0 +1,4 @@
+
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+
+obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 0000000..b2f1f3a
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,557 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1.9"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	u16 cmd;
+	u8 msix_pos;
+
+	vdev->reset_works = (pci_reset_function(pdev) == 0);
+	pci_save_state(pdev);
+	vdev->pci_saved_state = pci_store_saved_state(pdev);
+	if (!vdev->pci_saved_state)
+		printk(KERN_DEBUG "%s: Couldn't store %s saved state\n",
+		       __func__, dev_name(&pdev->dev));
+
+	ret = vfio_config_init(vdev);
+	if (ret)
+		goto out;
+
+	vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+		cmd &= ~PCI_COMMAND_INTX_DISABLE;
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+	}
+
+	msix_pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+	if (msix_pos) {
+		u16 flags;
+		u32 table;
+
+		pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+		pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+		vdev->msix_bar = table & PCI_MSIX_FLAGS_BIRMASK;
+		vdev->msix_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
+		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+	} else
+		vdev->msix_bar = 0xFF;
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		goto out;
+
+	return ret;
+
+out:
+	kfree(vdev->pci_saved_state);
+	vdev->pci_saved_state = NULL;
+	vfio_config_free(vdev);
+	return ret;
+}
+
+static void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+	int bar;
+
+	pci_disable_device(vdev->pdev);
+
+	vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+				VFIO_IRQ_SET_ACTION_TRIGGER,
+				vdev->irq_type, 0, 0, NULL);
+
+	vdev->virq_disabled = false;
+
+	vfio_config_free(vdev);
+
+	if (pci_reset_function(vdev->pdev) == 0) {
+		if (pci_load_and_free_saved_state(vdev->pdev,
+						  &vdev->pci_saved_state) == 0)
+			pci_restore_state(vdev->pdev);
+		else
+			printk(KERN_INFO "%s: Couldn't reload %s saved state\n",
+			       __func__, dev_name(&vdev->pdev->dev));
+	}
+
+	for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+		if (!vdev->barmap[bar])
+			continue;
+		pci_iounmap(vdev->pdev, vdev->barmap[bar]);
+		pci_release_selected_regions(vdev->pdev, 1 << bar);
+		vdev->barmap[bar] = NULL;
+	}
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	if (atomic_dec_and_test(&vdev->refcnt))
+		vfio_pci_disable(vdev);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (atomic_inc_return(&vdev->refcnt) == 1) {
+		int ret = vfio_pci_enable(vdev);
+		if (ret) {
+			module_put(THIS_MODULE);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+		if (pin)
+			return 1;
+
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSI);
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSI_FLAGS, &flags);
+
+			return 1 << (flags & PCI_MSI_FLAGS_QMASK);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 pos;
+		u16 flags;
+
+		pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSIX);
+		if (pos) {
+			pci_read_config_word(vdev->pdev,
+					     pos + PCI_MSIX_FLAGS, &flags);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	}
+
+	return 0;
+}
+
+static long vfio_pci_ioctl(void *device_data,
+			   unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_device *vdev = device_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (vdev->reset_works)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		struct pci_dev *pdev = vdev->pdev;
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_REGIONS)
+			return -EINVAL;
+
+		info.flags = 0;
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+
+		if (info.index == VFIO_PCI_CONFIG_REGION_INDEX) {
+			info.size = pdev->cfg_size;
+		} else if (pci_resource_start(pdev, info.index)) {
+			unsigned long flags;
+
+			flags = pci_resource_flags(pdev, info.index);
+
+			info.flags |= VFIO_REGION_INFO_FLAG_READ;
+
+			/* Report the actual ROM size instead of the BAR size,
+			 * this gives the user an easy way to determine whether
+			 * there's anything here w/o trying to read it. */
+			if (info.index == VFIO_PCI_ROM_REGION_INDEX) {
+				void __iomem *io;
+				size_t size;
+
+				io = pci_map_rom(pdev, &size);
+				info.size = io ? size : 0;
+				pci_unmap_rom(pdev, io);
+			} else if (flags & IORESOURCE_MEM) {
+				info.size = pci_resource_len(pdev, info.index);
+				info.flags |= (VFIO_REGION_INFO_FLAG_WRITE |
+					       VFIO_REGION_INFO_FLAG_MMAP);
+			} else {
+				info.size = pci_resource_len(pdev, info.index);
+				info.flags |= VFIO_REGION_INFO_FLAG_WRITE;
+			}
+		} else
+			info.size = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+		info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+				       VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
+		struct vfio_irq_set hdr;
+		u8 *data = NULL;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.count > vfio_pci_get_irq_count(vdev, hdr.index))
+				return -EINVAL;
+
+			data = kmalloc(hdr.count * size, GFP_KERNEL);
+			if (!data)
+				return -ENOMEM;
+
+			if (copy_from_user(data, (void __user *)(arg + minsz),
+					   hdr.count * size)) {
+				kfree(data);
+				return -EFAULT;
+			}
+		}
+
+		mutex_lock(&vdev->igate);
+
+		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+					      hdr.start, hdr.count, data);
+
+		mutex_unlock(&vdev->igate);
+		kfree(data);
+
+		return ret;
+
+	} else if (cmd == VFIO_DEVICE_RESET)
+		return vdev->reset_works ?
+			pci_reset_function(vdev->pdev) : -EINVAL;
+
+	return -ENOTTY;
+}
+
+static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return vfio_pci_config_readwrite(vdev, buf, count, ppos, false);
+	else if (index == VFIO_PCI_ROM_REGION_INDEX)
+		return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+		return vfio_pci_io_readwrite(vdev, buf, count, ppos, false);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM)
+		return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+
+	return -EINVAL;
+}
+
+static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return vfio_pci_config_readwrite(vdev, (char __user *)buf,
+						 count, ppos, true);
+	else if (index == VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+		return vfio_pci_io_readwrite(vdev, (char __user *)buf,
+					     count, ppos, true);
+	else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM) {
+		return vfio_pci_mem_readwrite(vdev, (char __user *)buf,
+					      count, ppos, true);
+	}
+
+	return -EINVAL;
+}
+
+static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index;
+	u64 phys_len, req_len, pgoff, req_start, phys;
+	int ret;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+	if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	phys_len = pci_resource_len(pdev, index);
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
+		return -EINVAL;
+
+	if (index == vdev->msix_bar) {
+		/*
+		 * Disallow mmaps overlapping the MSI-X table; users don't
+		 * get to touch this directly.  We could find somewhere
+		 * else to map the overlap, but page granularity is only
+		 * a recommendation, not a requirement, so the user needs
+		 * to know which bits are real.  Requiring them to mmap
+		 * around the table makes that clear.
+		 */
+
+		/* If neither entirely above nor below, then it overlaps */
+		if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
+		      req_start + req_len <= vdev->msix_offset))
+			return -EINVAL;
+	}
+
+	/*
+	 * Even though we don't make use of the barmap for the mmap,
+	 * we need to request the region and the barmap tracks that.
+	 */
+	if (!vdev->barmap[index]) {
+		ret = pci_request_selected_regions(pdev,
+						   1 << index, "vfio-pci");
+		if (ret)
+			return ret;
+
+		vdev->barmap[index] = pci_iomap(pdev, index, 0);
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_flags |= (VM_IO | VM_RESERVED);
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	phys = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, phys,
+			       req_len, vma->vm_page_prot);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_ioctl,
+	.read		= vfio_pci_read,
+	.write		= vfio_pci_write,
+	.mmap		= vfio_pci_mmap,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	u8 type;
+	struct vfio_pci_device *vdev;
+	struct iommu_group *group;
+	int ret;
+
+	pci_read_config_byte(pdev, PCI_HEADER_TYPE, &type);
+	if ((type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL)
+		return -EINVAL;
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		iommu_group_put(group);
+		return -ENOMEM;
+	}
+
+	vdev->pdev = pdev;
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	mutex_init(&vdev->igate);
+	spin_lock_init(&vdev->irqlock);
+	atomic_set(&vdev->refcnt, 0);
+
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	if (ret) {
+		iommu_group_put(group);
+		kfree(vdev);
+	}
+
+	return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_device *vdev;
+
+	vdev = vfio_del_group_dev(&pdev->dev);
+	if (!vdev)
+		return;
+
+	iommu_group_put(pdev->dev.iommu_group);
+	kfree(vdev);
+}
+
+static struct pci_driver vfio_pci_driver = {
+	.name		= "vfio-pci",
+	.id_table	= NULL, /* only dynamic ids */
+	.probe		= vfio_pci_probe,
+	.remove		= vfio_pci_remove,
+};
+
+void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+	vfio_pci_virqfd_exit();
+	vfio_pci_uninit_perm_bits();
+}
+
+int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Allocate shared config space permision data used by all devices */
+	ret = vfio_pci_init_perm_bits();
+	if (ret)
+		return ret;
+
+	/* Start the virqfd cleanup handler */
+	ret = vfio_pci_virqfd_init();
+	if (ret)
+		goto out_virqfd;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		goto out_driver;
+
+	return 0;
+
+out_virqfd:
+	vfio_pci_virqfd_exit();
+out_driver:
+	vfio_pci_uninit_perm_bits();
+	return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
new file mode 100644
index 0000000..a909433
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -0,0 +1,1527 @@
+/*
+ * VFIO PCI config space virtualization
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+/*
+ * This code handles reading and writing of PCI configuration registers.
+ * This is hairy because we want to allow a lot of flexibility to the
+ * user driver, but cannot trust it with all of the config fields.
+ * Tables determine which fields can be read and written, as well as
+ * which fields are 'virtualized' - special actions and translations to
+ * make it appear to the user that he has control, when in fact things
+ * must be negotiated with the underlying OS.
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define PCI_CFG_SPACE_SIZE	256
+
+/* Useful "pseudo" capabilities */
+#define PCI_CAP_ID_BASIC	0
+#define PCI_CAP_ID_INVALID	0xFF
+
+#define is_bar(offset)	\
+	((offset >= PCI_BASE_ADDRESS_0 && offset < PCI_BASE_ADDRESS_5 + 4) || \
+	 (offset >= PCI_ROM_ADDRESS && offset < PCI_ROM_ADDRESS + 4))
+
+/*
+ * Lengths of PCI Config Capabilities
+ *   0: Removed from the user visible capability list
+ *   FF: Variable length
+ */
+static u8 pci_cap_length[] = {
+	[PCI_CAP_ID_BASIC]	= PCI_STD_HEADER_SIZEOF, /* pci config header */
+	[PCI_CAP_ID_PM]		= PCI_PM_SIZEOF,
+	[PCI_CAP_ID_AGP]	= PCI_AGP_SIZEOF,
+	[PCI_CAP_ID_VPD]	= PCI_CAP_VPD_SIZEOF,
+	[PCI_CAP_ID_SLOTID]	= 0,		/* bridge - don't care */
+	[PCI_CAP_ID_MSI]	= 0xFF,		/* 10, 14, 20, or 24 */
+	[PCI_CAP_ID_CHSWP]	= 0,		/* cpci - not yet */
+	[PCI_CAP_ID_PCIX]	= 0xFF,		/* 8 or 24 */
+	[PCI_CAP_ID_HT]		= 0xFF,		/* hypertransport */
+	[PCI_CAP_ID_VNDR]	= 0xFF,		/* variable */
+	[PCI_CAP_ID_DBG]	= 0,		/* debug - don't care */
+	[PCI_CAP_ID_CCRC]	= 0,		/* cpci - not yet */
+	[PCI_CAP_ID_SHPC]	= 0,		/* hotswap - not yet */
+	[PCI_CAP_ID_SSVID]	= 0,		/* bridge - don't care */
+	[PCI_CAP_ID_AGP3]	= 0,		/* AGP8x - not yet */
+	[PCI_CAP_ID_SECDEV]	= 0,		/* secure device not yet */
+	[PCI_CAP_ID_EXP]	= 0xFF,		/* 20 or 44 */
+	[PCI_CAP_ID_MSIX]	= PCI_CAP_MSIX_SIZEOF,
+	[PCI_CAP_ID_SATA]	= 0xFF,
+	[PCI_CAP_ID_AF]		= PCI_CAP_AF_SIZEOF,
+};
+
+/*
+ * Lengths of PCIe/PCI-X Extended Config Capabilities
+ *   0: Removed or masked from the user visible capabilty list
+ *   FF: Variable length
+ */
+static u16 pci_ext_cap_length[] = {
+	[PCI_EXT_CAP_ID_ERR]	=	PCI_ERR_ROOT_COMMAND,
+	[PCI_EXT_CAP_ID_VC]	=	0xFF,
+	[PCI_EXT_CAP_ID_DSN]	=	PCI_EXT_CAP_DSN_SIZEOF,
+	[PCI_EXT_CAP_ID_PWR]	=	PCI_EXT_CAP_PWR_SIZEOF,
+	[PCI_EXT_CAP_ID_RCLD]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_RCILC]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_RCEC]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_MFVC]	=	0xFF,
+	[PCI_EXT_CAP_ID_VC9]	=	0xFF,	/* same as CAP_ID_VC */
+	[PCI_EXT_CAP_ID_RCRB]	=	0,	/* root only - don't care */
+	[PCI_EXT_CAP_ID_VNDR]	=	0xFF,
+	[PCI_EXT_CAP_ID_CAC]	=	0,	/* obsolete */
+	[PCI_EXT_CAP_ID_ACS]	=	0xFF,
+	[PCI_EXT_CAP_ID_ARI]	=	PCI_EXT_CAP_ARI_SIZEOF,
+	[PCI_EXT_CAP_ID_ATS]	=	PCI_EXT_CAP_ATS_SIZEOF,
+	[PCI_EXT_CAP_ID_SRIOV]	=	PCI_EXT_CAP_SRIOV_SIZEOF,
+	[PCI_EXT_CAP_ID_MRIOV]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_MCAST]	=	PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF,
+	[PCI_EXT_CAP_ID_PRI]	=	PCI_EXT_CAP_PRI_SIZEOF,
+	[PCI_EXT_CAP_ID_AMD_XXX] =	0,	/* not yet */
+	[PCI_EXT_CAP_ID_REBAR]	=	0xFF,
+	[PCI_EXT_CAP_ID_DPA]	=	0xFF,
+	[PCI_EXT_CAP_ID_TPH]	=	0xFF,
+	[PCI_EXT_CAP_ID_LTR]	=	PCI_EXT_CAP_LTR_SIZEOF,
+	[PCI_EXT_CAP_ID_SECPCI]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PMUX]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PASID]	=	0,	/* not yet */
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists, but what is read depends on
+ * whether the field is 'virtualized', or just pass thru to the
+ * hardware.  Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+	u8	*virt;		/* read/write virtual data, not hw */
+	u8	*write;		/* writeable bits */
+	int	(*readfn)(struct vfio_pci_device *vdev, int pos, int count,
+			  struct perm_bits *perm, int offset, u32 *val);
+	int	(*writefn)(struct vfio_pci_device *vdev, int pos, int count,
+			   struct perm_bits *perm, int offset, u32 val);
+};
+
+#define	NO_VIRT		0
+#define	ALL_VIRT	0xFFFFFFFFU
+#define	NO_WRITE	0
+#define	ALL_WRITE	0xFFFFFFFFU
+
+static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+				 u32 *val, int count)
+{
+	int ret = -EINVAL;
+
+	switch (count) {
+	case 1:
+		ret = pci_user_read_config_byte(pdev, offset, (u8 *)val);
+		break;
+	case 2:
+		ret = pci_user_read_config_word(pdev, offset, (u16 *)val);
+		*val = cpu_to_le16(*val);
+		break;
+	case 4:
+		ret = pci_user_read_config_dword(pdev, offset, val);
+		*val = cpu_to_le32(*val);
+		break;
+	}
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+				  u32 val, int count)
+{
+	int ret = -EINVAL;
+
+	switch (count) {
+	case 1:
+		ret = pci_user_write_config_byte(pdev, offset, val);
+		break;
+	case 2:
+		ret = pci_user_write_config_word(pdev, offset, val);
+		break;
+	case 4:
+		ret = pci_user_write_config_dword(pdev, offset, val);
+		break;
+	}
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
+				    int count, struct perm_bits *perm,
+				    int offset, u32 *val)
+{
+	u32 virt = 0;
+
+	memcpy(val, vdev->vconfig + pos, count);
+
+	memcpy(&virt, perm->virt + offset, count);
+
+	/* Any non-virtualized bits? */
+	if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
+		struct pci_dev *pdev = vdev->pdev;
+		u32 phys_val = 0;
+		int ret;
+
+		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		if (ret)
+			return ret;
+
+		*val = (phys_val & ~virt) | (*val & virt);
+	}
+
+	return count;
+}
+
+static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
+				     int count, struct perm_bits *perm,
+				     int offset, u32 val)
+{
+	u32 virt = 0, write = 0;
+
+	memcpy(&write, perm->write + offset, count);
+
+	if (!write)
+		return count; /* drop, no writable bits */
+
+	memcpy(&virt, perm->virt + offset, count);
+
+	/* Virtualized and writable bits go to vconfig */
+	if (write & virt) {
+		u32 virt_val = 0;
+
+		memcpy(&virt_val, vdev->vconfig + pos, count);
+
+		virt_val &= ~(write & virt);
+		virt_val |= (val & (write & virt));
+
+		memcpy(vdev->vconfig + pos, &virt_val, count);
+	}
+
+	/* Non-virtualzed and writable bits go to hardware */
+	if (write & ~virt) {
+		struct pci_dev *pdev = vdev->pdev;
+		u32 phys_val = 0;
+		int ret;
+
+		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		if (ret)
+			return ret;
+
+		phys_val &= ~(write & ~virt);
+		phys_val |= (val & (write & ~virt));
+
+		ret = vfio_user_config_write(pdev, pos, phys_val, count);
+		if (ret)
+			return ret;
+	}
+
+	return count;
+}
+
+/* Allow direct read from hardware, except for capability next pointer */
+static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, u32 *val)
+{
+	int ret;
+
+	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	if (pos >= PCI_CFG_SPACE_SIZE) { /* Extended cap header mangling */
+		if (offset < 4)
+			memcpy(val, vdev->vconfig + pos, count);
+	} else if (pos >= PCI_STD_HEADER_SIZEOF) { /* Std cap mangling */
+		if (offset == PCI_CAP_LIST_ID && count > 1)
+			memcpy(val, vdev->vconfig + pos,
+			       min(PCI_CAP_FLAGS, count));
+		else if (offset == PCI_CAP_LIST_NEXT)
+			memcpy(val, vdev->vconfig + pos, 1);
+	}
+
+	return count;
+}
+
+static int vfio_direct_config_write(struct vfio_pci_device *vdev, int pos,
+				    int count, struct perm_bits *perm,
+				    int offset, u32 val)
+{
+	int ret;
+
+	ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+/* Default all regions to read-only, no-virtualization */
+static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
+	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
+	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+
+static void free_perm_bits(struct perm_bits *perm)
+{
+	kfree(perm->virt);
+	kfree(perm->write);
+	perm->virt = NULL;
+	perm->write = NULL;
+}
+
+static int alloc_perm_bits(struct perm_bits *perm, int size)
+{
+	/*
+	 * Round up all permission bits to the next dword, this lets us
+	 * ignore whether a read/write exceeds the defined capability
+	 * structure.  We can do this because:
+	 *  - Standard config space is already dword aligned
+	 *  - Capabilities are all dword alinged (bits 0:1 of next reserved)
+	 *  - Express capabilities defined as dword aligned
+	 */
+	size = round_up(size, 4);
+
+	/*
+	 * Zero state is
+	 * - All Readable, None Writeable, None Virtualized
+	 */
+	perm->virt = kzalloc(size, GFP_KERNEL);
+	perm->write = kzalloc(size, GFP_KERNEL);
+	if (!perm->virt || !perm->write) {
+		free_perm_bits(perm);
+		return -ENOMEM;
+	}
+
+	perm->readfn = vfio_default_config_read;
+	perm->writefn = vfio_default_config_write;
+
+	return 0;
+}
+
+/*
+ * Helper functions for filling in permission tables
+ */
+static inline void p_setb(struct perm_bits *p, int off, u8 virt, u8 write)
+{
+	p->virt[off] = virt;
+	p->write[off] = write;
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setw(struct perm_bits *p, int off, u16 virt, u16 write)
+{
+	*(u16 *)(&p->virt[off]) = cpu_to_le16(virt);
+	*(u16 *)(&p->write[off]) = cpu_to_le16(write);
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
+{
+	*(u32 *)(&p->virt[off]) = cpu_to_le32(virt);
+	*(u32 *)(&p->write[off]) = cpu_to_le32(write);
+}
+
+/*
+ * Restore the *real* BARs after we detect a FLR or backdoor reset.
+ * (backdoor = some device specific technique that we didn't catch)
+ */
+static void vfio_bar_restore(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 *rbar = vdev->rbar;
+	int i;
+
+	if (pdev->is_virtfn)
+		return;
+
+	printk(KERN_INFO "%s: %s reset recovery - restoring bars\n",
+	       __func__, dev_name(&pdev->dev));
+
+	for (i = PCI_BASE_ADDRESS_0; i <= PCI_BASE_ADDRESS_5; i += 4, rbar++)
+		pci_user_write_config_dword(pdev, i, *rbar);
+
+	pci_user_write_config_dword(pdev, PCI_ROM_ADDRESS, *rbar);
+}
+
+static u32 vfio_generate_bar_flags(struct pci_dev *pdev, int bar)
+{
+	unsigned long flags = pci_resource_flags(pdev, bar);
+	u32 val;
+
+	if (flags & IORESOURCE_IO)
+		return cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
+
+	val = PCI_BASE_ADDRESS_SPACE_MEMORY;
+
+	if (flags & IORESOURCE_PREFETCH)
+		val |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+	if (flags & IORESOURCE_MEM_64)
+		val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+
+	return cpu_to_le32(val);
+}
+
+/*
+ * Pretend we're hardware and tweak the values of the *virtual* PCI BARs
+ * to reflect the hardware capabilities.  This implements BAR sizing.
+ */
+static void vfio_bar_fixup(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+	u32 *bar;
+	u64 mask;
+
+	bar = (u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_0];
+
+	for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++, bar++) {
+		if (!pci_resource_start(pdev, i)) {
+			*bar = 0; /* Unmapped by host = unimplemented to user */
+			continue;
+		}
+
+		mask = ~(pci_resource_len(pdev, i) - 1);
+
+		*bar &= cpu_to_le32((u32)mask);
+		*bar |= vfio_generate_bar_flags(pdev, i);
+
+		if (*bar & cpu_to_le32(IORESOURCE_MEM_64)) {
+			bar++;
+			*bar &= cpu_to_le32((u32)(mask >> 32));
+			i++;
+		}
+	}
+
+	bar = (u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];
+
+	/*
+	 * NB. we expose the actual BAR size here, regardless of whether
+	 * we can read it.  When we report the REGION_INFO for the ROM
+	 * we report what PCI tells us is the actual ROM size.
+	 */
+	if (pci_resource_start(pdev, PCI_ROM_RESOURCE)) {
+		mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
+		mask |= PCI_ROM_ADDRESS_ENABLE;
+		*bar &= cpu_to_le32((u32)mask);
+	} else
+		*bar = 0;
+
+	vdev->bardirty = false;
+}
+
+static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
+				  int count, struct perm_bits *perm,
+				  int offset, u32 *val)
+{
+	if (is_bar(offset)) /* pos == offset for basic config */
+		vfio_bar_fixup(vdev);
+
+	count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
+
+	/* Mask in virtual memory enable for SR-IOV devices */
+	if (offset == PCI_COMMAND && vdev->pdev->is_virtfn) {
+		u16 cmd = *(u16 *)&vdev->vconfig[PCI_COMMAND];
+		*val |= cmd & cpu_to_le16(PCI_COMMAND_MEMORY);
+	}
+
+	return count;
+}
+
+static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, u32 val)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 phys_cmd, *virt_cmd, new_cmd = 0;
+	int ret;
+
+	virt_cmd = (u16 *)&vdev->vconfig[PCI_COMMAND];
+
+	if (offset == PCI_COMMAND) {
+		bool phys_mem, virt_mem, new_mem, phys_io, virt_io, new_io;
+
+		ret = pci_user_read_config_word(pdev, PCI_COMMAND, &phys_cmd);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		switch (count) {
+		case 1:
+			new_cmd = val;
+			break;
+		case 2:
+			new_cmd = le16_to_cpu(val);
+			break;
+		case 4:
+			new_cmd = (u16)le32_to_cpu(val);
+			break;
+		}
+
+		phys_mem = !!(phys_cmd & PCI_COMMAND_MEMORY);
+		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
+		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
+
+		phys_io = !!(phys_cmd & PCI_COMMAND_IO);
+		virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
+		new_io = !!(new_cmd & PCI_COMMAND_IO);
+
+		/*
+		 * If the user is writing mem/io enable (new_mem/io) and we
+		 * think it's already enabled (virt_mem/io), but the hardware
+		 * shows it disabled (phys_mem/io, then the device has
+		 * undergone some kind of backdoor reset and needs to be
+		 * restored before we allow it to enable the bars.
+		 * SR-IOV devices will trigger this, but we catch them later
+		 */
+		if ((new_mem && virt_mem && !phys_mem) ||
+		    (new_io && virt_io && !phys_io))
+			vfio_bar_restore(vdev);
+	}
+
+	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (count < 0)
+		return count;
+
+	/*
+	 * Save current memory/io enable bits in vconfig to allow for
+	 * the test above next time.
+	 */
+	if (offset == PCI_COMMAND) {
+		u16 mask = PCI_COMMAND_MEMORY | PCI_COMMAND_IO;
+
+		*virt_cmd &= cpu_to_le16(~mask);
+		*virt_cmd |= new_cmd & cpu_to_le16(mask);
+	}
+
+	/* Emulate INTx disable */
+	if (offset >= PCI_COMMAND && offset <= PCI_COMMAND + 1) {
+		bool virt_intx_disable;
+
+		virt_intx_disable = !!(le16_to_cpu(*virt_cmd) &
+				       PCI_COMMAND_INTX_DISABLE);
+
+		if (virt_intx_disable && !vdev->virq_disabled) {
+			vdev->virq_disabled = true;
+			vfio_pci_intx_mask(vdev);
+		} else if (!virt_intx_disable && vdev->virq_disabled) {
+			vdev->virq_disabled = false;
+			vfio_pci_intx_unmask(vdev);
+		}
+	}
+
+	if (is_bar(offset))
+		vdev->bardirty = true;
+
+	return count;
+}
+
+/* Permissions for the Basic PCI Header */
+static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, PCI_STD_HEADER_SIZEOF))
+		return -ENOMEM;
+
+	perm->readfn = vfio_basic_config_read;
+	perm->writefn = vfio_basic_config_write;
+
+	/* Virtualized for SR-IOV functions, which just have FFFF */
+	p_setw(perm, PCI_VENDOR_ID, (u16)ALL_VIRT, NO_WRITE);
+	p_setw(perm, PCI_DEVICE_ID, (u16)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Virtualize INTx disable, we use it internally for interrupt
+	 * control and can emulate it for non-PCI 2.3 devices.
+	 */
+	p_setw(perm, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE, (u16)ALL_WRITE);
+
+	/* Virtualize capability list, we might want to skip/disable */
+	p_setw(perm, PCI_STATUS, PCI_STATUS_CAP_LIST, NO_WRITE);
+
+	/* No harm to write */
+	p_setb(perm, PCI_CACHE_LINE_SIZE, NO_VIRT, (u8)ALL_WRITE);
+	p_setb(perm, PCI_LATENCY_TIMER, NO_VIRT, (u8)ALL_WRITE);
+	p_setb(perm, PCI_BIST, NO_VIRT, (u8)ALL_WRITE);
+
+	/* Virtualize all bars, can't touch the real ones */
+	p_setd(perm, PCI_BASE_ADDRESS_0, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_1, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_2, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_3, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_4, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_BASE_ADDRESS_5, ALL_VIRT, ALL_WRITE);
+	p_setd(perm, PCI_ROM_ADDRESS, ALL_VIRT, ALL_WRITE);
+
+	/* Allow us to adjust capability chain */
+	p_setb(perm, PCI_CAPABILITY_LIST, (u8)ALL_VIRT, NO_WRITE);
+
+	/* Sometimes used by sw, just virtualize */
+	p_setb(perm, PCI_INTERRUPT_LINE, (u8)ALL_VIRT, (u8)ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for the Power Management capability */
+static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM]))
+		return -ENOMEM;
+
+	/*
+	 * We always virtualize the next field so we can remove
+	 * capabilities from the chain if we want to.
+	 */
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Power management is defined *per function*,
+	 * so we let the user write this
+	 */
+	p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for PCI-X capability */
+static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
+{
+	/* Alloc 24, but only 8 are used in v0 */
+	if (alloc_perm_bits(perm, PCI_CAP_PCIX_SIZEOF_V12))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	p_setw(perm, PCI_X_CMD, NO_VIRT, (u16)ALL_WRITE);
+	p_setd(perm, PCI_X_ECC_CSR, NO_VIRT, ALL_WRITE);
+	return 0;
+}
+
+/* Permissions for PCI Express capability */
+static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
+{
+	/* Alloc larger of two possible sizes */
+	if (alloc_perm_bits(perm, PCI_CAP_EXP_ENDPOINT_SIZEOF_V2))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * Allow writes to device control fields (includes FLR!)
+	 * but not to devctl_phantom which could confuse IOMMU
+	 * or to the ARI bit in devctl2 which is set at probe time
+	 */
+	p_setw(perm, PCI_EXP_DEVCTL, NO_VIRT, ~PCI_EXP_DEVCTL_PHANTOM);
+	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
+	return 0;
+}
+
+/* Permissions for Advanced Function capability */
+static int __init init_pci_cap_af_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_AF]))
+		return -ENOMEM;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+	p_setb(perm, PCI_AF_CTRL, NO_VIRT, PCI_AF_CTRL_FLR);
+	return 0;
+}
+
+/* Permissions for Advanced Error Reporting extended capability */
+static int __init init_pci_ext_cap_err_perm(struct perm_bits *perm)
+{
+	u32 mask;
+
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_ERR]))
+		return -ENOMEM;
+
+	/*
+	 * Virtualize the first dword of all express capabilities
+	 * because it includes the next pointer.  This lets us later
+	 * remove capabilities from the chain if we need to.
+	 */
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+	/* Writable bits mask */
+	mask =	PCI_ERR_UNC_TRAIN |		/* Training */
+		PCI_ERR_UNC_DLP |		/* Data Link Protocol */
+		PCI_ERR_UNC_SURPDN |		/* Surprise Down */
+		PCI_ERR_UNC_POISON_TLP |	/* Poisoned TLP */
+		PCI_ERR_UNC_FCP |		/* Flow Control Protocol */
+		PCI_ERR_UNC_COMP_TIME |		/* Completion Timeout */
+		PCI_ERR_UNC_COMP_ABORT |	/* Completer Abort */
+		PCI_ERR_UNC_UNX_COMP |		/* Unexpected Completion */
+		PCI_ERR_UNC_RX_OVER |		/* Receiver Overflow */
+		PCI_ERR_UNC_MALF_TLP |		/* Malformed TLP */
+		PCI_ERR_UNC_ECRC |		/* ECRC Error Status */
+		PCI_ERR_UNC_UNSUP |		/* Unsupported Request */
+		PCI_ERR_UNC_ACSV |		/* ACS Violation */
+		PCI_ERR_UNC_INTN |		/* internal error */
+		PCI_ERR_UNC_MCBTLP |		/* MC blocked TLP */
+		PCI_ERR_UNC_ATOMEG |		/* Atomic egress blocked */
+		PCI_ERR_UNC_TLPPRE;		/* TLP prefix blocked */
+	p_setd(perm, PCI_ERR_UNCOR_STATUS, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_UNCOR_MASK, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_UNCOR_SEVER, NO_VIRT, mask);
+
+	mask =	PCI_ERR_COR_RCVR |		/* Receiver Error Status */
+		PCI_ERR_COR_BAD_TLP |		/* Bad TLP Status */
+		PCI_ERR_COR_BAD_DLLP |		/* Bad DLLP Status */
+		PCI_ERR_COR_REP_ROLL |		/* REPLAY_NUM Rollover */
+		PCI_ERR_COR_REP_TIMER |		/* Replay Timer Timeout */
+		PCI_ERR_COR_ADV_NFAT |		/* Advisory Non-Fatal */
+		PCI_ERR_COR_INTERNAL |		/* Corrected Internal */
+		PCI_ERR_COR_LOG_OVER;		/* Header Log Overflow */
+	p_setd(perm, PCI_ERR_COR_STATUS, NO_VIRT, mask);
+	p_setd(perm, PCI_ERR_COR_MASK, NO_VIRT, mask);
+
+	mask =	PCI_ERR_CAP_ECRC_GENE |		/* ECRC Generation Enable */
+		PCI_ERR_CAP_ECRC_CHKE;		/* ECRC Check Enable */
+	p_setd(perm, PCI_ERR_CAP, NO_VIRT, mask);
+	return 0;
+}
+
+/* Permissions for Power Budgeting extended capability */
+static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
+{
+	if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_PWR]))
+		return -ENOMEM;
+
+	p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+	/* Writing the data selector is OK, the info is still read-only */
+	p_setb(perm, PCI_PWR_DATA, NO_VIRT, (u8)ALL_WRITE);
+	return 0;
+}
+
+/*
+ * Initialize the shared permission tables
+ */
+void vfio_pci_uninit_perm_bits(void)
+{
+	free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
+
+	free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_PCIX]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_EXP]);
+	free_perm_bits(&cap_perms[PCI_CAP_ID_AF]);
+
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+	free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+}
+
+int __init vfio_pci_init_perm_bits(void)
+{
+	int ret;
+
+	/* Basic config space */
+	ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
+
+	/* Capabilities */
+	ret |= init_pci_cap_pm_perm(&cap_perms[PCI_CAP_ID_PM]);
+	cap_perms[PCI_CAP_ID_VPD].writefn = vfio_direct_config_write;
+	ret |= init_pci_cap_pcix_perm(&cap_perms[PCI_CAP_ID_PCIX]);
+	cap_perms[PCI_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+	ret |= init_pci_cap_exp_perm(&cap_perms[PCI_CAP_ID_EXP]);
+	ret |= init_pci_cap_af_perm(&cap_perms[PCI_CAP_ID_AF]);
+
+	/* Extended capabilities */
+	ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+	ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+
+	if (ret)
+		vfio_pci_uninit_perm_bits();
+
+	return ret;
+}
+
+static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos)
+{
+	u8 cap;
+	int base = (pos >= PCI_CFG_SPACE_SIZE) ? PCI_CFG_SPACE_SIZE :
+						 PCI_STD_HEADER_SIZEOF;
+	base /= 4;
+	pos /= 4;
+
+	cap = vdev->pci_config_map[pos];
+
+	if (cap == PCI_CAP_ID_BASIC)
+		return 0;
+
+	/* XXX Can we have to abutting capabilities of the same type? */
+	while (pos - 1 >= base && vdev->pci_config_map[pos - 1] == cap)
+		pos--;
+
+	return pos * 4;
+}
+
+static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
+				int count, struct perm_bits *perm,
+				int offset, u32 *val)
+{
+	/* Update max available queue size from msi_qmax */
+	if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+		u16 *flags;
+		int start;
+
+		start = vfio_find_cap_start(vdev, pos);
+
+		flags = (u16 *)&vdev->vconfig[start];
+
+		*flags &= cpu_to_le16(~PCI_MSI_FLAGS_QMASK);
+		*flags |= cpu_to_le16(vdev->msi_qmax << 1);
+	}
+
+	return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_msi_config_write(struct vfio_pci_device *vdev, int pos,
+				 int count, struct perm_bits *perm,
+				 int offset, u32 val)
+{
+	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+	if (count < 0)
+		return count;
+
+	/* Fixup and write configured queue size and enable to hardware */
+	if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+		u16 *pflags, flags;
+		int start, ret;
+
+		start = vfio_find_cap_start(vdev, pos);
+
+		pflags = (u16 *)&vdev->vconfig[start + PCI_MSI_FLAGS];
+
+		flags = le16_to_cpu(*pflags);
+
+		/* MSI is enabled via ioctl */
+		if  (!is_msi(vdev))
+			flags &= ~PCI_MSI_FLAGS_ENABLE;
+
+		/* Check queue size */
+		if ((flags & PCI_MSI_FLAGS_QSIZE) >> 4 > vdev->msi_qmax) {
+			flags &= ~PCI_MSI_FLAGS_QSIZE;
+			flags |= vdev->msi_qmax << 4;
+		}
+
+		/* Write back to virt and to hardware */
+		*pflags = cpu_to_le16(flags);
+		ret = pci_user_write_config_word(vdev->pdev,
+						 start + PCI_MSI_FLAGS,
+						 flags);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+	}
+
+	return count;
+}
+
+/*
+ * MSI determination is per-device, so this routine gets used beyond
+ * initialization time. Don't add __init
+ */
+static int init_pci_cap_msi_perm(struct perm_bits *perm, int len, u16 flags)
+{
+	if (alloc_perm_bits(perm, len))
+		return -ENOMEM;
+
+	perm->readfn = vfio_msi_config_read;
+	perm->writefn = vfio_msi_config_write;
+
+	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+	/*
+	 * The upper byte of the control register is reserved,
+	 * just setup the lower byte.
+	 */
+	p_setb(perm, PCI_MSI_FLAGS, (u8)ALL_VIRT, (u8)ALL_WRITE);
+	p_setd(perm, PCI_MSI_ADDRESS_LO, ALL_VIRT, ALL_WRITE);
+	if (flags & PCI_MSI_FLAGS_64BIT) {
+		p_setd(perm, PCI_MSI_ADDRESS_HI, ALL_VIRT, ALL_WRITE);
+		p_setw(perm, PCI_MSI_DATA_64, (u16)ALL_VIRT, (u16)ALL_WRITE);
+		if (flags & PCI_MSI_FLAGS_MASKBIT) {
+			p_setd(perm, PCI_MSI_MASK_64, NO_VIRT, ALL_WRITE);
+			p_setd(perm, PCI_MSI_PENDING_64, NO_VIRT, ALL_WRITE);
+		}
+	} else {
+		p_setw(perm, PCI_MSI_DATA_32, (u16)ALL_VIRT, (u16)ALL_WRITE);
+		if (flags & PCI_MSI_FLAGS_MASKBIT) {
+			p_setd(perm, PCI_MSI_MASK_32, NO_VIRT, ALL_WRITE);
+			p_setd(perm, PCI_MSI_PENDING_32, NO_VIRT, ALL_WRITE);
+		}
+	}
+	return 0;
+}
+
+/* Determine MSI CAP field length; initialize msi_perms on 1st call per vdev */
+static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int len, ret;
+	u16 flags;
+
+	ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	len = 10; /* Minimum size */
+	if (flags & PCI_MSI_FLAGS_64BIT)
+		len += 4;
+	if (flags & PCI_MSI_FLAGS_MASKBIT)
+		len += 10;
+
+	if (vdev->msi_perm)
+		return len;
+
+	vdev->msi_perm = kmalloc(sizeof(struct perm_bits), GFP_KERNEL);
+	if (!vdev->msi_perm)
+		return -ENOMEM;
+
+	ret = init_pci_cap_msi_perm(vdev->msi_perm, len, flags);
+	if (ret)
+		return ret;
+
+	return len;
+}
+
+/* Determine extended capability length for VC (2 & 9) and MFVC */
+static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 tmp;
+	int ret, evcc, phases, vc_arb;
+	int len = PCI_CAP_VC_BASE_SIZEOF;
+
+	ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG1, &tmp);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	evcc = tmp & PCI_VC_REG1_EVCC; /* extended vc count */
+	ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG2, &tmp);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	if (tmp & PCI_VC_REG2_128_PHASE)
+		phases = 128;
+	else if (tmp & PCI_VC_REG2_64_PHASE)
+		phases = 64;
+	else if (tmp & PCI_VC_REG2_32_PHASE)
+		phases = 32;
+	else
+		phases = 0;
+
+	vc_arb = phases * 4;
+
+	/*
+	 * Port arbitration tables are root & switch only;
+	 * function arbitration tables are function 0 only.
+	 * In either case, we'll never let user write them so
+	 * we don't care how big they are
+	 */
+	len += (1 + evcc) * PCI_CAP_VC_PER_VC_SIZEOF;
+	if (vc_arb) {
+		len = round_up(len, 16);
+		len += vc_arb / 8;
+	}
+	return len;
+}
+
+static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 word;
+	u8 byte;
+	int ret;
+
+	switch (cap) {
+	case PCI_CAP_ID_MSI:
+		return vfio_msi_cap_len(vdev, pos);
+	case PCI_CAP_ID_PCIX:
+		ret = pci_read_config_word(pdev, pos + PCI_X_CMD, &word);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if (PCI_X_CMD_VERSION(word)) {
+			vdev->extended_caps = true;
+			return PCI_CAP_PCIX_SIZEOF_V12;
+		} else
+			return PCI_CAP_PCIX_SIZEOF_V0;
+	case PCI_CAP_ID_VNDR:
+		/* length follows next field */
+		ret = pci_read_config_byte(pdev, pos + PCI_CAP_FLAGS, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return byte;
+	case PCI_CAP_ID_EXP:
+		/* length based on version */
+		ret = pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &word);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if ((word & PCI_EXP_FLAGS_VERS) == 1)
+			return PCI_CAP_EXP_ENDPOINT_SIZEOF_V1;
+		else {
+			vdev->extended_caps = true;
+			return PCI_CAP_EXP_ENDPOINT_SIZEOF_V2;
+		}
+	case PCI_CAP_ID_HT:
+		ret = pci_read_config_byte(pdev, pos + 3, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return (byte & HT_3BIT_CAP_MASK) ?
+			HT_CAP_SIZEOF_SHORT : HT_CAP_SIZEOF_LONG;
+	case PCI_CAP_ID_SATA:
+		ret = pci_read_config_byte(pdev, pos + PCI_SATA_REGS, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_SATA_REGS_MASK;
+		if (byte == PCI_SATA_REGS_INLINE)
+			return PCI_SATA_SIZEOF_LONG;
+		else
+			return PCI_SATA_SIZEOF_SHORT;
+	default:
+		printk(KERN_WARNING
+		       "%s: %s unknown length for pci cap 0x%x@0x%x\n",
+		       dev_name(&pdev->dev), __func__, cap, pos);
+	}
+
+	return 0;
+}
+
+static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 byte;
+	u32 dword;
+	int ret;
+
+	switch (ecap) {
+	case PCI_EXT_CAP_ID_VNDR:
+		ret = pci_read_config_dword(pdev, epos + PCI_VSEC_HDR, &dword);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		return dword >> PCI_VSEC_HDR_LEN_SHIFT;
+	case PCI_EXT_CAP_ID_VC:
+	case PCI_EXT_CAP_ID_VC9:
+	case PCI_EXT_CAP_ID_MFVC:
+		return vfio_vc_cap_len(vdev, epos);
+	case PCI_EXT_CAP_ID_ACS:
+		ret = pci_read_config_byte(pdev, epos + PCI_ACS_CAP, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if (byte & PCI_ACS_EC) {
+			int bits;
+
+			ret = pci_read_config_byte(pdev,
+						   epos + PCI_ACS_EGRESS_BITS,
+						   &byte);
+			if (ret)
+				return pcibios_err_to_errno(ret);
+
+			bits = byte ? round_up(byte, 32) : 256;
+			return 8 + (bits / 8);
+		}
+		return 8;
+
+	case PCI_EXT_CAP_ID_REBAR:
+		ret = pci_read_config_byte(pdev, epos + PCI_REBAR_CTRL, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_REBAR_CTRL_NBAR_MASK;
+		byte >>= PCI_REBAR_CTRL_NBAR_SHIFT;
+
+		return 4 + (byte * 8);
+	case PCI_EXT_CAP_ID_DPA:
+		ret = pci_read_config_byte(pdev, epos + PCI_DPA_CAP, &byte);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		byte &= PCI_DPA_CAP_SUBSTATE_MASK;
+		byte = round_up(byte + 1, 4);
+		return PCI_DPA_BASE_SIZEOF + byte;
+	case PCI_EXT_CAP_ID_TPH:
+		ret = pci_read_config_dword(pdev, epos + PCI_TPH_CAP, &dword);
+		if (ret)
+			return pcibios_err_to_errno(ret);
+
+		if ((dword & PCI_TPH_CAP_LOC_MASK) == PCI_TPH_LOC_CAP) {
+			int sts;
+
+			sts = byte & PCI_TPH_CAP_ST_MASK;
+			sts >>= PCI_TPH_CAP_ST_SHIFT;
+			return PCI_TPH_BASE_SIZEOF + round_up(sts * 2, 4);
+		}
+		return PCI_TPH_BASE_SIZEOF;
+	default:
+		printk(KERN_WARNING
+		       "%s: %s unknown length for pci ecap 0x%x@0x%x\n",
+		       dev_name(&pdev->dev), __func__, ecap, epos);
+	}
+
+	return 0;
+}
+
+static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
+				   int offset, int size)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret = 0;
+
+	/*
+	 * We try to read physical config space in the largest chunks
+	 * we can, assuming that all of the fields support dword access.
+	 * pci_save_state() makes this same assumption and seems to do ok.
+	 */
+	while (size) {
+		int filled;
+
+		if (size >= 4 && !(offset % 4)) {
+			u32 *dword = (u32 *)&vdev->vconfig[offset];
+			ret = pci_read_config_dword(pdev, offset, dword);
+			if (ret)
+				return ret;
+			*dword = cpu_to_le32(*dword);
+			filled = 4;
+		} else if (size >= 2 && !(offset % 2)) {
+			u16 *word = (u16 *)&vdev->vconfig[offset];
+			ret = pci_read_config_word(pdev, offset, word);
+			if (ret)
+				return ret;
+			*word = cpu_to_le16(*word);
+			filled = 2;
+		} else {
+			u8 *byte = &vdev->vconfig[offset];
+			ret = pci_read_config_byte(pdev, offset, byte);
+			if (ret)
+				return ret;
+			filled = 1;
+		}
+
+		offset += filled;
+		size -= filled;
+	}
+
+	return ret;
+}
+
+static int vfio_cap_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u16 status;
+	u8 pos, *prev, cap;
+	int loops, ret, caps = 0;
+
+	/* Any capabilities? */
+	ret = pci_read_config_word(pdev, PCI_STATUS, &status);
+	if (ret)
+		return ret;
+
+	if (!(status & PCI_STATUS_CAP_LIST))
+		return 0; /* Done */
+
+	ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+	if (ret)
+		return ret;
+
+	/* Mark the previous position in case we want to skip a capability */
+	prev = &vdev->vconfig[PCI_CAPABILITY_LIST];
+
+	/* We can bound our loop, capabilities are dword aligned */
+	loops = (PCI_CFG_SPACE_SIZE - PCI_STD_HEADER_SIZEOF) / PCI_CAP_SIZEOF;
+	while (pos && loops--) {
+		u8 next;
+		int i, len = 0;
+
+		ret = pci_read_config_byte(pdev, pos, &cap);
+		if (ret)
+			return ret;
+
+		ret = pci_read_config_byte(pdev,
+					   pos + PCI_CAP_LIST_NEXT, &next);
+		if (ret)
+			return ret;
+
+		if (cap <= PCI_CAP_ID_MAX) {
+			len = pci_cap_length[cap];
+			if (len == 0xFF) { /* Variable length */
+				len = vfio_cap_len(vdev, cap, pos);
+				if (len < 0)
+					return len;
+			}
+		}
+
+		if (!len) {
+			printk(KERN_INFO "%s: %s hiding cap 0x%x\n",
+			       __func__, dev_name(&pdev->dev), cap);
+			*prev = next;
+			pos = next;
+			continue;
+		}
+
+		/* Sanity check, do we overlap other capabilities? */
+		for (i = 0; i < len; i += 4) {
+			if (likely(map[(pos + i) / 4] == PCI_CAP_ID_INVALID))
+				continue;
+
+			printk(KERN_WARNING
+			       "%s: %s pci config conflict @0x%x, was cap 0x%x now cap 0x%x\n",
+			       __func__, dev_name(&pdev->dev), pos + i,
+			       map[pos + i], cap);
+		}
+
+		memset(map + (pos / 4), cap, len / 4);
+		ret = vfio_fill_vconfig_bytes(vdev, pos, len);
+		if (ret)
+			return ret;
+
+		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
+		pos = next;
+		caps++;
+	}
+
+	/* If we didn't fill any capabilities, clear the status flag */
+	if (!caps) {
+		u16 *vstatus = (u16 *)&vdev->vconfig[PCI_STATUS];
+		*vstatus &= ~cpu_to_le32(PCI_STATUS_CAP_LIST);
+	}
+
+	return 0;
+}
+
+static int vfio_ecap_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u16 epos;
+	u32 *prev = NULL;
+	int loops, ret, ecaps = 0;
+
+	if (!vdev->extended_caps)
+		return 0;
+
+	epos = PCI_CFG_SPACE_SIZE;
+
+	loops = (pdev->cfg_size - PCI_CFG_SPACE_SIZE) / PCI_CAP_SIZEOF;
+
+	while (loops-- && epos >= PCI_CFG_SPACE_SIZE) {
+		u32 header;
+		u16 ecap;
+		int i, len = 0;
+		bool hidden = false;
+
+		ret = pci_read_config_dword(pdev, epos, &header);
+		if (ret)
+			return ret;
+
+		ecap = PCI_EXT_CAP_ID(header);
+
+		if (ecap <= PCI_EXT_CAP_ID_MAX) {
+			len = pci_ext_cap_length[ecap];
+			if (len == 0xFF) {
+				len = vfio_ext_cap_len(vdev, ecap, epos);
+				if (len < 0)
+					return ret;
+			}
+		}
+
+		if (!len) {
+			printk(KERN_INFO "%s: %s hiding ecap 0x%x@0x%x\n",
+			       __func__, dev_name(&pdev->dev), ecap, epos);
+
+			/* If not the first in the chain, we can skip over it */
+			if (prev) {
+				epos = PCI_EXT_CAP_NEXT(header);
+				*prev &= cpu_to_le32(~((u32)0xffc << 20));
+				*prev |= cpu_to_le32((u32)epos << 20);
+				continue;
+			}
+
+			/*
+			 * Otherwise, fill in a placeholder, the direct
+			 * readfn will virtualize this automatically
+			 */
+			len = PCI_CAP_SIZEOF;
+			hidden = true;
+		}
+
+		for (i = 0; i < len; i += 4) {
+			if (likely(map[(epos + i) / 4] == PCI_CAP_ID_INVALID))
+				continue;
+
+			printk(KERN_WARNING
+			       "%s: %s pci config conflict @0x%x, was ecap 0x%x now ecap 0x%x\n",
+			       __func__, dev_name(&pdev->dev),
+			       epos + i, map[epos + i], ecap);
+		}
+
+		/*
+		 * Even though ecap is 2 bytes, we're currently a long way
+		 * from exceeding 1 byte capabilities.  If we ever make it
+		 * up to 0xFF we'll need to up this to a two-byte, byte map.
+		 */
+		BUILD_BUG_ON(PCI_EXT_CAP_ID_MAX >= PCI_CAP_ID_INVALID);
+
+		memset(map + (epos / 4), ecap, len / 4);
+		ret = vfio_fill_vconfig_bytes(vdev, epos, len);
+		if (ret)
+			return ret;
+
+		/*
+		 * If we're just using this capability to anchor the list,
+		 * hide the real ID.  Only count real ecaps.  XXX PCI spec
+		 * indicates to use cap id = 0, version = 0, next = 0 if
+		 * ecaps are absent, hope users check all the way to next.
+		 */
+		if (hidden)
+			*(u32 *)&vdev->vconfig[epos] &=
+				cpu_to_le32(((u32)0xffc << 20));
+		else
+			ecaps++;
+
+		prev = (u32 *)&vdev->vconfig[epos];
+		epos = PCI_EXT_CAP_NEXT(header);
+	}
+
+	if (!ecaps)
+		*(u32 *)&vdev->vconfig[PCI_CFG_SPACE_SIZE] = 0;
+
+	return 0;
+}
+
+/*
+ * For each device we allocate a pci_config_map that indicates the
+ * capability occupying each dword and thus the struct perm_bits we
+ * use for read and write.  We also allocate a virtualized config
+ * space which tracks reads and writes to bits that we emulate for
+ * the user.  Initial values filled from device.
+ *
+ * Using shared stuct perm_bits between all vfio-pci devices saves
+ * us from allocating cfg_size buffers for virt and write for every
+ * device.  We could remove vconfig and allocate individual buffers
+ * for each area requring emulated bits, but the array of pointers
+ * would be comparable in size (at least for standard config space).
+ */
+int vfio_config_init(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map, *vconfig;
+	int ret;
+
+	/*
+	 * Config space, caps and ecaps are all dword aligned, so we can
+	 * use one byte per dword to record the type.
+	 */
+	map = kmalloc(pdev->cfg_size / 4, GFP_KERNEL);
+	if (!map)
+		return -ENOMEM;
+
+	vconfig = kmalloc(pdev->cfg_size, GFP_KERNEL);
+	if (!vconfig) {
+		kfree(map);
+		return -ENOMEM;
+	}
+
+	vdev->pci_config_map = map;
+	vdev->vconfig = vconfig;
+
+	memset(map, PCI_CAP_ID_BASIC, PCI_STD_HEADER_SIZEOF / 4);
+	memset(map + (PCI_STD_HEADER_SIZEOF / 4), PCI_CAP_ID_INVALID,
+	       (pdev->cfg_size - PCI_STD_HEADER_SIZEOF) / 4);
+
+	ret = vfio_fill_vconfig_bytes(vdev, 0, PCI_STD_HEADER_SIZEOF);
+	if (ret)
+		goto out;
+
+	vdev->bardirty = true;
+
+	/*
+	 * XXX can we just pci_load_saved_state/pci_restore_state?
+	 * may need to rebuild vconfig after that
+	 */
+
+	/* For restore after reset */
+	vdev->rbar[0] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_0];
+	vdev->rbar[1] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_1];
+	vdev->rbar[2] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_2];
+	vdev->rbar[3] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_3];
+	vdev->rbar[4] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_4];
+	vdev->rbar[5] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_5];
+	vdev->rbar[6] = *(u32 *)&vconfig[PCI_ROM_ADDRESS];
+
+	if (pdev->is_virtfn) {
+		*(u16 *)&vconfig[PCI_VENDOR_ID] = cpu_to_le16(pdev->vendor);
+		*(u16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
+	}
+
+	ret = vfio_cap_init(vdev);
+	if (ret)
+		goto out;
+
+	ret = vfio_ecap_init(vdev);
+	if (ret)
+		goto out;
+
+	return 0;
+
+out:
+	kfree(map);
+	vdev->pci_config_map = NULL;
+	kfree(vconfig);
+	vdev->vconfig = NULL;
+	return pcibios_err_to_errno(ret);
+}
+
+void vfio_config_free(struct vfio_pci_device *vdev)
+{
+	kfree(vdev->vconfig);
+	vdev->vconfig = NULL;
+	kfree(vdev->pci_config_map);
+	vdev->pci_config_map = NULL;
+	kfree(vdev->msi_perm);
+	vdev->msi_perm = NULL;
+}
+
+ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
+			  size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct perm_bits *perm;
+	u32 val = 0;
+	int cap_start = 0, offset;
+	u8 cap_id;
+
+
+	if (*ppos < 0 || *ppos + count > pdev->cfg_size)
+		return -EFAULT;
+
+	cap_id = vdev->pci_config_map[*ppos / 4];
+
+	if (cap_id == PCI_CAP_ID_INVALID) {
+		if (iswrite)
+			return count; /* drop */
+
+		/*
+		 * Per PCI spec 3.0, section 6.1, reads from reserved and
+		 * unimplemented registers return 0
+		 */
+		if (copy_to_user(buf, &val, count))
+			return -EFAULT;
+
+		return count;
+	}
+
+	/*
+	 * All capabilities are minimum 4 bytes and aligned on dword
+	 * boundaries.  Since we don't support unaligned accesses, we're
+	 * only ever accessing a single capability.
+	 */
+	if (*ppos >= PCI_CFG_SPACE_SIZE) {
+		WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
+
+		perm = &ecap_perms[cap_id];
+		cap_start = vfio_find_cap_start(vdev, *ppos);
+
+	} else {
+		WARN_ON(cap_id > PCI_CAP_ID_MAX);
+
+		perm = &cap_perms[cap_id];
+
+		if (cap_id == PCI_CAP_ID_MSI)
+			perm = vdev->msi_perm;
+
+		if (cap_id > PCI_CAP_ID_BASIC)
+			cap_start = vfio_find_cap_start(vdev, *ppos);
+	}
+
+	WARN_ON(!cap_start && cap_id != PCI_CAP_ID_BASIC);
+	WARN_ON(cap_start > *ppos);
+
+	offset = *ppos - cap_start;
+
+	if (iswrite) {
+		if (perm->writefn) {
+			if (copy_from_user(&val, buf, count))
+				return -EFAULT;
+
+			count = perm->writefn(vdev, *ppos, count,
+					      perm, offset, val);
+		}
+	} else {
+		if (perm->readfn) {
+			count = perm->readfn(vdev, *ppos, count,
+					     perm, offset, &val);
+			if (count < 0)
+				return count;
+		}
+
+		if (copy_to_user(buf, &val, count))
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+				  char __user *buf, size_t count,
+				  loff_t *ppos, bool iswrite)
+{
+	size_t done = 0;
+	int ret = 0;
+	loff_t pos = *ppos;
+
+	pos &= VFIO_PCI_OFFSET_MASK;
+
+	/*
+	 * We want to both keep the access size the caller users as well as
+	 * support reading large chunks of config space in a single call.
+	 * PCI doesn't support unaligned accesses, so we can safely break
+	 * those apart.
+	 */
+	while (count) {
+		if (count >= 4 && !(pos % 4))
+			ret = vfio_config_do_rw(vdev, buf, 4, &pos, iswrite);
+		else if (count >= 2 && !(pos % 2))
+			ret = vfio_config_do_rw(vdev, buf, 2, &pos, iswrite);
+		else
+			ret = vfio_config_do_rw(vdev, buf, 1, &pos, iswrite);
+
+		if (ret < 0)
+			return ret;
+
+		count -= ret;
+		done += ret;
+		buf += ret;
+		pos += ret;
+	}
+
+	*ppos += done;
+
+	return done;
+}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
new file mode 100644
index 0000000..2996f37
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -0,0 +1,724 @@
+/*
+ * VFIO PCI interrupt handling
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+#include "vfio_pci_private.h"
+
+/*
+ * IRQfd - generic
+ */
+struct virqfd {
+	struct vfio_pci_device	*vdev;
+	void			*data;
+	struct eventfd_ctx	*eventfd;
+	poll_table		pt;
+	wait_queue_t		wait;
+	struct work_struct	inject;
+	struct work_struct	shutdown;
+	struct virqfd		**pvirqfd;
+};
+
+static struct workqueue_struct *vfio_irqfd_cleanup_wq;
+
+int __init vfio_pci_virqfd_init(void)
+{
+	vfio_irqfd_cleanup_wq =
+		create_singlethread_workqueue("vfio-irqfd-cleanup");
+	if (!vfio_irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void vfio_pci_virqfd_exit(void)
+{
+	destroy_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+static void virqfd_deactivate(struct virqfd *virqfd)
+{
+	queue_work(vfio_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct virqfd *virqfd = container_of(wait, struct virqfd, wait);
+	unsigned long flags = (unsigned long)key;
+
+	if (flags & POLLIN)
+		/* An event has been signaled, inject an interrupt */
+		schedule_work(&virqfd->inject);
+
+	if (flags & POLLHUP)
+		/* The eventfd is closing, detach from VFIO */
+		virqfd_deactivate(virqfd);
+
+	return 0;
+}
+
+static void virqfd_ptable_queue_proc(struct file *file,
+				     wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
+	add_wait_queue(wqh, &virqfd->wait);
+}
+
+static void virqfd_shutdown(struct work_struct *work)
+{
+	u64 cnt;
+	struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
+	struct virqfd **pvirqfd = virqfd->pvirqfd;
+
+	eventfd_ctx_remove_wait_queue(virqfd->eventfd, &virqfd->wait, &cnt);
+	flush_work(&virqfd->inject);
+	eventfd_ctx_put(virqfd->eventfd);
+
+	kfree(virqfd);
+	*pvirqfd = NULL;
+}
+
+static int virqfd_enable(struct vfio_pci_device *vdev,
+			 void (*inject)(struct work_struct *work),
+			 void *data, struct virqfd **pvirqfd, int fd)
+{
+	struct file *file = NULL;
+	struct eventfd_ctx *ctx = NULL;
+	struct virqfd *virqfd;
+	int ret = 0;
+	unsigned int events;
+
+	if (*pvirqfd)
+		return -EBUSY;
+
+	virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+	if (!*pvirqfd)
+		return -ENOMEM;
+
+	virqfd->vdev = vdev;
+	virqfd->data = data;
+	virqfd->pvirqfd = pvirqfd;
+	*pvirqfd = virqfd;
+
+	INIT_WORK(&virqfd->inject, inject);
+	INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
+
+	file = eventfd_fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto fail;
+	}
+
+	ctx = eventfd_ctx_fileget(file);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
+		goto fail;
+	}
+
+	virqfd->eventfd = ctx;
+
+	/*
+	 * Install our own custom wake-up handling so we are notified via
+	 * a callback whenever someone signals the underlying eventfd.
+	 */
+	init_waitqueue_func_entry(&virqfd->wait, virqfd_wakeup);
+	init_poll_funcptr(&virqfd->pt, virqfd_ptable_queue_proc);
+
+	events = file->f_op->poll(file, &virqfd->pt);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered and trigger it as if we didn't miss it.
+	 */
+	if (events & POLLIN)
+		schedule_work(&virqfd->inject);
+
+	/*
+	 * Do not drop the file until the irqfd is fully initialized,
+	 * otherwise we might race against the POLLHUP.
+	 */
+	fput(file);
+
+	return 0;
+
+fail:
+	if (ctx && !IS_ERR(ctx))
+		eventfd_ctx_put(ctx);
+
+	if (!IS_ERR(file))
+		fput(file);
+
+	kfree(virqfd);
+	*pvirqfd = NULL;
+
+	return ret;
+}
+
+static void virqfd_disable(struct virqfd *virqfd)
+{
+	if (!virqfd)
+		return;
+
+	virqfd_deactivate(virqfd);
+
+	/* Block until we know all outstanding shutdown jobs have completed. */
+	flush_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+/*
+ * INTx
+ */
+static inline void vfio_send_intx_eventfd(struct vfio_pci_device *vdev)
+{
+	if (likely(is_intx(vdev) && !vdev->virq_disabled))
+		eventfd_signal(vdev->ctx[0].trigger, 1);
+}
+
+void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+
+	spin_lock_irq(&vdev->irqlock);
+
+	/*
+	 * Masking can come from interrupt, ioctl, or config space
+	 * via INTx disable.  The latter means this can get called
+	 * even when not using intx delivery.  In this case, just
+	 * try to have the physical bit follow the virtual bit.
+	 */
+	if (unlikely(!is_intx(vdev))) {
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 0);
+	} else if (!vdev->ctx[0].masked) {
+		/*
+		 * Can't use check_and_mask here because we always want to
+		 * mask, not just when something is pending.
+		 */
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 0);
+		else
+			disable_irq_nosync(pdev->irq);
+
+		vdev->ctx[0].masked = true;
+	}
+
+	spin_unlock_irq(&vdev->irqlock);
+}
+
+void vfio_pci_intx_unmask(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	bool signal = false;
+
+	spin_lock_irq(&vdev->irqlock);
+
+	/*
+	 * Unmasking comes from ioctl or config, so again, have the
+	 * physical bit follow the virtual even when not using INTx.
+	 */
+	if (unlikely(!is_intx(vdev))) {
+		if (vdev->pci_2_3)
+			pci_intx(pdev, 1);
+	} else if (vdev->ctx[0].masked && !vdev->virq_disabled) {
+		/*
+		 * A pending interrupt here would immediately trigger,
+		 * but we can avoid that overhead by just re-sending
+		 * the interrupt to the user.
+		 */
+		if (vdev->pci_2_3) {
+			if (!pci_check_and_unmask_intx(pdev))
+				signal = true;
+		} else
+			enable_irq(pdev->irq);
+
+		vdev->ctx[0].masked = signal;
+	}
+
+	spin_unlock_irq(&vdev->irqlock);
+
+	if (signal)
+		vfio_send_intx_eventfd(vdev);
+}
+
+static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
+{
+	struct vfio_pci_device *vdev = dev_id;
+	struct pci_dev *pdev = vdev->pdev;
+	irqreturn_t ret = IRQ_NONE;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vdev->irqlock, flags);
+
+	/* Non-PCI 2.3 device don't use this hard handler */
+	if (pci_check_and_mask_intx(pdev)) {
+		ret = IRQ_WAKE_THREAD;
+		vdev->ctx[0].masked = true;
+	}
+
+	spin_unlock_irqrestore(&vdev->irqlock, flags);
+
+	return ret;
+}
+
+static irqreturn_t vfio_intx_thread(int irq, void *dev_id)
+{
+	struct vfio_pci_device *vdev = dev_id;
+	int ret = IRQ_HANDLED;
+
+	if (unlikely(!vdev->pci_2_3)) {
+		spin_lock_irq(&vdev->irqlock);
+		if (!vdev->ctx[0].masked) {
+			disable_irq_nosync(vdev->pdev->irq);
+			vdev->ctx[0].masked = true;
+		} else
+			ret = IRQ_NONE;
+		spin_unlock_irq(&vdev->irqlock);
+	}
+
+	if (ret == IRQ_HANDLED)
+		vfio_send_intx_eventfd(vdev);
+
+	return ret;
+}
+
+static int vfio_intx_enable(struct vfio_pci_device *vdev)
+{
+	if (!is_irq_none(vdev))
+		return -EINVAL;
+
+	if (!vdev->pdev->irq)
+		return -ENODEV;
+
+	vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	vdev->num_ctx = 1;
+	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
+
+	return 0;
+}
+
+static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	irq_handler_t handler = vfio_intx_handler;
+	unsigned long irqflags = IRQF_SHARED;
+	int ret;
+
+	if (vdev->ctx[0].trigger) {
+		free_irq(pdev->irq, vdev);
+		kfree(vdev->ctx[0].name);
+		eventfd_ctx_put(vdev->ctx[0].trigger);
+		vdev->ctx[0].trigger = NULL;
+	}
+
+	if (fd >= 0) {
+		vdev->ctx[0].name = kasprintf(GFP_KERNEL, "vfio-intx(%s)",
+					    pci_name(pdev));
+		if (!vdev->ctx[0].name)
+			return -ENOMEM;
+
+		vdev->ctx[0].trigger = eventfd_ctx_fdget(fd);
+		if (!vdev->ctx[0].trigger) {
+			kfree(vdev->ctx[0].name);
+			return -EINVAL;
+		}
+
+		if (!vdev->pci_2_3) {
+			handler = NULL;
+			irqflags = IRQF_ONESHOT;
+		}
+
+		ret = request_threaded_irq(pdev->irq, handler, vfio_intx_thread,
+					   irqflags, vdev->ctx[0].name, vdev);
+		if (ret) {
+			eventfd_ctx_put(vdev->ctx[0].trigger);
+			kfree(vdev->ctx[0].name);
+			return ret;
+		}
+
+		/*
+		 * INTx disable will stick across the new irq setup,
+		 * disable_irq won't.
+		 */
+		if (!vdev->pci_2_3)
+			if (vdev->ctx[0].masked || vdev->virq_disabled)
+				disable_irq_nosync(pdev->irq);
+	}
+	return 0;
+}
+
+static void vfio_intx_disable(struct vfio_pci_device *vdev)
+{
+	vfio_intx_set_signal(vdev, -1);
+	virqfd_disable(vdev->ctx[0].unmask);
+	virqfd_disable(vdev->ctx[0].mask);
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	vdev->num_ctx = 0;
+	kfree(vdev->ctx);
+}
+
+static void vfio_intx_unmask_inject(struct work_struct *work)
+{
+	struct virqfd *virqfd = container_of(work, struct virqfd, inject);
+	vfio_pci_intx_unmask(virqfd->vdev);
+}
+
+/*
+ * MSI/MSI-X
+ */
+static irqreturn_t vfio_msihandler(int irq, void *arg)
+{
+	struct eventfd_ctx *trigger = arg;
+
+	eventfd_signal(trigger, 1);
+	return IRQ_HANDLED;
+}
+
+static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+
+	if (!is_irq_none(vdev))
+		return -EINVAL;
+
+	vdev->ctx = kzalloc(nvec * sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	if (msix) {
+		int i;
+
+		vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+				     GFP_KERNEL);
+		if (!vdev->msix) {
+			kfree(vdev->ctx);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < nvec; i++)
+			vdev->msix[i].entry = i;
+
+		ret = pci_enable_msix(pdev, vdev->msix, nvec);
+		if (ret) {
+			kfree(vdev->msix);
+			kfree(vdev->ctx);
+			return ret;
+		}
+	} else {
+		ret = pci_enable_msi_block(pdev, nvec);
+		if (ret) {
+			kfree(vdev->ctx);
+			return ret;
+		}
+	}
+
+	vdev->num_ctx = nvec;
+	vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX :
+				VFIO_PCI_MSI_IRQ_INDEX;
+
+	if (!msix) {
+		/*
+		 * Compute the virtual hardware field for max msi vectors -
+		 * it is the log base 2 of the number of vectors.
+		 */
+		vdev->msi_qmax = fls(nvec * 2 - 1) - 1;
+	}
+
+	return 0;
+}
+
+static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
+				      int vector, int fd, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
+	char *name = msix ? "vfio-msix" : "vfio-msi";
+
+	if (vector >= vdev->num_ctx)
+		return -EINVAL;
+
+	if (vdev->ctx[vector].trigger) {
+		free_irq(irq, vdev->ctx[vector].trigger);
+		kfree(vdev->ctx[vector].name);
+		eventfd_ctx_put(vdev->ctx[vector].trigger);
+		vdev->ctx[vector].trigger = NULL;
+	}
+
+	if (fd >= 0) {
+		struct eventfd_ctx *trigger;
+		int ret;
+
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "%s[%d](%s)",
+						   name, vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name)
+			return -ENOMEM;
+
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger)) {
+			kfree(vdev->ctx[vector].name);
+			return PTR_ERR(trigger);
+		}
+
+		ret = request_threaded_irq(irq, NULL, vfio_msihandler, 0,
+					   vdev->ctx[vector].name, trigger);
+		if (ret) {
+			eventfd_ctx_put(trigger);
+			kfree(vdev->ctx[vector].name);
+			return ret;
+		}
+
+		vdev->ctx[vector].trigger = trigger;
+	}
+
+	return 0;
+}
+
+static int vfio_msi_set_block(struct vfio_pci_device *vdev, int start,
+			      int count, int32_t *fds, bool msix)
+{
+	int i, j, ret = 0;
+
+	if (start + count > vdev->num_ctx)
+		return -EINVAL;
+
+	for (i = 0, j = start; i < count && !ret; i++, j++) {
+		int fd = fds ? fds[i] : -1;
+		ret = vfio_msi_set_vector_signal(vdev, j, fd, msix);
+	}
+
+	if (ret) {
+		for (--j; j >= start; j--)
+			vfio_msi_set_vector_signal(vdev, j, -1, msix);
+	}
+
+	return ret;
+}
+
+static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+
+	vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);
+
+	for (i = 0; i < vdev->num_ctx; i++) {
+		virqfd_disable(vdev->ctx[i].unmask);
+		virqfd_disable(vdev->ctx[i].mask);
+	}
+
+	if (msix) {
+		pci_disable_msix(vdev->pdev);
+		kfree(vdev->msix);
+	} else
+		pci_disable_msi(pdev);
+
+	vdev->irq_type = VFIO_PCI_NUM_IRQS;
+	vdev->num_ctx = 0;
+	kfree(vdev->ctx);
+}
+
+/*
+ * IOCTL support
+ */
+static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
+				    int index, int start, int count,
+				    uint32_t flags, void *data)
+{
+	if (!is_intx(vdev) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_pci_intx_unmask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t unmask = *(uint8_t *)data;
+		if (unmask)
+			vfio_pci_intx_unmask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t fd = *(int32_t *)data;
+		if (fd >= 0)
+			return virqfd_enable(vdev, vfio_intx_unmask_inject,
+					     NULL, &vdev->ctx[0].unmask, fd);
+
+		virqfd_disable(vdev->ctx[0].unmask);
+	}
+
+	return 0;
+}
+
+static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
+				  int index, int start, int count,
+				  uint32_t flags, void *data)
+{
+	if (!is_intx(vdev) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_pci_intx_mask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t mask = *(uint8_t *)data;
+		if (mask)
+			vfio_pci_intx_mask(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		return -ENOTTY; /* XXX implement me */
+	}
+
+	return 0;
+}
+
+static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
+				     int index, int start, int count,
+				     uint32_t flags, void *data)
+{
+	if (is_intx(vdev) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		vfio_intx_disable(vdev);
+		return 0;
+	}
+
+	if (!(is_intx(vdev) || is_irq_none(vdev)) || start != 0 || count != 1)
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t fd = *(int32_t *)data;
+		int ret;
+
+		if (is_intx(vdev))
+			return vfio_intx_set_signal(vdev, fd);
+
+		ret = vfio_intx_enable(vdev);
+		if (ret)
+			return ret;
+
+		ret = vfio_intx_set_signal(vdev, fd);
+		if (ret)
+			vfio_intx_disable(vdev);
+
+		return ret;
+	}
+
+	if (!is_intx(vdev))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		vfio_send_intx_eventfd(vdev);
+	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+		uint8_t trigger = *(uint8_t *)data;
+		if (trigger)
+			vfio_send_intx_eventfd(vdev);
+	}
+	return 0;
+}
+
+static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
+				    int index, int start, int count,
+				    uint32_t flags, void *data)
+{
+	int i;
+	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
+
+	if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+		vfio_msi_disable(vdev, msix);
+		return 0;
+	}
+
+	if (!(irq_is(vdev, index) || is_irq_none(vdev)))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		int32_t *fds = data;
+		int ret;
+
+		if (vdev->irq_type == index)
+			return vfio_msi_set_block(vdev, start, count,
+						  fds, msix);
+
+		ret = vfio_msi_enable(vdev, start + count, msix);
+		if (ret)
+			return ret;
+
+		ret = vfio_msi_set_block(vdev, start, count, fds, msix);
+		if (ret)
+			vfio_msi_disable(vdev, msix);
+
+		return ret;
+	}
+
+	if (!irq_is(vdev, index) || start + count > vdev->num_ctx)
+		return -EINVAL;
+
+	for (i = start; i < start + count; i++) {
+		if (!vdev->ctx[i].trigger)
+			continue;
+		if (flags & VFIO_IRQ_SET_DATA_NONE) {
+			eventfd_signal(vdev->ctx[i].trigger, 1);
+		} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+			uint8_t *bools = data;
+			if (bools[i - start])
+				eventfd_signal(vdev->ctx[i].trigger, 1);
+		}
+	}
+	return 0;
+}
+
+int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
+			    int index, int start, int count, void *data)
+{
+	int (*func)(struct vfio_pci_device *vdev, int index, int start,
+		    int count, uint32_t flags, void *data) = NULL;
+
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+			func = vfio_pci_set_intx_mask;
+			break;
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			func = vfio_pci_set_intx_unmask;
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_intx_trigger;
+			break;
+		}
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			/* XXX Need masking support exported */
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_msi_trigger;
+			break;
+		}
+		break;
+	}
+
+	if (!func)
+		return -ENOTTY;
+
+	return func(vdev, index, start, count, flags, data);
+}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
new file mode 100644
index 0000000..a4a3678
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -0,0 +1,91 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/mutex.h>
+#include <linux/pci.h>
+
+#ifndef VFIO_PCI_PRIVATE_H
+#define VFIO_PCI_PRIVATE_H
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_pci_irq_ctx {
+	struct eventfd_ctx	*trigger;
+	struct virqfd		*unmask;
+	struct virqfd		*mask;
+	char			*name;
+	bool			masked;
+};
+
+struct vfio_pci_device {
+	struct pci_dev		*pdev;
+	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
+	u8			*pci_config_map;
+	u8			*vconfig;
+	struct perm_bits	*msi_perm;
+	spinlock_t		irqlock;
+	struct mutex		igate;
+	struct msix_entry	*msix;
+	struct vfio_pci_irq_ctx	*ctx;
+	int			num_ctx;
+	int			irq_type;
+	u8			msi_qmax;
+	u8			msix_bar;
+	u16			msix_size;
+	u32			msix_offset;
+	u32			rbar[7];
+	bool			pci_2_3;
+	bool			virq_disabled;
+	bool			reset_works;
+	bool			extended_caps;
+	bool			bardirty;
+	struct pci_saved_state	*pci_saved_state;
+	atomic_t		refcnt;
+};
+
+#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
+#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
+#define irq_is(vdev, type) (vdev->irq_type == type)
+
+extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
+extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
+
+extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
+				   uint32_t flags, int index, int start,
+				   int count, void *data);
+
+extern ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+					 char __user *buf, size_t count,
+					 loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev,
+				      char __user *buf, size_t count,
+				      loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev,
+				     char __user *buf, size_t count,
+				     loff_t *ppos, bool iswrite);
+
+extern int vfio_pci_init_perm_bits(void);
+extern void vfio_pci_uninit_perm_bits(void);
+
+extern int vfio_pci_virqfd_init(void);
+extern void vfio_pci_virqfd_exit(void);
+
+extern int vfio_config_init(struct vfio_pci_device *vdev);
+extern void vfio_config_free(struct vfio_pci_device *vdev);
+#endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
new file mode 100644
index 0000000..44c3ba2
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -0,0 +1,267 @@
+/*
+ * VFIO PCI I/O Port & MMIO access
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include "vfio_pci_private.h"
+
+/* I/O Port BAR access */
+ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+			      size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	void __iomem *io;
+	size_t done = 0;
+
+	if (!pci_resource_start(pdev, bar))
+		return -EINVAL;
+
+	if (pos + count > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!vdev->barmap[bar]) {
+		int ret;
+
+		ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+		if (ret)
+			return ret;
+
+		vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+		if (!vdev->barmap[bar]) {
+			pci_release_selected_regions(pdev, 1 << bar);
+			return -EINVAL;
+		}
+	}
+
+	io = vdev->barmap[bar];
+
+	while (count) {
+		int filled;
+
+		if (count >= 3 && !(pos % 4)) {
+			u32 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 4))
+					return -EFAULT;
+
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+
+				if (copy_to_user(buf, &val, 4))
+					return -EFAULT;
+			}
+
+			filled = 4;
+
+		} else if ((pos % 2) == 0 && count >= 2) {
+			u16 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 2))
+					return -EFAULT;
+
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+
+				if (copy_to_user(buf, &val, 2))
+					return -EFAULT;
+			}
+
+			filled = 2;
+		} else {
+			u8 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 1))
+					return -EFAULT;
+
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+
+				if (copy_to_user(buf, &val, 1))
+					return -EFAULT;
+			}
+
+			filled = 1;
+		}
+
+		count -= filled;
+		done += filled;
+		buf += filled;
+		pos += filled;
+	}
+
+	*ppos += done;
+
+	return done;
+}
+
+/*
+ * MMIO BAR access
+ * We handle two excluded ranges here as well, if the user tries to read
+ * the ROM beyond what PCI tells us is available or the MSI-X table region,
+ * we return 0xFF and writes are dropped.
+ */
+ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+			       size_t count, loff_t *ppos, bool iswrite)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	void __iomem *io;
+	resource_size_t end;
+	size_t done = 0;
+	size_t x_start = 0, x_end = 0; /* excluded range */
+
+	if (!pci_resource_start(pdev, bar))
+		return -EINVAL;
+
+	end = pci_resource_len(pdev, bar);
+
+	if (pos > end)
+		return -EINVAL;
+
+	if (pos == end)
+		return 0;
+
+	if (pos + count > end)
+		count = end - pos;
+
+	if (bar == PCI_ROM_RESOURCE) {
+		io = pci_map_rom(pdev, &x_start);
+		x_end = end;
+	} else {
+		if (!vdev->barmap[bar]) {
+			int ret;
+
+			ret = pci_request_selected_regions(pdev, 1 << bar,
+							   "vfio");
+			if (ret)
+				return ret;
+
+			vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+			if (!vdev->barmap[bar]) {
+				pci_release_selected_regions(pdev, 1 << bar);
+				return -EINVAL;
+			}
+		}
+
+		io = vdev->barmap[bar];
+
+		if (bar == vdev->msix_bar) {
+			x_start = vdev->msix_offset;
+			x_end = vdev->msix_offset + vdev->msix_size;
+		}
+	}
+
+	if (!io)
+		return -EINVAL;
+
+	while (count) {
+		size_t fillable, filled;
+
+		if (pos < x_start)
+			fillable = x_start - pos;
+		else if (pos >= x_end)
+			fillable = end - pos;
+		else
+			fillable = 0;
+
+		if (fillable >= 4 && !(pos % 4)) {
+			u32 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 4))
+					goto out;
+
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+				if (copy_to_user(buf, &val, 4))
+					goto out;
+			}
+
+			filled = 4;
+		} else if (fillable >= 2 && !(pos % 2)) {
+			u16 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 2))
+					goto out;
+
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+				if (copy_to_user(buf, &val, 2))
+					goto out;
+			}
+
+			filled = 2;
+		} else if (fillable) {
+			u8 val;
+
+			if (iswrite) {
+				if (copy_from_user(&val, buf, 1))
+					goto out;
+
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+
+				if (copy_to_user(buf, &val, 1))
+					goto out;
+			}
+
+			filled = 1;
+		} else {
+			/* Drop writes, fill reads with FF */
+			if (!iswrite) {
+				char val = 0xFF;
+				size_t i;
+
+				for (i = 0; i < x_end - pos; i++) {
+					if (put_user(val, buf + i))
+						goto out;
+				}
+			}
+
+			filled = x_end - pos;
+		}
+
+		count -= filled;
+		done += filled;
+		buf += filled;
+		pos += filled;
+	}
+
+	*ppos += done;
+
+out:
+	if (bar == PCI_ROM_RESOURCE)
+		pci_unmap_rom(pdev, io);
+
+	return count ? -EFAULT : done;
+}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 1c7119c..d668283 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -220,6 +220,7 @@ struct vfio_device_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
@@ -361,6 +362,31 @@ struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping.  Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_PCI_BAR0_REGION_INDEX,
+	VFIO_PCI_BAR1_REGION_INDEX,
+	VFIO_PCI_BAR2_REGION_INDEX,
+	VFIO_PCI_BAR3_REGION_INDEX,
+	VFIO_PCI_BAR4_REGION_INDEX,
+	VFIO_PCI_BAR5_REGION_INDEX,
+	VFIO_PCI_ROM_REGION_INDEX,
+	VFIO_PCI_CONFIG_REGION_INDEX,
+	VFIO_PCI_NUM_REGIONS
+};
+
+enum {
+	VFIO_PCI_INTX_IRQ_INDEX,
+	VFIO_PCI_MSI_IRQ_INDEX,
+	VFIO_PCI_MSIX_IRQ_INDEX,
+	VFIO_PCI_NUM_IRQS
+};
+
 /* -------- API for x86 VFIO IOMMU -------- */
 
 /**

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/13] driver core: Add iommu_group tracking to struct device
  2012-05-11 22:55   ` Alex Williamson
@ 2012-05-11 23:38     ` Greg KH
  -1 siblings, 0 replies; 129+ messages in thread
From: Greg KH @ 2012-05-11 23:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, bhelgaas

On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> IOMMU groups allow IOMMU drivers to represent DMA visibility
> and isolation of devices.  Multiple devices may be grouped
> together for the purposes of DMA.  Placing a pointer on
> struct device enable easy access for things like streaming
> DMA programming and drivers like VFIO.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Can't you get this today from the iommu_ops pointer that is on the bus
that the device is associated with?  Or can devices on a bus have
different iommu_group pointers?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 23:38     ` Greg KH
  0 siblings, 0 replies; 129+ messages in thread
From: Greg KH @ 2012-05-11 23:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, avi, joerg.roedel, bhelgaas,
	benve, dwmw2, linux-kernel, david

On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> IOMMU groups allow IOMMU drivers to represent DMA visibility
> and isolation of devices.  Multiple devices may be grouped
> together for the purposes of DMA.  Placing a pointer on
> struct device enable easy access for things like streaming
> DMA programming and drivers like VFIO.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Can't you get this today from the iommu_ops pointer that is on the bus
that the device is associated with?  Or can devices on a bus have
different iommu_group pointers?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
  2012-05-11 22:55   ` Alex Williamson
@ 2012-05-11 23:39     ` Greg KH
  -1 siblings, 0 replies; 129+ messages in thread
From: Greg KH @ 2012-05-11 23:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, bhelgaas

On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> IOMMU device groups are currently a rather vague associative notion
> with assembly required by the user or user level driver provider to
> do anything useful.  This patch intends to grow the IOMMU group concept
> into something a bit more consumable.
> 
> To do this, we first create an object representing the group, struct
> iommu_group.  This structure is allocated (iommu_group_alloc) and
> filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> is free to add devices to the group using it's own set of policies.
> This allows inclusion of devices based on physical hardware or topology
> limitations of the platform, as well as soft requirements, such as
> multi-function trust levels or peer-to-peer protection of the
> interconnects.  Each device may only belong to a single iommu group,
> which is linked from struct device.iommu_group.  IOMMU groups are
> maintained using kobject reference counting, allowing for automatic
> removal of empty, unreferenced groups.  It is the responsibility of
> the iommu driver to remove devices from the group
> (iommu_group_remove_device).
> 
> IOMMU groups also include a userspace representation in sysfs under
> /sys/kernel/iommu_groups.  When allocated, each group is given a
> dynamically assign ID (int).  The ID is managed by the core IOMMU group
> code to support multiple heterogeneous iommu drivers, which could
> potentially collide in group naming/numbering.  This also keeps group
> IDs to small, easily managed values.  A directory is created under
> /sys/kernel/iommu_groups for each group.  A further subdirectory named
> "devices" contains links to each device within the group.  The iommu_group
> file in the device's sysfs directory, which formerly contained a group
> number when read, is now a link to the iommu group.  Example:
> 
> $ ls -l /sys/kernel/iommu_groups/26/devices/

<snip>

As you are creating new sysfs files and directories, you need to also
add the proper Documentation/ABI/ files at the same time.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 23:39     ` Greg KH
  0 siblings, 0 replies; 129+ messages in thread
From: Greg KH @ 2012-05-11 23:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, avi, joerg.roedel, bhelgaas,
	benve, dwmw2, linux-kernel, david

On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> IOMMU device groups are currently a rather vague associative notion
> with assembly required by the user or user level driver provider to
> do anything useful.  This patch intends to grow the IOMMU group concept
> into something a bit more consumable.
> 
> To do this, we first create an object representing the group, struct
> iommu_group.  This structure is allocated (iommu_group_alloc) and
> filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> is free to add devices to the group using it's own set of policies.
> This allows inclusion of devices based on physical hardware or topology
> limitations of the platform, as well as soft requirements, such as
> multi-function trust levels or peer-to-peer protection of the
> interconnects.  Each device may only belong to a single iommu group,
> which is linked from struct device.iommu_group.  IOMMU groups are
> maintained using kobject reference counting, allowing for automatic
> removal of empty, unreferenced groups.  It is the responsibility of
> the iommu driver to remove devices from the group
> (iommu_group_remove_device).
> 
> IOMMU groups also include a userspace representation in sysfs under
> /sys/kernel/iommu_groups.  When allocated, each group is given a
> dynamically assign ID (int).  The ID is managed by the core IOMMU group
> code to support multiple heterogeneous iommu drivers, which could
> potentially collide in group naming/numbering.  This also keeps group
> IDs to small, easily managed values.  A directory is created under
> /sys/kernel/iommu_groups for each group.  A further subdirectory named
> "devices" contains links to each device within the group.  The iommu_group
> file in the device's sysfs directory, which formerly contained a group
> number when read, is now a link to the iommu group.  Example:
> 
> $ ls -l /sys/kernel/iommu_groups/26/devices/

<snip>

As you are creating new sysfs files and directories, you need to also
add the proper Documentation/ABI/ files at the same time.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 23:58       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 23:58 UTC (permalink / raw)
  To: Greg KH
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, bhelgaas

On Fri, 2012-05-11 at 16:38 -0700, Greg KH wrote:
> On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> > IOMMU groups allow IOMMU drivers to represent DMA visibility
> > and isolation of devices.  Multiple devices may be grouped
> > together for the purposes of DMA.  Placing a pointer on
> > struct device enable easy access for things like streaming
> > DMA programming and drivers like VFIO.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Can't you get this today from the iommu_ops pointer that is on the bus
> that the device is associated with?  Or can devices on a bus have
> different iommu_group pointers?

The latter, each device on a bus might be it's own group.  This is often
the case on x86 unless PCIe-to-PCI bridges obscure the device
visibility.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 23:58       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 23:58 UTC (permalink / raw)
  To: Greg KH
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Fri, 2012-05-11 at 16:38 -0700, Greg KH wrote:
> On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> > IOMMU groups allow IOMMU drivers to represent DMA visibility
> > and isolation of devices.  Multiple devices may be grouped
> > together for the purposes of DMA.  Placing a pointer on
> > struct device enable easy access for things like streaming
> > DMA programming and drivers like VFIO.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> Can't you get this today from the iommu_ops pointer that is on the bus
> that the device is associated with?  Or can devices on a bus have
> different iommu_group pointers?

The latter, each device on a bus might be it's own group.  This is often
the case on x86 unless PCIe-to-PCI bridges obscure the device
visibility.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-11 23:58       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 23:58 UTC (permalink / raw)
  To: Greg KH
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, avi, joerg.roedel, bhelgaas,
	benve, dwmw2, linux-kernel, david

On Fri, 2012-05-11 at 16:38 -0700, Greg KH wrote:
> On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> > IOMMU groups allow IOMMU drivers to represent DMA visibility
> > and isolation of devices.  Multiple devices may be grouped
> > together for the purposes of DMA.  Placing a pointer on
> > struct device enable easy access for things like streaming
> > DMA programming and drivers like VFIO.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Can't you get this today from the iommu_ops pointer that is on the bus
> that the device is associated with?  Or can devices on a bus have
> different iommu_group pointers?

The latter, each device on a bus might be it's own group.  This is often
the case on x86 unless PCIe-to-PCI bridges obscure the device
visibility.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 23:58       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 23:58 UTC (permalink / raw)
  To: Greg KH
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, bhelgaas

On Fri, 2012-05-11 at 16:39 -0700, Greg KH wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> 
> <snip>
> 
> As you are creating new sysfs files and directories, you need to also
> add the proper Documentation/ABI/ files at the same time.

I'll update.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 23:58       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 23:58 UTC (permalink / raw)
  To: Greg KH
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Fri, 2012-05-11 at 16:39 -0700, Greg KH wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> 
> <snip>
> 
> As you are creating new sysfs files and directories, you need to also
> add the proper Documentation/ABI/ files at the same time.

I'll update.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-11 23:58       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-11 23:58 UTC (permalink / raw)
  To: Greg KH
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, avi, joerg.roedel, bhelgaas,
	benve, dwmw2, linux-kernel, david

On Fri, 2012-05-11 at 16:39 -0700, Greg KH wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> 
> <snip>
> 
> As you are creating new sysfs files and directories, you need to also
> add the proper Documentation/ABI/ files at the same time.

I'll update.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/13] driver core: Add iommu_group tracking to struct device
  2012-05-11 23:58       ` Alex Williamson
@ 2012-05-12  0:00         ` Greg KH
  -1 siblings, 0 replies; 129+ messages in thread
From: Greg KH @ 2012-05-12  0:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, bhelgaas

On Fri, May 11, 2012 at 05:58:01PM -0600, Alex Williamson wrote:
> On Fri, 2012-05-11 at 16:38 -0700, Greg KH wrote:
> > On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> > > IOMMU groups allow IOMMU drivers to represent DMA visibility
> > > and isolation of devices.  Multiple devices may be grouped
> > > together for the purposes of DMA.  Placing a pointer on
> > > struct device enable easy access for things like streaming
> > > DMA programming and drivers like VFIO.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > Can't you get this today from the iommu_ops pointer that is on the bus
> > that the device is associated with?  Or can devices on a bus have
> > different iommu_group pointers?
> 
> The latter, each device on a bus might be it's own group.  This is often
> the case on x86 unless PCIe-to-PCI bridges obscure the device
> visibility.  Thanks,

Ah, ok, then I have no objection to add this to struct device:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 01/13] driver core: Add iommu_group tracking to struct device
@ 2012-05-12  0:00         ` Greg KH
  0 siblings, 0 replies; 129+ messages in thread
From: Greg KH @ 2012-05-12  0:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, avi, joerg.roedel, bhelgaas,
	benve, dwmw2, linux-kernel, david

On Fri, May 11, 2012 at 05:58:01PM -0600, Alex Williamson wrote:
> On Fri, 2012-05-11 at 16:38 -0700, Greg KH wrote:
> > On Fri, May 11, 2012 at 04:55:35PM -0600, Alex Williamson wrote:
> > > IOMMU groups allow IOMMU drivers to represent DMA visibility
> > > and isolation of devices.  Multiple devices may be grouped
> > > together for the purposes of DMA.  Placing a pointer on
> > > struct device enable easy access for things like streaming
> > > DMA programming and drivers like VFIO.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > Can't you get this today from the iommu_ops pointer that is on the bus
> > that the device is associated with?  Or can devices on a bus have
> > different iommu_group pointers?
> 
> The latter, each device on a bus might be it's own group.  This is often
> the case on x86 unless PCIe-to-PCI bridges obscure the device
> visibility.  Thanks,

Ah, ok, then I have no objection to add this to struct device:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-14  1:16     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-14  1:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> IOMMU device groups are currently a rather vague associative notion
> with assembly required by the user or user level driver provider to
> do anything useful.  This patch intends to grow the IOMMU group concept
> into something a bit more consumable.
> 
> To do this, we first create an object representing the group, struct
> iommu_group.  This structure is allocated (iommu_group_alloc) and
> filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> is free to add devices to the group using it's own set of policies.
> This allows inclusion of devices based on physical hardware or topology
> limitations of the platform, as well as soft requirements, such as
> multi-function trust levels or peer-to-peer protection of the
> interconnects.  Each device may only belong to a single iommu group,
> which is linked from struct device.iommu_group.  IOMMU groups are
> maintained using kobject reference counting, allowing for automatic
> removal of empty, unreferenced groups.  It is the responsibility of
> the iommu driver to remove devices from the group
> (iommu_group_remove_device).
> 
> IOMMU groups also include a userspace representation in sysfs under
> /sys/kernel/iommu_groups.  When allocated, each group is given a
> dynamically assign ID (int).  The ID is managed by the core IOMMU group
> code to support multiple heterogeneous iommu drivers, which could
> potentially collide in group naming/numbering.  This also keeps group
> IDs to small, easily managed values.  A directory is created under
> /sys/kernel/iommu_groups for each group.  A further subdirectory named
> "devices" contains links to each device within the group.  The iommu_group
> file in the device's sysfs directory, which formerly contained a group
> number when read, is now a link to the iommu group.  Example:
> 
> $ ls -l /sys/kernel/iommu_groups/26/devices/
> total 0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> 
> $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> [truncating perms/owner/timestamp]
> /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> 					../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> 
> Groups also include several exported functions for use by user level
> driver providers, for example VFIO.  These include:
> 
> iommu_group_get(): Acquires a reference to a group from a device
> iommu_group_put(): Releases reference
> iommu_group_for_each_dev(): Iterates over group devices using callback
> iommu_group_[un]register_notifier(): Allows notification of device add
>         and remove operations relevant to the group
> iommu_group_id(): Return the group number
> 
> This patch also extends the IOMMU API to allow attaching groups to
> domains.  This is currently a simple wrapper for iterating through
> devices within a group, but it's expected that the IOMMU API may
> eventually make groups a more integral part of domains.
> 
> Groups intentionally do not try to manage group ownership.  A user
> level driver provider must independently acquire ownership for each
> device within a group before making use of the group as a whole.
> This may change in the future if group usage becomes more pervasive
> across both DMA and IOMMU ops.
> 
> Groups intentionally do not provide a mechanism for driver locking
> or otherwise manipulating driver matching/probing of devices within
> the group.  Such interfaces are generic to devices and beyond the
> scope of IOMMU groups.  If implemented, user level providers have
> ready access via iommu_group_for_each_dev and group notifiers.
> 
> Groups currently provide no storage for iommu context, but some kind
> of iommu_group_get/set_iommudata() interface is likely if groups
> become more pervasive in the dma layers.
> 
> iommu_device_group() is removed here as it has no users.  The
> replacement is:
> 
> 	group = iommu_group_get(dev);
> 	id = iommu_group_id(group);
> 	iommu_group_put(group);

Looks reasonable to me, with a few nits, noted below.

[snip]
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2198b2d..f75004e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -26,60 +26,404 @@
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/idr.h>
> +#include <linux/notifier.h>
> +
> +static struct kset *iommu_group_kset;
> +static struct ida iommu_group_ida;
> +static struct mutex iommu_group_mutex;
> +
> +struct iommu_group {
> +	struct kobject kobj;
> +	struct kobject *devices_kobj;
> +	struct list_head devices;
> +	struct mutex mutex;
> +	struct blocking_notifier_head notifier;
> +	int id;

I think you should add some sort of name string to the group as well
(supplied by the iommu driver creating the group).  That would make it
easier to connect the arbitrary assigned IDs to any platform-standard
naming convention for these things.

[snip]
> +/**
> + * iommu_group_add_device - add a device to an iommu group
> + * @group: the group into which to add the device (reference should be held)
> + * @dev: the device
> + *
> + * This function is called by an iommu driver to add a device into a
> + * group.  Adding a device increments the group reference count.
> + */
> +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> +{
> +	int ret;
> +	struct iommu_device *device;
> +
> +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> +	if (!device)
> +		return -ENOMEM;
> +
> +	device->dev = dev;
> +
> +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> +	if (ret) {
> +		kfree(device);
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> +				kobject_name(&dev->kobj));
> +	if (ret) {
> +		sysfs_remove_link(&dev->kobj, "iommu_group");
> +		kfree(device);
> +		return ret;
> +	}

So, it's not clear that the kobject_name() here has to be unique
across all devices in the group.  It might be better to use an
arbitrary index here instead of a name to avoid that problem.

[snip]
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	struct iommu_device *device;
> +
> +	/* Pre-notify listeners that a device is being removed. */
> +	blocking_notifier_call_chain(&group->notifier,
> +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list) {
> +		if (device->dev == dev) {
> +			list_del(&device->list);
> +			kfree(device);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&group->mutex);
> +
> +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> +	sysfs_remove_link(&dev->kobj, "iommu_group");
> +
> +	dev->iommu_group = NULL;

I suspect the dev -> group pointer should be cleared first, under the
group lock, but I'm not certain about that.

[snip]
> +/**
> + * iommu_group_for_each_dev - iterate over each device in the group
> + * @group: the group
> + * @data: caller opaque data to be passed to callback function
> + * @fn: caller supplied callback function
> + *
> + * This function is called by group users to iterate over group devices.
> + * Callers should hold a reference count to the group during
> callback.

Probably also worth noting in this doco that the group lock will be
held across the callback.

[snip]
> +static int iommu_bus_notifier(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
>  	struct device *dev = data;
> +	struct iommu_ops *ops = dev->bus->iommu_ops;
> +	struct iommu_group *group;
> +	unsigned long group_action = 0;
> +
> +	/*
> +	 * ADD/DEL call into iommu driver ops if provided, which may
> +	 * result in ADD/DEL notifiers to group->notifier
> +	 */
> +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> +		if (ops->add_device)
> +			return ops->add_device(dev);
> +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> +		if (ops->remove_device && dev->iommu_group) {
> +			ops->remove_device(dev);
> +			return 0;
> +		}
> +	}

So, there's still the question of how to assign grouping for devices
on a subordinate bus behind a bridge which is iommu managed.  The
iommu driver for the top-level bus can't know about all possible
subordinate bus types, but the subordinate devices will (in most
cases, anyway) be iommu translated as if originating with the bus
bridge.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-14  1:16     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-14  1:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> IOMMU device groups are currently a rather vague associative notion
> with assembly required by the user or user level driver provider to
> do anything useful.  This patch intends to grow the IOMMU group concept
> into something a bit more consumable.
> 
> To do this, we first create an object representing the group, struct
> iommu_group.  This structure is allocated (iommu_group_alloc) and
> filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> is free to add devices to the group using it's own set of policies.
> This allows inclusion of devices based on physical hardware or topology
> limitations of the platform, as well as soft requirements, such as
> multi-function trust levels or peer-to-peer protection of the
> interconnects.  Each device may only belong to a single iommu group,
> which is linked from struct device.iommu_group.  IOMMU groups are
> maintained using kobject reference counting, allowing for automatic
> removal of empty, unreferenced groups.  It is the responsibility of
> the iommu driver to remove devices from the group
> (iommu_group_remove_device).
> 
> IOMMU groups also include a userspace representation in sysfs under
> /sys/kernel/iommu_groups.  When allocated, each group is given a
> dynamically assign ID (int).  The ID is managed by the core IOMMU group
> code to support multiple heterogeneous iommu drivers, which could
> potentially collide in group naming/numbering.  This also keeps group
> IDs to small, easily managed values.  A directory is created under
> /sys/kernel/iommu_groups for each group.  A further subdirectory named
> "devices" contains links to each device within the group.  The iommu_group
> file in the device's sysfs directory, which formerly contained a group
> number when read, is now a link to the iommu group.  Example:
> 
> $ ls -l /sys/kernel/iommu_groups/26/devices/
> total 0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> 
> $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> [truncating perms/owner/timestamp]
> /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> 					../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> 
> Groups also include several exported functions for use by user level
> driver providers, for example VFIO.  These include:
> 
> iommu_group_get(): Acquires a reference to a group from a device
> iommu_group_put(): Releases reference
> iommu_group_for_each_dev(): Iterates over group devices using callback
> iommu_group_[un]register_notifier(): Allows notification of device add
>         and remove operations relevant to the group
> iommu_group_id(): Return the group number
> 
> This patch also extends the IOMMU API to allow attaching groups to
> domains.  This is currently a simple wrapper for iterating through
> devices within a group, but it's expected that the IOMMU API may
> eventually make groups a more integral part of domains.
> 
> Groups intentionally do not try to manage group ownership.  A user
> level driver provider must independently acquire ownership for each
> device within a group before making use of the group as a whole.
> This may change in the future if group usage becomes more pervasive
> across both DMA and IOMMU ops.
> 
> Groups intentionally do not provide a mechanism for driver locking
> or otherwise manipulating driver matching/probing of devices within
> the group.  Such interfaces are generic to devices and beyond the
> scope of IOMMU groups.  If implemented, user level providers have
> ready access via iommu_group_for_each_dev and group notifiers.
> 
> Groups currently provide no storage for iommu context, but some kind
> of iommu_group_get/set_iommudata() interface is likely if groups
> become more pervasive in the dma layers.
> 
> iommu_device_group() is removed here as it has no users.  The
> replacement is:
> 
> 	group = iommu_group_get(dev);
> 	id = iommu_group_id(group);
> 	iommu_group_put(group);

Looks reasonable to me, with a few nits, noted below.

[snip]
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2198b2d..f75004e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -26,60 +26,404 @@
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/idr.h>
> +#include <linux/notifier.h>
> +
> +static struct kset *iommu_group_kset;
> +static struct ida iommu_group_ida;
> +static struct mutex iommu_group_mutex;
> +
> +struct iommu_group {
> +	struct kobject kobj;
> +	struct kobject *devices_kobj;
> +	struct list_head devices;
> +	struct mutex mutex;
> +	struct blocking_notifier_head notifier;
> +	int id;

I think you should add some sort of name string to the group as well
(supplied by the iommu driver creating the group).  That would make it
easier to connect the arbitrary assigned IDs to any platform-standard
naming convention for these things.

[snip]
> +/**
> + * iommu_group_add_device - add a device to an iommu group
> + * @group: the group into which to add the device (reference should be held)
> + * @dev: the device
> + *
> + * This function is called by an iommu driver to add a device into a
> + * group.  Adding a device increments the group reference count.
> + */
> +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> +{
> +	int ret;
> +	struct iommu_device *device;
> +
> +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> +	if (!device)
> +		return -ENOMEM;
> +
> +	device->dev = dev;
> +
> +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> +	if (ret) {
> +		kfree(device);
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> +				kobject_name(&dev->kobj));
> +	if (ret) {
> +		sysfs_remove_link(&dev->kobj, "iommu_group");
> +		kfree(device);
> +		return ret;
> +	}

So, it's not clear that the kobject_name() here has to be unique
across all devices in the group.  It might be better to use an
arbitrary index here instead of a name to avoid that problem.

[snip]
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	struct iommu_device *device;
> +
> +	/* Pre-notify listeners that a device is being removed. */
> +	blocking_notifier_call_chain(&group->notifier,
> +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list) {
> +		if (device->dev == dev) {
> +			list_del(&device->list);
> +			kfree(device);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&group->mutex);
> +
> +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> +	sysfs_remove_link(&dev->kobj, "iommu_group");
> +
> +	dev->iommu_group = NULL;

I suspect the dev -> group pointer should be cleared first, under the
group lock, but I'm not certain about that.

[snip]
> +/**
> + * iommu_group_for_each_dev - iterate over each device in the group
> + * @group: the group
> + * @data: caller opaque data to be passed to callback function
> + * @fn: caller supplied callback function
> + *
> + * This function is called by group users to iterate over group devices.
> + * Callers should hold a reference count to the group during
> callback.

Probably also worth noting in this doco that the group lock will be
held across the callback.

[snip]
> +static int iommu_bus_notifier(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
>  	struct device *dev = data;
> +	struct iommu_ops *ops = dev->bus->iommu_ops;
> +	struct iommu_group *group;
> +	unsigned long group_action = 0;
> +
> +	/*
> +	 * ADD/DEL call into iommu driver ops if provided, which may
> +	 * result in ADD/DEL notifiers to group->notifier
> +	 */
> +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> +		if (ops->add_device)
> +			return ops->add_device(dev);
> +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> +		if (ops->remove_device && dev->iommu_group) {
> +			ops->remove_device(dev);
> +			return 0;
> +		}
> +	}

So, there's still the question of how to assign grouping for devices
on a subordinate bus behind a bridge which is iommu managed.  The
iommu driver for the top-level bus can't know about all possible
subordinate bus types, but the subordinate devices will (in most
cases, anyway) be iommu translated as if originating with the bus
bridge.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-14  1:16     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-14  1:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> IOMMU device groups are currently a rather vague associative notion
> with assembly required by the user or user level driver provider to
> do anything useful.  This patch intends to grow the IOMMU group concept
> into something a bit more consumable.
> 
> To do this, we first create an object representing the group, struct
> iommu_group.  This structure is allocated (iommu_group_alloc) and
> filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> is free to add devices to the group using it's own set of policies.
> This allows inclusion of devices based on physical hardware or topology
> limitations of the platform, as well as soft requirements, such as
> multi-function trust levels or peer-to-peer protection of the
> interconnects.  Each device may only belong to a single iommu group,
> which is linked from struct device.iommu_group.  IOMMU groups are
> maintained using kobject reference counting, allowing for automatic
> removal of empty, unreferenced groups.  It is the responsibility of
> the iommu driver to remove devices from the group
> (iommu_group_remove_device).
> 
> IOMMU groups also include a userspace representation in sysfs under
> /sys/kernel/iommu_groups.  When allocated, each group is given a
> dynamically assign ID (int).  The ID is managed by the core IOMMU group
> code to support multiple heterogeneous iommu drivers, which could
> potentially collide in group naming/numbering.  This also keeps group
> IDs to small, easily managed values.  A directory is created under
> /sys/kernel/iommu_groups for each group.  A further subdirectory named
> "devices" contains links to each device within the group.  The iommu_group
> file in the device's sysfs directory, which formerly contained a group
> number when read, is now a link to the iommu group.  Example:
> 
> $ ls -l /sys/kernel/iommu_groups/26/devices/
> total 0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> 
> $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> [truncating perms/owner/timestamp]
> /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> 					../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> 
> Groups also include several exported functions for use by user level
> driver providers, for example VFIO.  These include:
> 
> iommu_group_get(): Acquires a reference to a group from a device
> iommu_group_put(): Releases reference
> iommu_group_for_each_dev(): Iterates over group devices using callback
> iommu_group_[un]register_notifier(): Allows notification of device add
>         and remove operations relevant to the group
> iommu_group_id(): Return the group number
> 
> This patch also extends the IOMMU API to allow attaching groups to
> domains.  This is currently a simple wrapper for iterating through
> devices within a group, but it's expected that the IOMMU API may
> eventually make groups a more integral part of domains.
> 
> Groups intentionally do not try to manage group ownership.  A user
> level driver provider must independently acquire ownership for each
> device within a group before making use of the group as a whole.
> This may change in the future if group usage becomes more pervasive
> across both DMA and IOMMU ops.
> 
> Groups intentionally do not provide a mechanism for driver locking
> or otherwise manipulating driver matching/probing of devices within
> the group.  Such interfaces are generic to devices and beyond the
> scope of IOMMU groups.  If implemented, user level providers have
> ready access via iommu_group_for_each_dev and group notifiers.
> 
> Groups currently provide no storage for iommu context, but some kind
> of iommu_group_get/set_iommudata() interface is likely if groups
> become more pervasive in the dma layers.
> 
> iommu_device_group() is removed here as it has no users.  The
> replacement is:
> 
> 	group = iommu_group_get(dev);
> 	id = iommu_group_id(group);
> 	iommu_group_put(group);

Looks reasonable to me, with a few nits, noted below.

[snip]
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2198b2d..f75004e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -26,60 +26,404 @@
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/idr.h>
> +#include <linux/notifier.h>
> +
> +static struct kset *iommu_group_kset;
> +static struct ida iommu_group_ida;
> +static struct mutex iommu_group_mutex;
> +
> +struct iommu_group {
> +	struct kobject kobj;
> +	struct kobject *devices_kobj;
> +	struct list_head devices;
> +	struct mutex mutex;
> +	struct blocking_notifier_head notifier;
> +	int id;

I think you should add some sort of name string to the group as well
(supplied by the iommu driver creating the group).  That would make it
easier to connect the arbitrary assigned IDs to any platform-standard
naming convention for these things.

[snip]
> +/**
> + * iommu_group_add_device - add a device to an iommu group
> + * @group: the group into which to add the device (reference should be held)
> + * @dev: the device
> + *
> + * This function is called by an iommu driver to add a device into a
> + * group.  Adding a device increments the group reference count.
> + */
> +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> +{
> +	int ret;
> +	struct iommu_device *device;
> +
> +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> +	if (!device)
> +		return -ENOMEM;
> +
> +	device->dev = dev;
> +
> +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> +	if (ret) {
> +		kfree(device);
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> +				kobject_name(&dev->kobj));
> +	if (ret) {
> +		sysfs_remove_link(&dev->kobj, "iommu_group");
> +		kfree(device);
> +		return ret;
> +	}

So, it's not clear that the kobject_name() here has to be unique
across all devices in the group.  It might be better to use an
arbitrary index here instead of a name to avoid that problem.

[snip]
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	struct iommu_device *device;
> +
> +	/* Pre-notify listeners that a device is being removed. */
> +	blocking_notifier_call_chain(&group->notifier,
> +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list) {
> +		if (device->dev == dev) {
> +			list_del(&device->list);
> +			kfree(device);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&group->mutex);
> +
> +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> +	sysfs_remove_link(&dev->kobj, "iommu_group");
> +
> +	dev->iommu_group = NULL;

I suspect the dev -> group pointer should be cleared first, under the
group lock, but I'm not certain about that.

[snip]
> +/**
> + * iommu_group_for_each_dev - iterate over each device in the group
> + * @group: the group
> + * @data: caller opaque data to be passed to callback function
> + * @fn: caller supplied callback function
> + *
> + * This function is called by group users to iterate over group devices.
> + * Callers should hold a reference count to the group during
> callback.

Probably also worth noting in this doco that the group lock will be
held across the callback.

[snip]
> +static int iommu_bus_notifier(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
>  	struct device *dev = data;
> +	struct iommu_ops *ops = dev->bus->iommu_ops;
> +	struct iommu_group *group;
> +	unsigned long group_action = 0;
> +
> +	/*
> +	 * ADD/DEL call into iommu driver ops if provided, which may
> +	 * result in ADD/DEL notifiers to group->notifier
> +	 */
> +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> +		if (ops->add_device)
> +			return ops->add_device(dev);
> +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> +		if (ops->remove_device && dev->iommu_group) {
> +			ops->remove_device(dev);
> +			return 0;
> +		}
> +	}

So, there's still the question of how to assign grouping for devices
on a subordinate bus behind a bridge which is iommu managed.  The
iommu driver for the top-level bus can't know about all possible
subordinate bus types, but the subordinate devices will (in most
cases, anyway) be iommu translated as if originating with the bus
bridge.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-14 17:11       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-14 17:11 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> > total 0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> > 
> > $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> > [truncating perms/owner/timestamp]
> > /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> > 					../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > 
> > Groups also include several exported functions for use by user level
> > driver providers, for example VFIO.  These include:
> > 
> > iommu_group_get(): Acquires a reference to a group from a device
> > iommu_group_put(): Releases reference
> > iommu_group_for_each_dev(): Iterates over group devices using callback
> > iommu_group_[un]register_notifier(): Allows notification of device add
> >         and remove operations relevant to the group
> > iommu_group_id(): Return the group number
> > 
> > This patch also extends the IOMMU API to allow attaching groups to
> > domains.  This is currently a simple wrapper for iterating through
> > devices within a group, but it's expected that the IOMMU API may
> > eventually make groups a more integral part of domains.
> > 
> > Groups intentionally do not try to manage group ownership.  A user
> > level driver provider must independently acquire ownership for each
> > device within a group before making use of the group as a whole.
> > This may change in the future if group usage becomes more pervasive
> > across both DMA and IOMMU ops.
> > 
> > Groups intentionally do not provide a mechanism for driver locking
> > or otherwise manipulating driver matching/probing of devices within
> > the group.  Such interfaces are generic to devices and beyond the
> > scope of IOMMU groups.  If implemented, user level providers have
> > ready access via iommu_group_for_each_dev and group notifiers.
> > 
> > Groups currently provide no storage for iommu context, but some kind
> > of iommu_group_get/set_iommudata() interface is likely if groups
> > become more pervasive in the dma layers.
> > 
> > iommu_device_group() is removed here as it has no users.  The
> > replacement is:
> > 
> > 	group = iommu_group_get(dev);
> > 	id = iommu_group_id(group);
> > 	iommu_group_put(group);
> 
> Looks reasonable to me, with a few nits, noted below.
> 
> [snip]
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2198b2d..f75004e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -26,60 +26,404 @@
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/idr.h>
> > +#include <linux/notifier.h>
> > +
> > +static struct kset *iommu_group_kset;
> > +static struct ida iommu_group_ida;
> > +static struct mutex iommu_group_mutex;
> > +
> > +struct iommu_group {
> > +	struct kobject kobj;
> > +	struct kobject *devices_kobj;
> > +	struct list_head devices;
> > +	struct mutex mutex;
> > +	struct blocking_notifier_head notifier;
> > +	int id;
> 
> I think you should add some sort of name string to the group as well
> (supplied by the iommu driver creating the group).  That would make it
> easier to connect the arbitrary assigned IDs to any platform-standard
> naming convention for these things.

When would the name be used and how is it exposed?

> [snip]
> > +/**
> > + * iommu_group_add_device - add a device to an iommu group
> > + * @group: the group into which to add the device (reference should be held)
> > + * @dev: the device
> > + *
> > + * This function is called by an iommu driver to add a device into a
> > + * group.  Adding a device increments the group reference count.
> > + */
> > +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> > +{
> > +	int ret;
> > +	struct iommu_device *device;
> > +
> > +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +	if (!device)
> > +		return -ENOMEM;
> > +
> > +	device->dev = dev;
> > +
> > +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> > +	if (ret) {
> > +		kfree(device);
> > +		return ret;
> > +	}
> > +
> > +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> > +				kobject_name(&dev->kobj));
> > +	if (ret) {
> > +		sysfs_remove_link(&dev->kobj, "iommu_group");
> > +		kfree(device);
> > +		return ret;
> > +	}
> 
> So, it's not clear that the kobject_name() here has to be unique
> across all devices in the group.  It might be better to use an
> arbitrary index here instead of a name to avoid that problem.

Hmm, that loses useful convenience when they are unique, such as on PCI.
I'll look and see if sysfs_create_link will fail on duplicate names and
see about adding some kind of instance to it.

> [snip]
> > +/**
> > + * iommu_group_remove_device - remove a device from it's current group
> > + * @dev: device to be removed
> > + *
> > + * This function is called by an iommu driver to remove the device from
> > + * it's current group.  This decrements the iommu group reference count.
> > + */
> > +void iommu_group_remove_device(struct device *dev)
> > +{
> > +	struct iommu_group *group = dev->iommu_group;
> > +	struct iommu_device *device;
> > +
> > +	/* Pre-notify listeners that a device is being removed. */
> > +	blocking_notifier_call_chain(&group->notifier,
> > +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> > +
> > +	mutex_lock(&group->mutex);
> > +	list_for_each_entry(device, &group->devices, list) {
> > +		if (device->dev == dev) {
> > +			list_del(&device->list);
> > +			kfree(device);
> > +			break;
> > +		}
> > +	}
> > +	mutex_unlock(&group->mutex);
> > +
> > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > +
> > +	dev->iommu_group = NULL;
> 
> I suspect the dev -> group pointer should be cleared first, under the
> group lock, but I'm not certain about that.

group->mutex is protecting the group's device list.  I think my
assumption is that when a device is being removed, there should be no
references to it for anyone to race with iommu_group_get(dev), but I'm
not sure how valid that is.

> [snip]
> > +/**
> > + * iommu_group_for_each_dev - iterate over each device in the group
> > + * @group: the group
> > + * @data: caller opaque data to be passed to callback function
> > + * @fn: caller supplied callback function
> > + *
> > + * This function is called by group users to iterate over group devices.
> > + * Callers should hold a reference count to the group during
> > callback.
> 
> Probably also worth noting in this doco that the group lock will be
> held across the callback.

Yes; calling iommu_group_remove_device through this would be a bad idea.

> [snip]
> > +static int iommu_bus_notifier(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> >  	struct device *dev = data;
> > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > +	struct iommu_group *group;
> > +	unsigned long group_action = 0;
> > +
> > +	/*
> > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > +	 * result in ADD/DEL notifiers to group->notifier
> > +	 */
> > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > +		if (ops->add_device)
> > +			return ops->add_device(dev);
> > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > +		if (ops->remove_device && dev->iommu_group) {
> > +			ops->remove_device(dev);
> > +			return 0;
> > +		}
> > +	}
> 
> So, there's still the question of how to assign grouping for devices
> on a subordinate bus behind a bridge which is iommu managed.  The
> iommu driver for the top-level bus can't know about all possible
> subordinate bus types, but the subordinate devices will (in most
> cases, anyway) be iommu translated as if originating with the bus
> bridge.

Not just any bridge, there has to be a different bus_type on the
subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
would trigger this.  In general, I don't know how to handle it and I'm
open to suggestions.  Perhaps we need to register notifiers for every
bus_type and create notifiers for new bus_types.  If we can catch the
ADD_DEVICE for it, we can walk up to the first parent bus that supports
an iommu_ops and add the device there.  But then we need some kind of
foreign bus support for the group since the iommu driver won't know what
to do with the device if we call add_device with it.  Other ideas?
Thanks,

Alex



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-14 17:11       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-14 17:11 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> > total 0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> > 
> > $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> > [truncating perms/owner/timestamp]
> > /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> > 					../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > 
> > Groups also include several exported functions for use by user level
> > driver providers, for example VFIO.  These include:
> > 
> > iommu_group_get(): Acquires a reference to a group from a device
> > iommu_group_put(): Releases reference
> > iommu_group_for_each_dev(): Iterates over group devices using callback
> > iommu_group_[un]register_notifier(): Allows notification of device add
> >         and remove operations relevant to the group
> > iommu_group_id(): Return the group number
> > 
> > This patch also extends the IOMMU API to allow attaching groups to
> > domains.  This is currently a simple wrapper for iterating through
> > devices within a group, but it's expected that the IOMMU API may
> > eventually make groups a more integral part of domains.
> > 
> > Groups intentionally do not try to manage group ownership.  A user
> > level driver provider must independently acquire ownership for each
> > device within a group before making use of the group as a whole.
> > This may change in the future if group usage becomes more pervasive
> > across both DMA and IOMMU ops.
> > 
> > Groups intentionally do not provide a mechanism for driver locking
> > or otherwise manipulating driver matching/probing of devices within
> > the group.  Such interfaces are generic to devices and beyond the
> > scope of IOMMU groups.  If implemented, user level providers have
> > ready access via iommu_group_for_each_dev and group notifiers.
> > 
> > Groups currently provide no storage for iommu context, but some kind
> > of iommu_group_get/set_iommudata() interface is likely if groups
> > become more pervasive in the dma layers.
> > 
> > iommu_device_group() is removed here as it has no users.  The
> > replacement is:
> > 
> > 	group = iommu_group_get(dev);
> > 	id = iommu_group_id(group);
> > 	iommu_group_put(group);
> 
> Looks reasonable to me, with a few nits, noted below.
> 
> [snip]
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2198b2d..f75004e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -26,60 +26,404 @@
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/idr.h>
> > +#include <linux/notifier.h>
> > +
> > +static struct kset *iommu_group_kset;
> > +static struct ida iommu_group_ida;
> > +static struct mutex iommu_group_mutex;
> > +
> > +struct iommu_group {
> > +	struct kobject kobj;
> > +	struct kobject *devices_kobj;
> > +	struct list_head devices;
> > +	struct mutex mutex;
> > +	struct blocking_notifier_head notifier;
> > +	int id;
> 
> I think you should add some sort of name string to the group as well
> (supplied by the iommu driver creating the group).  That would make it
> easier to connect the arbitrary assigned IDs to any platform-standard
> naming convention for these things.

When would the name be used and how is it exposed?

> [snip]
> > +/**
> > + * iommu_group_add_device - add a device to an iommu group
> > + * @group: the group into which to add the device (reference should be held)
> > + * @dev: the device
> > + *
> > + * This function is called by an iommu driver to add a device into a
> > + * group.  Adding a device increments the group reference count.
> > + */
> > +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> > +{
> > +	int ret;
> > +	struct iommu_device *device;
> > +
> > +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +	if (!device)
> > +		return -ENOMEM;
> > +
> > +	device->dev = dev;
> > +
> > +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> > +	if (ret) {
> > +		kfree(device);
> > +		return ret;
> > +	}
> > +
> > +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> > +				kobject_name(&dev->kobj));
> > +	if (ret) {
> > +		sysfs_remove_link(&dev->kobj, "iommu_group");
> > +		kfree(device);
> > +		return ret;
> > +	}
> 
> So, it's not clear that the kobject_name() here has to be unique
> across all devices in the group.  It might be better to use an
> arbitrary index here instead of a name to avoid that problem.

Hmm, that loses useful convenience when they are unique, such as on PCI.
I'll look and see if sysfs_create_link will fail on duplicate names and
see about adding some kind of instance to it.

> [snip]
> > +/**
> > + * iommu_group_remove_device - remove a device from it's current group
> > + * @dev: device to be removed
> > + *
> > + * This function is called by an iommu driver to remove the device from
> > + * it's current group.  This decrements the iommu group reference count.
> > + */
> > +void iommu_group_remove_device(struct device *dev)
> > +{
> > +	struct iommu_group *group = dev->iommu_group;
> > +	struct iommu_device *device;
> > +
> > +	/* Pre-notify listeners that a device is being removed. */
> > +	blocking_notifier_call_chain(&group->notifier,
> > +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> > +
> > +	mutex_lock(&group->mutex);
> > +	list_for_each_entry(device, &group->devices, list) {
> > +		if (device->dev == dev) {
> > +			list_del(&device->list);
> > +			kfree(device);
> > +			break;
> > +		}
> > +	}
> > +	mutex_unlock(&group->mutex);
> > +
> > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > +
> > +	dev->iommu_group = NULL;
> 
> I suspect the dev -> group pointer should be cleared first, under the
> group lock, but I'm not certain about that.

group->mutex is protecting the group's device list.  I think my
assumption is that when a device is being removed, there should be no
references to it for anyone to race with iommu_group_get(dev), but I'm
not sure how valid that is.

> [snip]
> > +/**
> > + * iommu_group_for_each_dev - iterate over each device in the group
> > + * @group: the group
> > + * @data: caller opaque data to be passed to callback function
> > + * @fn: caller supplied callback function
> > + *
> > + * This function is called by group users to iterate over group devices.
> > + * Callers should hold a reference count to the group during
> > callback.
> 
> Probably also worth noting in this doco that the group lock will be
> held across the callback.

Yes; calling iommu_group_remove_device through this would be a bad idea.

> [snip]
> > +static int iommu_bus_notifier(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> >  	struct device *dev = data;
> > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > +	struct iommu_group *group;
> > +	unsigned long group_action = 0;
> > +
> > +	/*
> > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > +	 * result in ADD/DEL notifiers to group->notifier
> > +	 */
> > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > +		if (ops->add_device)
> > +			return ops->add_device(dev);
> > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > +		if (ops->remove_device && dev->iommu_group) {
> > +			ops->remove_device(dev);
> > +			return 0;
> > +		}
> > +	}
> 
> So, there's still the question of how to assign grouping for devices
> on a subordinate bus behind a bridge which is iommu managed.  The
> iommu driver for the top-level bus can't know about all possible
> subordinate bus types, but the subordinate devices will (in most
> cases, anyway) be iommu translated as if originating with the bus
> bridge.

Not just any bridge, there has to be a different bus_type on the
subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
would trigger this.  In general, I don't know how to handle it and I'm
open to suggestions.  Perhaps we need to register notifiers for every
bus_type and create notifiers for new bus_types.  If we can catch the
ADD_DEVICE for it, we can walk up to the first parent bus that supports
an iommu_ops and add the device there.  But then we need some kind of
foreign bus support for the group since the iommu driver won't know what
to do with the device if we call add_device with it.  Other ideas?
Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-14 17:11       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-14 17:11 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> > total 0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> > 
> > $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> > [truncating perms/owner/timestamp]
> > /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> > 					../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > 
> > Groups also include several exported functions for use by user level
> > driver providers, for example VFIO.  These include:
> > 
> > iommu_group_get(): Acquires a reference to a group from a device
> > iommu_group_put(): Releases reference
> > iommu_group_for_each_dev(): Iterates over group devices using callback
> > iommu_group_[un]register_notifier(): Allows notification of device add
> >         and remove operations relevant to the group
> > iommu_group_id(): Return the group number
> > 
> > This patch also extends the IOMMU API to allow attaching groups to
> > domains.  This is currently a simple wrapper for iterating through
> > devices within a group, but it's expected that the IOMMU API may
> > eventually make groups a more integral part of domains.
> > 
> > Groups intentionally do not try to manage group ownership.  A user
> > level driver provider must independently acquire ownership for each
> > device within a group before making use of the group as a whole.
> > This may change in the future if group usage becomes more pervasive
> > across both DMA and IOMMU ops.
> > 
> > Groups intentionally do not provide a mechanism for driver locking
> > or otherwise manipulating driver matching/probing of devices within
> > the group.  Such interfaces are generic to devices and beyond the
> > scope of IOMMU groups.  If implemented, user level providers have
> > ready access via iommu_group_for_each_dev and group notifiers.
> > 
> > Groups currently provide no storage for iommu context, but some kind
> > of iommu_group_get/set_iommudata() interface is likely if groups
> > become more pervasive in the dma layers.
> > 
> > iommu_device_group() is removed here as it has no users.  The
> > replacement is:
> > 
> > 	group = iommu_group_get(dev);
> > 	id = iommu_group_id(group);
> > 	iommu_group_put(group);
> 
> Looks reasonable to me, with a few nits, noted below.
> 
> [snip]
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2198b2d..f75004e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -26,60 +26,404 @@
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/idr.h>
> > +#include <linux/notifier.h>
> > +
> > +static struct kset *iommu_group_kset;
> > +static struct ida iommu_group_ida;
> > +static struct mutex iommu_group_mutex;
> > +
> > +struct iommu_group {
> > +	struct kobject kobj;
> > +	struct kobject *devices_kobj;
> > +	struct list_head devices;
> > +	struct mutex mutex;
> > +	struct blocking_notifier_head notifier;
> > +	int id;
> 
> I think you should add some sort of name string to the group as well
> (supplied by the iommu driver creating the group).  That would make it
> easier to connect the arbitrary assigned IDs to any platform-standard
> naming convention for these things.

When would the name be used and how is it exposed?

> [snip]
> > +/**
> > + * iommu_group_add_device - add a device to an iommu group
> > + * @group: the group into which to add the device (reference should be held)
> > + * @dev: the device
> > + *
> > + * This function is called by an iommu driver to add a device into a
> > + * group.  Adding a device increments the group reference count.
> > + */
> > +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> > +{
> > +	int ret;
> > +	struct iommu_device *device;
> > +
> > +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +	if (!device)
> > +		return -ENOMEM;
> > +
> > +	device->dev = dev;
> > +
> > +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> > +	if (ret) {
> > +		kfree(device);
> > +		return ret;
> > +	}
> > +
> > +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> > +				kobject_name(&dev->kobj));
> > +	if (ret) {
> > +		sysfs_remove_link(&dev->kobj, "iommu_group");
> > +		kfree(device);
> > +		return ret;
> > +	}
> 
> So, it's not clear that the kobject_name() here has to be unique
> across all devices in the group.  It might be better to use an
> arbitrary index here instead of a name to avoid that problem.

Hmm, that loses useful convenience when they are unique, such as on PCI.
I'll look and see if sysfs_create_link will fail on duplicate names and
see about adding some kind of instance to it.

> [snip]
> > +/**
> > + * iommu_group_remove_device - remove a device from it's current group
> > + * @dev: device to be removed
> > + *
> > + * This function is called by an iommu driver to remove the device from
> > + * it's current group.  This decrements the iommu group reference count.
> > + */
> > +void iommu_group_remove_device(struct device *dev)
> > +{
> > +	struct iommu_group *group = dev->iommu_group;
> > +	struct iommu_device *device;
> > +
> > +	/* Pre-notify listeners that a device is being removed. */
> > +	blocking_notifier_call_chain(&group->notifier,
> > +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> > +
> > +	mutex_lock(&group->mutex);
> > +	list_for_each_entry(device, &group->devices, list) {
> > +		if (device->dev == dev) {
> > +			list_del(&device->list);
> > +			kfree(device);
> > +			break;
> > +		}
> > +	}
> > +	mutex_unlock(&group->mutex);
> > +
> > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > +
> > +	dev->iommu_group = NULL;
> 
> I suspect the dev -> group pointer should be cleared first, under the
> group lock, but I'm not certain about that.

group->mutex is protecting the group's device list.  I think my
assumption is that when a device is being removed, there should be no
references to it for anyone to race with iommu_group_get(dev), but I'm
not sure how valid that is.

> [snip]
> > +/**
> > + * iommu_group_for_each_dev - iterate over each device in the group
> > + * @group: the group
> > + * @data: caller opaque data to be passed to callback function
> > + * @fn: caller supplied callback function
> > + *
> > + * This function is called by group users to iterate over group devices.
> > + * Callers should hold a reference count to the group during
> > callback.
> 
> Probably also worth noting in this doco that the group lock will be
> held across the callback.

Yes; calling iommu_group_remove_device through this would be a bad idea.

> [snip]
> > +static int iommu_bus_notifier(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> >  	struct device *dev = data;
> > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > +	struct iommu_group *group;
> > +	unsigned long group_action = 0;
> > +
> > +	/*
> > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > +	 * result in ADD/DEL notifiers to group->notifier
> > +	 */
> > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > +		if (ops->add_device)
> > +			return ops->add_device(dev);
> > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > +		if (ops->remove_device && dev->iommu_group) {
> > +			ops->remove_device(dev);
> > +			return 0;
> > +		}
> > +	}
> 
> So, there's still the question of how to assign grouping for devices
> on a subordinate bus behind a bridge which is iommu managed.  The
> iommu driver for the top-level bus can't know about all possible
> subordinate bus types, but the subordinate devices will (in most
> cases, anyway) be iommu translated as if originating with the bus
> bridge.

Not just any bridge, there has to be a different bus_type on the
subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
would trigger this.  In general, I don't know how to handle it and I'm
open to suggestions.  Perhaps we need to register notifiers for every
bus_type and create notifiers for new bus_types.  If we can catch the
ADD_DEVICE for it, we can walk up to the first parent bus that supports
an iommu_ops and add the device there.  But then we need some kind of
foreign bus support for the group since the iommu driver won't know what
to do with the device if we call add_device with it.  Other ideas?
Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 10/13] pci: export pci_user functions for use by other drivers
@ 2012-05-14 21:20     ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-14 21:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, gregkh

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dc25da3..b437225 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
>        return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
>  }
>
> +/* user-space driven config access */
> +extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
> +extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
> +extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
> +extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
> +extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
> +extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);

If you repost this, can you remove the externs when you move these
declarations?  I know the file's currently a random mix, but we might
as well make a tiny improvement.

Looks fine to me otherwise, and if you don't have any other reason to
update this series, don't worry about it.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

>  int __must_check pci_enable_device(struct pci_dev *dev);
>  int __must_check pci_enable_device_io(struct pci_dev *dev);
>  int __must_check pci_enable_device_mem(struct pci_dev *dev);
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 10/13] pci: export pci_user functions for use by other drivers
@ 2012-05-14 21:20     ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-14 21:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, benve-FYB4Gu1CFyUAvxtiuMwx3w,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dc25da3..b437225 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
>        return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
>  }
>
> +/* user-space driven config access */
> +extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
> +extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
> +extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
> +extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
> +extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
> +extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);

If you repost this, can you remove the externs when you move these
declarations?  I know the file's currently a random mix, but we might
as well make a tiny improvement.

Looks fine to me otherwise, and if you don't have any other reason to
update this series, don't worry about it.

Acked-by: Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

>  int __must_check pci_enable_device(struct pci_dev *dev);
>  int __must_check pci_enable_device_io(struct pci_dev *dev);
>  int __must_check pci_enable_device_mem(struct pci_dev *dev);
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 10/13] pci: export pci_user functions for use by other drivers
@ 2012-05-14 21:20     ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-14 21:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	benve, dwmw2, linux-kernel, david

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dc25da3..b437225 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
>        return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
>  }
>
> +/* user-space driven config access */
> +extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
> +extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
> +extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
> +extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
> +extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
> +extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);

If you repost this, can you remove the externs when you move these
declarations?  I know the file's currently a random mix, but we might
as well make a tiny improvement.

Looks fine to me otherwise, and if you don't have any other reason to
update this series, don't worry about it.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

>  int __must_check pci_enable_device(struct pci_dev *dev);
>  int __must_check pci_enable_device_io(struct pci_dev *dev);
>  int __must_check pci_enable_device_mem(struct pci_dev *dev);
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-14 22:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-14 22:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, gregkh

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> In a PCIe environment, transactions aren't always required to
> reach the root bus before being re-routed.  Peer-to-peer DMA
> may actually not be seen by the IOMMU in these cases.  For
> IOMMU groups, we want to provide IOMMU drivers a way to detect
> these restrictions.  Provided with a PCI device, pci_acs_enabled
> returns the furthest downstream device with a complete PCI ACS
> chain.  This information can then be used in grouping to create
> fully isolated groups.  ACS chain logic extracted from libvirt.

The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
>  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h |    1 +
>  2 files changed, 44 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 111569c..d7f05ce 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>  }
>
> +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> +                                        PCI_ACS_CR | PCI_ACS_UF)
> +
> +/**
> + * pci_acs_enabled - test ACS support in downstream chain
> + * @dev: starting PCI device
> + *
> + * Returns the furthest downstream device with an unbroken ACS chain.  If
> + * ACS is enabled throughout the chain, the returned device is the same as
> + * the one passed in.
> + */
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> +{
> +       struct pci_dev *acs_dev;
> +       int pos;
> +       u16 ctrl;
> +
> +       if (!pci_is_root_bus(dev->bus))
> +               acs_dev = pci_acs_enabled(dev->bus->self);
> +       else
> +               return dev;
> +
> +       /* If the chain is already broken, pass on the device */
> +       if (acs_dev != dev->bus->self)
> +               return acs_dev;
> +
> +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> +               return dev;
> +
> +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> +               return dev;
> +
> +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> +       if (!pos)
> +               return acs_dev;
> +
> +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> +               return acs_dev;
> +
> +       return dev;
> +}
> +
>  /**
>  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>  * @dev: the PCI device
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 9910b5c..dc25da3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>  }
>
>  void pci_request_acs(void);
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>
>
>  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-14 22:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-14 22:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, benve-FYB4Gu1CFyUAvxtiuMwx3w,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> In a PCIe environment, transactions aren't always required to
> reach the root bus before being re-routed.  Peer-to-peer DMA
> may actually not be seen by the IOMMU in these cases.  For
> IOMMU groups, we want to provide IOMMU drivers a way to detect
> these restrictions.  Provided with a PCI device, pci_acs_enabled
> returns the furthest downstream device with a complete PCI ACS
> chain.  This information can then be used in grouping to create
> fully isolated groups.  ACS chain logic extracted from libvirt.

The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.

> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>
>  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h |    1 +
>  2 files changed, 44 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 111569c..d7f05ce 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>  }
>
> +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> +                                        PCI_ACS_CR | PCI_ACS_UF)
> +
> +/**
> + * pci_acs_enabled - test ACS support in downstream chain
> + * @dev: starting PCI device
> + *
> + * Returns the furthest downstream device with an unbroken ACS chain.  If
> + * ACS is enabled throughout the chain, the returned device is the same as
> + * the one passed in.
> + */
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> +{
> +       struct pci_dev *acs_dev;
> +       int pos;
> +       u16 ctrl;
> +
> +       if (!pci_is_root_bus(dev->bus))
> +               acs_dev = pci_acs_enabled(dev->bus->self);
> +       else
> +               return dev;
> +
> +       /* If the chain is already broken, pass on the device */
> +       if (acs_dev != dev->bus->self)
> +               return acs_dev;
> +
> +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> +               return dev;
> +
> +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> +               return dev;
> +
> +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> +       if (!pos)
> +               return acs_dev;
> +
> +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> +               return acs_dev;
> +
> +       return dev;
> +}
> +
>  /**
>  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>  * @dev: the PCI device
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 9910b5c..dc25da3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>  }
>
>  void pci_request_acs(void);
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>
>
>  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-14 22:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-14 22:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	benve, dwmw2, linux-kernel, david

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> In a PCIe environment, transactions aren't always required to
> reach the root bus before being re-routed.  Peer-to-peer DMA
> may actually not be seen by the IOMMU in these cases.  For
> IOMMU groups, we want to provide IOMMU drivers a way to detect
> these restrictions.  Provided with a PCI device, pci_acs_enabled
> returns the furthest downstream device with a complete PCI ACS
> chain.  This information can then be used in grouping to create
> fully isolated groups.  ACS chain logic extracted from libvirt.

The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
>  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h |    1 +
>  2 files changed, 44 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 111569c..d7f05ce 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>  }
>
> +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> +                                        PCI_ACS_CR | PCI_ACS_UF)
> +
> +/**
> + * pci_acs_enabled - test ACS support in downstream chain
> + * @dev: starting PCI device
> + *
> + * Returns the furthest downstream device with an unbroken ACS chain.  If
> + * ACS is enabled throughout the chain, the returned device is the same as
> + * the one passed in.
> + */
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> +{
> +       struct pci_dev *acs_dev;
> +       int pos;
> +       u16 ctrl;
> +
> +       if (!pci_is_root_bus(dev->bus))
> +               acs_dev = pci_acs_enabled(dev->bus->self);
> +       else
> +               return dev;
> +
> +       /* If the chain is already broken, pass on the device */
> +       if (acs_dev != dev->bus->self)
> +               return acs_dev;
> +
> +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> +               return dev;
> +
> +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> +               return dev;
> +
> +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> +       if (!pos)
> +               return acs_dev;
> +
> +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> +               return acs_dev;
> +
> +       return dev;
> +}
> +
>  /**
>  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>  * @dev: the PCI device
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 9910b5c..dc25da3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>  }
>
>  void pci_request_acs(void);
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>
>
>  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-14 22:49       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-14 22:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, gregkh

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > In a PCIe environment, transactions aren't always required to
> > reach the root bus before being re-routed.  Peer-to-peer DMA
> > may actually not be seen by the IOMMU in these cases.  For
> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> > returns the furthest downstream device with a complete PCI ACS
> > chain.  This information can then be used in grouping to create
> > fully isolated groups.  ACS chain logic extracted from libvirt.
> 
> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);

> I'm not sure what "a complete PCI ACS chain" means.
> 
> The function starts from "dev" and searches *upstream*, so I'm
> guessing it returns the root of a subtree that must be contained in a
> group.

Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation, so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.  I'll fix the function name and comments/commit log if
that makes it sufficiently clear.  Thanks,

Alex

> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >
> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pci.h |    1 +
> >  2 files changed, 44 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 111569c..d7f05ce 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >
> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> > +                                        PCI_ACS_CR | PCI_ACS_UF)
> > +
> > +/**
> > + * pci_acs_enabled - test ACS support in downstream chain
> > + * @dev: starting PCI device
> > + *
> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
> > + * ACS is enabled throughout the chain, the returned device is the same as
> > + * the one passed in.
> > + */
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> > +{
> > +       struct pci_dev *acs_dev;
> > +       int pos;
> > +       u16 ctrl;
> > +
> > +       if (!pci_is_root_bus(dev->bus))
> > +               acs_dev = pci_acs_enabled(dev->bus->self);
> > +       else
> > +               return dev;
> > +
> > +       /* If the chain is already broken, pass on the device */
> > +       if (acs_dev != dev->bus->self)
> > +               return acs_dev;
> > +
> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> > +               return dev;
> > +
> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> > +               return dev;
> > +
> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> > +       if (!pos)
> > +               return acs_dev;
> > +
> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> > +               return acs_dev;
> > +
> > +       return dev;
> > +}
> > +
> >  /**
> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
> >  * @dev: the PCI device
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 9910b5c..dc25da3 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
> >  }
> >
> >  void pci_request_acs(void);
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
> >
> >
> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-14 22:49       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-14 22:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, benve-FYB4Gu1CFyUAvxtiuMwx3w,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > In a PCIe environment, transactions aren't always required to
> > reach the root bus before being re-routed.  Peer-to-peer DMA
> > may actually not be seen by the IOMMU in these cases.  For
> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> > returns the furthest downstream device with a complete PCI ACS
> > chain.  This information can then be used in grouping to create
> > fully isolated groups.  ACS chain logic extracted from libvirt.
> 
> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);

> I'm not sure what "a complete PCI ACS chain" means.
> 
> The function starts from "dev" and searches *upstream*, so I'm
> guessing it returns the root of a subtree that must be contained in a
> group.

Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation, so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.  I'll fix the function name and comments/commit log if
that makes it sufficiently clear.  Thanks,

Alex

> > Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >
> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pci.h |    1 +
> >  2 files changed, 44 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 111569c..d7f05ce 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >
> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> > +                                        PCI_ACS_CR | PCI_ACS_UF)
> > +
> > +/**
> > + * pci_acs_enabled - test ACS support in downstream chain
> > + * @dev: starting PCI device
> > + *
> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
> > + * ACS is enabled throughout the chain, the returned device is the same as
> > + * the one passed in.
> > + */
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> > +{
> > +       struct pci_dev *acs_dev;
> > +       int pos;
> > +       u16 ctrl;
> > +
> > +       if (!pci_is_root_bus(dev->bus))
> > +               acs_dev = pci_acs_enabled(dev->bus->self);
> > +       else
> > +               return dev;
> > +
> > +       /* If the chain is already broken, pass on the device */
> > +       if (acs_dev != dev->bus->self)
> > +               return acs_dev;
> > +
> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> > +               return dev;
> > +
> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> > +               return dev;
> > +
> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> > +       if (!pos)
> > +               return acs_dev;
> > +
> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> > +               return acs_dev;
> > +
> > +       return dev;
> > +}
> > +
> >  /**
> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
> >  * @dev: the PCI device
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 9910b5c..dc25da3 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
> >  }
> >
> >  void pci_request_acs(void);
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
> >
> >
> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-14 22:49       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-14 22:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	benve, dwmw2, linux-kernel, david

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > In a PCIe environment, transactions aren't always required to
> > reach the root bus before being re-routed.  Peer-to-peer DMA
> > may actually not be seen by the IOMMU in these cases.  For
> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> > returns the furthest downstream device with a complete PCI ACS
> > chain.  This information can then be used in grouping to create
> > fully isolated groups.  ACS chain logic extracted from libvirt.
> 
> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);

> I'm not sure what "a complete PCI ACS chain" means.
> 
> The function starts from "dev" and searches *upstream*, so I'm
> guessing it returns the root of a subtree that must be contained in a
> group.

Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation, so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.  I'll fix the function name and comments/commit log if
that makes it sufficiently clear.  Thanks,

Alex

> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >
> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pci.h |    1 +
> >  2 files changed, 44 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 111569c..d7f05ce 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >
> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> > +                                        PCI_ACS_CR | PCI_ACS_UF)
> > +
> > +/**
> > + * pci_acs_enabled - test ACS support in downstream chain
> > + * @dev: starting PCI device
> > + *
> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
> > + * ACS is enabled throughout the chain, the returned device is the same as
> > + * the one passed in.
> > + */
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> > +{
> > +       struct pci_dev *acs_dev;
> > +       int pos;
> > +       u16 ctrl;
> > +
> > +       if (!pci_is_root_bus(dev->bus))
> > +               acs_dev = pci_acs_enabled(dev->bus->self);
> > +       else
> > +               return dev;
> > +
> > +       /* If the chain is already broken, pass on the device */
> > +       if (acs_dev != dev->bus->self)
> > +               return acs_dev;
> > +
> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> > +               return dev;
> > +
> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> > +               return dev;
> > +
> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> > +       if (!pos)
> > +               return acs_dev;
> > +
> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> > +               return acs_dev;
> > +
> > +       return dev;
> > +}
> > +
> >  /**
> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
> >  * @dev: the PCI device
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 9910b5c..dc25da3 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
> >  }
> >
> >  void pci_request_acs(void);
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
> >
> >
> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-15  2:03         ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-15  2:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
[snip]
> > > +struct iommu_group {
> > > +	struct kobject kobj;
> > > +	struct kobject *devices_kobj;
> > > +	struct list_head devices;
> > > +	struct mutex mutex;
> > > +	struct blocking_notifier_head notifier;
> > > +	int id;
> > 
> > I think you should add some sort of name string to the group as well
> > (supplied by the iommu driver creating the group).  That would make it
> > easier to connect the arbitrary assigned IDs to any platform-standard
> > naming convention for these things.
> 
> When would the name be used and how is it exposed?

I'm thinking of this basically as a debugging aid.  So I'd expect it
to appear in a 'name' (or 'description') sysfs property on the group,
and in printk messages regarding the group.

[snip]
> > So, it's not clear that the kobject_name() here has to be unique
> > across all devices in the group.  It might be better to use an
> > arbitrary index here instead of a name to avoid that problem.
> 
> Hmm, that loses useful convenience when they are unique, such as on PCI.
> I'll look and see if sysfs_create_link will fail on duplicate names and
> see about adding some kind of instance to it.

Ok.  Is the name necessarily unique even for PCI, if the group crosses
multiple domains?

[snip]
> > > +	mutex_lock(&group->mutex);
> > > +	list_for_each_entry(device, &group->devices, list) {
> > > +		if (device->dev == dev) {
> > > +			list_del(&device->list);
> > > +			kfree(device);
> > > +			break;
> > > +		}
> > > +	}
> > > +	mutex_unlock(&group->mutex);
> > > +
> > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > +
> > > +	dev->iommu_group = NULL;
> > 
> > I suspect the dev -> group pointer should be cleared first, under the
> > group lock, but I'm not certain about that.
> 
> group->mutex is protecting the group's device list.  I think my
> assumption is that when a device is being removed, there should be no
> references to it for anyone to race with iommu_group_get(dev), but I'm
> not sure how valid that is.

What I'm concerned about here is someone grabbing the device by
non-group-related means, grabbing a pointer to its group and that
racing with remove_device().  It would then end up with a group
pointer it thinks is right for the device, when the group no longer
thinks it owns the device.

Doing it under the lock is so that on the other side, group aware code
doesn't traverse the group member list and grab a reference to a
device which no longer points back to the group.

> > [snip]
> > > +/**
> > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > + * @group: the group
> > > + * @data: caller opaque data to be passed to callback function
> > > + * @fn: caller supplied callback function
> > > + *
> > > + * This function is called by group users to iterate over group devices.
> > > + * Callers should hold a reference count to the group during
> > > callback.
> > 
> > Probably also worth noting in this doco that the group lock will be
> > held across the callback.
> 
> Yes; calling iommu_group_remove_device through this would be a bad idea.

Or anything which blocks.

> > [snip]
> > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > +			      unsigned long action, void *data)
> > >  {
> > >  	struct device *dev = data;
> > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > +	struct iommu_group *group;
> > > +	unsigned long group_action = 0;
> > > +
> > > +	/*
> > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > +	 * result in ADD/DEL notifiers to group->notifier
> > > +	 */
> > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > +		if (ops->add_device)
> > > +			return ops->add_device(dev);
> > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > +		if (ops->remove_device && dev->iommu_group) {
> > > +			ops->remove_device(dev);
> > > +			return 0;
> > > +		}
> > > +	}
> > 
> > So, there's still the question of how to assign grouping for devices
> > on a subordinate bus behind a bridge which is iommu managed.  The
> > iommu driver for the top-level bus can't know about all possible
> > subordinate bus types, but the subordinate devices will (in most
> > cases, anyway) be iommu translated as if originating with the bus
> > bridge.
> 
> Not just any bridge, there has to be a different bus_type on the
> subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> would trigger this.

Right, although ISA-under-PCI is a bit of a special case anyway.  I
think PCI to Firewire/IEEE1394 would also have this issue, as would
SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
virtual struct devices where the PCI driver is structured as a wrapper
around a "vanilla" device driver, a pattern used in a number of
drivers for chips with both PCI and non PCI variants.

>  In general, I don't know how to handle it and I'm
> open to suggestions.  Perhaps we need to register notifiers for every
> bus_type and create notifiers for new bus_types.

I think we do need to do something like that and basically ensure that
if a new device is not explicitly claimed by an IOMMU driver, it
inherits its group from its parent bridge.

>  If we can catch the
> ADD_DEVICE for it, we can walk up to the first parent bus that supports
> an iommu_ops and add the device there.  But then we need some kind of
> foreign bus support for the group since the iommu driver won't know what
> to do with the device if we call add_device with it.  Other ideas?

Hrm, yeah.  We may need a distinction between primary group members
(i.e. explicitly claimed by the IOMMU driver) and subordinate group
members (devices which are descendents of a primary group member).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-15  2:03         ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-15  2:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
[snip]
> > > +struct iommu_group {
> > > +	struct kobject kobj;
> > > +	struct kobject *devices_kobj;
> > > +	struct list_head devices;
> > > +	struct mutex mutex;
> > > +	struct blocking_notifier_head notifier;
> > > +	int id;
> > 
> > I think you should add some sort of name string to the group as well
> > (supplied by the iommu driver creating the group).  That would make it
> > easier to connect the arbitrary assigned IDs to any platform-standard
> > naming convention for these things.
> 
> When would the name be used and how is it exposed?

I'm thinking of this basically as a debugging aid.  So I'd expect it
to appear in a 'name' (or 'description') sysfs property on the group,
and in printk messages regarding the group.

[snip]
> > So, it's not clear that the kobject_name() here has to be unique
> > across all devices in the group.  It might be better to use an
> > arbitrary index here instead of a name to avoid that problem.
> 
> Hmm, that loses useful convenience when they are unique, such as on PCI.
> I'll look and see if sysfs_create_link will fail on duplicate names and
> see about adding some kind of instance to it.

Ok.  Is the name necessarily unique even for PCI, if the group crosses
multiple domains?

[snip]
> > > +	mutex_lock(&group->mutex);
> > > +	list_for_each_entry(device, &group->devices, list) {
> > > +		if (device->dev == dev) {
> > > +			list_del(&device->list);
> > > +			kfree(device);
> > > +			break;
> > > +		}
> > > +	}
> > > +	mutex_unlock(&group->mutex);
> > > +
> > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > +
> > > +	dev->iommu_group = NULL;
> > 
> > I suspect the dev -> group pointer should be cleared first, under the
> > group lock, but I'm not certain about that.
> 
> group->mutex is protecting the group's device list.  I think my
> assumption is that when a device is being removed, there should be no
> references to it for anyone to race with iommu_group_get(dev), but I'm
> not sure how valid that is.

What I'm concerned about here is someone grabbing the device by
non-group-related means, grabbing a pointer to its group and that
racing with remove_device().  It would then end up with a group
pointer it thinks is right for the device, when the group no longer
thinks it owns the device.

Doing it under the lock is so that on the other side, group aware code
doesn't traverse the group member list and grab a reference to a
device which no longer points back to the group.

> > [snip]
> > > +/**
> > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > + * @group: the group
> > > + * @data: caller opaque data to be passed to callback function
> > > + * @fn: caller supplied callback function
> > > + *
> > > + * This function is called by group users to iterate over group devices.
> > > + * Callers should hold a reference count to the group during
> > > callback.
> > 
> > Probably also worth noting in this doco that the group lock will be
> > held across the callback.
> 
> Yes; calling iommu_group_remove_device through this would be a bad idea.

Or anything which blocks.

> > [snip]
> > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > +			      unsigned long action, void *data)
> > >  {
> > >  	struct device *dev = data;
> > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > +	struct iommu_group *group;
> > > +	unsigned long group_action = 0;
> > > +
> > > +	/*
> > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > +	 * result in ADD/DEL notifiers to group->notifier
> > > +	 */
> > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > +		if (ops->add_device)
> > > +			return ops->add_device(dev);
> > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > +		if (ops->remove_device && dev->iommu_group) {
> > > +			ops->remove_device(dev);
> > > +			return 0;
> > > +		}
> > > +	}
> > 
> > So, there's still the question of how to assign grouping for devices
> > on a subordinate bus behind a bridge which is iommu managed.  The
> > iommu driver for the top-level bus can't know about all possible
> > subordinate bus types, but the subordinate devices will (in most
> > cases, anyway) be iommu translated as if originating with the bus
> > bridge.
> 
> Not just any bridge, there has to be a different bus_type on the
> subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> would trigger this.

Right, although ISA-under-PCI is a bit of a special case anyway.  I
think PCI to Firewire/IEEE1394 would also have this issue, as would
SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
virtual struct devices where the PCI driver is structured as a wrapper
around a "vanilla" device driver, a pattern used in a number of
drivers for chips with both PCI and non PCI variants.

>  In general, I don't know how to handle it and I'm
> open to suggestions.  Perhaps we need to register notifiers for every
> bus_type and create notifiers for new bus_types.

I think we do need to do something like that and basically ensure that
if a new device is not explicitly claimed by an IOMMU driver, it
inherits its group from its parent bridge.

>  If we can catch the
> ADD_DEVICE for it, we can walk up to the first parent bus that supports
> an iommu_ops and add the device there.  But then we need some kind of
> foreign bus support for the group since the iommu driver won't know what
> to do with the device if we call add_device with it.  Other ideas?

Hrm, yeah.  We may need a distinction between primary group members
(i.e. explicitly claimed by the IOMMU driver) and subordinate group
members (devices which are descendents of a primary group member).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-15  2:03         ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-15  2:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
[snip]
> > > +struct iommu_group {
> > > +	struct kobject kobj;
> > > +	struct kobject *devices_kobj;
> > > +	struct list_head devices;
> > > +	struct mutex mutex;
> > > +	struct blocking_notifier_head notifier;
> > > +	int id;
> > 
> > I think you should add some sort of name string to the group as well
> > (supplied by the iommu driver creating the group).  That would make it
> > easier to connect the arbitrary assigned IDs to any platform-standard
> > naming convention for these things.
> 
> When would the name be used and how is it exposed?

I'm thinking of this basically as a debugging aid.  So I'd expect it
to appear in a 'name' (or 'description') sysfs property on the group,
and in printk messages regarding the group.

[snip]
> > So, it's not clear that the kobject_name() here has to be unique
> > across all devices in the group.  It might be better to use an
> > arbitrary index here instead of a name to avoid that problem.
> 
> Hmm, that loses useful convenience when they are unique, such as on PCI.
> I'll look and see if sysfs_create_link will fail on duplicate names and
> see about adding some kind of instance to it.

Ok.  Is the name necessarily unique even for PCI, if the group crosses
multiple domains?

[snip]
> > > +	mutex_lock(&group->mutex);
> > > +	list_for_each_entry(device, &group->devices, list) {
> > > +		if (device->dev == dev) {
> > > +			list_del(&device->list);
> > > +			kfree(device);
> > > +			break;
> > > +		}
> > > +	}
> > > +	mutex_unlock(&group->mutex);
> > > +
> > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > +
> > > +	dev->iommu_group = NULL;
> > 
> > I suspect the dev -> group pointer should be cleared first, under the
> > group lock, but I'm not certain about that.
> 
> group->mutex is protecting the group's device list.  I think my
> assumption is that when a device is being removed, there should be no
> references to it for anyone to race with iommu_group_get(dev), but I'm
> not sure how valid that is.

What I'm concerned about here is someone grabbing the device by
non-group-related means, grabbing a pointer to its group and that
racing with remove_device().  It would then end up with a group
pointer it thinks is right for the device, when the group no longer
thinks it owns the device.

Doing it under the lock is so that on the other side, group aware code
doesn't traverse the group member list and grab a reference to a
device which no longer points back to the group.

> > [snip]
> > > +/**
> > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > + * @group: the group
> > > + * @data: caller opaque data to be passed to callback function
> > > + * @fn: caller supplied callback function
> > > + *
> > > + * This function is called by group users to iterate over group devices.
> > > + * Callers should hold a reference count to the group during
> > > callback.
> > 
> > Probably also worth noting in this doco that the group lock will be
> > held across the callback.
> 
> Yes; calling iommu_group_remove_device through this would be a bad idea.

Or anything which blocks.

> > [snip]
> > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > +			      unsigned long action, void *data)
> > >  {
> > >  	struct device *dev = data;
> > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > +	struct iommu_group *group;
> > > +	unsigned long group_action = 0;
> > > +
> > > +	/*
> > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > +	 * result in ADD/DEL notifiers to group->notifier
> > > +	 */
> > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > +		if (ops->add_device)
> > > +			return ops->add_device(dev);
> > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > +		if (ops->remove_device && dev->iommu_group) {
> > > +			ops->remove_device(dev);
> > > +			return 0;
> > > +		}
> > > +	}
> > 
> > So, there's still the question of how to assign grouping for devices
> > on a subordinate bus behind a bridge which is iommu managed.  The
> > iommu driver for the top-level bus can't know about all possible
> > subordinate bus types, but the subordinate devices will (in most
> > cases, anyway) be iommu translated as if originating with the bus
> > bridge.
> 
> Not just any bridge, there has to be a different bus_type on the
> subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> would trigger this.

Right, although ISA-under-PCI is a bit of a special case anyway.  I
think PCI to Firewire/IEEE1394 would also have this issue, as would
SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
virtual struct devices where the PCI driver is structured as a wrapper
around a "vanilla" device driver, a pattern used in a number of
drivers for chips with both PCI and non PCI variants.

>  In general, I don't know how to handle it and I'm
> open to suggestions.  Perhaps we need to register notifiers for every
> bus_type and create notifiers for new bus_types.

I think we do need to do something like that and basically ensure that
if a new device is not explicitly claimed by an IOMMU driver, it
inherits its group from its parent bridge.

>  If we can catch the
> ADD_DEVICE for it, we can walk up to the first parent bus that supports
> an iommu_ops and add the device there.  But then we need some kind of
> foreign bus support for the group since the iommu driver won't know what
> to do with the device if we call add_device with it.  Other ideas?

Hrm, yeah.  We may need a distinction between primary group members
(i.e. explicitly claimed by the IOMMU driver) and subordinate group
members (devices which are descendents of a primary group member).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-15  6:34           ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-15  6:34 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> [snip]
> > > > +struct iommu_group {
> > > > +	struct kobject kobj;
> > > > +	struct kobject *devices_kobj;
> > > > +	struct list_head devices;
> > > > +	struct mutex mutex;
> > > > +	struct blocking_notifier_head notifier;
> > > > +	int id;
> > > 
> > > I think you should add some sort of name string to the group as well
> > > (supplied by the iommu driver creating the group).  That would make it
> > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > naming convention for these things.
> > 
> > When would the name be used and how is it exposed?
> 
> I'm thinking of this basically as a debugging aid.  So I'd expect it
> to appear in a 'name' (or 'description') sysfs property on the group,
> and in printk messages regarding the group.

Ok, so long as it's only descriptive/debugging I don't have a problem
adding something like that.

> [snip]
> > > So, it's not clear that the kobject_name() here has to be unique
> > > across all devices in the group.  It might be better to use an
> > > arbitrary index here instead of a name to avoid that problem.
> > 
> > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > I'll look and see if sysfs_create_link will fail on duplicate names and
> > see about adding some kind of instance to it.
> 
> Ok.  Is the name necessarily unique even for PCI, if the group crosses
> multiple domains?

Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
just use sysfs_create_link_nowarn and add a .# index when we have a name
collision.

> [snip]
> > > > +	mutex_lock(&group->mutex);
> > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > +		if (device->dev == dev) {
> > > > +			list_del(&device->list);
> > > > +			kfree(device);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +	mutex_unlock(&group->mutex);
> > > > +
> > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > +
> > > > +	dev->iommu_group = NULL;
> > > 
> > > I suspect the dev -> group pointer should be cleared first, under the
> > > group lock, but I'm not certain about that.
> > 
> > group->mutex is protecting the group's device list.  I think my
> > assumption is that when a device is being removed, there should be no
> > references to it for anyone to race with iommu_group_get(dev), but I'm
> > not sure how valid that is.
> 
> What I'm concerned about here is someone grabbing the device by
> non-group-related means, grabbing a pointer to its group and that
> racing with remove_device().  It would then end up with a group
> pointer it thinks is right for the device, when the group no longer
> thinks it owns the device.
> 
> Doing it under the lock is so that on the other side, group aware code
> doesn't traverse the group member list and grab a reference to a
> device which no longer points back to the group.

Our for_each function does grab the lock, as you noticed below, so
removing it from the list under lock prevents that path.  Where it gets
fuzzy is if someone can call iommu_group_get(dev) to get a group
reference in this gap.  Whether we clear the iommu_group pointer under
lock or not doesn't matter for that path since it doesn't retrieve it
under lock.  The assumption there is that the caller is going to have a
reference to the device and therefore the device is not being removed.
The asynchronous locking and reference counting is by far the hardest
part of iommu_groups and vfio core, so appreciate any hard analysis of
that.

> > > [snip]
> > > > +/**
> > > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > > + * @group: the group
> > > > + * @data: caller opaque data to be passed to callback function
> > > > + * @fn: caller supplied callback function
> > > > + *
> > > > + * This function is called by group users to iterate over group devices.
> > > > + * Callers should hold a reference count to the group during
> > > > callback.
> > > 
> > > Probably also worth noting in this doco that the group lock will be
> > > held across the callback.
> > 
> > Yes; calling iommu_group_remove_device through this would be a bad idea.
> 
> Or anything which blocks.
> 
> > > [snip]
> > > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > > +			      unsigned long action, void *data)
> > > >  {
> > > >  	struct device *dev = data;
> > > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > +	struct iommu_group *group;
> > > > +	unsigned long group_action = 0;
> > > > +
> > > > +	/*
> > > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > > +	 * result in ADD/DEL notifiers to group->notifier
> > > > +	 */
> > > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > > +		if (ops->add_device)
> > > > +			return ops->add_device(dev);
> > > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > > +		if (ops->remove_device && dev->iommu_group) {
> > > > +			ops->remove_device(dev);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > 
> > > So, there's still the question of how to assign grouping for devices
> > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > iommu driver for the top-level bus can't know about all possible
> > > subordinate bus types, but the subordinate devices will (in most
> > > cases, anyway) be iommu translated as if originating with the bus
> > > bridge.
> > 
> > Not just any bridge, there has to be a different bus_type on the
> > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > would trigger this.
> 
> Right, although ISA-under-PCI is a bit of a special case anyway.  I
> think PCI to Firewire/IEEE1394 would also have this issue, as would
> SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> virtual struct devices where the PCI driver is structured as a wrapper
> around a "vanilla" device driver, a pattern used in a number of
> drivers for chips with both PCI and non PCI variants.

Sorry, I jumped into reliving this issue without remembering how I
decided to rationalize it for IOMMU groups.  Let's step through it.
Given DeviceA that's a member of GroupA and potentially sources a
subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
>From a VFIO perspective, GroupA isn't usable so long as DeviceA is
claimed by a non-VFIO driver.  That same non-VFIO driver is the one
causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
goes away and we can freely give GroupA to userspace.  I believe this is
always true; there are no subordinate buses to devices that meet the
"viable" driver requirements of VFIO.  I don't see any problems with the
fact that userspace can then re-source A_bus_type and find DeviceA'.
That's what should happen.  If we want to assign just DeviceA' to
userspace, well, it has no IOMMU group of it's own, so clearly it's not
assignable on it's own.

For the more general IOMMU group case, I'm still struggling to figure
out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
don't think it's unreasonable that map_page would discover there's no
iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
practical reason why DeviceA' would need to be listed as a member of
GroupA, or is it just an optimization?  I know we had a number of
discussions about these type of devices for isolation groups, but I
think that trying to strictly represent these types of devices was also
one of the downfalls of the isolation proposal.

This did make me think of one other generic quirk we might need.
There's some funkiness with USB that makes me think that it's
effectively a shared bus between 1.x and 2.0 controllers.  So assigning
the 1.x and 2.0 controllers to different groups potentially allows a
fast and a slow path to the same set of devices.  Is this true?  If so,
we probably need to quirk OHCI/UHCI and EHCI functions together when
they're on the same PCI device.  I think the PCI class code is
sufficient to do this.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-15  6:34           ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-15  6:34 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> [snip]
> > > > +struct iommu_group {
> > > > +	struct kobject kobj;
> > > > +	struct kobject *devices_kobj;
> > > > +	struct list_head devices;
> > > > +	struct mutex mutex;
> > > > +	struct blocking_notifier_head notifier;
> > > > +	int id;
> > > 
> > > I think you should add some sort of name string to the group as well
> > > (supplied by the iommu driver creating the group).  That would make it
> > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > naming convention for these things.
> > 
> > When would the name be used and how is it exposed?
> 
> I'm thinking of this basically as a debugging aid.  So I'd expect it
> to appear in a 'name' (or 'description') sysfs property on the group,
> and in printk messages regarding the group.

Ok, so long as it's only descriptive/debugging I don't have a problem
adding something like that.

> [snip]
> > > So, it's not clear that the kobject_name() here has to be unique
> > > across all devices in the group.  It might be better to use an
> > > arbitrary index here instead of a name to avoid that problem.
> > 
> > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > I'll look and see if sysfs_create_link will fail on duplicate names and
> > see about adding some kind of instance to it.
> 
> Ok.  Is the name necessarily unique even for PCI, if the group crosses
> multiple domains?

Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
just use sysfs_create_link_nowarn and add a .# index when we have a name
collision.

> [snip]
> > > > +	mutex_lock(&group->mutex);
> > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > +		if (device->dev == dev) {
> > > > +			list_del(&device->list);
> > > > +			kfree(device);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +	mutex_unlock(&group->mutex);
> > > > +
> > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > +
> > > > +	dev->iommu_group = NULL;
> > > 
> > > I suspect the dev -> group pointer should be cleared first, under the
> > > group lock, but I'm not certain about that.
> > 
> > group->mutex is protecting the group's device list.  I think my
> > assumption is that when a device is being removed, there should be no
> > references to it for anyone to race with iommu_group_get(dev), but I'm
> > not sure how valid that is.
> 
> What I'm concerned about here is someone grabbing the device by
> non-group-related means, grabbing a pointer to its group and that
> racing with remove_device().  It would then end up with a group
> pointer it thinks is right for the device, when the group no longer
> thinks it owns the device.
> 
> Doing it under the lock is so that on the other side, group aware code
> doesn't traverse the group member list and grab a reference to a
> device which no longer points back to the group.

Our for_each function does grab the lock, as you noticed below, so
removing it from the list under lock prevents that path.  Where it gets
fuzzy is if someone can call iommu_group_get(dev) to get a group
reference in this gap.  Whether we clear the iommu_group pointer under
lock or not doesn't matter for that path since it doesn't retrieve it
under lock.  The assumption there is that the caller is going to have a
reference to the device and therefore the device is not being removed.
The asynchronous locking and reference counting is by far the hardest
part of iommu_groups and vfio core, so appreciate any hard analysis of
that.

> > > [snip]
> > > > +/**
> > > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > > + * @group: the group
> > > > + * @data: caller opaque data to be passed to callback function
> > > > + * @fn: caller supplied callback function
> > > > + *
> > > > + * This function is called by group users to iterate over group devices.
> > > > + * Callers should hold a reference count to the group during
> > > > callback.
> > > 
> > > Probably also worth noting in this doco that the group lock will be
> > > held across the callback.
> > 
> > Yes; calling iommu_group_remove_device through this would be a bad idea.
> 
> Or anything which blocks.
> 
> > > [snip]
> > > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > > +			      unsigned long action, void *data)
> > > >  {
> > > >  	struct device *dev = data;
> > > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > +	struct iommu_group *group;
> > > > +	unsigned long group_action = 0;
> > > > +
> > > > +	/*
> > > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > > +	 * result in ADD/DEL notifiers to group->notifier
> > > > +	 */
> > > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > > +		if (ops->add_device)
> > > > +			return ops->add_device(dev);
> > > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > > +		if (ops->remove_device && dev->iommu_group) {
> > > > +			ops->remove_device(dev);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > 
> > > So, there's still the question of how to assign grouping for devices
> > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > iommu driver for the top-level bus can't know about all possible
> > > subordinate bus types, but the subordinate devices will (in most
> > > cases, anyway) be iommu translated as if originating with the bus
> > > bridge.
> > 
> > Not just any bridge, there has to be a different bus_type on the
> > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > would trigger this.
> 
> Right, although ISA-under-PCI is a bit of a special case anyway.  I
> think PCI to Firewire/IEEE1394 would also have this issue, as would
> SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> virtual struct devices where the PCI driver is structured as a wrapper
> around a "vanilla" device driver, a pattern used in a number of
> drivers for chips with both PCI and non PCI variants.

Sorry, I jumped into reliving this issue without remembering how I
decided to rationalize it for IOMMU groups.  Let's step through it.
Given DeviceA that's a member of GroupA and potentially sources a
subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
>From a VFIO perspective, GroupA isn't usable so long as DeviceA is
claimed by a non-VFIO driver.  That same non-VFIO driver is the one
causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
goes away and we can freely give GroupA to userspace.  I believe this is
always true; there are no subordinate buses to devices that meet the
"viable" driver requirements of VFIO.  I don't see any problems with the
fact that userspace can then re-source A_bus_type and find DeviceA'.
That's what should happen.  If we want to assign just DeviceA' to
userspace, well, it has no IOMMU group of it's own, so clearly it's not
assignable on it's own.

For the more general IOMMU group case, I'm still struggling to figure
out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
don't think it's unreasonable that map_page would discover there's no
iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
practical reason why DeviceA' would need to be listed as a member of
GroupA, or is it just an optimization?  I know we had a number of
discussions about these type of devices for isolation groups, but I
think that trying to strictly represent these types of devices was also
one of the downfalls of the isolation proposal.

This did make me think of one other generic quirk we might need.
There's some funkiness with USB that makes me think that it's
effectively a shared bus between 1.x and 2.0 controllers.  So assigning
the 1.x and 2.0 controllers to different groups potentially allows a
fast and a slow path to the same set of devices.  Is this true?  If so,
we probably need to quirk OHCI/UHCI and EHCI functions together when
they're on the same PCI device.  I think the PCI class code is
sufficient to do this.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-15  6:34           ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-15  6:34 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> [snip]
> > > > +struct iommu_group {
> > > > +	struct kobject kobj;
> > > > +	struct kobject *devices_kobj;
> > > > +	struct list_head devices;
> > > > +	struct mutex mutex;
> > > > +	struct blocking_notifier_head notifier;
> > > > +	int id;
> > > 
> > > I think you should add some sort of name string to the group as well
> > > (supplied by the iommu driver creating the group).  That would make it
> > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > naming convention for these things.
> > 
> > When would the name be used and how is it exposed?
> 
> I'm thinking of this basically as a debugging aid.  So I'd expect it
> to appear in a 'name' (or 'description') sysfs property on the group,
> and in printk messages regarding the group.

Ok, so long as it's only descriptive/debugging I don't have a problem
adding something like that.

> [snip]
> > > So, it's not clear that the kobject_name() here has to be unique
> > > across all devices in the group.  It might be better to use an
> > > arbitrary index here instead of a name to avoid that problem.
> > 
> > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > I'll look and see if sysfs_create_link will fail on duplicate names and
> > see about adding some kind of instance to it.
> 
> Ok.  Is the name necessarily unique even for PCI, if the group crosses
> multiple domains?

Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
just use sysfs_create_link_nowarn and add a .# index when we have a name
collision.

> [snip]
> > > > +	mutex_lock(&group->mutex);
> > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > +		if (device->dev == dev) {
> > > > +			list_del(&device->list);
> > > > +			kfree(device);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +	mutex_unlock(&group->mutex);
> > > > +
> > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > +
> > > > +	dev->iommu_group = NULL;
> > > 
> > > I suspect the dev -> group pointer should be cleared first, under the
> > > group lock, but I'm not certain about that.
> > 
> > group->mutex is protecting the group's device list.  I think my
> > assumption is that when a device is being removed, there should be no
> > references to it for anyone to race with iommu_group_get(dev), but I'm
> > not sure how valid that is.
> 
> What I'm concerned about here is someone grabbing the device by
> non-group-related means, grabbing a pointer to its group and that
> racing with remove_device().  It would then end up with a group
> pointer it thinks is right for the device, when the group no longer
> thinks it owns the device.
> 
> Doing it under the lock is so that on the other side, group aware code
> doesn't traverse the group member list and grab a reference to a
> device which no longer points back to the group.

Our for_each function does grab the lock, as you noticed below, so
removing it from the list under lock prevents that path.  Where it gets
fuzzy is if someone can call iommu_group_get(dev) to get a group
reference in this gap.  Whether we clear the iommu_group pointer under
lock or not doesn't matter for that path since it doesn't retrieve it
under lock.  The assumption there is that the caller is going to have a
reference to the device and therefore the device is not being removed.
The asynchronous locking and reference counting is by far the hardest
part of iommu_groups and vfio core, so appreciate any hard analysis of
that.

> > > [snip]
> > > > +/**
> > > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > > + * @group: the group
> > > > + * @data: caller opaque data to be passed to callback function
> > > > + * @fn: caller supplied callback function
> > > > + *
> > > > + * This function is called by group users to iterate over group devices.
> > > > + * Callers should hold a reference count to the group during
> > > > callback.
> > > 
> > > Probably also worth noting in this doco that the group lock will be
> > > held across the callback.
> > 
> > Yes; calling iommu_group_remove_device through this would be a bad idea.
> 
> Or anything which blocks.
> 
> > > [snip]
> > > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > > +			      unsigned long action, void *data)
> > > >  {
> > > >  	struct device *dev = data;
> > > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > +	struct iommu_group *group;
> > > > +	unsigned long group_action = 0;
> > > > +
> > > > +	/*
> > > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > > +	 * result in ADD/DEL notifiers to group->notifier
> > > > +	 */
> > > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > > +		if (ops->add_device)
> > > > +			return ops->add_device(dev);
> > > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > > +		if (ops->remove_device && dev->iommu_group) {
> > > > +			ops->remove_device(dev);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > 
> > > So, there's still the question of how to assign grouping for devices
> > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > iommu driver for the top-level bus can't know about all possible
> > > subordinate bus types, but the subordinate devices will (in most
> > > cases, anyway) be iommu translated as if originating with the bus
> > > bridge.
> > 
> > Not just any bridge, there has to be a different bus_type on the
> > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > would trigger this.
> 
> Right, although ISA-under-PCI is a bit of a special case anyway.  I
> think PCI to Firewire/IEEE1394 would also have this issue, as would
> SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> virtual struct devices where the PCI driver is structured as a wrapper
> around a "vanilla" device driver, a pattern used in a number of
> drivers for chips with both PCI and non PCI variants.

Sorry, I jumped into reliving this issue without remembering how I
decided to rationalize it for IOMMU groups.  Let's step through it.
Given DeviceA that's a member of GroupA and potentially sources a
subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
>From a VFIO perspective, GroupA isn't usable so long as DeviceA is
claimed by a non-VFIO driver.  That same non-VFIO driver is the one
causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
goes away and we can freely give GroupA to userspace.  I believe this is
always true; there are no subordinate buses to devices that meet the
"viable" driver requirements of VFIO.  I don't see any problems with the
fact that userspace can then re-source A_bus_type and find DeviceA'.
That's what should happen.  If we want to assign just DeviceA' to
userspace, well, it has no IOMMU group of it's own, so clearly it's not
assignable on it's own.

For the more general IOMMU group case, I'm still struggling to figure
out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
don't think it's unreasonable that map_page would discover there's no
iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
practical reason why DeviceA' would need to be listed as a member of
GroupA, or is it just an optimization?  I know we had a number of
discussions about these type of devices for isolation groups, but I
think that trying to strictly represent these types of devices was also
one of the downfalls of the isolation proposal.

This did make me think of one other generic quirk we might need.
There's some funkiness with USB that makes me think that it's
effectively a shared bus between 1.x and 2.0 controllers.  So assigning
the 1.x and 2.0 controllers to different groups potentially allows a
fast and a slow path to the same set of devices.  Is this true?  If so,
we probably need to quirk OHCI/UHCI and EHCI functions together when
they're on the same PCI device.  I think the PCI class code is
sufficient to do this.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 19:56         ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-15 19:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, gregkh

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>> <alex.williamson@redhat.com> wrote:
>> > In a PCIe environment, transactions aren't always required to
>> > reach the root bus before being re-routed.  Peer-to-peer DMA
>> > may actually not be seen by the IOMMU in these cases.  For
>> > IOMMU groups, we want to provide IOMMU drivers a way to detect
>> > these restrictions.  Provided with a PCI device, pci_acs_enabled
>> > returns the furthest downstream device with a complete PCI ACS
>> > chain.  This information can then be used in grouping to create
>> > fully isolated groups.  ACS chain logic extracted from libvirt.
>>
>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>
> Right, maybe this should be:
>
> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>
>> I'm not sure what "a complete PCI ACS chain" means.
>>
>> The function starts from "dev" and searches *upstream*, so I'm
>> guessing it returns the root of a subtree that must be contained in a
>> group.
>
> Any intermediate switch between an endpoint and the root bus can
> redirect a dma access without iommu translation,

Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

                   ^
                   |
          +--------+-------+
          |                |
   +------+-----+    +-----++-----+
   | Downstream |    | Downstream |
   |    Port    |    |    Port    |
   |   06:05.0  |    |   06:06.0  |
   +------+-----+    +------+-----+
          |                 |
     +----v----+       +----v----+
     | Endpoint|       | Endpoint|
     | 07:00.0 |       | 08:00.0 |
     +---------+       +---------+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.

I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.

> so we're looking for
> the furthest upstream device for which acs is enabled all the way up to
> the root bus.

Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.

If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled.  Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.

I tried to work through some examples to develop some intuition about this:

                                |
        +------------+----------+----------------------+
        |            |                                 |
        |            |
+----------------|-------------------------------+
   +----v----+  +----v----+           |          +-----v----+
                |
   | 00:00.0 |  | 00:01.0 |           |          | 00:02.0  |
                |
   |   PCI   |  | PCIe-to |           |          | Upstream |
                |
   +---------+  |   PCI   |           |          +-----+----+
                |
                +----+----+           |                |
                |
                     |                |
+---------+------+----------------+       |
              +------+------+         |      |                |
        |       |
              |             |         | +----v-----+     +----v-----+
   +----v-----+ |
         +----v----+   +----v----+    | | 02:00.0  |     | 02:01.0  |
   | 02:02.0  | |
         | 01:00.0 |   | 01:01.0 |    | |Downstream|     |Downstream|
   |Downstream| |
         |   PCI   |   |   PCI   |    | |  w/o ACS |     |  w/ ACS  |
   |  w/ ACS  | |
         +---------+   +---------+    | +-----+----+     +----+-----+
   +----+-----+ |

+-------|---------------|----------------|-------+
                                              |               |                |
                                         +----v----+     +----v----+
   +----v----+
                                         | 03:00.0 |     | 04:00.0 |
   | 05:00.0 |
                                         |  PCIe   |     |  PCIe   |
   |  PCIe   |
                                         +---------+     | w/o ACS |
   |  w/ ACS |
                                                         +---------+
   +---------+

pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))
pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
PCIe; seems wrong)
pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
doesn't have ACS)
pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
a bridge; seems wrong if 04:00 is a multi-function device)
pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)

But it didn't really help.  I still can't develop a mental picture of
what this function does.

>> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > ---
>> >
>> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>> >  include/linux/pci.h |    1 +
>> >  2 files changed, 44 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> > index 111569c..d7f05ce 100644
>> > --- a/drivers/pci/pci.c
>> > +++ b/drivers/pci/pci.c
>> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>> >  }
>> >
>> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
>> > +                                        PCI_ACS_CR | PCI_ACS_UF)
>> > +
>> > +/**
>> > + * pci_acs_enabled - test ACS support in downstream chain
>> > + * @dev: starting PCI device
>> > + *
>> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
>> > + * ACS is enabled throughout the chain, the returned device is the same as
>> > + * the one passed in.
>> > + */
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
>> > +{
>> > +       struct pci_dev *acs_dev;
>> > +       int pos;
>> > +       u16 ctrl;
>> > +
>> > +       if (!pci_is_root_bus(dev->bus))
>> > +               acs_dev = pci_acs_enabled(dev->bus->self);
>> > +       else
>> > +               return dev;
>> > +
>> > +       /* If the chain is already broken, pass on the device */
>> > +       if (acs_dev != dev->bus->self)
>> > +               return acs_dev;
>> > +
>> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
>> > +               return dev;
>> > +
>> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
>> > +               return dev;
>> > +
>> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
>> > +       if (!pos)
>> > +               return acs_dev;
>> > +
>> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
>> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
>> > +               return acs_dev;
>> > +
>> > +       return dev;
>> > +}
>> > +
>> >  /**
>> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>> >  * @dev: the PCI device
>> > diff --git a/include/linux/pci.h b/include/linux/pci.h
>> > index 9910b5c..dc25da3 100644
>> > --- a/include/linux/pci.h
>> > +++ b/include/linux/pci.h
>> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>> >  }
>> >
>> >  void pci_request_acs(void);
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>> >
>> >
>> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 19:56         ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-15 19:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, benve-FYB4Gu1CFyUAvxtiuMwx3w,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > In a PCIe environment, transactions aren't always required to
>> > reach the root bus before being re-routed.  Peer-to-peer DMA
>> > may actually not be seen by the IOMMU in these cases.  For
>> > IOMMU groups, we want to provide IOMMU drivers a way to detect
>> > these restrictions.  Provided with a PCI device, pci_acs_enabled
>> > returns the furthest downstream device with a complete PCI ACS
>> > chain.  This information can then be used in grouping to create
>> > fully isolated groups.  ACS chain logic extracted from libvirt.
>>
>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>
> Right, maybe this should be:
>
> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>
>> I'm not sure what "a complete PCI ACS chain" means.
>>
>> The function starts from "dev" and searches *upstream*, so I'm
>> guessing it returns the root of a subtree that must be contained in a
>> group.
>
> Any intermediate switch between an endpoint and the root bus can
> redirect a dma access without iommu translation,

Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

                   ^
                   |
          +--------+-------+
          |                |
   +------+-----+    +-----++-----+
   | Downstream |    | Downstream |
   |    Port    |    |    Port    |
   |   06:05.0  |    |   06:06.0  |
   +------+-----+    +------+-----+
          |                 |
     +----v----+       +----v----+
     | Endpoint|       | Endpoint|
     | 07:00.0 |       | 08:00.0 |
     +---------+       +---------+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.

I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.

> so we're looking for
> the furthest upstream device for which acs is enabled all the way up to
> the root bus.

Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.

If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled.  Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.

I tried to work through some examples to develop some intuition about this:

                                |
        +------------+----------+----------------------+
        |            |                                 |
        |            |
+----------------|-------------------------------+
   +----v----+  +----v----+           |          +-----v----+
                |
   | 00:00.0 |  | 00:01.0 |           |          | 00:02.0  |
                |
   |   PCI   |  | PCIe-to |           |          | Upstream |
                |
   +---------+  |   PCI   |           |          +-----+----+
                |
                +----+----+           |                |
                |
                     |                |
+---------+------+----------------+       |
              +------+------+         |      |                |
        |       |
              |             |         | +----v-----+     +----v-----+
   +----v-----+ |
         +----v----+   +----v----+    | | 02:00.0  |     | 02:01.0  |
   | 02:02.0  | |
         | 01:00.0 |   | 01:01.0 |    | |Downstream|     |Downstream|
   |Downstream| |
         |   PCI   |   |   PCI   |    | |  w/o ACS |     |  w/ ACS  |
   |  w/ ACS  | |
         +---------+   +---------+    | +-----+----+     +----+-----+
   +----+-----+ |

+-------|---------------|----------------|-------+
                                              |               |                |
                                         +----v----+     +----v----+
   +----v----+
                                         | 03:00.0 |     | 04:00.0 |
   | 05:00.0 |
                                         |  PCIe   |     |  PCIe   |
   |  PCIe   |
                                         +---------+     | w/o ACS |
   |  w/ ACS |
                                                         +---------+
   +---------+

pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))
pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
PCIe; seems wrong)
pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
doesn't have ACS)
pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
a bridge; seems wrong if 04:00 is a multi-function device)
pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)

But it didn't really help.  I still can't develop a mental picture of
what this function does.

>> > Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> > ---
>> >
>> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>> >  include/linux/pci.h |    1 +
>> >  2 files changed, 44 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> > index 111569c..d7f05ce 100644
>> > --- a/drivers/pci/pci.c
>> > +++ b/drivers/pci/pci.c
>> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>> >  }
>> >
>> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
>> > +                                        PCI_ACS_CR | PCI_ACS_UF)
>> > +
>> > +/**
>> > + * pci_acs_enabled - test ACS support in downstream chain
>> > + * @dev: starting PCI device
>> > + *
>> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
>> > + * ACS is enabled throughout the chain, the returned device is the same as
>> > + * the one passed in.
>> > + */
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
>> > +{
>> > +       struct pci_dev *acs_dev;
>> > +       int pos;
>> > +       u16 ctrl;
>> > +
>> > +       if (!pci_is_root_bus(dev->bus))
>> > +               acs_dev = pci_acs_enabled(dev->bus->self);
>> > +       else
>> > +               return dev;
>> > +
>> > +       /* If the chain is already broken, pass on the device */
>> > +       if (acs_dev != dev->bus->self)
>> > +               return acs_dev;
>> > +
>> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
>> > +               return dev;
>> > +
>> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
>> > +               return dev;
>> > +
>> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
>> > +       if (!pos)
>> > +               return acs_dev;
>> > +
>> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
>> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
>> > +               return acs_dev;
>> > +
>> > +       return dev;
>> > +}
>> > +
>> >  /**
>> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>> >  * @dev: the PCI device
>> > diff --git a/include/linux/pci.h b/include/linux/pci.h
>> > index 9910b5c..dc25da3 100644
>> > --- a/include/linux/pci.h
>> > +++ b/include/linux/pci.h
>> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>> >  }
>> >
>> >  void pci_request_acs(void);
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>> >
>> >
>> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 19:56         ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-15 19:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	benve, dwmw2, linux-kernel, david

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>> <alex.williamson@redhat.com> wrote:
>> > In a PCIe environment, transactions aren't always required to
>> > reach the root bus before being re-routed.  Peer-to-peer DMA
>> > may actually not be seen by the IOMMU in these cases.  For
>> > IOMMU groups, we want to provide IOMMU drivers a way to detect
>> > these restrictions.  Provided with a PCI device, pci_acs_enabled
>> > returns the furthest downstream device with a complete PCI ACS
>> > chain.  This information can then be used in grouping to create
>> > fully isolated groups.  ACS chain logic extracted from libvirt.
>>
>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>
> Right, maybe this should be:
>
> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>
>> I'm not sure what "a complete PCI ACS chain" means.
>>
>> The function starts from "dev" and searches *upstream*, so I'm
>> guessing it returns the root of a subtree that must be contained in a
>> group.
>
> Any intermediate switch between an endpoint and the root bus can
> redirect a dma access without iommu translation,

Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

                   ^
                   |
          +--------+-------+
          |                |
   +------+-----+    +-----++-----+
   | Downstream |    | Downstream |
   |    Port    |    |    Port    |
   |   06:05.0  |    |   06:06.0  |
   +------+-----+    +------+-----+
          |                 |
     +----v----+       +----v----+
     | Endpoint|       | Endpoint|
     | 07:00.0 |       | 08:00.0 |
     +---------+       +---------+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.

I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.

> so we're looking for
> the furthest upstream device for which acs is enabled all the way up to
> the root bus.

Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.

If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled.  Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.

I tried to work through some examples to develop some intuition about this:

                                |
        +------------+----------+----------------------+
        |            |                                 |
        |            |
+----------------|-------------------------------+
   +----v----+  +----v----+           |          +-----v----+
                |
   | 00:00.0 |  | 00:01.0 |           |          | 00:02.0  |
                |
   |   PCI   |  | PCIe-to |           |          | Upstream |
                |
   +---------+  |   PCI   |           |          +-----+----+
                |
                +----+----+           |                |
                |
                     |                |
+---------+------+----------------+       |
              +------+------+         |      |                |
        |       |
              |             |         | +----v-----+     +----v-----+
   +----v-----+ |
         +----v----+   +----v----+    | | 02:00.0  |     | 02:01.0  |
   | 02:02.0  | |
         | 01:00.0 |   | 01:01.0 |    | |Downstream|     |Downstream|
   |Downstream| |
         |   PCI   |   |   PCI   |    | |  w/o ACS |     |  w/ ACS  |
   |  w/ ACS  | |
         +---------+   +---------+    | +-----+----+     +----+-----+
   +----+-----+ |

+-------|---------------|----------------|-------+
                                              |               |                |
                                         +----v----+     +----v----+
   +----v----+
                                         | 03:00.0 |     | 04:00.0 |
   | 05:00.0 |
                                         |  PCIe   |     |  PCIe   |
   |  PCIe   |
                                         +---------+     | w/o ACS |
   |  w/ ACS |
                                                         +---------+
   +---------+

pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))
pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
PCIe; seems wrong)
pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
doesn't have ACS)
pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
a bridge; seems wrong if 04:00 is a multi-function device)
pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)

But it didn't really help.  I still can't develop a mental picture of
what this function does.

>> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > ---
>> >
>> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>> >  include/linux/pci.h |    1 +
>> >  2 files changed, 44 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> > index 111569c..d7f05ce 100644
>> > --- a/drivers/pci/pci.c
>> > +++ b/drivers/pci/pci.c
>> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>> >  }
>> >
>> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
>> > +                                        PCI_ACS_CR | PCI_ACS_UF)
>> > +
>> > +/**
>> > + * pci_acs_enabled - test ACS support in downstream chain
>> > + * @dev: starting PCI device
>> > + *
>> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
>> > + * ACS is enabled throughout the chain, the returned device is the same as
>> > + * the one passed in.
>> > + */
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
>> > +{
>> > +       struct pci_dev *acs_dev;
>> > +       int pos;
>> > +       u16 ctrl;
>> > +
>> > +       if (!pci_is_root_bus(dev->bus))
>> > +               acs_dev = pci_acs_enabled(dev->bus->self);
>> > +       else
>> > +               return dev;
>> > +
>> > +       /* If the chain is already broken, pass on the device */
>> > +       if (acs_dev != dev->bus->self)
>> > +               return acs_dev;
>> > +
>> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
>> > +               return dev;
>> > +
>> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
>> > +               return dev;
>> > +
>> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
>> > +       if (!pos)
>> > +               return acs_dev;
>> > +
>> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
>> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
>> > +               return acs_dev;
>> > +
>> > +       return dev;
>> > +}
>> > +
>> >  /**
>> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>> >  * @dev: the PCI device
>> > diff --git a/include/linux/pci.h b/include/linux/pci.h
>> > index 9910b5c..dc25da3 100644
>> > --- a/include/linux/pci.h
>> > +++ b/include/linux/pci.h
>> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>> >  }
>> >
>> >  void pci_request_acs(void);
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>> >
>> >
>> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 20:05           ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-15 20:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, gregkh

> I tried to work through some examples to develop some intuition about this:

Sorry, gmail inserted line breaks that ruined this picture.  Here's a
URL for it:

http://www.asciiflow.com/#3736558963405980039

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 20:05           ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-15 20:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, benve-FYB4Gu1CFyUAvxtiuMwx3w,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

> I tried to work through some examples to develop some intuition about this:

Sorry, gmail inserted line breaks that ruined this picture.  Here's a
URL for it:

http://www.asciiflow.com/#3736558963405980039

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 20:05           ` Bjorn Helgaas
  0 siblings, 0 replies; 129+ messages in thread
From: Bjorn Helgaas @ 2012-05-15 20:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	benve, dwmw2, linux-kernel, david

> I tried to work through some examples to develop some intuition about this:

Sorry, gmail inserted line breaks that ruined this picture.  Here's a
URL for it:

http://www.asciiflow.com/#3736558963405980039

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 21:09           ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-15 21:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci, linux-kernel, gregkh

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >> <alex.williamson@redhat.com> wrote:
> >> > In a PCIe environment, transactions aren't always required to
> >> > reach the root bus before being re-routed.  Peer-to-peer DMA
> >> > may actually not be seen by the IOMMU in these cases.  For
> >> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> >> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> >> > returns the furthest downstream device with a complete PCI ACS
> >> > chain.  This information can then be used in grouping to create
> >> > fully isolated groups.  ACS chain logic extracted from libvirt.
> >>
> >> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >
> > Right, maybe this should be:
> >
> > struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >
> >> I'm not sure what "a complete PCI ACS chain" means.
> >>
> >> The function starts from "dev" and searches *upstream*, so I'm
> >> guessing it returns the root of a subtree that must be contained in a
> >> group.
> >
> > Any intermediate switch between an endpoint and the root bus can
> > redirect a dma access without iommu translation,
> 
> Is this "redirection" just the normal PCI bridge forwarding that
> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> ranges that are forwarded from primary to secondary interface, and the
> inverse ranges are forwarded from secondary to primary?  For example,
> here:
> 
>                    ^
>                    |
>           +--------+-------+
>           |                |
>    +------+-----+    +-----++-----+
>    | Downstream |    | Downstream |
>    |    Port    |    |    Port    |
>    |   06:05.0  |    |   06:06.0  |
>    +------+-----+    +------+-----+
>           |                 |
>      +----v----+       +----v----+
>      | Endpoint|       | Endpoint|
>      | 07:00.0 |       | 08:00.0 |
>      +---------+       +---------+
> 
> that rule is all that's needed for a transaction from 07:00.0 to be
> forwarded from upstream to the internal switch bus 06, then claimed by
> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> nothing specific to PCIe.

Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.

> I don't understand ACS very well, but it looks like it basically
> provides ways to prevent that peer-to-peer forwarding, so transactions
> would be sent upstream toward the root (and specifically, the IOMMU)
> instead of being directly claimed by 06:06.0.

Yep, that's my meager understanding as well.

> > so we're looking for
> > the furthest upstream device for which acs is enabled all the way up to
> > the root bus.
> 
> Correct me if this is wrong: To force device A's DMAs to be processed
> by an IOMMU, ACS must be enabled on the root port and every downstream
> port along the path to A.

Yes, modulo this comment in libvirt source:

    /* if we have no parent, and this is the root bus, ACS doesn't come
     * into play since devices on the root bus can't P2P without going
     * through the root IOMMU.
     */

So we assume that a redirect at the point of the iommu will factor in
iommu translation.

> If so, I think you're trying to find out the closest upstream device X
> such that everything leading to X has ACS enabled.  Every device below
> X can DMA freely to other devices below X, so they would all have to
> be in the same isolated group.

Yes

> I tried to work through some examples to develop some intuition about this:

(inserting fixed url)
> http://www.asciiflow.com/#3736558963405980039

> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> if 00:00.0 is PCIe or if RP has ACS?))

Hmm, the latter is the assumption above.  For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe.  I'll need to work on making that more generic.

> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> PCIe; seems wrong)

Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
input devices, so this was passing for me.  I'll need to incorporate
that generically.

> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> doesn't have ACS)

Yeah, let me validate the libvirt assumption.  I see ACS on my root
port, so maybe they're just assuming it's always enabled or that the
precedence favors IOMMU translation.  I'm also starting to think that we
might want "from" and "to" struct pci_dev parameters to make it more
flexible where the iommu lives in the system.

> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> a bridge; seems wrong if 04:00 is a multi-function device)

AIUI, ACS is not an endpoint property, so this is what should happen.  I
don't think multifunction plays a role other than how much do we trust
the implementation to not allow back channels between functions (the
answer should probably be not at all).

> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
> 
> But it didn't really help.  I still can't develop a mental picture of
> what this function does.

It helped me :)  These are good examples, I'll work on fixing it for
them.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 21:09           ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-15 21:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, benve-FYB4Gu1CFyUAvxtiuMwx3w,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> > In a PCIe environment, transactions aren't always required to
> >> > reach the root bus before being re-routed.  Peer-to-peer DMA
> >> > may actually not be seen by the IOMMU in these cases.  For
> >> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> >> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> >> > returns the furthest downstream device with a complete PCI ACS
> >> > chain.  This information can then be used in grouping to create
> >> > fully isolated groups.  ACS chain logic extracted from libvirt.
> >>
> >> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >
> > Right, maybe this should be:
> >
> > struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >
> >> I'm not sure what "a complete PCI ACS chain" means.
> >>
> >> The function starts from "dev" and searches *upstream*, so I'm
> >> guessing it returns the root of a subtree that must be contained in a
> >> group.
> >
> > Any intermediate switch between an endpoint and the root bus can
> > redirect a dma access without iommu translation,
> 
> Is this "redirection" just the normal PCI bridge forwarding that
> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> ranges that are forwarded from primary to secondary interface, and the
> inverse ranges are forwarded from secondary to primary?  For example,
> here:
> 
>                    ^
>                    |
>           +--------+-------+
>           |                |
>    +------+-----+    +-----++-----+
>    | Downstream |    | Downstream |
>    |    Port    |    |    Port    |
>    |   06:05.0  |    |   06:06.0  |
>    +------+-----+    +------+-----+
>           |                 |
>      +----v----+       +----v----+
>      | Endpoint|       | Endpoint|
>      | 07:00.0 |       | 08:00.0 |
>      +---------+       +---------+
> 
> that rule is all that's needed for a transaction from 07:00.0 to be
> forwarded from upstream to the internal switch bus 06, then claimed by
> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> nothing specific to PCIe.

Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.

> I don't understand ACS very well, but it looks like it basically
> provides ways to prevent that peer-to-peer forwarding, so transactions
> would be sent upstream toward the root (and specifically, the IOMMU)
> instead of being directly claimed by 06:06.0.

Yep, that's my meager understanding as well.

> > so we're looking for
> > the furthest upstream device for which acs is enabled all the way up to
> > the root bus.
> 
> Correct me if this is wrong: To force device A's DMAs to be processed
> by an IOMMU, ACS must be enabled on the root port and every downstream
> port along the path to A.

Yes, modulo this comment in libvirt source:

    /* if we have no parent, and this is the root bus, ACS doesn't come
     * into play since devices on the root bus can't P2P without going
     * through the root IOMMU.
     */

So we assume that a redirect at the point of the iommu will factor in
iommu translation.

> If so, I think you're trying to find out the closest upstream device X
> such that everything leading to X has ACS enabled.  Every device below
> X can DMA freely to other devices below X, so they would all have to
> be in the same isolated group.

Yes

> I tried to work through some examples to develop some intuition about this:

(inserting fixed url)
> http://www.asciiflow.com/#3736558963405980039

> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> if 00:00.0 is PCIe or if RP has ACS?))

Hmm, the latter is the assumption above.  For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe.  I'll need to work on making that more generic.

> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> PCIe; seems wrong)

Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
input devices, so this was passing for me.  I'll need to incorporate
that generically.

> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> doesn't have ACS)

Yeah, let me validate the libvirt assumption.  I see ACS on my root
port, so maybe they're just assuming it's always enabled or that the
precedence favors IOMMU translation.  I'm also starting to think that we
might want "from" and "to" struct pci_dev parameters to make it more
flexible where the iommu lives in the system.

> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> a bridge; seems wrong if 04:00 is a multi-function device)

AIUI, ACS is not an endpoint property, so this is what should happen.  I
don't think multifunction plays a role other than how much do we trust
the implementation to not allow back channels between functions (the
answer should probably be not at all).

> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
> 
> But it didn't really help.  I still can't develop a mental picture of
> what this function does.

It helped me :)  These are good examples, I'll work on fixing it for
them.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-15 21:09           ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-15 21:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	benve, dwmw2, linux-kernel, david

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >> <alex.williamson@redhat.com> wrote:
> >> > In a PCIe environment, transactions aren't always required to
> >> > reach the root bus before being re-routed.  Peer-to-peer DMA
> >> > may actually not be seen by the IOMMU in these cases.  For
> >> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> >> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> >> > returns the furthest downstream device with a complete PCI ACS
> >> > chain.  This information can then be used in grouping to create
> >> > fully isolated groups.  ACS chain logic extracted from libvirt.
> >>
> >> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >
> > Right, maybe this should be:
> >
> > struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >
> >> I'm not sure what "a complete PCI ACS chain" means.
> >>
> >> The function starts from "dev" and searches *upstream*, so I'm
> >> guessing it returns the root of a subtree that must be contained in a
> >> group.
> >
> > Any intermediate switch between an endpoint and the root bus can
> > redirect a dma access without iommu translation,
> 
> Is this "redirection" just the normal PCI bridge forwarding that
> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> ranges that are forwarded from primary to secondary interface, and the
> inverse ranges are forwarded from secondary to primary?  For example,
> here:
> 
>                    ^
>                    |
>           +--------+-------+
>           |                |
>    +------+-----+    +-----++-----+
>    | Downstream |    | Downstream |
>    |    Port    |    |    Port    |
>    |   06:05.0  |    |   06:06.0  |
>    +------+-----+    +------+-----+
>           |                 |
>      +----v----+       +----v----+
>      | Endpoint|       | Endpoint|
>      | 07:00.0 |       | 08:00.0 |
>      +---------+       +---------+
> 
> that rule is all that's needed for a transaction from 07:00.0 to be
> forwarded from upstream to the internal switch bus 06, then claimed by
> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> nothing specific to PCIe.

Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.

> I don't understand ACS very well, but it looks like it basically
> provides ways to prevent that peer-to-peer forwarding, so transactions
> would be sent upstream toward the root (and specifically, the IOMMU)
> instead of being directly claimed by 06:06.0.

Yep, that's my meager understanding as well.

> > so we're looking for
> > the furthest upstream device for which acs is enabled all the way up to
> > the root bus.
> 
> Correct me if this is wrong: To force device A's DMAs to be processed
> by an IOMMU, ACS must be enabled on the root port and every downstream
> port along the path to A.

Yes, modulo this comment in libvirt source:

    /* if we have no parent, and this is the root bus, ACS doesn't come
     * into play since devices on the root bus can't P2P without going
     * through the root IOMMU.
     */

So we assume that a redirect at the point of the iommu will factor in
iommu translation.

> If so, I think you're trying to find out the closest upstream device X
> such that everything leading to X has ACS enabled.  Every device below
> X can DMA freely to other devices below X, so they would all have to
> be in the same isolated group.

Yes

> I tried to work through some examples to develop some intuition about this:

(inserting fixed url)
> http://www.asciiflow.com/#3736558963405980039

> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> if 00:00.0 is PCIe or if RP has ACS?))

Hmm, the latter is the assumption above.  For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe.  I'll need to work on making that more generic.

> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> PCIe; seems wrong)

Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
input devices, so this was passing for me.  I'll need to incorporate
that generically.

> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> doesn't have ACS)

Yeah, let me validate the libvirt assumption.  I see ACS on my root
port, so maybe they're just assuming it's always enabled or that the
precedence favors IOMMU translation.  I'm also starting to think that we
might want "from" and "to" struct pci_dev parameters to make it more
flexible where the iommu lives in the system.

> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> a bridge; seems wrong if 04:00 is a multi-function device)

AIUI, ACS is not an endpoint property, so this is what should happen.  I
don't think multifunction plays a role other than how much do we trust
the implementation to not allow back channels between functions (the
answer should probably be not at all).

> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
> 
> But it didn't really help.  I still can't develop a mental picture of
> what this function does.

It helped me :)  These are good examples, I'll work on fixing it for
them.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 13:29             ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-16 13:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On 05/15/2012 05:09 PM, Alex Williamson wrote:
> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>> <alex.williamson@redhat.com>  wrote:
>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>  wrote:
>>>>> In a PCIe environment, transactions aren't always required to
>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>> chain.  This information can then be used in grouping to create
>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>
>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>
>>> Right, maybe this should be:
>>>
>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>
+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>
>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>> guessing it returns the root of a subtree that must be contained in a
>>>> group.
>>>
>>> Any intermediate switch between an endpoint and the root bus can
>>> redirect a dma access without iommu translation,
>>
>> Is this "redirection" just the normal PCI bridge forwarding that
>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>> ranges that are forwarded from primary to secondary interface, and the
>> inverse ranges are forwarded from secondary to primary?  For example,
>> here:
>>
>>                     ^
>>                     |
>>            +--------+-------+
>>            |                |
>>     +------+-----+    +-----++-----+
>>     | Downstream |    | Downstream |
>>     |    Port    |    |    Port    |
>>     |   06:05.0  |    |   06:06.0  |
>>     +------+-----+    +------+-----+
>>            |                 |
>>       +----v----+       +----v----+
>>       | Endpoint|       | Endpoint|
>>       | 07:00.0 |       | 08:00.0 |
>>       +---------+       +---------+
>>
>> that rule is all that's needed for a transaction from 07:00.0 to be
>> forwarded from upstream to the internal switch bus 06, then claimed by
>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>> nothing specific to PCIe.
>
> Right, I think the main PCI difference is the point-to-point nature of
> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> devices talking to each other, but on PCIe the transaction makes a
> U-turn at some point and heads out another downstream port.  ACS allows
> us to prevent that from happening.
>
detail: PCIe up/downstream routing is really done by an internal switch;
         ACS forces the legacy, PCI base-limit address routing and *forces*
         the switch to always route the transaction from a downstream port
         to the upstream port.

>> I don't understand ACS very well, but it looks like it basically
>> provides ways to prevent that peer-to-peer forwarding, so transactions
>> would be sent upstream toward the root (and specifically, the IOMMU)
>> instead of being directly claimed by 06:06.0.
>
> Yep, that's my meager understanding as well.
>
+1

>>> so we're looking for
>>> the furthest upstream device for which acs is enabled all the way up to
>>> the root bus.
>>
>> Correct me if this is wrong: To force device A's DMAs to be processed
>> by an IOMMU, ACS must be enabled on the root port and every downstream
>> port along the path to A.
>
> Yes, modulo this comment in libvirt source:
>
>      /* if we have no parent, and this is the root bus, ACS doesn't come
>       * into play since devices on the root bus can't P2P without going
>       * through the root IOMMU.
>       */
>
Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.

> So we assume that a redirect at the point of the iommu will factor in
> iommu translation.
>
>> If so, I think you're trying to find out the closest upstream device X
>> such that everything leading to X has ACS enabled.  Every device below
>> X can DMA freely to other devices below X, so they would all have to
>> be in the same isolated group.
>
> Yes
>
>> I tried to work through some examples to develop some intuition about this:
>
> (inserting fixed url)
>> http://www.asciiflow.com/#3736558963405980039
>
>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>> if 00:00.0 is PCIe or if RP has ACS?))
>
> Hmm, the latter is the assumption above.  For the former, I think
> libvirt was probably assuming that PCI devices must have a PCIe device
> upstream from them because x86 doesn't have assignment friendly IOMMUs
> except on PCIe.  I'll need to work on making that more generic.
>
>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>> PCIe; seems wrong)
>
> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> input devices, so this was passing for me.  I'll need to incorporate
> that generically.
>
>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>> doesn't have ACS)
>
> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> port, so maybe they're just assuming it's always enabled or that the
> precedence favors IOMMU translation.  I'm also starting to think that we
> might want "from" and "to" struct pci_dev parameters to make it more
> flexible where the iommu lives in the system.
>
see comment above wrt root ports that have IOMMUs in them.

>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>> a bridge; seems wrong if 04:00 is a multi-function device)
>
> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> don't think multifunction plays a role other than how much do we trust
> the implementation to not allow back channels between functions (the
> answer should probably be not at all).
>
correct. ACS is a *bridge* property.
The unknown wrt multifunction devices is that such devices *could* be implemented
by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
btwn the functions within a device.
Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
force ACS.  So, one has to ask the hw vendors if such a hidden device exists
in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
and allow parent bridge re-route it back down if peer-to-peer is desired.
Debate exists whether multifunction devices are 'secure' b/c of this unknown.
Maybe a PCIe (min., SRIOV) spec change is needed in this area to
determine this status about a device (via pci cfg/cap space).

>> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
>> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
>>
>> But it didn't really help.  I still can't develop a mental picture of
>> what this function does.
>
> It helped me :)  These are good examples, I'll work on fixing it for
> them.  Thanks,
>
> Alex
>
>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 13:29             ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-16 13:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On 05/15/2012 05:09 PM, Alex Williamson wrote:
> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
>>>>> In a PCIe environment, transactions aren't always required to
>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>> chain.  This information can then be used in grouping to create
>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>
>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>
>>> Right, maybe this should be:
>>>
>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>
+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>
>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>> guessing it returns the root of a subtree that must be contained in a
>>>> group.
>>>
>>> Any intermediate switch between an endpoint and the root bus can
>>> redirect a dma access without iommu translation,
>>
>> Is this "redirection" just the normal PCI bridge forwarding that
>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>> ranges that are forwarded from primary to secondary interface, and the
>> inverse ranges are forwarded from secondary to primary?  For example,
>> here:
>>
>>                     ^
>>                     |
>>            +--------+-------+
>>            |                |
>>     +------+-----+    +-----++-----+
>>     | Downstream |    | Downstream |
>>     |    Port    |    |    Port    |
>>     |   06:05.0  |    |   06:06.0  |
>>     +------+-----+    +------+-----+
>>            |                 |
>>       +----v----+       +----v----+
>>       | Endpoint|       | Endpoint|
>>       | 07:00.0 |       | 08:00.0 |
>>       +---------+       +---------+
>>
>> that rule is all that's needed for a transaction from 07:00.0 to be
>> forwarded from upstream to the internal switch bus 06, then claimed by
>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>> nothing specific to PCIe.
>
> Right, I think the main PCI difference is the point-to-point nature of
> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> devices talking to each other, but on PCIe the transaction makes a
> U-turn at some point and heads out another downstream port.  ACS allows
> us to prevent that from happening.
>
detail: PCIe up/downstream routing is really done by an internal switch;
         ACS forces the legacy, PCI base-limit address routing and *forces*
         the switch to always route the transaction from a downstream port
         to the upstream port.

>> I don't understand ACS very well, but it looks like it basically
>> provides ways to prevent that peer-to-peer forwarding, so transactions
>> would be sent upstream toward the root (and specifically, the IOMMU)
>> instead of being directly claimed by 06:06.0.
>
> Yep, that's my meager understanding as well.
>
+1

>>> so we're looking for
>>> the furthest upstream device for which acs is enabled all the way up to
>>> the root bus.
>>
>> Correct me if this is wrong: To force device A's DMAs to be processed
>> by an IOMMU, ACS must be enabled on the root port and every downstream
>> port along the path to A.
>
> Yes, modulo this comment in libvirt source:
>
>      /* if we have no parent, and this is the root bus, ACS doesn't come
>       * into play since devices on the root bus can't P2P without going
>       * through the root IOMMU.
>       */
>
Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.

> So we assume that a redirect at the point of the iommu will factor in
> iommu translation.
>
>> If so, I think you're trying to find out the closest upstream device X
>> such that everything leading to X has ACS enabled.  Every device below
>> X can DMA freely to other devices below X, so they would all have to
>> be in the same isolated group.
>
> Yes
>
>> I tried to work through some examples to develop some intuition about this:
>
> (inserting fixed url)
>> http://www.asciiflow.com/#3736558963405980039
>
>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>> if 00:00.0 is PCIe or if RP has ACS?))
>
> Hmm, the latter is the assumption above.  For the former, I think
> libvirt was probably assuming that PCI devices must have a PCIe device
> upstream from them because x86 doesn't have assignment friendly IOMMUs
> except on PCIe.  I'll need to work on making that more generic.
>
>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>> PCIe; seems wrong)
>
> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> input devices, so this was passing for me.  I'll need to incorporate
> that generically.
>
>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>> doesn't have ACS)
>
> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> port, so maybe they're just assuming it's always enabled or that the
> precedence favors IOMMU translation.  I'm also starting to think that we
> might want "from" and "to" struct pci_dev parameters to make it more
> flexible where the iommu lives in the system.
>
see comment above wrt root ports that have IOMMUs in them.

>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>> a bridge; seems wrong if 04:00 is a multi-function device)
>
> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> don't think multifunction plays a role other than how much do we trust
> the implementation to not allow back channels between functions (the
> answer should probably be not at all).
>
correct. ACS is a *bridge* property.
The unknown wrt multifunction devices is that such devices *could* be implemented
by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
btwn the functions within a device.
Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
force ACS.  So, one has to ask the hw vendors if such a hidden device exists
in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
and allow parent bridge re-route it back down if peer-to-peer is desired.
Debate exists whether multifunction devices are 'secure' b/c of this unknown.
Maybe a PCIe (min., SRIOV) spec change is needed in this area to
determine this status about a device (via pci cfg/cap space).

>> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
>> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
>>
>> But it didn't really help.  I still can't develop a mental picture of
>> what this function does.
>
> It helped me :)  These are good examples, I'll work on fixing it for
> them.  Thanks,
>
> Alex
>
>
> _______________________________________________
> iommu mailing list
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 13:29             ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-16 13:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On 05/15/2012 05:09 PM, Alex Williamson wrote:
> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>> <alex.williamson@redhat.com>  wrote:
>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>  wrote:
>>>>> In a PCIe environment, transactions aren't always required to
>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>> chain.  This information can then be used in grouping to create
>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>
>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>
>>> Right, maybe this should be:
>>>
>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>
+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>
>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>> guessing it returns the root of a subtree that must be contained in a
>>>> group.
>>>
>>> Any intermediate switch between an endpoint and the root bus can
>>> redirect a dma access without iommu translation,
>>
>> Is this "redirection" just the normal PCI bridge forwarding that
>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>> ranges that are forwarded from primary to secondary interface, and the
>> inverse ranges are forwarded from secondary to primary?  For example,
>> here:
>>
>>                     ^
>>                     |
>>            +--------+-------+
>>            |                |
>>     +------+-----+    +-----++-----+
>>     | Downstream |    | Downstream |
>>     |    Port    |    |    Port    |
>>     |   06:05.0  |    |   06:06.0  |
>>     +------+-----+    +------+-----+
>>            |                 |
>>       +----v----+       +----v----+
>>       | Endpoint|       | Endpoint|
>>       | 07:00.0 |       | 08:00.0 |
>>       +---------+       +---------+
>>
>> that rule is all that's needed for a transaction from 07:00.0 to be
>> forwarded from upstream to the internal switch bus 06, then claimed by
>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>> nothing specific to PCIe.
>
> Right, I think the main PCI difference is the point-to-point nature of
> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> devices talking to each other, but on PCIe the transaction makes a
> U-turn at some point and heads out another downstream port.  ACS allows
> us to prevent that from happening.
>
detail: PCIe up/downstream routing is really done by an internal switch;
         ACS forces the legacy, PCI base-limit address routing and *forces*
         the switch to always route the transaction from a downstream port
         to the upstream port.

>> I don't understand ACS very well, but it looks like it basically
>> provides ways to prevent that peer-to-peer forwarding, so transactions
>> would be sent upstream toward the root (and specifically, the IOMMU)
>> instead of being directly claimed by 06:06.0.
>
> Yep, that's my meager understanding as well.
>
+1

>>> so we're looking for
>>> the furthest upstream device for which acs is enabled all the way up to
>>> the root bus.
>>
>> Correct me if this is wrong: To force device A's DMAs to be processed
>> by an IOMMU, ACS must be enabled on the root port and every downstream
>> port along the path to A.
>
> Yes, modulo this comment in libvirt source:
>
>      /* if we have no parent, and this is the root bus, ACS doesn't come
>       * into play since devices on the root bus can't P2P without going
>       * through the root IOMMU.
>       */
>
Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.

> So we assume that a redirect at the point of the iommu will factor in
> iommu translation.
>
>> If so, I think you're trying to find out the closest upstream device X
>> such that everything leading to X has ACS enabled.  Every device below
>> X can DMA freely to other devices below X, so they would all have to
>> be in the same isolated group.
>
> Yes
>
>> I tried to work through some examples to develop some intuition about this:
>
> (inserting fixed url)
>> http://www.asciiflow.com/#3736558963405980039
>
>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>> if 00:00.0 is PCIe or if RP has ACS?))
>
> Hmm, the latter is the assumption above.  For the former, I think
> libvirt was probably assuming that PCI devices must have a PCIe device
> upstream from them because x86 doesn't have assignment friendly IOMMUs
> except on PCIe.  I'll need to work on making that more generic.
>
>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>> PCIe; seems wrong)
>
> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> input devices, so this was passing for me.  I'll need to incorporate
> that generically.
>
>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>> doesn't have ACS)
>
> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> port, so maybe they're just assuming it's always enabled or that the
> precedence favors IOMMU translation.  I'm also starting to think that we
> might want "from" and "to" struct pci_dev parameters to make it more
> flexible where the iommu lives in the system.
>
see comment above wrt root ports that have IOMMUs in them.

>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>> a bridge; seems wrong if 04:00 is a multi-function device)
>
> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> don't think multifunction plays a role other than how much do we trust
> the implementation to not allow back channels between functions (the
> answer should probably be not at all).
>
correct. ACS is a *bridge* property.
The unknown wrt multifunction devices is that such devices *could* be implemented
by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
btwn the functions within a device.
Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
force ACS.  So, one has to ask the hw vendors if such a hidden device exists
in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
and allow parent bridge re-route it back down if peer-to-peer is desired.
Debate exists whether multifunction devices are 'secure' b/c of this unknown.
Maybe a PCIe (min., SRIOV) spec change is needed in this area to
determine this status about a device (via pci cfg/cap space).

>> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
>> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
>>
>> But it didn't really help.  I still can't develop a mental picture of
>> what this function does.
>
> It helped me :)  These are good examples, I'll work on fixing it for
> them.  Thanks,
>
> Alex
>
>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 16:21               ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-16 16:21 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >> <alex.williamson@redhat.com>  wrote:
> >>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>> <alex.williamson@redhat.com>  wrote:
> >>>>> In a PCIe environment, transactions aren't always required to
> >>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>> chain.  This information can then be used in grouping to create
> >>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>
> >>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>
> >>> Right, maybe this should be:
> >>>
> >>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>
> +1; there is a global in the PCI code, pci_acs_enable,
> and a function pci_enable_acs(), which the above name certainly
> confuses.  I recommend  pci_find_top_acs_bridge()
> would be most descriptive.

Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
                          struct pci_dev *end, u16 acs_flags);

> >>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>
> >>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>> guessing it returns the root of a subtree that must be contained in a
> >>>> group.
> >>>
> >>> Any intermediate switch between an endpoint and the root bus can
> >>> redirect a dma access without iommu translation,
> >>
> >> Is this "redirection" just the normal PCI bridge forwarding that
> >> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >> ranges that are forwarded from primary to secondary interface, and the
> >> inverse ranges are forwarded from secondary to primary?  For example,
> >> here:
> >>
> >>                     ^
> >>                     |
> >>            +--------+-------+
> >>            |                |
> >>     +------+-----+    +-----++-----+
> >>     | Downstream |    | Downstream |
> >>     |    Port    |    |    Port    |
> >>     |   06:05.0  |    |   06:06.0  |
> >>     +------+-----+    +------+-----+
> >>            |                 |
> >>       +----v----+       +----v----+
> >>       | Endpoint|       | Endpoint|
> >>       | 07:00.0 |       | 08:00.0 |
> >>       +---------+       +---------+
> >>
> >> that rule is all that's needed for a transaction from 07:00.0 to be
> >> forwarded from upstream to the internal switch bus 06, then claimed by
> >> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >> nothing specific to PCIe.
> >
> > Right, I think the main PCI difference is the point-to-point nature of
> > PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> > devices talking to each other, but on PCIe the transaction makes a
> > U-turn at some point and heads out another downstream port.  ACS allows
> > us to prevent that from happening.
> >
> detail: PCIe up/downstream routing is really done by an internal switch;
>          ACS forces the legacy, PCI base-limit address routing and *forces*
>          the switch to always route the transaction from a downstream port
>          to the upstream port.
> 
> >> I don't understand ACS very well, but it looks like it basically
> >> provides ways to prevent that peer-to-peer forwarding, so transactions
> >> would be sent upstream toward the root (and specifically, the IOMMU)
> >> instead of being directly claimed by 06:06.0.
> >
> > Yep, that's my meager understanding as well.
> >
> +1
> 
> >>> so we're looking for
> >>> the furthest upstream device for which acs is enabled all the way up to
> >>> the root bus.
> >>
> >> Correct me if this is wrong: To force device A's DMAs to be processed
> >> by an IOMMU, ACS must be enabled on the root port and every downstream
> >> port along the path to A.
> >
> > Yes, modulo this comment in libvirt source:
> >
> >      /* if we have no parent, and this is the root bus, ACS doesn't come
> >       * into play since devices on the root bus can't P2P without going
> >       * through the root IOMMU.
> >       */
> >
> Correct. PCIe spec says roots must support ACS. I believe all the
> root bridges that have an IOMMU have ACS wired in/on.

Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.

> > So we assume that a redirect at the point of the iommu will factor in
> > iommu translation.
> >
> >> If so, I think you're trying to find out the closest upstream device X
> >> such that everything leading to X has ACS enabled.  Every device below
> >> X can DMA freely to other devices below X, so they would all have to
> >> be in the same isolated group.
> >
> > Yes
> >
> >> I tried to work through some examples to develop some intuition about this:
> >
> > (inserting fixed url)
> >> http://www.asciiflow.com/#3736558963405980039
> >
> >> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >> if 00:00.0 is PCIe or if RP has ACS?))
> >
> > Hmm, the latter is the assumption above.  For the former, I think
> > libvirt was probably assuming that PCI devices must have a PCIe device
> > upstream from them because x86 doesn't have assignment friendly IOMMUs
> > except on PCIe.  I'll need to work on making that more generic.
> >
> >> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >> PCIe; seems wrong)
> >
> > Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> > input devices, so this was passing for me.  I'll need to incorporate
> > that generically.
> >
> >> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >> doesn't have ACS)
> >
> > Yeah, let me validate the libvirt assumption.  I see ACS on my root
> > port, so maybe they're just assuming it's always enabled or that the
> > precedence favors IOMMU translation.  I'm also starting to think that we
> > might want "from" and "to" struct pci_dev parameters to make it more
> > flexible where the iommu lives in the system.
> >
> see comment above wrt root ports that have IOMMUs in them.

Except it really seems to be platform convention where the IOMMU lives.
The DMAR for VT-d describes which devices and hierarchies a DRHD is used
for and from that we can make assumptions about where it physically
lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
function on the root bus.  For now I'm just allowing
pci_acs_path_enabled to take NULL for and end, which means "up to the
root bus".

> 
> >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >> a bridge; seems wrong if 04:00 is a multi-function device)
> >
> > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > don't think multifunction plays a role other than how much do we trust
> > the implementation to not allow back channels between functions (the
> > answer should probably be not at all).
> >
> correct. ACS is a *bridge* property.
> The unknown wrt multifunction devices is that such devices *could* be implemented
> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> btwn the functions within a device.
> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> and allow parent bridge re-route it back down if peer-to-peer is desired.
> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> determine this status about a device (via pci cfg/cap space).

Well, there is actually a section of the ACS part of the spec
identifying valid flags for multifunction devices.  Secretly I'd like to
use this as justification for blacklisting all multifunction devices
that don't explicitly support ACS, but that makes for pretty course
granularity.  For instance, all these devices end up in a group:

   +-14.0  ATI Technologies Inc SBx00 SMBus Controller
   +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
   +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
   +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller

  00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)

And these in another:

   +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
   +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
   +-15.2-[08]--
   +-15.3-[09]--

  00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
  00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
  00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
  00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)

Am I misinterpreting the spec or is this the correct, if strict,
interpretation?

Alex


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 16:21               ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-16 16:21 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
> >>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
> >>>>> In a PCIe environment, transactions aren't always required to
> >>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>> chain.  This information can then be used in grouping to create
> >>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>
> >>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>
> >>> Right, maybe this should be:
> >>>
> >>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>
> +1; there is a global in the PCI code, pci_acs_enable,
> and a function pci_enable_acs(), which the above name certainly
> confuses.  I recommend  pci_find_top_acs_bridge()
> would be most descriptive.

Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
                          struct pci_dev *end, u16 acs_flags);

> >>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>
> >>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>> guessing it returns the root of a subtree that must be contained in a
> >>>> group.
> >>>
> >>> Any intermediate switch between an endpoint and the root bus can
> >>> redirect a dma access without iommu translation,
> >>
> >> Is this "redirection" just the normal PCI bridge forwarding that
> >> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >> ranges that are forwarded from primary to secondary interface, and the
> >> inverse ranges are forwarded from secondary to primary?  For example,
> >> here:
> >>
> >>                     ^
> >>                     |
> >>            +--------+-------+
> >>            |                |
> >>     +------+-----+    +-----++-----+
> >>     | Downstream |    | Downstream |
> >>     |    Port    |    |    Port    |
> >>     |   06:05.0  |    |   06:06.0  |
> >>     +------+-----+    +------+-----+
> >>            |                 |
> >>       +----v----+       +----v----+
> >>       | Endpoint|       | Endpoint|
> >>       | 07:00.0 |       | 08:00.0 |
> >>       +---------+       +---------+
> >>
> >> that rule is all that's needed for a transaction from 07:00.0 to be
> >> forwarded from upstream to the internal switch bus 06, then claimed by
> >> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >> nothing specific to PCIe.
> >
> > Right, I think the main PCI difference is the point-to-point nature of
> > PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> > devices talking to each other, but on PCIe the transaction makes a
> > U-turn at some point and heads out another downstream port.  ACS allows
> > us to prevent that from happening.
> >
> detail: PCIe up/downstream routing is really done by an internal switch;
>          ACS forces the legacy, PCI base-limit address routing and *forces*
>          the switch to always route the transaction from a downstream port
>          to the upstream port.
> 
> >> I don't understand ACS very well, but it looks like it basically
> >> provides ways to prevent that peer-to-peer forwarding, so transactions
> >> would be sent upstream toward the root (and specifically, the IOMMU)
> >> instead of being directly claimed by 06:06.0.
> >
> > Yep, that's my meager understanding as well.
> >
> +1
> 
> >>> so we're looking for
> >>> the furthest upstream device for which acs is enabled all the way up to
> >>> the root bus.
> >>
> >> Correct me if this is wrong: To force device A's DMAs to be processed
> >> by an IOMMU, ACS must be enabled on the root port and every downstream
> >> port along the path to A.
> >
> > Yes, modulo this comment in libvirt source:
> >
> >      /* if we have no parent, and this is the root bus, ACS doesn't come
> >       * into play since devices on the root bus can't P2P without going
> >       * through the root IOMMU.
> >       */
> >
> Correct. PCIe spec says roots must support ACS. I believe all the
> root bridges that have an IOMMU have ACS wired in/on.

Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.

> > So we assume that a redirect at the point of the iommu will factor in
> > iommu translation.
> >
> >> If so, I think you're trying to find out the closest upstream device X
> >> such that everything leading to X has ACS enabled.  Every device below
> >> X can DMA freely to other devices below X, so they would all have to
> >> be in the same isolated group.
> >
> > Yes
> >
> >> I tried to work through some examples to develop some intuition about this:
> >
> > (inserting fixed url)
> >> http://www.asciiflow.com/#3736558963405980039
> >
> >> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >> if 00:00.0 is PCIe or if RP has ACS?))
> >
> > Hmm, the latter is the assumption above.  For the former, I think
> > libvirt was probably assuming that PCI devices must have a PCIe device
> > upstream from them because x86 doesn't have assignment friendly IOMMUs
> > except on PCIe.  I'll need to work on making that more generic.
> >
> >> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >> PCIe; seems wrong)
> >
> > Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> > input devices, so this was passing for me.  I'll need to incorporate
> > that generically.
> >
> >> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >> doesn't have ACS)
> >
> > Yeah, let me validate the libvirt assumption.  I see ACS on my root
> > port, so maybe they're just assuming it's always enabled or that the
> > precedence favors IOMMU translation.  I'm also starting to think that we
> > might want "from" and "to" struct pci_dev parameters to make it more
> > flexible where the iommu lives in the system.
> >
> see comment above wrt root ports that have IOMMUs in them.

Except it really seems to be platform convention where the IOMMU lives.
The DMAR for VT-d describes which devices and hierarchies a DRHD is used
for and from that we can make assumptions about where it physically
lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
function on the root bus.  For now I'm just allowing
pci_acs_path_enabled to take NULL for and end, which means "up to the
root bus".

> 
> >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >> a bridge; seems wrong if 04:00 is a multi-function device)
> >
> > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > don't think multifunction plays a role other than how much do we trust
> > the implementation to not allow back channels between functions (the
> > answer should probably be not at all).
> >
> correct. ACS is a *bridge* property.
> The unknown wrt multifunction devices is that such devices *could* be implemented
> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> btwn the functions within a device.
> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> and allow parent bridge re-route it back down if peer-to-peer is desired.
> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> determine this status about a device (via pci cfg/cap space).

Well, there is actually a section of the ACS part of the spec
identifying valid flags for multifunction devices.  Secretly I'd like to
use this as justification for blacklisting all multifunction devices
that don't explicitly support ACS, but that makes for pretty course
granularity.  For instance, all these devices end up in a group:

   +-14.0  ATI Technologies Inc SBx00 SMBus Controller
   +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
   +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
   +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller

  00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)

And these in another:

   +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
   +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
   +-15.2-[08]--
   +-15.3-[09]--

  00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
  00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
  00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
  00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)

Am I misinterpreting the spec or is this the correct, if strict,
interpretation?

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 16:21               ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-16 16:21 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >> <alex.williamson@redhat.com>  wrote:
> >>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>> <alex.williamson@redhat.com>  wrote:
> >>>>> In a PCIe environment, transactions aren't always required to
> >>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>> chain.  This information can then be used in grouping to create
> >>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>
> >>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>
> >>> Right, maybe this should be:
> >>>
> >>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>
> +1; there is a global in the PCI code, pci_acs_enable,
> and a function pci_enable_acs(), which the above name certainly
> confuses.  I recommend  pci_find_top_acs_bridge()
> would be most descriptive.

Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
                          struct pci_dev *end, u16 acs_flags);

> >>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>
> >>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>> guessing it returns the root of a subtree that must be contained in a
> >>>> group.
> >>>
> >>> Any intermediate switch between an endpoint and the root bus can
> >>> redirect a dma access without iommu translation,
> >>
> >> Is this "redirection" just the normal PCI bridge forwarding that
> >> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >> ranges that are forwarded from primary to secondary interface, and the
> >> inverse ranges are forwarded from secondary to primary?  For example,
> >> here:
> >>
> >>                     ^
> >>                     |
> >>            +--------+-------+
> >>            |                |
> >>     +------+-----+    +-----++-----+
> >>     | Downstream |    | Downstream |
> >>     |    Port    |    |    Port    |
> >>     |   06:05.0  |    |   06:06.0  |
> >>     +------+-----+    +------+-----+
> >>            |                 |
> >>       +----v----+       +----v----+
> >>       | Endpoint|       | Endpoint|
> >>       | 07:00.0 |       | 08:00.0 |
> >>       +---------+       +---------+
> >>
> >> that rule is all that's needed for a transaction from 07:00.0 to be
> >> forwarded from upstream to the internal switch bus 06, then claimed by
> >> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >> nothing specific to PCIe.
> >
> > Right, I think the main PCI difference is the point-to-point nature of
> > PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> > devices talking to each other, but on PCIe the transaction makes a
> > U-turn at some point and heads out another downstream port.  ACS allows
> > us to prevent that from happening.
> >
> detail: PCIe up/downstream routing is really done by an internal switch;
>          ACS forces the legacy, PCI base-limit address routing and *forces*
>          the switch to always route the transaction from a downstream port
>          to the upstream port.
> 
> >> I don't understand ACS very well, but it looks like it basically
> >> provides ways to prevent that peer-to-peer forwarding, so transactions
> >> would be sent upstream toward the root (and specifically, the IOMMU)
> >> instead of being directly claimed by 06:06.0.
> >
> > Yep, that's my meager understanding as well.
> >
> +1
> 
> >>> so we're looking for
> >>> the furthest upstream device for which acs is enabled all the way up to
> >>> the root bus.
> >>
> >> Correct me if this is wrong: To force device A's DMAs to be processed
> >> by an IOMMU, ACS must be enabled on the root port and every downstream
> >> port along the path to A.
> >
> > Yes, modulo this comment in libvirt source:
> >
> >      /* if we have no parent, and this is the root bus, ACS doesn't come
> >       * into play since devices on the root bus can't P2P without going
> >       * through the root IOMMU.
> >       */
> >
> Correct. PCIe spec says roots must support ACS. I believe all the
> root bridges that have an IOMMU have ACS wired in/on.

Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.

> > So we assume that a redirect at the point of the iommu will factor in
> > iommu translation.
> >
> >> If so, I think you're trying to find out the closest upstream device X
> >> such that everything leading to X has ACS enabled.  Every device below
> >> X can DMA freely to other devices below X, so they would all have to
> >> be in the same isolated group.
> >
> > Yes
> >
> >> I tried to work through some examples to develop some intuition about this:
> >
> > (inserting fixed url)
> >> http://www.asciiflow.com/#3736558963405980039
> >
> >> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >> if 00:00.0 is PCIe or if RP has ACS?))
> >
> > Hmm, the latter is the assumption above.  For the former, I think
> > libvirt was probably assuming that PCI devices must have a PCIe device
> > upstream from them because x86 doesn't have assignment friendly IOMMUs
> > except on PCIe.  I'll need to work on making that more generic.
> >
> >> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >> PCIe; seems wrong)
> >
> > Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> > input devices, so this was passing for me.  I'll need to incorporate
> > that generically.
> >
> >> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >> doesn't have ACS)
> >
> > Yeah, let me validate the libvirt assumption.  I see ACS on my root
> > port, so maybe they're just assuming it's always enabled or that the
> > precedence favors IOMMU translation.  I'm also starting to think that we
> > might want "from" and "to" struct pci_dev parameters to make it more
> > flexible where the iommu lives in the system.
> >
> see comment above wrt root ports that have IOMMUs in them.

Except it really seems to be platform convention where the IOMMU lives.
The DMAR for VT-d describes which devices and hierarchies a DRHD is used
for and from that we can make assumptions about where it physically
lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
function on the root bus.  For now I'm just allowing
pci_acs_path_enabled to take NULL for and end, which means "up to the
root bus".

> 
> >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >> a bridge; seems wrong if 04:00 is a multi-function device)
> >
> > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > don't think multifunction plays a role other than how much do we trust
> > the implementation to not allow back channels between functions (the
> > answer should probably be not at all).
> >
> correct. ACS is a *bridge* property.
> The unknown wrt multifunction devices is that such devices *could* be implemented
> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> btwn the functions within a device.
> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> and allow parent bridge re-route it back down if peer-to-peer is desired.
> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> determine this status about a device (via pci cfg/cap space).

Well, there is actually a section of the ACS part of the spec
identifying valid flags for multifunction devices.  Secretly I'd like to
use this as justification for blacklisting all multifunction devices
that don't explicitly support ACS, but that makes for pretty course
granularity.  For instance, all these devices end up in a group:

   +-14.0  ATI Technologies Inc SBx00 SMBus Controller
   +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
   +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
   +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller

  00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)

And these in another:

   +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
   +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
   +-15.2-[08]--
   +-15.3-[09]--

  00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
  00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
  00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
  00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)

Am I misinterpreting the spec or is this the correct, if strict,
interpretation?

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 19:36                 ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-16 19:36 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On Wed, 2012-05-16 at 10:21 -0600, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> > On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> > >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> > >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> > >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> > >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> > >> a bridge; seems wrong if 04:00 is a multi-function device)
> > >
> > > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > > don't think multifunction plays a role other than how much do we trust
> > > the implementation to not allow back channels between functions (the
> > > answer should probably be not at all).
> > >
> > correct. ACS is a *bridge* property.
> > The unknown wrt multifunction devices is that such devices *could* be implemented
> > by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> > btwn the functions within a device.
> > Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> > force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> > in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> > bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> > and allow parent bridge re-route it back down if peer-to-peer is desired.
> > Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> > Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> > determine this status about a device (via pci cfg/cap space).
> 
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
> 
>    +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>    +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>    +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>    +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> 
>   00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> 
> And these in another:
> 
>    +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>    +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>    +-15.2-[08]--
>    +-15.3-[09]--
> 
>   00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>   00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>   00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>   00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> 
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?

Here's what I'm currently thinking.  This is a much more simple
interface, but I don't know if I'm correctly accounting for
multifunciton devices.  Callers use something like:

+       if (dma_pdev->multifunction &&
+           !pci_acs_enabled(dma_pdev, PCI_ACS_ENABLED))
+               dma_pdev = pci_get_slot(dma_pdev->bus,
+                                       PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
+                                       0));
+
+       while (!pci_is_root_bus(dma_pdev->bus)) {
+               if (pci_acs_path_enabled(dma_pdev->bus->self,
+                                        NULL, PCI_ACS_ENABLED))
+                       break;
+
+               dma_pdev = dma_pdev->bus->self;
+       }

Where the first test is where we have the option to make a very strict
ACS check for multifunction.  Does this start to make more sense?
Interested in opinions on multifunction strict-ness.  Thanks,

Alex

Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Wed May 16 13:17:24 2012 -0600

    pci: Add ACS validation utility
    
    In a PCIe environment, transactions aren't always required to
    reach the root bus before being re-routed.  Intermediate
    switches between an endpoint and the root bus can redirect
    DMA back downstream before things like IOMMU have a chance
    to intervene.  This utility function allows us to determine
    the closest device for which ACS is enabled back to the root
    bus for use in determining the boundaries of an iommu group.
    Logic for this extracted from libvirt.
    
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..0300e7a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2359,6 +2359,79 @@ void pci_enable_acs(struct pci_dev *dev)
 }
 
 /**
+ * pci_acs_enable - test ACS against required flags for a given device
+ * @pdev: device to test
+ * @acs_flags: required PCI ACS flags
+ *
+ * Return true if the device supports the provided flags.  Automatically
+ * filters out flags that are not implemented on multifunction devices.
+ */
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
+{
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_pcie(pdev))
+		return false;
+
+	if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+		pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+		if (!pos)
+			return false;
+
+		pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+		if ((ctrl & acs_flags) != acs_flags)
+			return false;
+	} else if (pdev->multifunction) {
+		/* Filter out flags not applicable to multifunction */
+		acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
+			      PCI_ACS_EC | PCI_ACS_DT);
+
+		pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+		if (!pos)
+			return false;
+
+		pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+		if ((ctrl & acs_flags) != acs_flags)
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_enabled);
+	
+/**
+ * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
+ * @start: starting downstream device
+ * @end: ending upstream device or NULL to search to the root bus
+ * @acs_flags: required flags
+ *
+ * Walk up a device tree from start to end testing PCI ACS support.  If
+ * any step along the way does not support the required flags, return false.
+ */
+bool pci_acs_path_enabled(struct pci_dev *start,
+			  struct pci_dev *end, u16 acs_flags)
+{
+	struct pci_dev *pdev, *parent = start;
+
+	do {
+		pdev = parent;
+
+		if (!pci_acs_enabled(pdev, acs_flags))
+			return false;
+
+		if (pci_is_root_bus(pdev->bus))
+			return (end == NULL);
+
+		parent = pdev->bus->self;
+	} while (pdev != end);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_path_enabled);
+
+/**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
  * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..83c1711 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,7 +1586,9 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
-
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
+bool pci_acs_path_enabled(struct pci_dev *start,
+			  struct pci_dev *end, u16 acs_flags);
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */
 #define PCI_VPD_LRDT_ID(x)		(x | PCI_VPD_LRDT)




^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 19:36                 ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-16 19:36 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Wed, 2012-05-16 at 10:21 -0600, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> > On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> > >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> > >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> > >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> > >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> > >> a bridge; seems wrong if 04:00 is a multi-function device)
> > >
> > > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > > don't think multifunction plays a role other than how much do we trust
> > > the implementation to not allow back channels between functions (the
> > > answer should probably be not at all).
> > >
> > correct. ACS is a *bridge* property.
> > The unknown wrt multifunction devices is that such devices *could* be implemented
> > by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> > btwn the functions within a device.
> > Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> > force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> > in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> > bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> > and allow parent bridge re-route it back down if peer-to-peer is desired.
> > Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> > Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> > determine this status about a device (via pci cfg/cap space).
> 
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
> 
>    +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>    +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>    +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>    +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> 
>   00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> 
> And these in another:
> 
>    +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>    +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>    +-15.2-[08]--
>    +-15.3-[09]--
> 
>   00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>   00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>   00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>   00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> 
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?

Here's what I'm currently thinking.  This is a much more simple
interface, but I don't know if I'm correctly accounting for
multifunciton devices.  Callers use something like:

+       if (dma_pdev->multifunction &&
+           !pci_acs_enabled(dma_pdev, PCI_ACS_ENABLED))
+               dma_pdev = pci_get_slot(dma_pdev->bus,
+                                       PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
+                                       0));
+
+       while (!pci_is_root_bus(dma_pdev->bus)) {
+               if (pci_acs_path_enabled(dma_pdev->bus->self,
+                                        NULL, PCI_ACS_ENABLED))
+                       break;
+
+               dma_pdev = dma_pdev->bus->self;
+       }

Where the first test is where we have the option to make a very strict
ACS check for multifunction.  Does this start to make more sense?
Interested in opinions on multifunction strict-ness.  Thanks,

Alex

Author: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date:   Wed May 16 13:17:24 2012 -0600

    pci: Add ACS validation utility
    
    In a PCIe environment, transactions aren't always required to
    reach the root bus before being re-routed.  Intermediate
    switches between an endpoint and the root bus can redirect
    DMA back downstream before things like IOMMU have a chance
    to intervene.  This utility function allows us to determine
    the closest device for which ACS is enabled back to the root
    bus for use in determining the boundaries of an iommu group.
    Logic for this extracted from libvirt.
    
    Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..0300e7a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2359,6 +2359,79 @@ void pci_enable_acs(struct pci_dev *dev)
 }
 
 /**
+ * pci_acs_enable - test ACS against required flags for a given device
+ * @pdev: device to test
+ * @acs_flags: required PCI ACS flags
+ *
+ * Return true if the device supports the provided flags.  Automatically
+ * filters out flags that are not implemented on multifunction devices.
+ */
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
+{
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_pcie(pdev))
+		return false;
+
+	if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+		pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+		if (!pos)
+			return false;
+
+		pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+		if ((ctrl & acs_flags) != acs_flags)
+			return false;
+	} else if (pdev->multifunction) {
+		/* Filter out flags not applicable to multifunction */
+		acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
+			      PCI_ACS_EC | PCI_ACS_DT);
+
+		pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+		if (!pos)
+			return false;
+
+		pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+		if ((ctrl & acs_flags) != acs_flags)
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_enabled);
+	
+/**
+ * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
+ * @start: starting downstream device
+ * @end: ending upstream device or NULL to search to the root bus
+ * @acs_flags: required flags
+ *
+ * Walk up a device tree from start to end testing PCI ACS support.  If
+ * any step along the way does not support the required flags, return false.
+ */
+bool pci_acs_path_enabled(struct pci_dev *start,
+			  struct pci_dev *end, u16 acs_flags)
+{
+	struct pci_dev *pdev, *parent = start;
+
+	do {
+		pdev = parent;
+
+		if (!pci_acs_enabled(pdev, acs_flags))
+			return false;
+
+		if (pci_is_root_bus(pdev->bus))
+			return (end == NULL);
+
+		parent = pdev->bus->self;
+	} while (pdev != end);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_path_enabled);
+
+/**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
  * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..83c1711 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,7 +1586,9 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
-
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
+bool pci_acs_path_enabled(struct pci_dev *start,
+			  struct pci_dev *end, u16 acs_flags);
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */
 #define PCI_VPD_LRDT_ID(x)		(x | PCI_VPD_LRDT)

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-16 19:36                 ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-16 19:36 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On Wed, 2012-05-16 at 10:21 -0600, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> > On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> > >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> > >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> > >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> > >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> > >> a bridge; seems wrong if 04:00 is a multi-function device)
> > >
> > > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > > don't think multifunction plays a role other than how much do we trust
> > > the implementation to not allow back channels between functions (the
> > > answer should probably be not at all).
> > >
> > correct. ACS is a *bridge* property.
> > The unknown wrt multifunction devices is that such devices *could* be implemented
> > by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> > btwn the functions within a device.
> > Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> > force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> > in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> > bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> > and allow parent bridge re-route it back down if peer-to-peer is desired.
> > Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> > Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> > determine this status about a device (via pci cfg/cap space).
> 
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
> 
>    +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>    +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>    +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>    +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> 
>   00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> 
> And these in another:
> 
>    +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>    +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>    +-15.2-[08]--
>    +-15.3-[09]--
> 
>   00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>   00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>   00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>   00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> 
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?

Here's what I'm currently thinking.  This is a much more simple
interface, but I don't know if I'm correctly accounting for
multifunciton devices.  Callers use something like:

+       if (dma_pdev->multifunction &&
+           !pci_acs_enabled(dma_pdev, PCI_ACS_ENABLED))
+               dma_pdev = pci_get_slot(dma_pdev->bus,
+                                       PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
+                                       0));
+
+       while (!pci_is_root_bus(dma_pdev->bus)) {
+               if (pci_acs_path_enabled(dma_pdev->bus->self,
+                                        NULL, PCI_ACS_ENABLED))
+                       break;
+
+               dma_pdev = dma_pdev->bus->self;
+       }

Where the first test is where we have the option to make a very strict
ACS check for multifunction.  Does this start to make more sense?
Interested in opinions on multifunction strict-ness.  Thanks,

Alex

Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Wed May 16 13:17:24 2012 -0600

    pci: Add ACS validation utility
    
    In a PCIe environment, transactions aren't always required to
    reach the root bus before being re-routed.  Intermediate
    switches between an endpoint and the root bus can redirect
    DMA back downstream before things like IOMMU have a chance
    to intervene.  This utility function allows us to determine
    the closest device for which ACS is enabled back to the root
    bus for use in determining the boundaries of an iommu group.
    Logic for this extracted from libvirt.
    
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..0300e7a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2359,6 +2359,79 @@ void pci_enable_acs(struct pci_dev *dev)
 }
 
 /**
+ * pci_acs_enable - test ACS against required flags for a given device
+ * @pdev: device to test
+ * @acs_flags: required PCI ACS flags
+ *
+ * Return true if the device supports the provided flags.  Automatically
+ * filters out flags that are not implemented on multifunction devices.
+ */
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
+{
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_pcie(pdev))
+		return false;
+
+	if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+		pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+		if (!pos)
+			return false;
+
+		pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+		if ((ctrl & acs_flags) != acs_flags)
+			return false;
+	} else if (pdev->multifunction) {
+		/* Filter out flags not applicable to multifunction */
+		acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
+			      PCI_ACS_EC | PCI_ACS_DT);
+
+		pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+		if (!pos)
+			return false;
+
+		pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+		if ((ctrl & acs_flags) != acs_flags)
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_enabled);
+	
+/**
+ * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
+ * @start: starting downstream device
+ * @end: ending upstream device or NULL to search to the root bus
+ * @acs_flags: required flags
+ *
+ * Walk up a device tree from start to end testing PCI ACS support.  If
+ * any step along the way does not support the required flags, return false.
+ */
+bool pci_acs_path_enabled(struct pci_dev *start,
+			  struct pci_dev *end, u16 acs_flags)
+{
+	struct pci_dev *pdev, *parent = start;
+
+	do {
+		pdev = parent;
+
+		if (!pci_acs_enabled(pdev, acs_flags))
+			return false;
+
+		if (pci_is_root_bus(pdev->bus))
+			return (end == NULL);
+
+		parent = pdev->bus->self;
+	} while (pdev != end);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_path_enabled);
+
+/**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
  * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..83c1711 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,7 +1586,9 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
-
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
+bool pci_acs_path_enabled(struct pci_dev *start,
+			  struct pci_dev *end, u16 acs_flags);
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */
 #define PCI_VPD_LRDT_ID(x)		(x | PCI_VPD_LRDT)

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-17  3:29             ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Tue, May 15, 2012 at 12:34:03AM -0600, Alex Williamson wrote:
> On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> > On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > [snip]
> > > > > +struct iommu_group {
> > > > > +	struct kobject kobj;
> > > > > +	struct kobject *devices_kobj;
> > > > > +	struct list_head devices;
> > > > > +	struct mutex mutex;
> > > > > +	struct blocking_notifier_head notifier;
> > > > > +	int id;
> > > > 
> > > > I think you should add some sort of name string to the group as well
> > > > (supplied by the iommu driver creating the group).  That would make it
> > > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > > naming convention for these things.
> > > 
> > > When would the name be used and how is it exposed?
> > 
> > I'm thinking of this basically as a debugging aid.  So I'd expect it
> > to appear in a 'name' (or 'description') sysfs property on the group,
> > and in printk messages regarding the group.
> 
> Ok, so long as it's only descriptive/debugging I don't have a problem
> adding something like that.
> 
> > [snip]
> > > > So, it's not clear that the kobject_name() here has to be unique
> > > > across all devices in the group.  It might be better to use an
> > > > arbitrary index here instead of a name to avoid that problem.
> > > 
> > > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > > I'll look and see if sysfs_create_link will fail on duplicate names and
> > > see about adding some kind of instance to it.
> > 
> > Ok.  Is the name necessarily unique even for PCI, if the group crosses
> > multiple domains?
> 
> Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
> just use sysfs_create_link_nowarn and add a .# index when we have a name
> collision.

Ok, that sounds good.

> > [snip]
> > > > > +	mutex_lock(&group->mutex);
> > > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > > +		if (device->dev == dev) {
> > > > > +			list_del(&device->list);
> > > > > +			kfree(device);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +	mutex_unlock(&group->mutex);
> > > > > +
> > > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > > +
> > > > > +	dev->iommu_group = NULL;
> > > > 
> > > > I suspect the dev -> group pointer should be cleared first, under the
> > > > group lock, but I'm not certain about that.
> > > 
> > > group->mutex is protecting the group's device list.  I think my
> > > assumption is that when a device is being removed, there should be no
> > > references to it for anyone to race with iommu_group_get(dev), but I'm
> > > not sure how valid that is.
> > 
> > What I'm concerned about here is someone grabbing the device by
> > non-group-related means, grabbing a pointer to its group and that
> > racing with remove_device().  It would then end up with a group
> > pointer it thinks is right for the device, when the group no longer
> > thinks it owns the device.
> > 
> > Doing it under the lock is so that on the other side, group aware code
> > doesn't traverse the group member list and grab a reference to a
> > device which no longer points back to the group.
> 
> Our for_each function does grab the lock, as you noticed below, so
> removing it from the list under lock prevents that path.  Where it gets
> fuzzy is if someone can call iommu_group_get(dev) to get a group
> reference in this gap.

Right, that's what I'm concerned about.

>  Whether we clear the iommu_group pointer under
> lock or not doesn't matter for that path since it doesn't retrieve it
> under lock.  The assumption there is that the caller is going to have a
> reference to the device and therefore the device is not being
> removed.

Hrm.  I guess that works, assuming that removing a device from the
system is the only thing that will cause a device to be removed from
the group.  How confident are we of that?

> The asynchronous locking and reference counting is by far the hardest
> part of iommu_groups and vfio core, so appreciate any hard analysis of
> that.

Well, I've given my recommendation on this small aspect.

[snip]
> > > > So, there's still the question of how to assign grouping for devices
> > > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > > iommu driver for the top-level bus can't know about all possible
> > > > subordinate bus types, but the subordinate devices will (in most
> > > > cases, anyway) be iommu translated as if originating with the bus
> > > > bridge.
> > > 
> > > Not just any bridge, there has to be a different bus_type on the
> > > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > > would trigger this.
> > 
> > Right, although ISA-under-PCI is a bit of a special case anyway.  I
> > think PCI to Firewire/IEEE1394 would also have this issue, as would
> > SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> > virtual struct devices where the PCI driver is structured as a wrapper
> > around a "vanilla" device driver, a pattern used in a number of
> > drivers for chips with both PCI and non PCI variants.
> 
> Sorry, I jumped into reliving this issue without remembering how I
> decided to rationalize it for IOMMU groups.  Let's step through it.
> Given DeviceA that's a member of GroupA and potentially sources a
> subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
> >From a VFIO perspective, GroupA isn't usable so long as DeviceA is
> claimed by a non-VFIO driver.  That same non-VFIO driver is the one
> causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
> goes away and we can freely give GroupA to userspace.  I believe this is
> always true; there are no subordinate buses to devices that meet the
> "viable" driver requirements of VFIO.  I don't see any problems with the
> fact that userspace can then re-source A_bus_type and find DeviceA'.
> That's what should happen.  If we want to assign just DeviceA' to
> userspace, well, it has no IOMMU group of it's own, so clearly it's not
> assignable on it's own.

Ah, right, so the assumption is that a bridge will only expose it's
subordinate bus when it has an active kernel driver.  Yeah, I think
that works at least for the current use cases.

> For the more general IOMMU group case, I'm still struggling to figure
> out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
> don't think it's unreasonable that map_page would discover there's no
> iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
> both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
> practical reason why DeviceA' would need to be listed as a member of
> GroupA, or is it just an optimization?  I know we had a number of
> discussions about these type of devices for isolation groups, but I
> think that trying to strictly represent these types of devices was also
> one of the downfalls of the isolation proposal.

Yeah, walking up the tree to find a group is certainly a possibility
as well.  Ok, so I'm happy enough only to track primary group members
for now.  I think we should bear in mind the possibility we might need
to track secondary members in future though (either as an optimization
or as something we need for a use case we haven't pinned down yet).

> This did make me think of one other generic quirk we might need.
> There's some funkiness with USB that makes me think that it's
> effectively a shared bus between 1.x and 2.0 controllers.  So assigning
> the 1.x and 2.0 controllers to different groups potentially allows a
> fast and a slow path to the same set of devices.  Is this true?  If so,
> we probably need to quirk OHCI/UHCI and EHCI functions together when
> they're on the same PCI device.  I think the PCI class code is
> sufficient to do this.  Thanks,

Ah, maybe.  I don't know enough about USB conventions to say for sure.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-17  3:29             ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Tue, May 15, 2012 at 12:34:03AM -0600, Alex Williamson wrote:
> On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> > On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > [snip]
> > > > > +struct iommu_group {
> > > > > +	struct kobject kobj;
> > > > > +	struct kobject *devices_kobj;
> > > > > +	struct list_head devices;
> > > > > +	struct mutex mutex;
> > > > > +	struct blocking_notifier_head notifier;
> > > > > +	int id;
> > > > 
> > > > I think you should add some sort of name string to the group as well
> > > > (supplied by the iommu driver creating the group).  That would make it
> > > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > > naming convention for these things.
> > > 
> > > When would the name be used and how is it exposed?
> > 
> > I'm thinking of this basically as a debugging aid.  So I'd expect it
> > to appear in a 'name' (or 'description') sysfs property on the group,
> > and in printk messages regarding the group.
> 
> Ok, so long as it's only descriptive/debugging I don't have a problem
> adding something like that.
> 
> > [snip]
> > > > So, it's not clear that the kobject_name() here has to be unique
> > > > across all devices in the group.  It might be better to use an
> > > > arbitrary index here instead of a name to avoid that problem.
> > > 
> > > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > > I'll look and see if sysfs_create_link will fail on duplicate names and
> > > see about adding some kind of instance to it.
> > 
> > Ok.  Is the name necessarily unique even for PCI, if the group crosses
> > multiple domains?
> 
> Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
> just use sysfs_create_link_nowarn and add a .# index when we have a name
> collision.

Ok, that sounds good.

> > [snip]
> > > > > +	mutex_lock(&group->mutex);
> > > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > > +		if (device->dev == dev) {
> > > > > +			list_del(&device->list);
> > > > > +			kfree(device);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +	mutex_unlock(&group->mutex);
> > > > > +
> > > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > > +
> > > > > +	dev->iommu_group = NULL;
> > > > 
> > > > I suspect the dev -> group pointer should be cleared first, under the
> > > > group lock, but I'm not certain about that.
> > > 
> > > group->mutex is protecting the group's device list.  I think my
> > > assumption is that when a device is being removed, there should be no
> > > references to it for anyone to race with iommu_group_get(dev), but I'm
> > > not sure how valid that is.
> > 
> > What I'm concerned about here is someone grabbing the device by
> > non-group-related means, grabbing a pointer to its group and that
> > racing with remove_device().  It would then end up with a group
> > pointer it thinks is right for the device, when the group no longer
> > thinks it owns the device.
> > 
> > Doing it under the lock is so that on the other side, group aware code
> > doesn't traverse the group member list and grab a reference to a
> > device which no longer points back to the group.
> 
> Our for_each function does grab the lock, as you noticed below, so
> removing it from the list under lock prevents that path.  Where it gets
> fuzzy is if someone can call iommu_group_get(dev) to get a group
> reference in this gap.

Right, that's what I'm concerned about.

>  Whether we clear the iommu_group pointer under
> lock or not doesn't matter for that path since it doesn't retrieve it
> under lock.  The assumption there is that the caller is going to have a
> reference to the device and therefore the device is not being
> removed.

Hrm.  I guess that works, assuming that removing a device from the
system is the only thing that will cause a device to be removed from
the group.  How confident are we of that?

> The asynchronous locking and reference counting is by far the hardest
> part of iommu_groups and vfio core, so appreciate any hard analysis of
> that.

Well, I've given my recommendation on this small aspect.

[snip]
> > > > So, there's still the question of how to assign grouping for devices
> > > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > > iommu driver for the top-level bus can't know about all possible
> > > > subordinate bus types, but the subordinate devices will (in most
> > > > cases, anyway) be iommu translated as if originating with the bus
> > > > bridge.
> > > 
> > > Not just any bridge, there has to be a different bus_type on the
> > > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > > would trigger this.
> > 
> > Right, although ISA-under-PCI is a bit of a special case anyway.  I
> > think PCI to Firewire/IEEE1394 would also have this issue, as would
> > SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> > virtual struct devices where the PCI driver is structured as a wrapper
> > around a "vanilla" device driver, a pattern used in a number of
> > drivers for chips with both PCI and non PCI variants.
> 
> Sorry, I jumped into reliving this issue without remembering how I
> decided to rationalize it for IOMMU groups.  Let's step through it.
> Given DeviceA that's a member of GroupA and potentially sources a
> subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
> >From a VFIO perspective, GroupA isn't usable so long as DeviceA is
> claimed by a non-VFIO driver.  That same non-VFIO driver is the one
> causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
> goes away and we can freely give GroupA to userspace.  I believe this is
> always true; there are no subordinate buses to devices that meet the
> "viable" driver requirements of VFIO.  I don't see any problems with the
> fact that userspace can then re-source A_bus_type and find DeviceA'.
> That's what should happen.  If we want to assign just DeviceA' to
> userspace, well, it has no IOMMU group of it's own, so clearly it's not
> assignable on it's own.

Ah, right, so the assumption is that a bridge will only expose it's
subordinate bus when it has an active kernel driver.  Yeah, I think
that works at least for the current use cases.

> For the more general IOMMU group case, I'm still struggling to figure
> out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
> don't think it's unreasonable that map_page would discover there's no
> iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
> both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
> practical reason why DeviceA' would need to be listed as a member of
> GroupA, or is it just an optimization?  I know we had a number of
> discussions about these type of devices for isolation groups, but I
> think that trying to strictly represent these types of devices was also
> one of the downfalls of the isolation proposal.

Yeah, walking up the tree to find a group is certainly a possibility
as well.  Ok, so I'm happy enough only to track primary group members
for now.  I think we should bear in mind the possibility we might need
to track secondary members in future though (either as an optimization
or as something we need for a use case we haven't pinned down yet).

> This did make me think of one other generic quirk we might need.
> There's some funkiness with USB that makes me think that it's
> effectively a shared bus between 1.x and 2.0 controllers.  So assigning
> the 1.x and 2.0 controllers to different groups potentially allows a
> fast and a slow path to the same set of devices.  Is this true?  If so,
> we probably need to quirk OHCI/UHCI and EHCI functions together when
> they're on the same PCI device.  I think the PCI class code is
> sufficient to do this.  Thanks,

Ah, maybe.  I don't know enough about USB conventions to say for sure.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] iommu: IOMMU Groups
@ 2012-05-17  3:29             ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Tue, May 15, 2012 at 12:34:03AM -0600, Alex Williamson wrote:
> On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> > On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > [snip]
> > > > > +struct iommu_group {
> > > > > +	struct kobject kobj;
> > > > > +	struct kobject *devices_kobj;
> > > > > +	struct list_head devices;
> > > > > +	struct mutex mutex;
> > > > > +	struct blocking_notifier_head notifier;
> > > > > +	int id;
> > > > 
> > > > I think you should add some sort of name string to the group as well
> > > > (supplied by the iommu driver creating the group).  That would make it
> > > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > > naming convention for these things.
> > > 
> > > When would the name be used and how is it exposed?
> > 
> > I'm thinking of this basically as a debugging aid.  So I'd expect it
> > to appear in a 'name' (or 'description') sysfs property on the group,
> > and in printk messages regarding the group.
> 
> Ok, so long as it's only descriptive/debugging I don't have a problem
> adding something like that.
> 
> > [snip]
> > > > So, it's not clear that the kobject_name() here has to be unique
> > > > across all devices in the group.  It might be better to use an
> > > > arbitrary index here instead of a name to avoid that problem.
> > > 
> > > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > > I'll look and see if sysfs_create_link will fail on duplicate names and
> > > see about adding some kind of instance to it.
> > 
> > Ok.  Is the name necessarily unique even for PCI, if the group crosses
> > multiple domains?
> 
> Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
> just use sysfs_create_link_nowarn and add a .# index when we have a name
> collision.

Ok, that sounds good.

> > [snip]
> > > > > +	mutex_lock(&group->mutex);
> > > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > > +		if (device->dev == dev) {
> > > > > +			list_del(&device->list);
> > > > > +			kfree(device);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +	mutex_unlock(&group->mutex);
> > > > > +
> > > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > > +
> > > > > +	dev->iommu_group = NULL;
> > > > 
> > > > I suspect the dev -> group pointer should be cleared first, under the
> > > > group lock, but I'm not certain about that.
> > > 
> > > group->mutex is protecting the group's device list.  I think my
> > > assumption is that when a device is being removed, there should be no
> > > references to it for anyone to race with iommu_group_get(dev), but I'm
> > > not sure how valid that is.
> > 
> > What I'm concerned about here is someone grabbing the device by
> > non-group-related means, grabbing a pointer to its group and that
> > racing with remove_device().  It would then end up with a group
> > pointer it thinks is right for the device, when the group no longer
> > thinks it owns the device.
> > 
> > Doing it under the lock is so that on the other side, group aware code
> > doesn't traverse the group member list and grab a reference to a
> > device which no longer points back to the group.
> 
> Our for_each function does grab the lock, as you noticed below, so
> removing it from the list under lock prevents that path.  Where it gets
> fuzzy is if someone can call iommu_group_get(dev) to get a group
> reference in this gap.

Right, that's what I'm concerned about.

>  Whether we clear the iommu_group pointer under
> lock or not doesn't matter for that path since it doesn't retrieve it
> under lock.  The assumption there is that the caller is going to have a
> reference to the device and therefore the device is not being
> removed.

Hrm.  I guess that works, assuming that removing a device from the
system is the only thing that will cause a device to be removed from
the group.  How confident are we of that?

> The asynchronous locking and reference counting is by far the hardest
> part of iommu_groups and vfio core, so appreciate any hard analysis of
> that.

Well, I've given my recommendation on this small aspect.

[snip]
> > > > So, there's still the question of how to assign grouping for devices
> > > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > > iommu driver for the top-level bus can't know about all possible
> > > > subordinate bus types, but the subordinate devices will (in most
> > > > cases, anyway) be iommu translated as if originating with the bus
> > > > bridge.
> > > 
> > > Not just any bridge, there has to be a different bus_type on the
> > > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > > would trigger this.
> > 
> > Right, although ISA-under-PCI is a bit of a special case anyway.  I
> > think PCI to Firewire/IEEE1394 would also have this issue, as would
> > SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> > virtual struct devices where the PCI driver is structured as a wrapper
> > around a "vanilla" device driver, a pattern used in a number of
> > drivers for chips with both PCI and non PCI variants.
> 
> Sorry, I jumped into reliving this issue without remembering how I
> decided to rationalize it for IOMMU groups.  Let's step through it.
> Given DeviceA that's a member of GroupA and potentially sources a
> subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
> >From a VFIO perspective, GroupA isn't usable so long as DeviceA is
> claimed by a non-VFIO driver.  That same non-VFIO driver is the one
> causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
> goes away and we can freely give GroupA to userspace.  I believe this is
> always true; there are no subordinate buses to devices that meet the
> "viable" driver requirements of VFIO.  I don't see any problems with the
> fact that userspace can then re-source A_bus_type and find DeviceA'.
> That's what should happen.  If we want to assign just DeviceA' to
> userspace, well, it has no IOMMU group of it's own, so clearly it's not
> assignable on it's own.

Ah, right, so the assumption is that a bridge will only expose it's
subordinate bus when it has an active kernel driver.  Yeah, I think
that works at least for the current use cases.

> For the more general IOMMU group case, I'm still struggling to figure
> out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
> don't think it's unreasonable that map_page would discover there's no
> iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
> both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
> practical reason why DeviceA' would need to be listed as a member of
> GroupA, or is it just an optimization?  I know we had a number of
> discussions about these type of devices for isolation groups, but I
> think that trying to strictly represent these types of devices was also
> one of the downfalls of the isolation proposal.

Yeah, walking up the tree to find a group is certainly a possibility
as well.  Ok, so I'm happy enough only to track primary group members
for now.  I think we should bear in mind the possibility we might need
to track secondary members in future though (either as an optimization
or as something we need for a use case we haven't pinned down yet).

> This did make me think of one other generic quirk we might need.
> There's some funkiness with USB that makes me think that it's
> effectively a shared bus between 1.x and 2.0 controllers.  So assigning
> the 1.x and 2.0 controllers to different groups potentially allows a
> fast and a slow path to the same set of devices.  Is this true?  If so,
> we probably need to quirk OHCI/UHCI and EHCI functions together when
> they're on the same PCI device.  I think the PCI class code is
> sufficient to do this.  Thanks,

Ah, maybe.  I don't know enough about USB conventions to say for sure.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi
@ 2012-05-17  3:37     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Fri, May 11, 2012 at 04:55:48PM -0600, Alex Williamson wrote:
> Add back group support for AMD & Intel.  amd_iommu already tracks
> devices and has init and uninit routines to manage groups.
> intel-iommu does this on the fly, so we make use of the notifier
> support built into iommu groups to create and remove groups.

Looks ok to me.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi
@ 2012-05-17  3:37     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Fri, May 11, 2012 at 04:55:48PM -0600, Alex Williamson wrote:
> Add back group support for AMD & Intel.  amd_iommu already tracks
> devices and has init and uninit routines to manage groups.
> intel-iommu does this on the fly, so we make use of the notifier
> support built into iommu groups to create and remove groups.

Looks ok to me.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi
@ 2012-05-17  3:37     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Fri, May 11, 2012 at 04:55:48PM -0600, Alex Williamson wrote:
> Add back group support for AMD & Intel.  amd_iommu already tracks
> devices and has init and uninit routines to manage groups.
> intel-iommu does this on the fly, so we make use of the notifier
> support built into iommu groups to create and remove groups.

Looks ok to me.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  3:39     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  drivers/pci/quirks.c |   22 ++++++++++++++++++++++
>  include/linux/pci.h  |    2 ++
>  2 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4bf7102..6f9f7f9 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
>  
>  	return -ENOTTY;
>  }
> +
> +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> +{
> +	struct pci_dev *dma_dev = dev;
> +
> +	/*
> +	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> +	 *
> +	 * Some Ricoh devices use the function 0 source ID for DMA on
> +	 * other functions of a multifunction device.  The DMA devices
> +	 * is therefore function 0, which will have implications of the
> +	 * iommu grouping of these devices.
> +	 */
> +	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> +	    (dev->device == 0xe822 || dev->device == 0xe230 ||
> +	     dev->device == 0xe832 || dev->device == 0xe476)) {
> +		dma_dev = pci_get_slot(dev->bus,
> +				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> +	}

Hrm.  This seems like a very generic name for a function performing a
very specific test.  We could well have devices with the same problem
in future, so shouldn't this be set up so the same quirk can be easily
added to new device ids without changing the function code itself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  3:39     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
> 
> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
> 
>  drivers/pci/quirks.c |   22 ++++++++++++++++++++++
>  include/linux/pci.h  |    2 ++
>  2 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4bf7102..6f9f7f9 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
>  
>  	return -ENOTTY;
>  }
> +
> +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> +{
> +	struct pci_dev *dma_dev = dev;
> +
> +	/*
> +	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> +	 *
> +	 * Some Ricoh devices use the function 0 source ID for DMA on
> +	 * other functions of a multifunction device.  The DMA devices
> +	 * is therefore function 0, which will have implications of the
> +	 * iommu grouping of these devices.
> +	 */
> +	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> +	    (dev->device == 0xe822 || dev->device == 0xe230 ||
> +	     dev->device == 0xe832 || dev->device == 0xe476)) {
> +		dma_dev = pci_get_slot(dev->bus,
> +				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> +	}

Hrm.  This seems like a very generic name for a function performing a
very specific test.  We could well have devices with the same problem
in future, so shouldn't this be set up so the same quirk can be easily
added to new device ids without changing the function code itself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  3:39     ` David Gibson
  0 siblings, 0 replies; 129+ messages in thread
From: David Gibson @ 2012-05-17  3:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  drivers/pci/quirks.c |   22 ++++++++++++++++++++++
>  include/linux/pci.h  |    2 ++
>  2 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4bf7102..6f9f7f9 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
>  
>  	return -ENOTTY;
>  }
> +
> +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> +{
> +	struct pci_dev *dma_dev = dev;
> +
> +	/*
> +	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> +	 *
> +	 * Some Ricoh devices use the function 0 source ID for DMA on
> +	 * other functions of a multifunction device.  The DMA devices
> +	 * is therefore function 0, which will have implications of the
> +	 * iommu grouping of these devices.
> +	 */
> +	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> +	    (dev->device == 0xe822 || dev->device == 0xe230 ||
> +	     dev->device == 0xe832 || dev->device == 0xe476)) {
> +		dma_dev = pci_get_slot(dev->bus,
> +				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> +	}

Hrm.  This seems like a very generic name for a function performing a
very specific test.  We could well have devices with the same problem
in future, so shouldn't this be set up so the same quirk can be easily
added to new device ids without changing the function code itself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  4:06       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-17  4:06 UTC (permalink / raw)
  To: David Gibson
  Cc: benh, aik, joerg.roedel, dwmw2, chrisw, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci, linux-kernel, gregkh, bhelgaas

On Thu, 2012-05-17 at 13:39 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> > Integrating IOMMU groups more closely into the driver core allows
> > us to more easily work around DMA quirks.  The Ricoh multifunction
> > controller is a favorite example of devices that are currently
> > incompatible with IOMMU isolation as all the functions use the
> > requestor ID of function 0 for DMA.  Passing this device into
> > pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> > driver can then construct an IOMMU group including both devices.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> >  drivers/pci/quirks.c |   22 ++++++++++++++++++++++
> >  include/linux/pci.h  |    2 ++
> >  2 files changed, 24 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 4bf7102..6f9f7f9 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
> >  
> >  	return -ENOTTY;
> >  }
> > +
> > +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *dma_dev = dev;
> > +
> > +	/*
> > +	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> > +	 *
> > +	 * Some Ricoh devices use the function 0 source ID for DMA on
> > +	 * other functions of a multifunction device.  The DMA devices
> > +	 * is therefore function 0, which will have implications of the
> > +	 * iommu grouping of these devices.
> > +	 */
> > +	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> > +	    (dev->device == 0xe822 || dev->device == 0xe230 ||
> > +	     dev->device == 0xe832 || dev->device == 0xe476)) {
> > +		dma_dev = pci_get_slot(dev->bus,
> > +				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> > +	}
> 
> Hrm.  This seems like a very generic name for a function performing a
> very specific test.  We could well have devices with the same problem
> in future, so shouldn't this be set up so the same quirk can be easily
> added to new device ids without changing the function code itself.

I've since added a USB quirk here to group all the USB functions in a
slot.  I'll take a closer look at the quirk helpers to see if anything
makes this easier, but I didn't see much point in spending a lot of time
over-optimizing this for 1 or 2 quirks that we can just step through in
a monolithic function.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  4:06       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-17  4:06 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Thu, 2012-05-17 at 13:39 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> > Integrating IOMMU groups more closely into the driver core allows
> > us to more easily work around DMA quirks.  The Ricoh multifunction
> > controller is a favorite example of devices that are currently
> > incompatible with IOMMU isolation as all the functions use the
> > requestor ID of function 0 for DMA.  Passing this device into
> > pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> > driver can then construct an IOMMU group including both devices.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> > 
> >  drivers/pci/quirks.c |   22 ++++++++++++++++++++++
> >  include/linux/pci.h  |    2 ++
> >  2 files changed, 24 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 4bf7102..6f9f7f9 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
> >  
> >  	return -ENOTTY;
> >  }
> > +
> > +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *dma_dev = dev;
> > +
> > +	/*
> > +	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> > +	 *
> > +	 * Some Ricoh devices use the function 0 source ID for DMA on
> > +	 * other functions of a multifunction device.  The DMA devices
> > +	 * is therefore function 0, which will have implications of the
> > +	 * iommu grouping of these devices.
> > +	 */
> > +	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> > +	    (dev->device == 0xe822 || dev->device == 0xe230 ||
> > +	     dev->device == 0xe832 || dev->device == 0xe476)) {
> > +		dma_dev = pci_get_slot(dev->bus,
> > +				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> > +	}
> 
> Hrm.  This seems like a very generic name for a function performing a
> very specific test.  We could well have devices with the same problem
> in future, so shouldn't this be set up so the same quirk can be easily
> added to new device ids without changing the function code itself.

I've since added a USB quirk here to group all the USB functions in a
slot.  I'll take a closer look at the quirk helpers to see if anything
makes this easier, but I didn't see much point in spending a lot of time
over-optimizing this for 1 or 2 quirks that we can just step through in
a monolithic function.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  4:06       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-17  4:06 UTC (permalink / raw)
  To: David Gibson
  Cc: aafabbri, kvm, B07421, aik, konrad.wilk, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, joerg.roedel,
	bhelgaas, dwmw2, linux-kernel, benve

On Thu, 2012-05-17 at 13:39 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> > Integrating IOMMU groups more closely into the driver core allows
> > us to more easily work around DMA quirks.  The Ricoh multifunction
> > controller is a favorite example of devices that are currently
> > incompatible with IOMMU isolation as all the functions use the
> > requestor ID of function 0 for DMA.  Passing this device into
> > pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> > driver can then construct an IOMMU group including both devices.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> >  drivers/pci/quirks.c |   22 ++++++++++++++++++++++
> >  include/linux/pci.h  |    2 ++
> >  2 files changed, 24 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 4bf7102..6f9f7f9 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)
> >  
> >  	return -ENOTTY;
> >  }
> > +
> > +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *dma_dev = dev;
> > +
> > +	/*
> > +	 * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> > +	 *
> > +	 * Some Ricoh devices use the function 0 source ID for DMA on
> > +	 * other functions of a multifunction device.  The DMA devices
> > +	 * is therefore function 0, which will have implications of the
> > +	 * iommu grouping of these devices.
> > +	 */
> > +	if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> > +	    (dev->device == 0xe822 || dev->device == 0xe230 ||
> > +	     dev->device == 0xe832 || dev->device == 0xe476)) {
> > +		dma_dev = pci_get_slot(dev->bus,
> > +				       PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> > +	}
> 
> Hrm.  This seems like a very generic name for a function performing a
> very specific test.  We could well have devices with the same problem
> in future, so shouldn't this be set up so the same quirk can be easily
> added to new device ids without changing the function code itself.

I've since added a USB quirk here to group all the USB functions in a
slot.  I'll take a closer look at the quirk helpers to see if anything
makes this easier, but I didn't see much point in spending a lot of time
over-optimizing this for 1 or 2 quirks that we can just step through in
a monolithic function.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  7:19     ` Anonymous
  0 siblings, 0 replies; 129+ messages in thread
From: Anonymous @ 2012-05-17  7:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, kvm, B07421, linux-pci,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

Alex,

On Sat, May 12, 2012 at 6:55 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
>

Please give some thought to the Marvell SATA controller quirk as well.

Instead of multiple visible functions using the same requester ID of
function 0, the Marvell device only makes function 0 visible, but uses
the requestor ID of function 1 as well during DMA.

See https://bugzilla.redhat.com/show_bug.cgi?id=757166

--
A.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  7:19     ` Anonymous
  0 siblings, 0 replies; 129+ messages in thread
From: Anonymous @ 2012-05-17  7:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

Alex,

On Sat, May 12, 2012 at 6:55 AM, Alex Williamson
<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
>

Please give some thought to the Marvell SATA controller quirk as well.

Instead of multiple visible functions using the same requester ID of
function 0, the Marvell device only makes function 0 visible, but uses
the requestor ID of function 1 as well during DMA.

See https://bugzilla.redhat.com/show_bug.cgi?id=757166

--
A.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17  7:19     ` Anonymous
  0 siblings, 0 replies; 129+ messages in thread
From: Anonymous @ 2012-05-17  7:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, joerg.roedel, bhelgaas, benve, dwmw2,
	linux-kernel, david

Alex,

On Sat, May 12, 2012 at 6:55 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
>

Please give some thought to the Marvell SATA controller quirk as well.

Instead of multiple visible functions using the same requester ID of
function 0, the Marvell device only makes function 0 visible, but uses
the requestor ID of function 1 as well during DMA.

See https://bugzilla.redhat.com/show_bug.cgi?id=757166

--
A.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17 15:22       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-17 15:22 UTC (permalink / raw)
  To: Anonymous
  Cc: benh, aik, david, joerg.roedel, dwmw2, kvm, B07421, linux-pci,
	agraf, qemu-devel, chrisw, B08248, iommu, avi, gregkh, bhelgaas,
	linux-kernel, benve

On Thu, 2012-05-17 at 15:19 +0800, Anonymous wrote:
> Alex,
> 
> On Sat, May 12, 2012 at 6:55 AM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > Integrating IOMMU groups more closely into the driver core allows
> > us to more easily work around DMA quirks.  The Ricoh multifunction
> > controller is a favorite example of devices that are currently
> > incompatible with IOMMU isolation as all the functions use the
> > requestor ID of function 0 for DMA.  Passing this device into
> > pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> > driver can then construct an IOMMU group including both devices.
> >
> 
> Please give some thought to the Marvell SATA controller quirk as well.
> 
> Instead of multiple visible functions using the same requester ID of
> function 0, the Marvell device only makes function 0 visible, but uses
> the requestor ID of function 1 as well during DMA.
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=757166

Wow.  That one isn't quite as easy to deal with since there's no
existing device in the kernel to point to.  This comment might be on the
right track:

https://bugzilla.kernel.org/show_bug.cgi?id=42679#c11

Perhaps David Woodhouse can comment on support for phantom functions.
If we had infrastructure for that it might be easy for the quirk to
update the pci_dev struct, inserting a new phantom function.  Otherwise
we'd need to create a new device in the kernel for it.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17 15:22       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-17 15:22 UTC (permalink / raw)
  To: Anonymous
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Thu, 2012-05-17 at 15:19 +0800, Anonymous wrote:
> Alex,
> 
> On Sat, May 12, 2012 at 6:55 AM, Alex Williamson
> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Integrating IOMMU groups more closely into the driver core allows
> > us to more easily work around DMA quirks.  The Ricoh multifunction
> > controller is a favorite example of devices that are currently
> > incompatible with IOMMU isolation as all the functions use the
> > requestor ID of function 0 for DMA.  Passing this device into
> > pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> > driver can then construct an IOMMU group including both devices.
> >
> 
> Please give some thought to the Marvell SATA controller quirk as well.
> 
> Instead of multiple visible functions using the same requester ID of
> function 0, the Marvell device only makes function 0 visible, but uses
> the requestor ID of function 1 as well during DMA.
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=757166

Wow.  That one isn't quite as easy to deal with since there's no
existing device in the kernel to point to.  This comment might be on the
right track:

https://bugzilla.kernel.org/show_bug.cgi?id=42679#c11

Perhaps David Woodhouse can comment on support for phantom functions.
If we had infrastructure for that it might be easy for the quirk to
update the pci_dev struct, inserting a new phantom function.  Otherwise
we'd need to create a new device in the kernel for it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 04/13] pci: New pci_dma_quirk()
@ 2012-05-17 15:22       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-17 15:22 UTC (permalink / raw)
  To: Anonymous
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, joerg.roedel, bhelgaas, benve, dwmw2,
	linux-kernel, david

On Thu, 2012-05-17 at 15:19 +0800, Anonymous wrote:
> Alex,
> 
> On Sat, May 12, 2012 at 6:55 AM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > Integrating IOMMU groups more closely into the driver core allows
> > us to more easily work around DMA quirks.  The Ricoh multifunction
> > controller is a favorite example of devices that are currently
> > incompatible with IOMMU isolation as all the functions use the
> > requestor ID of function 0 for DMA.  Passing this device into
> > pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> > driver can then construct an IOMMU group including both devices.
> >
> 
> Please give some thought to the Marvell SATA controller quirk as well.
> 
> Instead of multiple visible functions using the same requester ID of
> function 0, the Marvell device only makes function 0 visible, but uses
> the requestor ID of function 1 as well during DMA.
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=757166

Wow.  That one isn't quite as easy to deal with since there's no
existing device in the kernel to point to.  This comment might be on the
right track:

https://bugzilla.kernel.org/show_bug.cgi?id=42679#c11

Perhaps David Woodhouse can comment on support for phantom functions.
If we had infrastructure for that it might be easy for the quirk to
update the pci_dev struct, inserting a new phantom function.  Otherwise
we'd need to create a new device in the kernel for it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: RESEND3: Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-18 23:00                 ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-18 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On 05/18/2012 06:02 PM, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>   wrote:
>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>> <alex.williamson@redhat.com>   wrote:
>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>
>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>
>>>>> Right, maybe this should be:
>>>>>
>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>
>> +1; there is a global in the PCI code, pci_acs_enable,
>> and a function pci_enable_acs(), which the above name certainly
>> confuses.  I recommend  pci_find_top_acs_bridge()
>> would be most descriptive.
Finally, with my email filters fixed, I can see this email... :)

>
> Yep, the new API I'm working with is:
>
> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> bool pci_acs_path_enabled(struct pci_dev *start,
>                            struct pci_dev *end, u16 acs_flags);
>
ok.

>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>
>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>> group.
>>>>>
>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>> redirect a dma access without iommu translation,
>>>>
>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>> ranges that are forwarded from primary to secondary interface, and the
>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>> here:
>>>>
>>>>                      ^
>>>>                      |
>>>>             +--------+-------+
>>>>             |                |
>>>>      +------+-----+    +-----++-----+
>>>>      | Downstream |    | Downstream |
>>>>      |    Port    |    |    Port    |
>>>>      |   06:05.0  |    |   06:06.0  |
>>>>      +------+-----+    +------+-----+
>>>>             |                 |
>>>>        +----v----+       +----v----+
>>>>        | Endpoint|       | Endpoint|
>>>>        | 07:00.0 |       | 08:00.0 |
>>>>        +---------+       +---------+
>>>>
>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>> nothing specific to PCIe.
>>>
>>> Right, I think the main PCI difference is the point-to-point nature of
>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>> devices talking to each other, but on PCIe the transaction makes a
>>> U-turn at some point and heads out another downstream port.  ACS allows
>>> us to prevent that from happening.
>>>
>> detail: PCIe up/downstream routing is really done by an internal switch;
>>           ACS forces the legacy, PCI base-limit address routing and *forces*
>>           the switch to always route the transaction from a downstream port
>>           to the upstream port.
>>
>>>> I don't understand ACS very well, but it looks like it basically
>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>> instead of being directly claimed by 06:06.0.
>>>
>>> Yep, that's my meager understanding as well.
>>>
>> +1
>>
>>>>> so we're looking for
>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>> the root bus.
>>>>
>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>> port along the path to A.
>>>
>>> Yes, modulo this comment in libvirt source:
>>>
>>>       /* if we have no parent, and this is the root bus, ACS doesn't come
>>>        * into play since devices on the root bus can't P2P without going
>>>        * through the root IOMMU.
>>>        */
>>>
>> Correct. PCIe spec says roots must support ACS. I believe all the
>> root bridges that have an IOMMU have ACS wired in/on.
>
> Would you mind looking for the paragraph that says this?  I'd rather
> code this into the iommu driver callers than core PCI code if this is
> just a platform standard.
>
In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented by Root Ports if the RC supports
                      Redirected Request Validation;
-- which means, if a Root port allows a peer-to-peer transaction to another
    one of its ports, then it has to support ACS.

So, this means that:
(a) if a Root complex with multiple ports can't do peer-to-peer to another port,
     ACS isn't needed
(b) if a Root complex w/multiple ports can do peer-to-peer to another port,
     it must have ACS capability if it does...
and since the linux code turns on ACS for all ports with an ACS cap,
it degenerates (in Linux) that all Root ports are implementing the
end functionality of ACS==on, all traffic goes up to IOMMU in RC.
  
>>> So we assume that a redirect at the point of the iommu will factor in
>>> iommu translation.
>>>
>>>> If so, I think you're trying to find out the closest upstream device X
>>>> such that everything leading to X has ACS enabled.  Every device below
>>>> X can DMA freely to other devices below X, so they would all have to
>>>> be in the same isolated group.
>>>
>>> Yes
>>>
>>>> I tried to work through some examples to develop some intuition about this:
>>>
>>> (inserting fixed url)
>>>> http://www.asciiflow.com/#3736558963405980039
>>>
>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>
>>> Hmm, the latter is the assumption above.  For the former, I think
>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>> except on PCIe.  I'll need to work on making that more generic.
>>>
>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>> PCIe; seems wrong)
>>>
>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>> input devices, so this was passing for me.  I'll need to incorporate
>>> that generically.
>>>
>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>> doesn't have ACS)
>>>
>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>> port, so maybe they're just assuming it's always enabled or that the
>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>> might want "from" and "to" struct pci_dev parameters to make it more
>>> flexible where the iommu lives in the system.
>>>
>> see comment above wrt root ports that have IOMMUs in them.
>
> Except it really seems to be platform convention where the IOMMU lives.
> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> for and from that we can make assumptions about where it physically
> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> function on the root bus.  For now I'm just allowing
> pci_acs_path_enabled to take NULL for and end, which means "up to the
> root bus".
>
ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
For AMD-Vi, I thought the same held true, ATM, but I have to dig through
yet-another spec (or ask Joerg to check-in to this thread & provide the details).

>>
>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>
>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>> don't think multifunction plays a role other than how much do we trust
>>> the implementation to not allow back channels between functions (the
>>> answer should probably be not at all).
>>>
>> correct. ACS is a *bridge* property.
>> The unknown wrt multifunction devices is that such devices *could* be implemented
>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>> btwn the functions within a device.
>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>> determine this status about a device (via pci cfg/cap space).
>
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
>
>     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>
>    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>
> And these in another:
>
>     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>     +-15.2-[08]--
>     +-15.3-[09]--
>
>    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?
>
> Alex
>
Well, in digging into the ACS-support in Root port question above,
I just found out about this ACS support status for multifunctions too.
I need more time to read/digest, but a quick read says MFDs should have
an ACS cap with relevant RO-status & control bits to direct/control
peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
peer-to-peer can happen, and thus, the above have to be in the same iommu group.
Unfortunately, I think a large lot of MFDs don't have ACS caps,
and don't/can't do peer-to-peer, so the above is heavy-handed,
albeit to spec.
Maybe we need a (large?) pci-quirk for the list of existing
MFDs that don't have ACS caps that would enable the above devices
to be in separate groups.
On the flip side, it solves some of the quirks for MFDs that
use the wrong BDF in their src-id dma packets! :) -- they default
to the same group now...


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: RESEND3: Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-18 23:00                 ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-18 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On 05/18/2012 06:02 PM, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>   wrote:
>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>   wrote:
>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>
>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>
>>>>> Right, maybe this should be:
>>>>>
>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>
>> +1; there is a global in the PCI code, pci_acs_enable,
>> and a function pci_enable_acs(), which the above name certainly
>> confuses.  I recommend  pci_find_top_acs_bridge()
>> would be most descriptive.
Finally, with my email filters fixed, I can see this email... :)

>
> Yep, the new API I'm working with is:
>
> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> bool pci_acs_path_enabled(struct pci_dev *start,
>                            struct pci_dev *end, u16 acs_flags);
>
ok.

>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>
>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>> group.
>>>>>
>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>> redirect a dma access without iommu translation,
>>>>
>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>> ranges that are forwarded from primary to secondary interface, and the
>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>> here:
>>>>
>>>>                      ^
>>>>                      |
>>>>             +--------+-------+
>>>>             |                |
>>>>      +------+-----+    +-----++-----+
>>>>      | Downstream |    | Downstream |
>>>>      |    Port    |    |    Port    |
>>>>      |   06:05.0  |    |   06:06.0  |
>>>>      +------+-----+    +------+-----+
>>>>             |                 |
>>>>        +----v----+       +----v----+
>>>>        | Endpoint|       | Endpoint|
>>>>        | 07:00.0 |       | 08:00.0 |
>>>>        +---------+       +---------+
>>>>
>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>> nothing specific to PCIe.
>>>
>>> Right, I think the main PCI difference is the point-to-point nature of
>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>> devices talking to each other, but on PCIe the transaction makes a
>>> U-turn at some point and heads out another downstream port.  ACS allows
>>> us to prevent that from happening.
>>>
>> detail: PCIe up/downstream routing is really done by an internal switch;
>>           ACS forces the legacy, PCI base-limit address routing and *forces*
>>           the switch to always route the transaction from a downstream port
>>           to the upstream port.
>>
>>>> I don't understand ACS very well, but it looks like it basically
>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>> instead of being directly claimed by 06:06.0.
>>>
>>> Yep, that's my meager understanding as well.
>>>
>> +1
>>
>>>>> so we're looking for
>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>> the root bus.
>>>>
>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>> port along the path to A.
>>>
>>> Yes, modulo this comment in libvirt source:
>>>
>>>       /* if we have no parent, and this is the root bus, ACS doesn't come
>>>        * into play since devices on the root bus can't P2P without going
>>>        * through the root IOMMU.
>>>        */
>>>
>> Correct. PCIe spec says roots must support ACS. I believe all the
>> root bridges that have an IOMMU have ACS wired in/on.
>
> Would you mind looking for the paragraph that says this?  I'd rather
> code this into the iommu driver callers than core PCI code if this is
> just a platform standard.
>
In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented by Root Ports if the RC supports
                      Redirected Request Validation;
-- which means, if a Root port allows a peer-to-peer transaction to another
    one of its ports, then it has to support ACS.

So, this means that:
(a) if a Root complex with multiple ports can't do peer-to-peer to another port,
     ACS isn't needed
(b) if a Root complex w/multiple ports can do peer-to-peer to another port,
     it must have ACS capability if it does...
and since the linux code turns on ACS for all ports with an ACS cap,
it degenerates (in Linux) that all Root ports are implementing the
end functionality of ACS==on, all traffic goes up to IOMMU in RC.
  
>>> So we assume that a redirect at the point of the iommu will factor in
>>> iommu translation.
>>>
>>>> If so, I think you're trying to find out the closest upstream device X
>>>> such that everything leading to X has ACS enabled.  Every device below
>>>> X can DMA freely to other devices below X, so they would all have to
>>>> be in the same isolated group.
>>>
>>> Yes
>>>
>>>> I tried to work through some examples to develop some intuition about this:
>>>
>>> (inserting fixed url)
>>>> http://www.asciiflow.com/#3736558963405980039
>>>
>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>
>>> Hmm, the latter is the assumption above.  For the former, I think
>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>> except on PCIe.  I'll need to work on making that more generic.
>>>
>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>> PCIe; seems wrong)
>>>
>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>> input devices, so this was passing for me.  I'll need to incorporate
>>> that generically.
>>>
>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>> doesn't have ACS)
>>>
>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>> port, so maybe they're just assuming it's always enabled or that the
>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>> might want "from" and "to" struct pci_dev parameters to make it more
>>> flexible where the iommu lives in the system.
>>>
>> see comment above wrt root ports that have IOMMUs in them.
>
> Except it really seems to be platform convention where the IOMMU lives.
> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> for and from that we can make assumptions about where it physically
> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> function on the root bus.  For now I'm just allowing
> pci_acs_path_enabled to take NULL for and end, which means "up to the
> root bus".
>
ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
For AMD-Vi, I thought the same held true, ATM, but I have to dig through
yet-another spec (or ask Joerg to check-in to this thread & provide the details).

>>
>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>
>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>> don't think multifunction plays a role other than how much do we trust
>>> the implementation to not allow back channels between functions (the
>>> answer should probably be not at all).
>>>
>> correct. ACS is a *bridge* property.
>> The unknown wrt multifunction devices is that such devices *could* be implemented
>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>> btwn the functions within a device.
>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>> determine this status about a device (via pci cfg/cap space).
>
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
>
>     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>
>    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>
> And these in another:
>
>     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>     +-15.2-[08]--
>     +-15.3-[09]--
>
>    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?
>
> Alex
>
Well, in digging into the ACS-support in Root port question above,
I just found out about this ACS support status for multifunctions too.
I need more time to read/digest, but a quick read says MFDs should have
an ACS cap with relevant RO-status & control bits to direct/control
peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
peer-to-peer can happen, and thus, the above have to be in the same iommu group.
Unfortunately, I think a large lot of MFDs don't have ACS caps,
and don't/can't do peer-to-peer, so the above is heavy-handed,
albeit to spec.
Maybe we need a (large?) pci-quirk for the list of existing
MFDs that don't have ACS caps that would enable the above devices
to be in separate groups.
On the flip side, it solves some of the quirks for MFDs that
use the wrong BDF in their src-id dma packets! :) -- they default
to the same group now...

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] RESEND3: Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-18 23:00                 ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-18 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On 05/18/2012 06:02 PM, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>   wrote:
>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>> <alex.williamson@redhat.com>   wrote:
>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>
>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>
>>>>> Right, maybe this should be:
>>>>>
>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>
>> +1; there is a global in the PCI code, pci_acs_enable,
>> and a function pci_enable_acs(), which the above name certainly
>> confuses.  I recommend  pci_find_top_acs_bridge()
>> would be most descriptive.
Finally, with my email filters fixed, I can see this email... :)

>
> Yep, the new API I'm working with is:
>
> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> bool pci_acs_path_enabled(struct pci_dev *start,
>                            struct pci_dev *end, u16 acs_flags);
>
ok.

>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>
>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>> group.
>>>>>
>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>> redirect a dma access without iommu translation,
>>>>
>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>> ranges that are forwarded from primary to secondary interface, and the
>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>> here:
>>>>
>>>>                      ^
>>>>                      |
>>>>             +--------+-------+
>>>>             |                |
>>>>      +------+-----+    +-----++-----+
>>>>      | Downstream |    | Downstream |
>>>>      |    Port    |    |    Port    |
>>>>      |   06:05.0  |    |   06:06.0  |
>>>>      +------+-----+    +------+-----+
>>>>             |                 |
>>>>        +----v----+       +----v----+
>>>>        | Endpoint|       | Endpoint|
>>>>        | 07:00.0 |       | 08:00.0 |
>>>>        +---------+       +---------+
>>>>
>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>> nothing specific to PCIe.
>>>
>>> Right, I think the main PCI difference is the point-to-point nature of
>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>> devices talking to each other, but on PCIe the transaction makes a
>>> U-turn at some point and heads out another downstream port.  ACS allows
>>> us to prevent that from happening.
>>>
>> detail: PCIe up/downstream routing is really done by an internal switch;
>>           ACS forces the legacy, PCI base-limit address routing and *forces*
>>           the switch to always route the transaction from a downstream port
>>           to the upstream port.
>>
>>>> I don't understand ACS very well, but it looks like it basically
>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>> instead of being directly claimed by 06:06.0.
>>>
>>> Yep, that's my meager understanding as well.
>>>
>> +1
>>
>>>>> so we're looking for
>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>> the root bus.
>>>>
>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>> port along the path to A.
>>>
>>> Yes, modulo this comment in libvirt source:
>>>
>>>       /* if we have no parent, and this is the root bus, ACS doesn't come
>>>        * into play since devices on the root bus can't P2P without going
>>>        * through the root IOMMU.
>>>        */
>>>
>> Correct. PCIe spec says roots must support ACS. I believe all the
>> root bridges that have an IOMMU have ACS wired in/on.
>
> Would you mind looking for the paragraph that says this?  I'd rather
> code this into the iommu driver callers than core PCI code if this is
> just a platform standard.
>
In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented by Root Ports if the RC supports
                      Redirected Request Validation;
-- which means, if a Root port allows a peer-to-peer transaction to another
    one of its ports, then it has to support ACS.

So, this means that:
(a) if a Root complex with multiple ports can't do peer-to-peer to another port,
     ACS isn't needed
(b) if a Root complex w/multiple ports can do peer-to-peer to another port,
     it must have ACS capability if it does...
and since the linux code turns on ACS for all ports with an ACS cap,
it degenerates (in Linux) that all Root ports are implementing the
end functionality of ACS==on, all traffic goes up to IOMMU in RC.
  
>>> So we assume that a redirect at the point of the iommu will factor in
>>> iommu translation.
>>>
>>>> If so, I think you're trying to find out the closest upstream device X
>>>> such that everything leading to X has ACS enabled.  Every device below
>>>> X can DMA freely to other devices below X, so they would all have to
>>>> be in the same isolated group.
>>>
>>> Yes
>>>
>>>> I tried to work through some examples to develop some intuition about this:
>>>
>>> (inserting fixed url)
>>>> http://www.asciiflow.com/#3736558963405980039
>>>
>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>
>>> Hmm, the latter is the assumption above.  For the former, I think
>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>> except on PCIe.  I'll need to work on making that more generic.
>>>
>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>> PCIe; seems wrong)
>>>
>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>> input devices, so this was passing for me.  I'll need to incorporate
>>> that generically.
>>>
>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>> doesn't have ACS)
>>>
>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>> port, so maybe they're just assuming it's always enabled or that the
>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>> might want "from" and "to" struct pci_dev parameters to make it more
>>> flexible where the iommu lives in the system.
>>>
>> see comment above wrt root ports that have IOMMUs in them.
>
> Except it really seems to be platform convention where the IOMMU lives.
> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> for and from that we can make assumptions about where it physically
> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> function on the root bus.  For now I'm just allowing
> pci_acs_path_enabled to take NULL for and end, which means "up to the
> root bus".
>
ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
For AMD-Vi, I thought the same held true, ATM, but I have to dig through
yet-another spec (or ask Joerg to check-in to this thread & provide the details).

>>
>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>
>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>> don't think multifunction plays a role other than how much do we trust
>>> the implementation to not allow back channels between functions (the
>>> answer should probably be not at all).
>>>
>> correct. ACS is a *bridge* property.
>> The unknown wrt multifunction devices is that such devices *could* be implemented
>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>> btwn the functions within a device.
>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>> determine this status about a device (via pci cfg/cap space).
>
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
>
>     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>
>    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>
> And these in another:
>
>     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>     +-15.2-[08]--
>     +-15.3-[09]--
>
>    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?
>
> Alex
>
Well, in digging into the ACS-support in Root port question above,
I just found out about this ACS support status for multifunctions too.
I need more time to read/digest, but a quick read says MFDs should have
an ACS cap with relevant RO-status & control bits to direct/control
peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
peer-to-peer can happen, and thus, the above have to be in the same iommu group.
Unfortunately, I think a large lot of MFDs don't have ACS caps,
and don't/can't do peer-to-peer, so the above is heavy-handed,
albeit to spec.
Maybe we need a (large?) pci-quirk for the list of existing
MFDs that don't have ACS caps that would enable the above devices
to be in separate groups.
On the flip side, it solves some of the quirks for MFDs that
use the wrong BDF in their src-id dma packets! :) -- they default
to the same group now...

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-19  2:47                   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-19  2:47 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> > On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>> <alex.williamson@redhat.com>   wrote:
> >>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>> <alex.williamson@redhat.com>   wrote:
> >>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>
> >>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>
> >>>>> Right, maybe this should be:
> >>>>>
> >>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>
> >> +1; there is a global in the PCI code, pci_acs_enable,
> >> and a function pci_enable_acs(), which the above name certainly
> >> confuses.  I recommend  pci_find_top_acs_bridge()
> >> would be most descriptive.
> Finally, with my email filters fixed, I can see this email... :)

Welcome back ;)

> > Yep, the new API I'm working with is:
> >
> > bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> > bool pci_acs_path_enabled(struct pci_dev *start,
> >                            struct pci_dev *end, u16 acs_flags);
> >
> ok.
> 
> >>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>
> >>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>> group.
> >>>>>
> >>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>> redirect a dma access without iommu translation,
> >>>>
> >>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>> ranges that are forwarded from primary to secondary interface, and the
> >>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>> here:
> >>>>
> >>>>                      ^
> >>>>                      |
> >>>>             +--------+-------+
> >>>>             |                |
> >>>>      +------+-----+    +-----++-----+
> >>>>      | Downstream |    | Downstream |
> >>>>      |    Port    |    |    Port    |
> >>>>      |   06:05.0  |    |   06:06.0  |
> >>>>      +------+-----+    +------+-----+
> >>>>             |                 |
> >>>>        +----v----+       +----v----+
> >>>>        | Endpoint|       | Endpoint|
> >>>>        | 07:00.0 |       | 08:00.0 |
> >>>>        +---------+       +---------+
> >>>>
> >>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>> nothing specific to PCIe.
> >>>
> >>> Right, I think the main PCI difference is the point-to-point nature of
> >>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>> devices talking to each other, but on PCIe the transaction makes a
> >>> U-turn at some point and heads out another downstream port.  ACS allows
> >>> us to prevent that from happening.
> >>>
> >> detail: PCIe up/downstream routing is really done by an internal switch;
> >>           ACS forces the legacy, PCI base-limit address routing and *forces*
> >>           the switch to always route the transaction from a downstream port
> >>           to the upstream port.
> >>
> >>>> I don't understand ACS very well, but it looks like it basically
> >>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>> instead of being directly claimed by 06:06.0.
> >>>
> >>> Yep, that's my meager understanding as well.
> >>>
> >> +1
> >>
> >>>>> so we're looking for
> >>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>> the root bus.
> >>>>
> >>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>> port along the path to A.
> >>>
> >>> Yes, modulo this comment in libvirt source:
> >>>
> >>>       /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>        * into play since devices on the root bus can't P2P without going
> >>>        * through the root IOMMU.
> >>>        */
> >>>
> >> Correct. PCIe spec says roots must support ACS. I believe all the
> >> root bridges that have an IOMMU have ACS wired in/on.
> >
> > Would you mind looking for the paragraph that says this?  I'd rather
> > code this into the iommu driver callers than core PCI code if this is
> > just a platform standard.
> >
> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>                       Redirected Request Validation;

Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.

> -- which means, if a Root port allows a peer-to-peer transaction to another
>     one of its ports, then it has to support ACS.

I don't get that at all from 6.12.1.1, especially given the first
sentence of that section:

        This section applies to Root Ports and Downstream Switch Ports
        that implement an ACS Extended Capability structure.
        

> So, this means that:
> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>      ACS isn't needed
> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>      it must have ACS capability if it does...
> and since the linux code turns on ACS for all ports with an ACS cap,
> it degenerates (in Linux) that all Root ports are implementing the
> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>   
> >>> So we assume that a redirect at the point of the iommu will factor in
> >>> iommu translation.
> >>>
> >>>> If so, I think you're trying to find out the closest upstream device X
> >>>> such that everything leading to X has ACS enabled.  Every device below
> >>>> X can DMA freely to other devices below X, so they would all have to
> >>>> be in the same isolated group.
> >>>
> >>> Yes
> >>>
> >>>> I tried to work through some examples to develop some intuition about this:
> >>>
> >>> (inserting fixed url)
> >>>> http://www.asciiflow.com/#3736558963405980039
> >>>
> >>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>
> >>> Hmm, the latter is the assumption above.  For the former, I think
> >>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>> except on PCIe.  I'll need to work on making that more generic.
> >>>
> >>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>> PCIe; seems wrong)
> >>>
> >>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>> input devices, so this was passing for me.  I'll need to incorporate
> >>> that generically.
> >>>
> >>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>> doesn't have ACS)
> >>>
> >>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>> port, so maybe they're just assuming it's always enabled or that the
> >>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>> might want "from" and "to" struct pci_dev parameters to make it more
> >>> flexible where the iommu lives in the system.
> >>>
> >> see comment above wrt root ports that have IOMMUs in them.
> >
> > Except it really seems to be platform convention where the IOMMU lives.
> > The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> > for and from that we can make assumptions about where it physically
> > lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> > function on the root bus.  For now I'm just allowing
> > pci_acs_path_enabled to take NULL for and end, which means "up to the
> > root bus".
> >
> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> yet-another spec (or ask Joerg to check-in to this thread & provide the details).

But are we programming to convention or spec?  And I'm still confused
about why we assume the root port isn't susceptible to redirection
before IOMMU translation.  One of the benefits of the
pci_acs_path_enable() API is that it pushes convention out to the IOMMU
driver.  So it's intel-iommu.c's problem whether to test for ACS support
to the RC or to a given level (and for that matter know whether IOMMU
translation takes precedence over redirection in the RC).

> >>
> >>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>
> >>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>> don't think multifunction plays a role other than how much do we trust
> >>> the implementation to not allow back channels between functions (the
> >>> answer should probably be not at all).
> >>>
> >> correct. ACS is a *bridge* property.
> >> The unknown wrt multifunction devices is that such devices *could* be implemented
> >> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >> btwn the functions within a device.
> >> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >> determine this status about a device (via pci cfg/cap space).
> >
> > Well, there is actually a section of the ACS part of the spec
> > identifying valid flags for multifunction devices.  Secretly I'd like to
> > use this as justification for blacklisting all multifunction devices
> > that don't explicitly support ACS, but that makes for pretty course
> > granularity.  For instance, all these devices end up in a group:
> >
> >     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >
> >    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >
> > And these in another:
> >
> >     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >     +-15.2-[08]--
> >     +-15.3-[09]--
> >
> >    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >
> > Am I misinterpreting the spec or is this the correct, if strict,
> > interpretation?
> >
> > Alex
> >
> Well, in digging into the ACS-support in Root port question above,
> I just found out about this ACS support status for multifunctions too.
> I need more time to read/digest, but a quick read says MFDs should have
> an ACS cap with relevant RO-status & control bits to direct/control
> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> and don't/can't do peer-to-peer, so the above is heavy-handed,
> albeit to spec.
> Maybe we need a (large?) pci-quirk for the list of existing
> MFDs that don't have ACS caps that would enable the above devices
> to be in separate groups.
> On the flip side, it solves some of the quirks for MFDs that
> use the wrong BDF in their src-id dma packets! :) -- they default
> to the same group now...

Yep, sounds like you might agree with my patch, it's heavy handed, but
seems to adhere to the spec.  That probably just means we need an option
to allow a more lenient interpretation, that maybe we don't have to
support.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-19  2:47                   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-19  2:47 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> > On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>   wrote:
> >>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>   wrote:
> >>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>
> >>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>
> >>>>> Right, maybe this should be:
> >>>>>
> >>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>
> >> +1; there is a global in the PCI code, pci_acs_enable,
> >> and a function pci_enable_acs(), which the above name certainly
> >> confuses.  I recommend  pci_find_top_acs_bridge()
> >> would be most descriptive.
> Finally, with my email filters fixed, I can see this email... :)

Welcome back ;)

> > Yep, the new API I'm working with is:
> >
> > bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> > bool pci_acs_path_enabled(struct pci_dev *start,
> >                            struct pci_dev *end, u16 acs_flags);
> >
> ok.
> 
> >>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>
> >>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>> group.
> >>>>>
> >>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>> redirect a dma access without iommu translation,
> >>>>
> >>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>> ranges that are forwarded from primary to secondary interface, and the
> >>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>> here:
> >>>>
> >>>>                      ^
> >>>>                      |
> >>>>             +--------+-------+
> >>>>             |                |
> >>>>      +------+-----+    +-----++-----+
> >>>>      | Downstream |    | Downstream |
> >>>>      |    Port    |    |    Port    |
> >>>>      |   06:05.0  |    |   06:06.0  |
> >>>>      +------+-----+    +------+-----+
> >>>>             |                 |
> >>>>        +----v----+       +----v----+
> >>>>        | Endpoint|       | Endpoint|
> >>>>        | 07:00.0 |       | 08:00.0 |
> >>>>        +---------+       +---------+
> >>>>
> >>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>> nothing specific to PCIe.
> >>>
> >>> Right, I think the main PCI difference is the point-to-point nature of
> >>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>> devices talking to each other, but on PCIe the transaction makes a
> >>> U-turn at some point and heads out another downstream port.  ACS allows
> >>> us to prevent that from happening.
> >>>
> >> detail: PCIe up/downstream routing is really done by an internal switch;
> >>           ACS forces the legacy, PCI base-limit address routing and *forces*
> >>           the switch to always route the transaction from a downstream port
> >>           to the upstream port.
> >>
> >>>> I don't understand ACS very well, but it looks like it basically
> >>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>> instead of being directly claimed by 06:06.0.
> >>>
> >>> Yep, that's my meager understanding as well.
> >>>
> >> +1
> >>
> >>>>> so we're looking for
> >>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>> the root bus.
> >>>>
> >>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>> port along the path to A.
> >>>
> >>> Yes, modulo this comment in libvirt source:
> >>>
> >>>       /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>        * into play since devices on the root bus can't P2P without going
> >>>        * through the root IOMMU.
> >>>        */
> >>>
> >> Correct. PCIe spec says roots must support ACS. I believe all the
> >> root bridges that have an IOMMU have ACS wired in/on.
> >
> > Would you mind looking for the paragraph that says this?  I'd rather
> > code this into the iommu driver callers than core PCI code if this is
> > just a platform standard.
> >
> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>                       Redirected Request Validation;

Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.

> -- which means, if a Root port allows a peer-to-peer transaction to another
>     one of its ports, then it has to support ACS.

I don't get that at all from 6.12.1.1, especially given the first
sentence of that section:

        This section applies to Root Ports and Downstream Switch Ports
        that implement an ACS Extended Capability structure.
        

> So, this means that:
> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>      ACS isn't needed
> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>      it must have ACS capability if it does...
> and since the linux code turns on ACS for all ports with an ACS cap,
> it degenerates (in Linux) that all Root ports are implementing the
> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>   
> >>> So we assume that a redirect at the point of the iommu will factor in
> >>> iommu translation.
> >>>
> >>>> If so, I think you're trying to find out the closest upstream device X
> >>>> such that everything leading to X has ACS enabled.  Every device below
> >>>> X can DMA freely to other devices below X, so they would all have to
> >>>> be in the same isolated group.
> >>>
> >>> Yes
> >>>
> >>>> I tried to work through some examples to develop some intuition about this:
> >>>
> >>> (inserting fixed url)
> >>>> http://www.asciiflow.com/#3736558963405980039
> >>>
> >>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>
> >>> Hmm, the latter is the assumption above.  For the former, I think
> >>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>> except on PCIe.  I'll need to work on making that more generic.
> >>>
> >>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>> PCIe; seems wrong)
> >>>
> >>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>> input devices, so this was passing for me.  I'll need to incorporate
> >>> that generically.
> >>>
> >>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>> doesn't have ACS)
> >>>
> >>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>> port, so maybe they're just assuming it's always enabled or that the
> >>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>> might want "from" and "to" struct pci_dev parameters to make it more
> >>> flexible where the iommu lives in the system.
> >>>
> >> see comment above wrt root ports that have IOMMUs in them.
> >
> > Except it really seems to be platform convention where the IOMMU lives.
> > The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> > for and from that we can make assumptions about where it physically
> > lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> > function on the root bus.  For now I'm just allowing
> > pci_acs_path_enabled to take NULL for and end, which means "up to the
> > root bus".
> >
> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> yet-another spec (or ask Joerg to check-in to this thread & provide the details).

But are we programming to convention or spec?  And I'm still confused
about why we assume the root port isn't susceptible to redirection
before IOMMU translation.  One of the benefits of the
pci_acs_path_enable() API is that it pushes convention out to the IOMMU
driver.  So it's intel-iommu.c's problem whether to test for ACS support
to the RC or to a given level (and for that matter know whether IOMMU
translation takes precedence over redirection in the RC).

> >>
> >>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>
> >>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>> don't think multifunction plays a role other than how much do we trust
> >>> the implementation to not allow back channels between functions (the
> >>> answer should probably be not at all).
> >>>
> >> correct. ACS is a *bridge* property.
> >> The unknown wrt multifunction devices is that such devices *could* be implemented
> >> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >> btwn the functions within a device.
> >> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >> determine this status about a device (via pci cfg/cap space).
> >
> > Well, there is actually a section of the ACS part of the spec
> > identifying valid flags for multifunction devices.  Secretly I'd like to
> > use this as justification for blacklisting all multifunction devices
> > that don't explicitly support ACS, but that makes for pretty course
> > granularity.  For instance, all these devices end up in a group:
> >
> >     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >
> >    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >
> > And these in another:
> >
> >     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >     +-15.2-[08]--
> >     +-15.3-[09]--
> >
> >    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >
> > Am I misinterpreting the spec or is this the correct, if strict,
> > interpretation?
> >
> > Alex
> >
> Well, in digging into the ACS-support in Root port question above,
> I just found out about this ACS support status for multifunctions too.
> I need more time to read/digest, but a quick read says MFDs should have
> an ACS cap with relevant RO-status & control bits to direct/control
> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> and don't/can't do peer-to-peer, so the above is heavy-handed,
> albeit to spec.
> Maybe we need a (large?) pci-quirk for the list of existing
> MFDs that don't have ACS caps that would enable the above devices
> to be in separate groups.
> On the flip side, it solves some of the quirks for MFDs that
> use the wrong BDF in their src-id dma packets! :) -- they default
> to the same group now...

Yep, sounds like you might agree with my patch, it's heavy handed, but
seems to adhere to the spec.  That probably just means we need an option
to allow a more lenient interpretation, that maybe we don't have to
support.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-19  2:47                   ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-19  2:47 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> > On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>> <alex.williamson@redhat.com>   wrote:
> >>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>> <alex.williamson@redhat.com>   wrote:
> >>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>
> >>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>
> >>>>> Right, maybe this should be:
> >>>>>
> >>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>
> >> +1; there is a global in the PCI code, pci_acs_enable,
> >> and a function pci_enable_acs(), which the above name certainly
> >> confuses.  I recommend  pci_find_top_acs_bridge()
> >> would be most descriptive.
> Finally, with my email filters fixed, I can see this email... :)

Welcome back ;)

> > Yep, the new API I'm working with is:
> >
> > bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> > bool pci_acs_path_enabled(struct pci_dev *start,
> >                            struct pci_dev *end, u16 acs_flags);
> >
> ok.
> 
> >>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>
> >>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>> group.
> >>>>>
> >>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>> redirect a dma access without iommu translation,
> >>>>
> >>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>> ranges that are forwarded from primary to secondary interface, and the
> >>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>> here:
> >>>>
> >>>>                      ^
> >>>>                      |
> >>>>             +--------+-------+
> >>>>             |                |
> >>>>      +------+-----+    +-----++-----+
> >>>>      | Downstream |    | Downstream |
> >>>>      |    Port    |    |    Port    |
> >>>>      |   06:05.0  |    |   06:06.0  |
> >>>>      +------+-----+    +------+-----+
> >>>>             |                 |
> >>>>        +----v----+       +----v----+
> >>>>        | Endpoint|       | Endpoint|
> >>>>        | 07:00.0 |       | 08:00.0 |
> >>>>        +---------+       +---------+
> >>>>
> >>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>> nothing specific to PCIe.
> >>>
> >>> Right, I think the main PCI difference is the point-to-point nature of
> >>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>> devices talking to each other, but on PCIe the transaction makes a
> >>> U-turn at some point and heads out another downstream port.  ACS allows
> >>> us to prevent that from happening.
> >>>
> >> detail: PCIe up/downstream routing is really done by an internal switch;
> >>           ACS forces the legacy, PCI base-limit address routing and *forces*
> >>           the switch to always route the transaction from a downstream port
> >>           to the upstream port.
> >>
> >>>> I don't understand ACS very well, but it looks like it basically
> >>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>> instead of being directly claimed by 06:06.0.
> >>>
> >>> Yep, that's my meager understanding as well.
> >>>
> >> +1
> >>
> >>>>> so we're looking for
> >>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>> the root bus.
> >>>>
> >>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>> port along the path to A.
> >>>
> >>> Yes, modulo this comment in libvirt source:
> >>>
> >>>       /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>        * into play since devices on the root bus can't P2P without going
> >>>        * through the root IOMMU.
> >>>        */
> >>>
> >> Correct. PCIe spec says roots must support ACS. I believe all the
> >> root bridges that have an IOMMU have ACS wired in/on.
> >
> > Would you mind looking for the paragraph that says this?  I'd rather
> > code this into the iommu driver callers than core PCI code if this is
> > just a platform standard.
> >
> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>                       Redirected Request Validation;

Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.

> -- which means, if a Root port allows a peer-to-peer transaction to another
>     one of its ports, then it has to support ACS.

I don't get that at all from 6.12.1.1, especially given the first
sentence of that section:

        This section applies to Root Ports and Downstream Switch Ports
        that implement an ACS Extended Capability structure.
        

> So, this means that:
> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>      ACS isn't needed
> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>      it must have ACS capability if it does...
> and since the linux code turns on ACS for all ports with an ACS cap,
> it degenerates (in Linux) that all Root ports are implementing the
> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>   
> >>> So we assume that a redirect at the point of the iommu will factor in
> >>> iommu translation.
> >>>
> >>>> If so, I think you're trying to find out the closest upstream device X
> >>>> such that everything leading to X has ACS enabled.  Every device below
> >>>> X can DMA freely to other devices below X, so they would all have to
> >>>> be in the same isolated group.
> >>>
> >>> Yes
> >>>
> >>>> I tried to work through some examples to develop some intuition about this:
> >>>
> >>> (inserting fixed url)
> >>>> http://www.asciiflow.com/#3736558963405980039
> >>>
> >>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>
> >>> Hmm, the latter is the assumption above.  For the former, I think
> >>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>> except on PCIe.  I'll need to work on making that more generic.
> >>>
> >>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>> PCIe; seems wrong)
> >>>
> >>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>> input devices, so this was passing for me.  I'll need to incorporate
> >>> that generically.
> >>>
> >>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>> doesn't have ACS)
> >>>
> >>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>> port, so maybe they're just assuming it's always enabled or that the
> >>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>> might want "from" and "to" struct pci_dev parameters to make it more
> >>> flexible where the iommu lives in the system.
> >>>
> >> see comment above wrt root ports that have IOMMUs in them.
> >
> > Except it really seems to be platform convention where the IOMMU lives.
> > The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> > for and from that we can make assumptions about where it physically
> > lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> > function on the root bus.  For now I'm just allowing
> > pci_acs_path_enabled to take NULL for and end, which means "up to the
> > root bus".
> >
> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> yet-another spec (or ask Joerg to check-in to this thread & provide the details).

But are we programming to convention or spec?  And I'm still confused
about why we assume the root port isn't susceptible to redirection
before IOMMU translation.  One of the benefits of the
pci_acs_path_enable() API is that it pushes convention out to the IOMMU
driver.  So it's intel-iommu.c's problem whether to test for ACS support
to the RC or to a given level (and for that matter know whether IOMMU
translation takes precedence over redirection in the RC).

> >>
> >>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>
> >>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>> don't think multifunction plays a role other than how much do we trust
> >>> the implementation to not allow back channels between functions (the
> >>> answer should probably be not at all).
> >>>
> >> correct. ACS is a *bridge* property.
> >> The unknown wrt multifunction devices is that such devices *could* be implemented
> >> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >> btwn the functions within a device.
> >> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >> determine this status about a device (via pci cfg/cap space).
> >
> > Well, there is actually a section of the ACS part of the spec
> > identifying valid flags for multifunction devices.  Secretly I'd like to
> > use this as justification for blacklisting all multifunction devices
> > that don't explicitly support ACS, but that makes for pretty course
> > granularity.  For instance, all these devices end up in a group:
> >
> >     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >
> >    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >
> > And these in another:
> >
> >     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >     +-15.2-[08]--
> >     +-15.3-[09]--
> >
> >    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >
> > Am I misinterpreting the spec or is this the correct, if strict,
> > interpretation?
> >
> > Alex
> >
> Well, in digging into the ACS-support in Root port question above,
> I just found out about this ACS support status for multifunctions too.
> I need more time to read/digest, but a quick read says MFDs should have
> an ACS cap with relevant RO-status & control bits to direct/control
> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> and don't/can't do peer-to-peer, so the above is heavy-handed,
> albeit to spec.
> Maybe we need a (large?) pci-quirk for the list of existing
> MFDs that don't have ACS caps that would enable the above devices
> to be in separate groups.
> On the flip side, it solves some of the quirks for MFDs that
> use the wrong BDF in their src-id dma packets! :) -- they default
> to the same group now...

Yep, sounds like you might agree with my patch, it's heavy handed, but
seems to adhere to the spec.  That probably just means we need an option
to allow a more lenient interpretation, that maybe we don't have to
support.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 13:31                     ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-21 13:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On 05/18/2012 10:47 PM, Alex Williamson wrote:
> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>
>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>
>>>>>>> Right, maybe this should be:
>>>>>>>
>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>
>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>> and a function pci_enable_acs(), which the above name certainly
>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>> would be most descriptive.
>> Finally, with my email filters fixed, I can see this email... :)
>
> Welcome back ;)
>
Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)

>>> Yep, the new API I'm working with is:
>>>
>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>                             struct pci_dev *end, u16 acs_flags);
>>>
>> ok.
>>
>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>
>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>> group.
>>>>>>>
>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>> redirect a dma access without iommu translation,
>>>>>>
>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>> here:
>>>>>>
>>>>>>                       ^
>>>>>>                       |
>>>>>>              +--------+-------+
>>>>>>              |                |
>>>>>>       +------+-----+    +-----++-----+
>>>>>>       | Downstream |    | Downstream |
>>>>>>       |    Port    |    |    Port    |
>>>>>>       |   06:05.0  |    |   06:06.0  |
>>>>>>       +------+-----+    +------+-----+
>>>>>>              |                 |
>>>>>>         +----v----+       +----v----+
>>>>>>         | Endpoint|       | Endpoint|
>>>>>>         | 07:00.0 |       | 08:00.0 |
>>>>>>         +---------+       +---------+
>>>>>>
>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>> nothing specific to PCIe.
>>>>>
>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>> us to prevent that from happening.
>>>>>
>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>            the switch to always route the transaction from a downstream port
>>>>            to the upstream port.
>>>>
>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>> instead of being directly claimed by 06:06.0.
>>>>>
>>>>> Yep, that's my meager understanding as well.
>>>>>
>>>> +1
>>>>
>>>>>>> so we're looking for
>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>> the root bus.
>>>>>>
>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>> port along the path to A.
>>>>>
>>>>> Yes, modulo this comment in libvirt source:
>>>>>
>>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>         * into play since devices on the root bus can't P2P without going
>>>>>         * through the root IOMMU.
>>>>>         */
>>>>>
>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>
>>> Would you mind looking for the paragraph that says this?  I'd rather
>>> code this into the iommu driver callers than core PCI code if this is
>>> just a platform standard.
>>>
>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>                        Redirected Request Validation;
>
> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>
UF?

>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>      one of its ports, then it has to support ACS.
>
> I don't get that at all from 6.12.1.1, especially given the first
> sentence of that section:
>
>          This section applies to Root Ports and Downstream Switch Ports
>          that implement an ACS Extended Capability structure.
>
>
hmmm, well I did.  The root port section is different than Downstream ports
as well.  downstream ports *must* support peer-xfers due to positive decoding
of base/limit addresses, and ACS is optional in downstream ports.
Peer-to-peer btwn root ports is optional.
so, I don't get what you don't get... ;-)
.. but, I understand how the spec can be read/interpreted differently,
    given it's clarity (did lawyers write it?!?!!), so I could be interpreting
    incorrectly.

>> So, this means that:
>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>       ACS isn't needed
>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>       it must have ACS capability if it does...
>> and since the linux code turns on ACS for all ports with an ACS cap,
>> it degenerates (in Linux) that all Root ports are implementing the
>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>
I thought I explained how I interpreted the root-port part of ACS above,
so maybe you can tell me how you think my interpretation is incorrect.

>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>> iommu translation.
>>>>>
>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>> be in the same isolated group.
>>>>>
>>>>> Yes
>>>>>
>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>
>>>>> (inserting fixed url)
>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>
>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>
>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>
>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>> PCIe; seems wrong)
>>>>>
>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>> that generically.
>>>>>
>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>> doesn't have ACS)
>>>>>
>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>> flexible where the iommu lives in the system.
>>>>>
>>>> see comment above wrt root ports that have IOMMUs in them.
>>>
>>> Except it really seems to be platform convention where the IOMMU lives.
>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>> for and from that we can make assumptions about where it physically
>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>> function on the root bus.  For now I'm just allowing
>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>> root bus".
>>>
>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
>
> But are we programming to convention or spec?  And I'm still confused
> about why we assume the root port isn't susceptible to redirection
> before IOMMU translation.  One of the benefits of the
> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> driver.  So it's intel-iommu.c's problem whether to test for ACS support
> to the RC or to a given level (and for that matter know whether IOMMU
> translation takes precedence over redirection in the RC).
>
Agreed. the best design here would be for the intel-iommu.c to test for ACS
wrt the root port (not to be confused with root complex) since Intel-IOMMUs
are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
so maybe it can be simplified in that case.... but it may be best to mimic
the iommu-set-acs-for-root-port attribute the same manner.

>>>>
>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>
>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>> don't think multifunction plays a role other than how much do we trust
>>>>> the implementation to not allow back channels between functions (the
>>>>> answer should probably be not at all).
>>>>>
>>>> correct. ACS is a *bridge* property.
>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>> btwn the functions within a device.
>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>> determine this status about a device (via pci cfg/cap space).
>>>
>>> Well, there is actually a section of the ACS part of the spec
>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>> use this as justification for blacklisting all multifunction devices
>>> that don't explicitly support ACS, but that makes for pretty course
>>> granularity.  For instance, all these devices end up in a group:
>>>
>>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>
>>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>
>>> And these in another:
>>>
>>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>      +-15.2-[08]--
>>>      +-15.3-[09]--
>>>
>>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>
>>> Am I misinterpreting the spec or is this the correct, if strict,
>>> interpretation?
>>>
>>> Alex
>>>
>> Well, in digging into the ACS-support in Root port question above,
>> I just found out about this ACS support status for multifunctions too.
>> I need more time to read/digest, but a quick read says MFDs should have
>> an ACS cap with relevant RO-status&  control bits to direct/control
>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>> albeit to spec.
>> Maybe we need a (large?) pci-quirk for the list of existing
>> MFDs that don't have ACS caps that would enable the above devices
>> to be in separate groups.
>> On the flip side, it solves some of the quirks for MFDs that
>> use the wrong BDF in their src-id dma packets! :) -- they default
>> to the same group now...
>
> Yep, sounds like you might agree with my patch, it's heavy handed, but
Yes, but we should design-in a quirk check list for MFDs,
b/c we already know some will fail when they should pass this check,
b/c the hw was made post ACS, or the vendors didn't see the benefit
of ACS reporting (even if the funcitonality was enabling bit-settings
that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
dev-assignment to virt guests, etc.).

> seems to adhere to the spec.  That probably just means we need an option
> to allow a more lenient interpretation, that maybe we don't have to
> support.  Thanks,
Right.  As I said, a hook to do quirk-level additions from the get-go
would speed this expected need/addition.

>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 13:31                     ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-21 13:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On 05/18/2012 10:47 PM, Alex Williamson wrote:
> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>    wrote:
>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>    wrote:
>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>
>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>
>>>>>>> Right, maybe this should be:
>>>>>>>
>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>
>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>> and a function pci_enable_acs(), which the above name certainly
>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>> would be most descriptive.
>> Finally, with my email filters fixed, I can see this email... :)
>
> Welcome back ;)
>
Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)

>>> Yep, the new API I'm working with is:
>>>
>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>                             struct pci_dev *end, u16 acs_flags);
>>>
>> ok.
>>
>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>
>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>> group.
>>>>>>>
>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>> redirect a dma access without iommu translation,
>>>>>>
>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>> here:
>>>>>>
>>>>>>                       ^
>>>>>>                       |
>>>>>>              +--------+-------+
>>>>>>              |                |
>>>>>>       +------+-----+    +-----++-----+
>>>>>>       | Downstream |    | Downstream |
>>>>>>       |    Port    |    |    Port    |
>>>>>>       |   06:05.0  |    |   06:06.0  |
>>>>>>       +------+-----+    +------+-----+
>>>>>>              |                 |
>>>>>>         +----v----+       +----v----+
>>>>>>         | Endpoint|       | Endpoint|
>>>>>>         | 07:00.0 |       | 08:00.0 |
>>>>>>         +---------+       +---------+
>>>>>>
>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>> nothing specific to PCIe.
>>>>>
>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>> us to prevent that from happening.
>>>>>
>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>            the switch to always route the transaction from a downstream port
>>>>            to the upstream port.
>>>>
>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>> instead of being directly claimed by 06:06.0.
>>>>>
>>>>> Yep, that's my meager understanding as well.
>>>>>
>>>> +1
>>>>
>>>>>>> so we're looking for
>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>> the root bus.
>>>>>>
>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>> port along the path to A.
>>>>>
>>>>> Yes, modulo this comment in libvirt source:
>>>>>
>>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>         * into play since devices on the root bus can't P2P without going
>>>>>         * through the root IOMMU.
>>>>>         */
>>>>>
>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>
>>> Would you mind looking for the paragraph that says this?  I'd rather
>>> code this into the iommu driver callers than core PCI code if this is
>>> just a platform standard.
>>>
>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>                        Redirected Request Validation;
>
> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>
UF?

>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>      one of its ports, then it has to support ACS.
>
> I don't get that at all from 6.12.1.1, especially given the first
> sentence of that section:
>
>          This section applies to Root Ports and Downstream Switch Ports
>          that implement an ACS Extended Capability structure.
>
>
hmmm, well I did.  The root port section is different than Downstream ports
as well.  downstream ports *must* support peer-xfers due to positive decoding
of base/limit addresses, and ACS is optional in downstream ports.
Peer-to-peer btwn root ports is optional.
so, I don't get what you don't get... ;-)
.. but, I understand how the spec can be read/interpreted differently,
    given it's clarity (did lawyers write it?!?!!), so I could be interpreting
    incorrectly.

>> So, this means that:
>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>       ACS isn't needed
>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>       it must have ACS capability if it does...
>> and since the linux code turns on ACS for all ports with an ACS cap,
>> it degenerates (in Linux) that all Root ports are implementing the
>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>
I thought I explained how I interpreted the root-port part of ACS above,
so maybe you can tell me how you think my interpretation is incorrect.

>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>> iommu translation.
>>>>>
>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>> be in the same isolated group.
>>>>>
>>>>> Yes
>>>>>
>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>
>>>>> (inserting fixed url)
>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>
>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>
>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>
>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>> PCIe; seems wrong)
>>>>>
>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>> that generically.
>>>>>
>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>> doesn't have ACS)
>>>>>
>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>> flexible where the iommu lives in the system.
>>>>>
>>>> see comment above wrt root ports that have IOMMUs in them.
>>>
>>> Except it really seems to be platform convention where the IOMMU lives.
>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>> for and from that we can make assumptions about where it physically
>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>> function on the root bus.  For now I'm just allowing
>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>> root bus".
>>>
>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
>
> But are we programming to convention or spec?  And I'm still confused
> about why we assume the root port isn't susceptible to redirection
> before IOMMU translation.  One of the benefits of the
> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> driver.  So it's intel-iommu.c's problem whether to test for ACS support
> to the RC or to a given level (and for that matter know whether IOMMU
> translation takes precedence over redirection in the RC).
>
Agreed. the best design here would be for the intel-iommu.c to test for ACS
wrt the root port (not to be confused with root complex) since Intel-IOMMUs
are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
so maybe it can be simplified in that case.... but it may be best to mimic
the iommu-set-acs-for-root-port attribute the same manner.

>>>>
>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>
>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>> don't think multifunction plays a role other than how much do we trust
>>>>> the implementation to not allow back channels between functions (the
>>>>> answer should probably be not at all).
>>>>>
>>>> correct. ACS is a *bridge* property.
>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>> btwn the functions within a device.
>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>> determine this status about a device (via pci cfg/cap space).
>>>
>>> Well, there is actually a section of the ACS part of the spec
>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>> use this as justification for blacklisting all multifunction devices
>>> that don't explicitly support ACS, but that makes for pretty course
>>> granularity.  For instance, all these devices end up in a group:
>>>
>>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>
>>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>
>>> And these in another:
>>>
>>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>      +-15.2-[08]--
>>>      +-15.3-[09]--
>>>
>>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>
>>> Am I misinterpreting the spec or is this the correct, if strict,
>>> interpretation?
>>>
>>> Alex
>>>
>> Well, in digging into the ACS-support in Root port question above,
>> I just found out about this ACS support status for multifunctions too.
>> I need more time to read/digest, but a quick read says MFDs should have
>> an ACS cap with relevant RO-status&  control bits to direct/control
>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>> albeit to spec.
>> Maybe we need a (large?) pci-quirk for the list of existing
>> MFDs that don't have ACS caps that would enable the above devices
>> to be in separate groups.
>> On the flip side, it solves some of the quirks for MFDs that
>> use the wrong BDF in their src-id dma packets! :) -- they default
>> to the same group now...
>
> Yep, sounds like you might agree with my patch, it's heavy handed, but
Yes, but we should design-in a quirk check list for MFDs,
b/c we already know some will fail when they should pass this check,
b/c the hw was made post ACS, or the vendors didn't see the benefit
of ACS reporting (even if the funcitonality was enabling bit-settings
that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
dev-assignment to virt guests, etc.).

> seems to adhere to the spec.  That probably just means we need an option
> to allow a more lenient interpretation, that maybe we don't have to
> support.  Thanks,
Right.  As I said, a hook to do quirk-level additions from the get-go
would speed this expected need/addition.

>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 13:31                     ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-21 13:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On 05/18/2012 10:47 PM, Alex Williamson wrote:
> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>
>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>
>>>>>>> Right, maybe this should be:
>>>>>>>
>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>
>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>> and a function pci_enable_acs(), which the above name certainly
>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>> would be most descriptive.
>> Finally, with my email filters fixed, I can see this email... :)
>
> Welcome back ;)
>
Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)

>>> Yep, the new API I'm working with is:
>>>
>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>                             struct pci_dev *end, u16 acs_flags);
>>>
>> ok.
>>
>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>
>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>> group.
>>>>>>>
>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>> redirect a dma access without iommu translation,
>>>>>>
>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>> here:
>>>>>>
>>>>>>                       ^
>>>>>>                       |
>>>>>>              +--------+-------+
>>>>>>              |                |
>>>>>>       +------+-----+    +-----++-----+
>>>>>>       | Downstream |    | Downstream |
>>>>>>       |    Port    |    |    Port    |
>>>>>>       |   06:05.0  |    |   06:06.0  |
>>>>>>       +------+-----+    +------+-----+
>>>>>>              |                 |
>>>>>>         +----v----+       +----v----+
>>>>>>         | Endpoint|       | Endpoint|
>>>>>>         | 07:00.0 |       | 08:00.0 |
>>>>>>         +---------+       +---------+
>>>>>>
>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>> nothing specific to PCIe.
>>>>>
>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>> us to prevent that from happening.
>>>>>
>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>            the switch to always route the transaction from a downstream port
>>>>            to the upstream port.
>>>>
>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>> instead of being directly claimed by 06:06.0.
>>>>>
>>>>> Yep, that's my meager understanding as well.
>>>>>
>>>> +1
>>>>
>>>>>>> so we're looking for
>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>> the root bus.
>>>>>>
>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>> port along the path to A.
>>>>>
>>>>> Yes, modulo this comment in libvirt source:
>>>>>
>>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>         * into play since devices on the root bus can't P2P without going
>>>>>         * through the root IOMMU.
>>>>>         */
>>>>>
>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>
>>> Would you mind looking for the paragraph that says this?  I'd rather
>>> code this into the iommu driver callers than core PCI code if this is
>>> just a platform standard.
>>>
>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>                        Redirected Request Validation;
>
> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>
UF?

>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>      one of its ports, then it has to support ACS.
>
> I don't get that at all from 6.12.1.1, especially given the first
> sentence of that section:
>
>          This section applies to Root Ports and Downstream Switch Ports
>          that implement an ACS Extended Capability structure.
>
>
hmmm, well I did.  The root port section is different than Downstream ports
as well.  downstream ports *must* support peer-xfers due to positive decoding
of base/limit addresses, and ACS is optional in downstream ports.
Peer-to-peer btwn root ports is optional.
so, I don't get what you don't get... ;-)
.. but, I understand how the spec can be read/interpreted differently,
    given it's clarity (did lawyers write it?!?!!), so I could be interpreting
    incorrectly.

>> So, this means that:
>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>       ACS isn't needed
>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>       it must have ACS capability if it does...
>> and since the linux code turns on ACS for all ports with an ACS cap,
>> it degenerates (in Linux) that all Root ports are implementing the
>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>
I thought I explained how I interpreted the root-port part of ACS above,
so maybe you can tell me how you think my interpretation is incorrect.

>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>> iommu translation.
>>>>>
>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>> be in the same isolated group.
>>>>>
>>>>> Yes
>>>>>
>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>
>>>>> (inserting fixed url)
>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>
>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>
>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>
>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>> PCIe; seems wrong)
>>>>>
>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>> that generically.
>>>>>
>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>> doesn't have ACS)
>>>>>
>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>> flexible where the iommu lives in the system.
>>>>>
>>>> see comment above wrt root ports that have IOMMUs in them.
>>>
>>> Except it really seems to be platform convention where the IOMMU lives.
>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>> for and from that we can make assumptions about where it physically
>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>> function on the root bus.  For now I'm just allowing
>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>> root bus".
>>>
>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
>
> But are we programming to convention or spec?  And I'm still confused
> about why we assume the root port isn't susceptible to redirection
> before IOMMU translation.  One of the benefits of the
> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> driver.  So it's intel-iommu.c's problem whether to test for ACS support
> to the RC or to a given level (and for that matter know whether IOMMU
> translation takes precedence over redirection in the RC).
>
Agreed. the best design here would be for the intel-iommu.c to test for ACS
wrt the root port (not to be confused with root complex) since Intel-IOMMUs
are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
so maybe it can be simplified in that case.... but it may be best to mimic
the iommu-set-acs-for-root-port attribute the same manner.

>>>>
>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>
>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>> don't think multifunction plays a role other than how much do we trust
>>>>> the implementation to not allow back channels between functions (the
>>>>> answer should probably be not at all).
>>>>>
>>>> correct. ACS is a *bridge* property.
>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>> btwn the functions within a device.
>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>> determine this status about a device (via pci cfg/cap space).
>>>
>>> Well, there is actually a section of the ACS part of the spec
>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>> use this as justification for blacklisting all multifunction devices
>>> that don't explicitly support ACS, but that makes for pretty course
>>> granularity.  For instance, all these devices end up in a group:
>>>
>>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>
>>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>
>>> And these in another:
>>>
>>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>      +-15.2-[08]--
>>>      +-15.3-[09]--
>>>
>>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>
>>> Am I misinterpreting the spec or is this the correct, if strict,
>>> interpretation?
>>>
>>> Alex
>>>
>> Well, in digging into the ACS-support in Root port question above,
>> I just found out about this ACS support status for multifunctions too.
>> I need more time to read/digest, but a quick read says MFDs should have
>> an ACS cap with relevant RO-status&  control bits to direct/control
>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>> albeit to spec.
>> Maybe we need a (large?) pci-quirk for the list of existing
>> MFDs that don't have ACS caps that would enable the above devices
>> to be in separate groups.
>> On the flip side, it solves some of the quirks for MFDs that
>> use the wrong BDF in their src-id dma packets! :) -- they default
>> to the same group now...
>
> Yep, sounds like you might agree with my patch, it's heavy handed, but
Yes, but we should design-in a quirk check list for MFDs,
b/c we already know some will fail when they should pass this check,
b/c the hw was made post ACS, or the vendors didn't see the benefit
of ACS reporting (even if the funcitonality was enabling bit-settings
that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
dev-assignment to virt guests, etc.).

> seems to adhere to the spec.  That probably just means we need an option
> to allow a more lenient interpretation, that maybe we don't have to
> support.  Thanks,
Right.  As I said, a hook to do quirk-level additions from the get-go
would speed this expected need/addition.

>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 14:59                       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-21 14:59 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
> On 05/18/2012 10:47 PM, Alex Williamson wrote:
> > On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> >> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> >>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>>>> <alex.williamson@redhat.com>    wrote:
> >>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>>>> <alex.williamson@redhat.com>    wrote:
> >>>>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>>>
> >>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>>>
> >>>>>>> Right, maybe this should be:
> >>>>>>>
> >>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>>>
> >>>> +1; there is a global in the PCI code, pci_acs_enable,
> >>>> and a function pci_enable_acs(), which the above name certainly
> >>>> confuses.  I recommend  pci_find_top_acs_bridge()
> >>>> would be most descriptive.
> >> Finally, with my email filters fixed, I can see this email... :)
> >
> > Welcome back ;)
> >
> Indeed... and I recvd 3 copies of this reply,
> so the pendulum has flipped the other direction... ;-)
> 
> >>> Yep, the new API I'm working with is:
> >>>
> >>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> >>> bool pci_acs_path_enabled(struct pci_dev *start,
> >>>                             struct pci_dev *end, u16 acs_flags);
> >>>
> >> ok.
> >>
> >>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>>>
> >>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>>>> group.
> >>>>>>>
> >>>>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>>>> redirect a dma access without iommu translation,
> >>>>>>
> >>>>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>>>> ranges that are forwarded from primary to secondary interface, and the
> >>>>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>>>> here:
> >>>>>>
> >>>>>>                       ^
> >>>>>>                       |
> >>>>>>              +--------+-------+
> >>>>>>              |                |
> >>>>>>       +------+-----+    +-----++-----+
> >>>>>>       | Downstream |    | Downstream |
> >>>>>>       |    Port    |    |    Port    |
> >>>>>>       |   06:05.0  |    |   06:06.0  |
> >>>>>>       +------+-----+    +------+-----+
> >>>>>>              |                 |
> >>>>>>         +----v----+       +----v----+
> >>>>>>         | Endpoint|       | Endpoint|
> >>>>>>         | 07:00.0 |       | 08:00.0 |
> >>>>>>         +---------+       +---------+
> >>>>>>
> >>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>>>> nothing specific to PCIe.
> >>>>>
> >>>>> Right, I think the main PCI difference is the point-to-point nature of
> >>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>>>> devices talking to each other, but on PCIe the transaction makes a
> >>>>> U-turn at some point and heads out another downstream port.  ACS allows
> >>>>> us to prevent that from happening.
> >>>>>
> >>>> detail: PCIe up/downstream routing is really done by an internal switch;
> >>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
> >>>>            the switch to always route the transaction from a downstream port
> >>>>            to the upstream port.
> >>>>
> >>>>>> I don't understand ACS very well, but it looks like it basically
> >>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>>>> instead of being directly claimed by 06:06.0.
> >>>>>
> >>>>> Yep, that's my meager understanding as well.
> >>>>>
> >>>> +1
> >>>>
> >>>>>>> so we're looking for
> >>>>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>>>> the root bus.
> >>>>>>
> >>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>>>> port along the path to A.
> >>>>>
> >>>>> Yes, modulo this comment in libvirt source:
> >>>>>
> >>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>>>         * into play since devices on the root bus can't P2P without going
> >>>>>         * through the root IOMMU.
> >>>>>         */
> >>>>>
> >>>> Correct. PCIe spec says roots must support ACS. I believe all the
> >>>> root bridges that have an IOMMU have ACS wired in/on.
> >>>
> >>> Would you mind looking for the paragraph that says this?  I'd rather
> >>> code this into the iommu driver callers than core PCI code if this is
> >>> just a platform standard.
> >>>
> >> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> >> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
> >>                        Redirected Request Validation;
> >
> > Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
> >
> UF?

Upstream Forwarding.  So basically that spec quote is saying that if the
RC supports one aspect of ACS then it must support this other.  But that
doesn't say to me that it must support ACS to begin with.

> >> -- which means, if a Root port allows a peer-to-peer transaction to another
> >>      one of its ports, then it has to support ACS.
> >
> > I don't get that at all from 6.12.1.1, especially given the first
> > sentence of that section:
> >
> >          This section applies to Root Ports and Downstream Switch Ports
> >          that implement an ACS Extended Capability structure.
> >
> >
> hmmm, well I did.  The root port section is different than Downstream ports
> as well.  downstream ports *must* support peer-xfers due to positive decoding
> of base/limit addresses, and ACS is optional in downstream ports.
> Peer-to-peer btwn root ports is optional.
> so, I don't get what you don't get... ;-)
> .. but, I understand how the spec can be read/interpreted differently,
>     given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>     incorrectly.
> 
> >> So, this means that:
> >> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
> >>       ACS isn't needed
> >> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
> >>       it must have ACS capability if it does...
> >> and since the linux code turns on ACS for all ports with an ACS cap,
> >> it degenerates (in Linux) that all Root ports are implementing the
> >> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
> >>
> I thought I explained how I interpreted the root-port part of ACS above,
> so maybe you can tell me how you think my interpretation is incorrect.

I don't know that it's incorrect, but I don't see how you're making the
leap you are.  I think the version I have now of the patch leaves this
nicely to the iommu drivers.  It would make sense, not that hardware
always makes sense, that anything with an iommu is going to hardwire
translations through the iommu when enabled or they couldn't reasonable
do any kind of peer-to-peer with iommu.

> >>>>> So we assume that a redirect at the point of the iommu will factor in
> >>>>> iommu translation.
> >>>>>
> >>>>>> If so, I think you're trying to find out the closest upstream device X
> >>>>>> such that everything leading to X has ACS enabled.  Every device below
> >>>>>> X can DMA freely to other devices below X, so they would all have to
> >>>>>> be in the same isolated group.
> >>>>>
> >>>>> Yes
> >>>>>
> >>>>>> I tried to work through some examples to develop some intuition about this:
> >>>>>
> >>>>> (inserting fixed url)
> >>>>>> http://www.asciiflow.com/#3736558963405980039
> >>>>>
> >>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>>>
> >>>>> Hmm, the latter is the assumption above.  For the former, I think
> >>>>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>>>> except on PCIe.  I'll need to work on making that more generic.
> >>>>>
> >>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>>>> PCIe; seems wrong)
> >>>>>
> >>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>>>> input devices, so this was passing for me.  I'll need to incorporate
> >>>>> that generically.
> >>>>>
> >>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>>>> doesn't have ACS)
> >>>>>
> >>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>>>> port, so maybe they're just assuming it's always enabled or that the
> >>>>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>>>> might want "from" and "to" struct pci_dev parameters to make it more
> >>>>> flexible where the iommu lives in the system.
> >>>>>
> >>>> see comment above wrt root ports that have IOMMUs in them.
> >>>
> >>> Except it really seems to be platform convention where the IOMMU lives.
> >>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> >>> for and from that we can make assumptions about where it physically
> >>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> >>> function on the root bus.  For now I'm just allowing
> >>> pci_acs_path_enabled to take NULL for and end, which means "up to the
> >>> root bus".
> >>>
> >> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> >> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> >> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> >> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
> >
> > But are we programming to convention or spec?  And I'm still confused
> > about why we assume the root port isn't susceptible to redirection
> > before IOMMU translation.  One of the benefits of the
> > pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> > driver.  So it's intel-iommu.c's problem whether to test for ACS support
> > to the RC or to a given level (and for that matter know whether IOMMU
> > translation takes precedence over redirection in the RC).
> >
> Agreed. the best design here would be for the intel-iommu.c to test for ACS
> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
> so maybe it can be simplified in that case.... but it may be best to mimic
> the iommu-set-acs-for-root-port attribute the same manner.
> 
> >>>>
> >>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>>>
> >>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>>>> don't think multifunction plays a role other than how much do we trust
> >>>>> the implementation to not allow back channels between functions (the
> >>>>> answer should probably be not at all).
> >>>>>
> >>>> correct. ACS is a *bridge* property.
> >>>> The unknown wrt multifunction devices is that such devices *could* be implemented
> >>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >>>> btwn the functions within a device.
> >>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >>>> determine this status about a device (via pci cfg/cap space).
> >>>
> >>> Well, there is actually a section of the ACS part of the spec
> >>> identifying valid flags for multifunction devices.  Secretly I'd like to
> >>> use this as justification for blacklisting all multifunction devices
> >>> that don't explicitly support ACS, but that makes for pretty course
> >>> granularity.  For instance, all these devices end up in a group:
> >>>
> >>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >>>
> >>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >>>
> >>> And these in another:
> >>>
> >>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >>>      +-15.2-[08]--
> >>>      +-15.3-[09]--
> >>>
> >>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >>>
> >>> Am I misinterpreting the spec or is this the correct, if strict,
> >>> interpretation?
> >>>
> >>> Alex
> >>>
> >> Well, in digging into the ACS-support in Root port question above,
> >> I just found out about this ACS support status for multifunctions too.
> >> I need more time to read/digest, but a quick read says MFDs should have
> >> an ACS cap with relevant RO-status&  control bits to direct/control
> >> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> >> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> >> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> >> and don't/can't do peer-to-peer, so the above is heavy-handed,
> >> albeit to spec.
> >> Maybe we need a (large?) pci-quirk for the list of existing
> >> MFDs that don't have ACS caps that would enable the above devices
> >> to be in separate groups.
> >> On the flip side, it solves some of the quirks for MFDs that
> >> use the wrong BDF in their src-id dma packets! :) -- they default
> >> to the same group now...
> >
> > Yep, sounds like you might agree with my patch, it's heavy handed, but
> Yes, but we should design-in a quirk check list for MFDs,
> b/c we already know some will fail when they should pass this check,
> b/c the hw was made post ACS, or the vendors didn't see the benefit
> of ACS reporting (even if the funcitonality was enabling bit-settings
> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
> dev-assignment to virt guests, etc.).
> 
> > seems to adhere to the spec.  That probably just means we need an option
> > to allow a more lenient interpretation, that maybe we don't have to
> > support.  Thanks,
> Right.  As I said, a hook to do quirk-level additions from the get-go
> would speed this expected need/addition.

Ok, a quirk should be easy there.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 14:59                       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-21 14:59 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
> On 05/18/2012 10:47 PM, Alex Williamson wrote:
> > On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> >> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> >>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>    wrote:
> >>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>    wrote:
> >>>>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>>>
> >>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>>>
> >>>>>>> Right, maybe this should be:
> >>>>>>>
> >>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>>>
> >>>> +1; there is a global in the PCI code, pci_acs_enable,
> >>>> and a function pci_enable_acs(), which the above name certainly
> >>>> confuses.  I recommend  pci_find_top_acs_bridge()
> >>>> would be most descriptive.
> >> Finally, with my email filters fixed, I can see this email... :)
> >
> > Welcome back ;)
> >
> Indeed... and I recvd 3 copies of this reply,
> so the pendulum has flipped the other direction... ;-)
> 
> >>> Yep, the new API I'm working with is:
> >>>
> >>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> >>> bool pci_acs_path_enabled(struct pci_dev *start,
> >>>                             struct pci_dev *end, u16 acs_flags);
> >>>
> >> ok.
> >>
> >>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>>>
> >>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>>>> group.
> >>>>>>>
> >>>>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>>>> redirect a dma access without iommu translation,
> >>>>>>
> >>>>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>>>> ranges that are forwarded from primary to secondary interface, and the
> >>>>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>>>> here:
> >>>>>>
> >>>>>>                       ^
> >>>>>>                       |
> >>>>>>              +--------+-------+
> >>>>>>              |                |
> >>>>>>       +------+-----+    +-----++-----+
> >>>>>>       | Downstream |    | Downstream |
> >>>>>>       |    Port    |    |    Port    |
> >>>>>>       |   06:05.0  |    |   06:06.0  |
> >>>>>>       +------+-----+    +------+-----+
> >>>>>>              |                 |
> >>>>>>         +----v----+       +----v----+
> >>>>>>         | Endpoint|       | Endpoint|
> >>>>>>         | 07:00.0 |       | 08:00.0 |
> >>>>>>         +---------+       +---------+
> >>>>>>
> >>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>>>> nothing specific to PCIe.
> >>>>>
> >>>>> Right, I think the main PCI difference is the point-to-point nature of
> >>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>>>> devices talking to each other, but on PCIe the transaction makes a
> >>>>> U-turn at some point and heads out another downstream port.  ACS allows
> >>>>> us to prevent that from happening.
> >>>>>
> >>>> detail: PCIe up/downstream routing is really done by an internal switch;
> >>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
> >>>>            the switch to always route the transaction from a downstream port
> >>>>            to the upstream port.
> >>>>
> >>>>>> I don't understand ACS very well, but it looks like it basically
> >>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>>>> instead of being directly claimed by 06:06.0.
> >>>>>
> >>>>> Yep, that's my meager understanding as well.
> >>>>>
> >>>> +1
> >>>>
> >>>>>>> so we're looking for
> >>>>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>>>> the root bus.
> >>>>>>
> >>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>>>> port along the path to A.
> >>>>>
> >>>>> Yes, modulo this comment in libvirt source:
> >>>>>
> >>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>>>         * into play since devices on the root bus can't P2P without going
> >>>>>         * through the root IOMMU.
> >>>>>         */
> >>>>>
> >>>> Correct. PCIe spec says roots must support ACS. I believe all the
> >>>> root bridges that have an IOMMU have ACS wired in/on.
> >>>
> >>> Would you mind looking for the paragraph that says this?  I'd rather
> >>> code this into the iommu driver callers than core PCI code if this is
> >>> just a platform standard.
> >>>
> >> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> >> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
> >>                        Redirected Request Validation;
> >
> > Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
> >
> UF?

Upstream Forwarding.  So basically that spec quote is saying that if the
RC supports one aspect of ACS then it must support this other.  But that
doesn't say to me that it must support ACS to begin with.

> >> -- which means, if a Root port allows a peer-to-peer transaction to another
> >>      one of its ports, then it has to support ACS.
> >
> > I don't get that at all from 6.12.1.1, especially given the first
> > sentence of that section:
> >
> >          This section applies to Root Ports and Downstream Switch Ports
> >          that implement an ACS Extended Capability structure.
> >
> >
> hmmm, well I did.  The root port section is different than Downstream ports
> as well.  downstream ports *must* support peer-xfers due to positive decoding
> of base/limit addresses, and ACS is optional in downstream ports.
> Peer-to-peer btwn root ports is optional.
> so, I don't get what you don't get... ;-)
> .. but, I understand how the spec can be read/interpreted differently,
>     given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>     incorrectly.
> 
> >> So, this means that:
> >> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
> >>       ACS isn't needed
> >> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
> >>       it must have ACS capability if it does...
> >> and since the linux code turns on ACS for all ports with an ACS cap,
> >> it degenerates (in Linux) that all Root ports are implementing the
> >> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
> >>
> I thought I explained how I interpreted the root-port part of ACS above,
> so maybe you can tell me how you think my interpretation is incorrect.

I don't know that it's incorrect, but I don't see how you're making the
leap you are.  I think the version I have now of the patch leaves this
nicely to the iommu drivers.  It would make sense, not that hardware
always makes sense, that anything with an iommu is going to hardwire
translations through the iommu when enabled or they couldn't reasonable
do any kind of peer-to-peer with iommu.

> >>>>> So we assume that a redirect at the point of the iommu will factor in
> >>>>> iommu translation.
> >>>>>
> >>>>>> If so, I think you're trying to find out the closest upstream device X
> >>>>>> such that everything leading to X has ACS enabled.  Every device below
> >>>>>> X can DMA freely to other devices below X, so they would all have to
> >>>>>> be in the same isolated group.
> >>>>>
> >>>>> Yes
> >>>>>
> >>>>>> I tried to work through some examples to develop some intuition about this:
> >>>>>
> >>>>> (inserting fixed url)
> >>>>>> http://www.asciiflow.com/#3736558963405980039
> >>>>>
> >>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>>>
> >>>>> Hmm, the latter is the assumption above.  For the former, I think
> >>>>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>>>> except on PCIe.  I'll need to work on making that more generic.
> >>>>>
> >>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>>>> PCIe; seems wrong)
> >>>>>
> >>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>>>> input devices, so this was passing for me.  I'll need to incorporate
> >>>>> that generically.
> >>>>>
> >>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>>>> doesn't have ACS)
> >>>>>
> >>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>>>> port, so maybe they're just assuming it's always enabled or that the
> >>>>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>>>> might want "from" and "to" struct pci_dev parameters to make it more
> >>>>> flexible where the iommu lives in the system.
> >>>>>
> >>>> see comment above wrt root ports that have IOMMUs in them.
> >>>
> >>> Except it really seems to be platform convention where the IOMMU lives.
> >>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> >>> for and from that we can make assumptions about where it physically
> >>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> >>> function on the root bus.  For now I'm just allowing
> >>> pci_acs_path_enabled to take NULL for and end, which means "up to the
> >>> root bus".
> >>>
> >> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> >> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> >> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> >> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
> >
> > But are we programming to convention or spec?  And I'm still confused
> > about why we assume the root port isn't susceptible to redirection
> > before IOMMU translation.  One of the benefits of the
> > pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> > driver.  So it's intel-iommu.c's problem whether to test for ACS support
> > to the RC or to a given level (and for that matter know whether IOMMU
> > translation takes precedence over redirection in the RC).
> >
> Agreed. the best design here would be for the intel-iommu.c to test for ACS
> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
> so maybe it can be simplified in that case.... but it may be best to mimic
> the iommu-set-acs-for-root-port attribute the same manner.
> 
> >>>>
> >>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>>>
> >>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>>>> don't think multifunction plays a role other than how much do we trust
> >>>>> the implementation to not allow back channels between functions (the
> >>>>> answer should probably be not at all).
> >>>>>
> >>>> correct. ACS is a *bridge* property.
> >>>> The unknown wrt multifunction devices is that such devices *could* be implemented
> >>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >>>> btwn the functions within a device.
> >>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >>>> determine this status about a device (via pci cfg/cap space).
> >>>
> >>> Well, there is actually a section of the ACS part of the spec
> >>> identifying valid flags for multifunction devices.  Secretly I'd like to
> >>> use this as justification for blacklisting all multifunction devices
> >>> that don't explicitly support ACS, but that makes for pretty course
> >>> granularity.  For instance, all these devices end up in a group:
> >>>
> >>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >>>
> >>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >>>
> >>> And these in another:
> >>>
> >>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >>>      +-15.2-[08]--
> >>>      +-15.3-[09]--
> >>>
> >>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >>>
> >>> Am I misinterpreting the spec or is this the correct, if strict,
> >>> interpretation?
> >>>
> >>> Alex
> >>>
> >> Well, in digging into the ACS-support in Root port question above,
> >> I just found out about this ACS support status for multifunctions too.
> >> I need more time to read/digest, but a quick read says MFDs should have
> >> an ACS cap with relevant RO-status&  control bits to direct/control
> >> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> >> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> >> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> >> and don't/can't do peer-to-peer, so the above is heavy-handed,
> >> albeit to spec.
> >> Maybe we need a (large?) pci-quirk for the list of existing
> >> MFDs that don't have ACS caps that would enable the above devices
> >> to be in separate groups.
> >> On the flip side, it solves some of the quirks for MFDs that
> >> use the wrong BDF in their src-id dma packets! :) -- they default
> >> to the same group now...
> >
> > Yep, sounds like you might agree with my patch, it's heavy handed, but
> Yes, but we should design-in a quirk check list for MFDs,
> b/c we already know some will fail when they should pass this check,
> b/c the hw was made post ACS, or the vendors didn't see the benefit
> of ACS reporting (even if the funcitonality was enabling bit-settings
> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
> dev-assignment to virt guests, etc.).
> 
> > seems to adhere to the spec.  That probably just means we need an option
> > to allow a more lenient interpretation, that maybe we don't have to
> > support.  Thanks,
> Right.  As I said, a hook to do quirk-level additions from the get-go
> would speed this expected need/addition.

Ok, a quirk should be easy there.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 14:59                       ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2012-05-21 14:59 UTC (permalink / raw)
  To: Don Dutile
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
> On 05/18/2012 10:47 PM, Alex Williamson wrote:
> > On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> >> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> >>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>>>> <alex.williamson@redhat.com>    wrote:
> >>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>>>> <alex.williamson@redhat.com>    wrote:
> >>>>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>>>
> >>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>>>
> >>>>>>> Right, maybe this should be:
> >>>>>>>
> >>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>>>
> >>>> +1; there is a global in the PCI code, pci_acs_enable,
> >>>> and a function pci_enable_acs(), which the above name certainly
> >>>> confuses.  I recommend  pci_find_top_acs_bridge()
> >>>> would be most descriptive.
> >> Finally, with my email filters fixed, I can see this email... :)
> >
> > Welcome back ;)
> >
> Indeed... and I recvd 3 copies of this reply,
> so the pendulum has flipped the other direction... ;-)
> 
> >>> Yep, the new API I'm working with is:
> >>>
> >>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> >>> bool pci_acs_path_enabled(struct pci_dev *start,
> >>>                             struct pci_dev *end, u16 acs_flags);
> >>>
> >> ok.
> >>
> >>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>>>
> >>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>>>> group.
> >>>>>>>
> >>>>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>>>> redirect a dma access without iommu translation,
> >>>>>>
> >>>>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>>>> ranges that are forwarded from primary to secondary interface, and the
> >>>>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>>>> here:
> >>>>>>
> >>>>>>                       ^
> >>>>>>                       |
> >>>>>>              +--------+-------+
> >>>>>>              |                |
> >>>>>>       +------+-----+    +-----++-----+
> >>>>>>       | Downstream |    | Downstream |
> >>>>>>       |    Port    |    |    Port    |
> >>>>>>       |   06:05.0  |    |   06:06.0  |
> >>>>>>       +------+-----+    +------+-----+
> >>>>>>              |                 |
> >>>>>>         +----v----+       +----v----+
> >>>>>>         | Endpoint|       | Endpoint|
> >>>>>>         | 07:00.0 |       | 08:00.0 |
> >>>>>>         +---------+       +---------+
> >>>>>>
> >>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>>>> nothing specific to PCIe.
> >>>>>
> >>>>> Right, I think the main PCI difference is the point-to-point nature of
> >>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>>>> devices talking to each other, but on PCIe the transaction makes a
> >>>>> U-turn at some point and heads out another downstream port.  ACS allows
> >>>>> us to prevent that from happening.
> >>>>>
> >>>> detail: PCIe up/downstream routing is really done by an internal switch;
> >>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
> >>>>            the switch to always route the transaction from a downstream port
> >>>>            to the upstream port.
> >>>>
> >>>>>> I don't understand ACS very well, but it looks like it basically
> >>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>>>> instead of being directly claimed by 06:06.0.
> >>>>>
> >>>>> Yep, that's my meager understanding as well.
> >>>>>
> >>>> +1
> >>>>
> >>>>>>> so we're looking for
> >>>>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>>>> the root bus.
> >>>>>>
> >>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>>>> port along the path to A.
> >>>>>
> >>>>> Yes, modulo this comment in libvirt source:
> >>>>>
> >>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>>>         * into play since devices on the root bus can't P2P without going
> >>>>>         * through the root IOMMU.
> >>>>>         */
> >>>>>
> >>>> Correct. PCIe spec says roots must support ACS. I believe all the
> >>>> root bridges that have an IOMMU have ACS wired in/on.
> >>>
> >>> Would you mind looking for the paragraph that says this?  I'd rather
> >>> code this into the iommu driver callers than core PCI code if this is
> >>> just a platform standard.
> >>>
> >> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> >> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
> >>                        Redirected Request Validation;
> >
> > Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
> >
> UF?

Upstream Forwarding.  So basically that spec quote is saying that if the
RC supports one aspect of ACS then it must support this other.  But that
doesn't say to me that it must support ACS to begin with.

> >> -- which means, if a Root port allows a peer-to-peer transaction to another
> >>      one of its ports, then it has to support ACS.
> >
> > I don't get that at all from 6.12.1.1, especially given the first
> > sentence of that section:
> >
> >          This section applies to Root Ports and Downstream Switch Ports
> >          that implement an ACS Extended Capability structure.
> >
> >
> hmmm, well I did.  The root port section is different than Downstream ports
> as well.  downstream ports *must* support peer-xfers due to positive decoding
> of base/limit addresses, and ACS is optional in downstream ports.
> Peer-to-peer btwn root ports is optional.
> so, I don't get what you don't get... ;-)
> .. but, I understand how the spec can be read/interpreted differently,
>     given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>     incorrectly.
> 
> >> So, this means that:
> >> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
> >>       ACS isn't needed
> >> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
> >>       it must have ACS capability if it does...
> >> and since the linux code turns on ACS for all ports with an ACS cap,
> >> it degenerates (in Linux) that all Root ports are implementing the
> >> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
> >>
> I thought I explained how I interpreted the root-port part of ACS above,
> so maybe you can tell me how you think my interpretation is incorrect.

I don't know that it's incorrect, but I don't see how you're making the
leap you are.  I think the version I have now of the patch leaves this
nicely to the iommu drivers.  It would make sense, not that hardware
always makes sense, that anything with an iommu is going to hardwire
translations through the iommu when enabled or they couldn't reasonable
do any kind of peer-to-peer with iommu.

> >>>>> So we assume that a redirect at the point of the iommu will factor in
> >>>>> iommu translation.
> >>>>>
> >>>>>> If so, I think you're trying to find out the closest upstream device X
> >>>>>> such that everything leading to X has ACS enabled.  Every device below
> >>>>>> X can DMA freely to other devices below X, so they would all have to
> >>>>>> be in the same isolated group.
> >>>>>
> >>>>> Yes
> >>>>>
> >>>>>> I tried to work through some examples to develop some intuition about this:
> >>>>>
> >>>>> (inserting fixed url)
> >>>>>> http://www.asciiflow.com/#3736558963405980039
> >>>>>
> >>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>>>
> >>>>> Hmm, the latter is the assumption above.  For the former, I think
> >>>>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>>>> except on PCIe.  I'll need to work on making that more generic.
> >>>>>
> >>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>>>> PCIe; seems wrong)
> >>>>>
> >>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>>>> input devices, so this was passing for me.  I'll need to incorporate
> >>>>> that generically.
> >>>>>
> >>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>>>> doesn't have ACS)
> >>>>>
> >>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>>>> port, so maybe they're just assuming it's always enabled or that the
> >>>>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>>>> might want "from" and "to" struct pci_dev parameters to make it more
> >>>>> flexible where the iommu lives in the system.
> >>>>>
> >>>> see comment above wrt root ports that have IOMMUs in them.
> >>>
> >>> Except it really seems to be platform convention where the IOMMU lives.
> >>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> >>> for and from that we can make assumptions about where it physically
> >>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> >>> function on the root bus.  For now I'm just allowing
> >>> pci_acs_path_enabled to take NULL for and end, which means "up to the
> >>> root bus".
> >>>
> >> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> >> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> >> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> >> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
> >
> > But are we programming to convention or spec?  And I'm still confused
> > about why we assume the root port isn't susceptible to redirection
> > before IOMMU translation.  One of the benefits of the
> > pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> > driver.  So it's intel-iommu.c's problem whether to test for ACS support
> > to the RC or to a given level (and for that matter know whether IOMMU
> > translation takes precedence over redirection in the RC).
> >
> Agreed. the best design here would be for the intel-iommu.c to test for ACS
> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
> so maybe it can be simplified in that case.... but it may be best to mimic
> the iommu-set-acs-for-root-port attribute the same manner.
> 
> >>>>
> >>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>>>
> >>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>>>> don't think multifunction plays a role other than how much do we trust
> >>>>> the implementation to not allow back channels between functions (the
> >>>>> answer should probably be not at all).
> >>>>>
> >>>> correct. ACS is a *bridge* property.
> >>>> The unknown wrt multifunction devices is that such devices *could* be implemented
> >>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >>>> btwn the functions within a device.
> >>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >>>> determine this status about a device (via pci cfg/cap space).
> >>>
> >>> Well, there is actually a section of the ACS part of the spec
> >>> identifying valid flags for multifunction devices.  Secretly I'd like to
> >>> use this as justification for blacklisting all multifunction devices
> >>> that don't explicitly support ACS, but that makes for pretty course
> >>> granularity.  For instance, all these devices end up in a group:
> >>>
> >>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >>>
> >>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >>>
> >>> And these in another:
> >>>
> >>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >>>      +-15.2-[08]--
> >>>      +-15.3-[09]--
> >>>
> >>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >>>
> >>> Am I misinterpreting the spec or is this the correct, if strict,
> >>> interpretation?
> >>>
> >>> Alex
> >>>
> >> Well, in digging into the ACS-support in Root port question above,
> >> I just found out about this ACS support status for multifunctions too.
> >> I need more time to read/digest, but a quick read says MFDs should have
> >> an ACS cap with relevant RO-status&  control bits to direct/control
> >> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> >> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> >> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> >> and don't/can't do peer-to-peer, so the above is heavy-handed,
> >> albeit to spec.
> >> Maybe we need a (large?) pci-quirk for the list of existing
> >> MFDs that don't have ACS caps that would enable the above devices
> >> to be in separate groups.
> >> On the flip side, it solves some of the quirks for MFDs that
> >> use the wrong BDF in their src-id dma packets! :) -- they default
> >> to the same group now...
> >
> > Yep, sounds like you might agree with my patch, it's heavy handed, but
> Yes, but we should design-in a quirk check list for MFDs,
> b/c we already know some will fail when they should pass this check,
> b/c the hw was made post ACS, or the vendors didn't see the benefit
> of ACS reporting (even if the funcitonality was enabling bit-settings
> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
> dev-assignment to virt guests, etc.).
> 
> > seems to adhere to the spec.  That probably just means we need an option
> > to allow a more lenient interpretation, that maybe we don't have to
> > support.  Thanks,
> Right.  As I said, a hook to do quirk-level additions from the get-go
> would speed this expected need/addition.

Ok, a quirk should be easy there.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/13] pci: Create common pcibios_err_to_errno
@ 2012-05-21 17:55     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 129+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-21 17:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: benh, aik, david, joerg.roedel, dwmw2, chrisw, agraf, benve,
	aafabbri, B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci,
	linux-kernel, gregkh, bhelgaas

On Fri, May 11, 2012 at 04:56:44PM -0600, Alex Williamson wrote:
> For returning errors out to non-PCI code.  Re-name xen's version.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
> 
>  drivers/xen/xen-pciback/conf_space.c |    6 +++---
>  include/linux/pci.h                  |   26 ++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
> index 30d7be0..46ae0f9 100644
> --- a/drivers/xen/xen-pciback/conf_space.c
> +++ b/drivers/xen/xen-pciback/conf_space.c
> @@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
>  	return val;
>  }
>  
> -static int pcibios_err_to_errno(int err)
> +static int xen_pcibios_err_to_errno(int err)
>  {
>  	switch (err) {
>  	case PCIBIOS_SUCCESSFUL:
> @@ -202,7 +202,7 @@ out:
>  		       pci_name(dev), size, offset, value);
>  
>  	*ret_val = value;
> -	return pcibios_err_to_errno(err);
> +	return xen_pcibios_err_to_errno(err);
>  }
>  
>  int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
> @@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
>  		}
>  	}
>  
> -	return pcibios_err_to_errno(err);
> +	return xen_pcibios_err_to_errno(err);
>  }
>  
>  void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b437225..20a8f2e 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
>  #define PCIBIOS_SET_FAILED		0x88
>  #define PCIBIOS_BUFFER_TOO_SMALL	0x89
>  
> +/*
> + * Translate above to generic errno for passing back through non-pci.
> + */
> +static inline int pcibios_err_to_errno(int err)
> +{
> +	if (err <= PCIBIOS_SUCCESSFUL)
> +		return err; /* Assume already errno */
> +
> +	switch (err) {
> +	case PCIBIOS_FUNC_NOT_SUPPORTED:
> +		return -ENOENT;
> +	case PCIBIOS_BAD_VENDOR_ID:
> +		return -EINVAL;
> +	case PCIBIOS_DEVICE_NOT_FOUND:
> +		return -ENODEV;
> +	case PCIBIOS_BAD_REGISTER_NUMBER:
> +		return -EFAULT;
> +	case PCIBIOS_SET_FAILED:
> +		return -EIO;
> +	case PCIBIOS_BUFFER_TOO_SMALL:
> +		return -ENOSPC;
> +	}
> +
> +	return -ENOTTY;
> +}
> +
>  /* Low-level architecture-dependent routines */
>  
>  struct pci_ops {

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/13] pci: Create common pcibios_err_to_errno
@ 2012-05-21 17:55     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 129+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-21 17:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, B07421-KZfg59tc24xl57MIdRCFDg,
	aik-sLpHqDYs0B2HXe+LvDLADg,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, agraf-l3A5Bk7waGM,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, chrisw-69jw2NvuJkxg9hUCZPvPmw,
	B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	avi-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+

On Fri, May 11, 2012 at 04:56:44PM -0600, Alex Williamson wrote:
> For returning errors out to non-PCI code.  Re-name xen's version.
> 
> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
> 
>  drivers/xen/xen-pciback/conf_space.c |    6 +++---
>  include/linux/pci.h                  |   26 ++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
> index 30d7be0..46ae0f9 100644
> --- a/drivers/xen/xen-pciback/conf_space.c
> +++ b/drivers/xen/xen-pciback/conf_space.c
> @@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
>  	return val;
>  }
>  
> -static int pcibios_err_to_errno(int err)
> +static int xen_pcibios_err_to_errno(int err)
>  {
>  	switch (err) {
>  	case PCIBIOS_SUCCESSFUL:
> @@ -202,7 +202,7 @@ out:
>  		       pci_name(dev), size, offset, value);
>  
>  	*ret_val = value;
> -	return pcibios_err_to_errno(err);
> +	return xen_pcibios_err_to_errno(err);
>  }
>  
>  int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
> @@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
>  		}
>  	}
>  
> -	return pcibios_err_to_errno(err);
> +	return xen_pcibios_err_to_errno(err);
>  }
>  
>  void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b437225..20a8f2e 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
>  #define PCIBIOS_SET_FAILED		0x88
>  #define PCIBIOS_BUFFER_TOO_SMALL	0x89
>  
> +/*
> + * Translate above to generic errno for passing back through non-pci.
> + */
> +static inline int pcibios_err_to_errno(int err)
> +{
> +	if (err <= PCIBIOS_SUCCESSFUL)
> +		return err; /* Assume already errno */
> +
> +	switch (err) {
> +	case PCIBIOS_FUNC_NOT_SUPPORTED:
> +		return -ENOENT;
> +	case PCIBIOS_BAD_VENDOR_ID:
> +		return -EINVAL;
> +	case PCIBIOS_DEVICE_NOT_FOUND:
> +		return -ENODEV;
> +	case PCIBIOS_BAD_REGISTER_NUMBER:
> +		return -EFAULT;
> +	case PCIBIOS_SET_FAILED:
> +		return -EIO;
> +	case PCIBIOS_BUFFER_TOO_SMALL:
> +		return -ENOSPC;
> +	}
> +
> +	return -ENOTTY;
> +}
> +
>  /* Low-level architecture-dependent routines */
>  
>  struct pci_ops {

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 11/13] pci: Create common pcibios_err_to_errno
@ 2012-05-21 17:55     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 129+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-21 17:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, kvm, B07421, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, gregkh, avi, joerg.roedel, bhelgaas, benve, dwmw2,
	linux-kernel, david

On Fri, May 11, 2012 at 04:56:44PM -0600, Alex Williamson wrote:
> For returning errors out to non-PCI code.  Re-name xen's version.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
> 
>  drivers/xen/xen-pciback/conf_space.c |    6 +++---
>  include/linux/pci.h                  |   26 ++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
> index 30d7be0..46ae0f9 100644
> --- a/drivers/xen/xen-pciback/conf_space.c
> +++ b/drivers/xen/xen-pciback/conf_space.c
> @@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
>  	return val;
>  }
>  
> -static int pcibios_err_to_errno(int err)
> +static int xen_pcibios_err_to_errno(int err)
>  {
>  	switch (err) {
>  	case PCIBIOS_SUCCESSFUL:
> @@ -202,7 +202,7 @@ out:
>  		       pci_name(dev), size, offset, value);
>  
>  	*ret_val = value;
> -	return pcibios_err_to_errno(err);
> +	return xen_pcibios_err_to_errno(err);
>  }
>  
>  int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
> @@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
>  		}
>  	}
>  
> -	return pcibios_err_to_errno(err);
> +	return xen_pcibios_err_to_errno(err);
>  }
>  
>  void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b437225..20a8f2e 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
>  #define PCIBIOS_SET_FAILED		0x88
>  #define PCIBIOS_BUFFER_TOO_SMALL	0x89
>  
> +/*
> + * Translate above to generic errno for passing back through non-pci.
> + */
> +static inline int pcibios_err_to_errno(int err)
> +{
> +	if (err <= PCIBIOS_SUCCESSFUL)
> +		return err; /* Assume already errno */
> +
> +	switch (err) {
> +	case PCIBIOS_FUNC_NOT_SUPPORTED:
> +		return -ENOENT;
> +	case PCIBIOS_BAD_VENDOR_ID:
> +		return -EINVAL;
> +	case PCIBIOS_DEVICE_NOT_FOUND:
> +		return -ENODEV;
> +	case PCIBIOS_BAD_REGISTER_NUMBER:
> +		return -EFAULT;
> +	case PCIBIOS_SET_FAILED:
> +		return -EIO;
> +	case PCIBIOS_BUFFER_TOO_SMALL:
> +		return -ENOSPC;
> +	}
> +
> +	return -ENOTTY;
> +}
> +
>  /* Low-level architecture-dependent routines */
>  
>  struct pci_ops {

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 18:14                         ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-21 18:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, kvm, B07421, aik, benh, linux-pci, agraf,
	qemu-devel, chrisw, B08248, iommu, gregkh, avi, benve, dwmw2,
	linux-kernel, david

On 05/21/2012 10:59 AM, Alex Williamson wrote:
> On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
>> On 05/18/2012 10:47 PM, Alex Williamson wrote:
>>> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>>>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>>>> <alex.williamson@redhat.com>     wrote:
>>>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>>>> <alex.williamson@redhat.com>     wrote:
>>>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>>>
>>>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>>>
>>>>>>>>> Right, maybe this should be:
>>>>>>>>>
>>>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>>>
>>>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>>>> and a function pci_enable_acs(), which the above name certainly
>>>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>>>> would be most descriptive.
>>>> Finally, with my email filters fixed, I can see this email... :)
>>>
>>> Welcome back ;)
>>>
>> Indeed... and I recvd 3 copies of this reply,
>> so the pendulum has flipped the other direction... ;-)
>>
>>>>> Yep, the new API I'm working with is:
>>>>>
>>>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>>>                              struct pci_dev *end, u16 acs_flags);
>>>>>
>>>> ok.
>>>>
>>>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>>>
>>>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>>>> group.
>>>>>>>>>
>>>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>>>> redirect a dma access without iommu translation,
>>>>>>>>
>>>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>>>> here:
>>>>>>>>
>>>>>>>>                        ^
>>>>>>>>                        |
>>>>>>>>               +--------+-------+
>>>>>>>>               |                |
>>>>>>>>        +------+-----+    +-----++-----+
>>>>>>>>        | Downstream |    | Downstream |
>>>>>>>>        |    Port    |    |    Port    |
>>>>>>>>        |   06:05.0  |    |   06:06.0  |
>>>>>>>>        +------+-----+    +------+-----+
>>>>>>>>               |                 |
>>>>>>>>          +----v----+       +----v----+
>>>>>>>>          | Endpoint|       | Endpoint|
>>>>>>>>          | 07:00.0 |       | 08:00.0 |
>>>>>>>>          +---------+       +---------+
>>>>>>>>
>>>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>>>> nothing specific to PCIe.
>>>>>>>
>>>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>>>> us to prevent that from happening.
>>>>>>>
>>>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>>>             ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>>>             the switch to always route the transaction from a downstream port
>>>>>>             to the upstream port.
>>>>>>
>>>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>>>> instead of being directly claimed by 06:06.0.
>>>>>>>
>>>>>>> Yep, that's my meager understanding as well.
>>>>>>>
>>>>>> +1
>>>>>>
>>>>>>>>> so we're looking for
>>>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>>>> the root bus.
>>>>>>>>
>>>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>>>> port along the path to A.
>>>>>>>
>>>>>>> Yes, modulo this comment in libvirt source:
>>>>>>>
>>>>>>>         /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>>>          * into play since devices on the root bus can't P2P without going
>>>>>>>          * through the root IOMMU.
>>>>>>>          */
>>>>>>>
>>>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>>>
>>>>> Would you mind looking for the paragraph that says this?  I'd rather
>>>>> code this into the iommu driver callers than core PCI code if this is
>>>>> just a platform standard.
>>>>>
>>>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>>>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>>>                         Redirected Request Validation;
>>>
>>> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>>>
>> UF?
>
> Upstream Forwarding.  So basically that spec quote is saying that if the
> RC supports one aspect of ACS then it must support this other.  But that
> doesn't say to me that it must support ACS to begin with.
>
It says if RC supports peer-to-peer btwn root ports, the root ports must
support ACS.... at least that's how I understood it.
but, as we've seen, there's spec, then there's reality...

>>>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>>>       one of its ports, then it has to support ACS.
>>>
>>> I don't get that at all from 6.12.1.1, especially given the first
>>> sentence of that section:
>>>
>>>           This section applies to Root Ports and Downstream Switch Ports
>>>           that implement an ACS Extended Capability structure.
>>>
>>>
>> hmmm, well I did.  The root port section is different than Downstream ports
>> as well.  downstream ports *must* support peer-xfers due to positive decoding
>> of base/limit addresses, and ACS is optional in downstream ports.
>> Peer-to-peer btwn root ports is optional.
>> so, I don't get what you don't get... ;-)
>> .. but, I understand how the spec can be read/interpreted differently,
>>      given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>>      incorrectly.
>>
>>>> So, this means that:
>>>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>>>        ACS isn't needed
>>>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>>>        it must have ACS capability if it does...
>>>> and since the linux code turns on ACS for all ports with an ACS cap,
>>>> it degenerates (in Linux) that all Root ports are implementing the
>>>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>>>
>> I thought I explained how I interpreted the root-port part of ACS above,
>> so maybe you can tell me how you think my interpretation is incorrect.
>
> I don't know that it's incorrect, but I don't see how you're making the
> leap you are.  I think the version I have now of the patch leaves this
> nicely to the iommu drivers.  It would make sense, not that hardware
> always makes sense, that anything with an iommu is going to hardwire
> translations through the iommu when enabled or they couldn't reasonable
> do any kind of peer-to-peer with iommu.
>
agreed. the version you have looks good,
and avoids/handles potential ACS cap idiosyncracies in RC's.

>>>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>>>> iommu translation.
>>>>>>>
>>>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>>>> be in the same isolated group.
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>>>
>>>>>>> (inserting fixed url)
>>>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>>>
>>>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>>>
>>>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>>>> PCIe; seems wrong)
>>>>>>>
>>>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>>>> that generically.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>>>> doesn't have ACS)
>>>>>>>
>>>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>>>> flexible where the iommu lives in the system.
>>>>>>>
>>>>>> see comment above wrt root ports that have IOMMUs in them.
>>>>>
>>>>> Except it really seems to be platform convention where the IOMMU lives.
>>>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>>>> for and from that we can make assumptions about where it physically
>>>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>>>> function on the root bus.  For now I'm just allowing
>>>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>>>> root bus".
>>>>>
>>>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>>>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>>>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>>>> yet-another spec (or ask Joerg to check-in to this thread&   provide the details).
>>>
>>> But are we programming to convention or spec?  And I'm still confused
>>> about why we assume the root port isn't susceptible to redirection
>>> before IOMMU translation.  One of the benefits of the
>>> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
>>> driver.  So it's intel-iommu.c's problem whether to test for ACS support
>>> to the RC or to a given level (and for that matter know whether IOMMU
>>> translation takes precedence over redirection in the RC).
>>>
>> Agreed. the best design here would be for the intel-iommu.c to test for ACS
>> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
>> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
>> so maybe it can be simplified in that case.... but it may be best to mimic
>> the iommu-set-acs-for-root-port attribute the same manner.
>>
>>>>>>
>>>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>>>
>>>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>>>> don't think multifunction plays a role other than how much do we trust
>>>>>>> the implementation to not allow back channels between functions (the
>>>>>>> answer should probably be not at all).
>>>>>>>
>>>>>> correct. ACS is a *bridge* property.
>>>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>>>> btwn the functions within a device.
>>>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>>>> determine this status about a device (via pci cfg/cap space).
>>>>>
>>>>> Well, there is actually a section of the ACS part of the spec
>>>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>>>> use this as justification for blacklisting all multifunction devices
>>>>> that don't explicitly support ACS, but that makes for pretty course
>>>>> granularity.  For instance, all these devices end up in a group:
>>>>>
>>>>>       +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>>>       +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>>>       +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>>>       +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>>>
>>>>>      00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>>>
>>>>> And these in another:
>>>>>
>>>>>       +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>>>       +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>>>       +-15.2-[08]--
>>>>>       +-15.3-[09]--
>>>>>
>>>>>      00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>>>      00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>>>      00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>>>      00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>>>
>>>>> Am I misinterpreting the spec or is this the correct, if strict,
>>>>> interpretation?
>>>>>
>>>>> Alex
>>>>>
>>>> Well, in digging into the ACS-support in Root port question above,
>>>> I just found out about this ACS support status for multifunctions too.
>>>> I need more time to read/digest, but a quick read says MFDs should have
>>>> an ACS cap with relevant RO-status&   control bits to direct/control
>>>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>>>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>>>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>>>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>>>> albeit to spec.
>>>> Maybe we need a (large?) pci-quirk for the list of existing
>>>> MFDs that don't have ACS caps that would enable the above devices
>>>> to be in separate groups.
>>>> On the flip side, it solves some of the quirks for MFDs that
>>>> use the wrong BDF in their src-id dma packets! :) -- they default
>>>> to the same group now...
>>>
>>> Yep, sounds like you might agree with my patch, it's heavy handed, but
>> Yes, but we should design-in a quirk check list for MFDs,
>> b/c we already know some will fail when they should pass this check,
>> b/c the hw was made post ACS, or the vendors didn't see the benefit
>> of ACS reporting (even if the funcitonality was enabling bit-settings
>> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
>> dev-assignment to virt guests, etc.).
>>
>>> seems to adhere to the spec.  That probably just means we need an option
>>> to allow a more lenient interpretation, that maybe we don't have to
>>> support.  Thanks,
>> Right.  As I said, a hook to do quirk-level additions from the get-go
>> would speed this expected need/addition.
>
> Ok, a quirk should be easy there.  Thanks,
>
> Alex
>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 18:14                         ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-21 18:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421-KZfg59tc24xl57MIdRCFDg, kvm-u79uwXL29TY76Z2rM5mHXA,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	aik-sLpHqDYs0B2HXe+LvDLADg, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	agraf-l3A5Bk7waGM, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	chrisw-69jw2NvuJkxg9hUCZPvPmw, B08248-KZfg59tc24xl57MIdRCFDg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avi-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r, Bjorn Helgaas,
	david-xT8FGy+AXnRB3Ne2BGzF6laj5H9X9Tb+,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	benve-FYB4Gu1CFyUAvxtiuMwx3w

On 05/21/2012 10:59 AM, Alex Williamson wrote:
> On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
>> On 05/18/2012 10:47 PM, Alex Williamson wrote:
>>> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>>>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>     wrote:
>>>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>>>> <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>     wrote:
>>>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>>>
>>>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>>>
>>>>>>>>> Right, maybe this should be:
>>>>>>>>>
>>>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>>>
>>>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>>>> and a function pci_enable_acs(), which the above name certainly
>>>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>>>> would be most descriptive.
>>>> Finally, with my email filters fixed, I can see this email... :)
>>>
>>> Welcome back ;)
>>>
>> Indeed... and I recvd 3 copies of this reply,
>> so the pendulum has flipped the other direction... ;-)
>>
>>>>> Yep, the new API I'm working with is:
>>>>>
>>>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>>>                              struct pci_dev *end, u16 acs_flags);
>>>>>
>>>> ok.
>>>>
>>>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>>>
>>>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>>>> group.
>>>>>>>>>
>>>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>>>> redirect a dma access without iommu translation,
>>>>>>>>
>>>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>>>> here:
>>>>>>>>
>>>>>>>>                        ^
>>>>>>>>                        |
>>>>>>>>               +--------+-------+
>>>>>>>>               |                |
>>>>>>>>        +------+-----+    +-----++-----+
>>>>>>>>        | Downstream |    | Downstream |
>>>>>>>>        |    Port    |    |    Port    |
>>>>>>>>        |   06:05.0  |    |   06:06.0  |
>>>>>>>>        +------+-----+    +------+-----+
>>>>>>>>               |                 |
>>>>>>>>          +----v----+       +----v----+
>>>>>>>>          | Endpoint|       | Endpoint|
>>>>>>>>          | 07:00.0 |       | 08:00.0 |
>>>>>>>>          +---------+       +---------+
>>>>>>>>
>>>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>>>> nothing specific to PCIe.
>>>>>>>
>>>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>>>> us to prevent that from happening.
>>>>>>>
>>>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>>>             ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>>>             the switch to always route the transaction from a downstream port
>>>>>>             to the upstream port.
>>>>>>
>>>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>>>> instead of being directly claimed by 06:06.0.
>>>>>>>
>>>>>>> Yep, that's my meager understanding as well.
>>>>>>>
>>>>>> +1
>>>>>>
>>>>>>>>> so we're looking for
>>>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>>>> the root bus.
>>>>>>>>
>>>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>>>> port along the path to A.
>>>>>>>
>>>>>>> Yes, modulo this comment in libvirt source:
>>>>>>>
>>>>>>>         /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>>>          * into play since devices on the root bus can't P2P without going
>>>>>>>          * through the root IOMMU.
>>>>>>>          */
>>>>>>>
>>>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>>>
>>>>> Would you mind looking for the paragraph that says this?  I'd rather
>>>>> code this into the iommu driver callers than core PCI code if this is
>>>>> just a platform standard.
>>>>>
>>>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>>>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>>>                         Redirected Request Validation;
>>>
>>> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>>>
>> UF?
>
> Upstream Forwarding.  So basically that spec quote is saying that if the
> RC supports one aspect of ACS then it must support this other.  But that
> doesn't say to me that it must support ACS to begin with.
>
It says if RC supports peer-to-peer btwn root ports, the root ports must
support ACS.... at least that's how I understood it.
but, as we've seen, there's spec, then there's reality...

>>>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>>>       one of its ports, then it has to support ACS.
>>>
>>> I don't get that at all from 6.12.1.1, especially given the first
>>> sentence of that section:
>>>
>>>           This section applies to Root Ports and Downstream Switch Ports
>>>           that implement an ACS Extended Capability structure.
>>>
>>>
>> hmmm, well I did.  The root port section is different than Downstream ports
>> as well.  downstream ports *must* support peer-xfers due to positive decoding
>> of base/limit addresses, and ACS is optional in downstream ports.
>> Peer-to-peer btwn root ports is optional.
>> so, I don't get what you don't get... ;-)
>> .. but, I understand how the spec can be read/interpreted differently,
>>      given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>>      incorrectly.
>>
>>>> So, this means that:
>>>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>>>        ACS isn't needed
>>>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>>>        it must have ACS capability if it does...
>>>> and since the linux code turns on ACS for all ports with an ACS cap,
>>>> it degenerates (in Linux) that all Root ports are implementing the
>>>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>>>
>> I thought I explained how I interpreted the root-port part of ACS above,
>> so maybe you can tell me how you think my interpretation is incorrect.
>
> I don't know that it's incorrect, but I don't see how you're making the
> leap you are.  I think the version I have now of the patch leaves this
> nicely to the iommu drivers.  It would make sense, not that hardware
> always makes sense, that anything with an iommu is going to hardwire
> translations through the iommu when enabled or they couldn't reasonable
> do any kind of peer-to-peer with iommu.
>
agreed. the version you have looks good,
and avoids/handles potential ACS cap idiosyncracies in RC's.

>>>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>>>> iommu translation.
>>>>>>>
>>>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>>>> be in the same isolated group.
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>>>
>>>>>>> (inserting fixed url)
>>>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>>>
>>>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>>>
>>>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>>>> PCIe; seems wrong)
>>>>>>>
>>>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>>>> that generically.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>>>> doesn't have ACS)
>>>>>>>
>>>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>>>> flexible where the iommu lives in the system.
>>>>>>>
>>>>>> see comment above wrt root ports that have IOMMUs in them.
>>>>>
>>>>> Except it really seems to be platform convention where the IOMMU lives.
>>>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>>>> for and from that we can make assumptions about where it physically
>>>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>>>> function on the root bus.  For now I'm just allowing
>>>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>>>> root bus".
>>>>>
>>>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>>>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>>>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>>>> yet-another spec (or ask Joerg to check-in to this thread&   provide the details).
>>>
>>> But are we programming to convention or spec?  And I'm still confused
>>> about why we assume the root port isn't susceptible to redirection
>>> before IOMMU translation.  One of the benefits of the
>>> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
>>> driver.  So it's intel-iommu.c's problem whether to test for ACS support
>>> to the RC or to a given level (and for that matter know whether IOMMU
>>> translation takes precedence over redirection in the RC).
>>>
>> Agreed. the best design here would be for the intel-iommu.c to test for ACS
>> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
>> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
>> so maybe it can be simplified in that case.... but it may be best to mimic
>> the iommu-set-acs-for-root-port attribute the same manner.
>>
>>>>>>
>>>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>>>
>>>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>>>> don't think multifunction plays a role other than how much do we trust
>>>>>>> the implementation to not allow back channels between functions (the
>>>>>>> answer should probably be not at all).
>>>>>>>
>>>>>> correct. ACS is a *bridge* property.
>>>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>>>> btwn the functions within a device.
>>>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>>>> determine this status about a device (via pci cfg/cap space).
>>>>>
>>>>> Well, there is actually a section of the ACS part of the spec
>>>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>>>> use this as justification for blacklisting all multifunction devices
>>>>> that don't explicitly support ACS, but that makes for pretty course
>>>>> granularity.  For instance, all these devices end up in a group:
>>>>>
>>>>>       +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>>>       +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>>>       +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>>>       +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>>>
>>>>>      00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>>>
>>>>> And these in another:
>>>>>
>>>>>       +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>>>       +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>>>       +-15.2-[08]--
>>>>>       +-15.3-[09]--
>>>>>
>>>>>      00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>>>      00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>>>      00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>>>      00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>>>
>>>>> Am I misinterpreting the spec or is this the correct, if strict,
>>>>> interpretation?
>>>>>
>>>>> Alex
>>>>>
>>>> Well, in digging into the ACS-support in Root port question above,
>>>> I just found out about this ACS support status for multifunctions too.
>>>> I need more time to read/digest, but a quick read says MFDs should have
>>>> an ACS cap with relevant RO-status&   control bits to direct/control
>>>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>>>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>>>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>>>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>>>> albeit to spec.
>>>> Maybe we need a (large?) pci-quirk for the list of existing
>>>> MFDs that don't have ACS caps that would enable the above devices
>>>> to be in separate groups.
>>>> On the flip side, it solves some of the quirks for MFDs that
>>>> use the wrong BDF in their src-id dma packets! :) -- they default
>>>> to the same group now...
>>>
>>> Yep, sounds like you might agree with my patch, it's heavy handed, but
>> Yes, but we should design-in a quirk check list for MFDs,
>> b/c we already know some will fail when they should pass this check,
>> b/c the hw was made post ACS, or the vendors didn't see the benefit
>> of ACS reporting (even if the funcitonality was enabling bit-settings
>> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
>> dev-assignment to virt guests, etc.).
>>
>>> seems to adhere to the spec.  That probably just means we need an option
>>> to allow a more lenient interpretation, that maybe we don't have to
>>> support.  Thanks,
>> Right.  As I said, a hook to do quirk-level additions from the get-go
>> would speed this expected need/addition.
>
> Ok, a quirk should be easy there.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [Qemu-devel] [PATCH 05/13] pci: New pci_acs_enabled()
@ 2012-05-21 18:14                         ` Don Dutile
  0 siblings, 0 replies; 129+ messages in thread
From: Don Dutile @ 2012-05-21 18:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: B07421, kvm, gregkh, aik, linux-pci, agraf, qemu-devel, chrisw,
	B08248, iommu, avi, Bjorn Helgaas, david, dwmw2, linux-kernel,
	benve

On 05/21/2012 10:59 AM, Alex Williamson wrote:
> On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
>> On 05/18/2012 10:47 PM, Alex Williamson wrote:
>>> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>>>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>>>> <alex.williamson@redhat.com>     wrote:
>>>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>>>> <alex.williamson@redhat.com>     wrote:
>>>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>>>
>>>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>>>
>>>>>>>>> Right, maybe this should be:
>>>>>>>>>
>>>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>>>
>>>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>>>> and a function pci_enable_acs(), which the above name certainly
>>>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>>>> would be most descriptive.
>>>> Finally, with my email filters fixed, I can see this email... :)
>>>
>>> Welcome back ;)
>>>
>> Indeed... and I recvd 3 copies of this reply,
>> so the pendulum has flipped the other direction... ;-)
>>
>>>>> Yep, the new API I'm working with is:
>>>>>
>>>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>>>                              struct pci_dev *end, u16 acs_flags);
>>>>>
>>>> ok.
>>>>
>>>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>>>
>>>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>>>> group.
>>>>>>>>>
>>>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>>>> redirect a dma access without iommu translation,
>>>>>>>>
>>>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>>>> here:
>>>>>>>>
>>>>>>>>                        ^
>>>>>>>>                        |
>>>>>>>>               +--------+-------+
>>>>>>>>               |                |
>>>>>>>>        +------+-----+    +-----++-----+
>>>>>>>>        | Downstream |    | Downstream |
>>>>>>>>        |    Port    |    |    Port    |
>>>>>>>>        |   06:05.0  |    |   06:06.0  |
>>>>>>>>        +------+-----+    +------+-----+
>>>>>>>>               |                 |
>>>>>>>>          +----v----+       +----v----+
>>>>>>>>          | Endpoint|       | Endpoint|
>>>>>>>>          | 07:00.0 |       | 08:00.0 |
>>>>>>>>          +---------+       +---------+
>>>>>>>>
>>>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>>>> nothing specific to PCIe.
>>>>>>>
>>>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>>>> us to prevent that from happening.
>>>>>>>
>>>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>>>             ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>>>             the switch to always route the transaction from a downstream port
>>>>>>             to the upstream port.
>>>>>>
>>>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>>>> instead of being directly claimed by 06:06.0.
>>>>>>>
>>>>>>> Yep, that's my meager understanding as well.
>>>>>>>
>>>>>> +1
>>>>>>
>>>>>>>>> so we're looking for
>>>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>>>> the root bus.
>>>>>>>>
>>>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>>>> port along the path to A.
>>>>>>>
>>>>>>> Yes, modulo this comment in libvirt source:
>>>>>>>
>>>>>>>         /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>>>          * into play since devices on the root bus can't P2P without going
>>>>>>>          * through the root IOMMU.
>>>>>>>          */
>>>>>>>
>>>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>>>
>>>>> Would you mind looking for the paragraph that says this?  I'd rather
>>>>> code this into the iommu driver callers than core PCI code if this is
>>>>> just a platform standard.
>>>>>
>>>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>>>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>>>                         Redirected Request Validation;
>>>
>>> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>>>
>> UF?
>
> Upstream Forwarding.  So basically that spec quote is saying that if the
> RC supports one aspect of ACS then it must support this other.  But that
> doesn't say to me that it must support ACS to begin with.
>
It says if RC supports peer-to-peer btwn root ports, the root ports must
support ACS.... at least that's how I understood it.
but, as we've seen, there's spec, then there's reality...

>>>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>>>       one of its ports, then it has to support ACS.
>>>
>>> I don't get that at all from 6.12.1.1, especially given the first
>>> sentence of that section:
>>>
>>>           This section applies to Root Ports and Downstream Switch Ports
>>>           that implement an ACS Extended Capability structure.
>>>
>>>
>> hmmm, well I did.  The root port section is different than Downstream ports
>> as well.  downstream ports *must* support peer-xfers due to positive decoding
>> of base/limit addresses, and ACS is optional in downstream ports.
>> Peer-to-peer btwn root ports is optional.
>> so, I don't get what you don't get... ;-)
>> .. but, I understand how the spec can be read/interpreted differently,
>>      given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>>      incorrectly.
>>
>>>> So, this means that:
>>>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>>>        ACS isn't needed
>>>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>>>        it must have ACS capability if it does...
>>>> and since the linux code turns on ACS for all ports with an ACS cap,
>>>> it degenerates (in Linux) that all Root ports are implementing the
>>>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>>>
>> I thought I explained how I interpreted the root-port part of ACS above,
>> so maybe you can tell me how you think my interpretation is incorrect.
>
> I don't know that it's incorrect, but I don't see how you're making the
> leap you are.  I think the version I have now of the patch leaves this
> nicely to the iommu drivers.  It would make sense, not that hardware
> always makes sense, that anything with an iommu is going to hardwire
> translations through the iommu when enabled or they couldn't reasonable
> do any kind of peer-to-peer with iommu.
>
agreed. the version you have looks good,
and avoids/handles potential ACS cap idiosyncracies in RC's.

>>>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>>>> iommu translation.
>>>>>>>
>>>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>>>> be in the same isolated group.
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>>>
>>>>>>> (inserting fixed url)
>>>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>>>
>>>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>>>
>>>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>>>> PCIe; seems wrong)
>>>>>>>
>>>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>>>> that generically.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>>>> doesn't have ACS)
>>>>>>>
>>>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>>>> flexible where the iommu lives in the system.
>>>>>>>
>>>>>> see comment above wrt root ports that have IOMMUs in them.
>>>>>
>>>>> Except it really seems to be platform convention where the IOMMU lives.
>>>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>>>> for and from that we can make assumptions about where it physically
>>>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>>>> function on the root bus.  For now I'm just allowing
>>>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>>>> root bus".
>>>>>
>>>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>>>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>>>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>>>> yet-another spec (or ask Joerg to check-in to this thread&   provide the details).
>>>
>>> But are we programming to convention or spec?  And I'm still confused
>>> about why we assume the root port isn't susceptible to redirection
>>> before IOMMU translation.  One of the benefits of the
>>> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
>>> driver.  So it's intel-iommu.c's problem whether to test for ACS support
>>> to the RC or to a given level (and for that matter know whether IOMMU
>>> translation takes precedence over redirection in the RC).
>>>
>> Agreed. the best design here would be for the intel-iommu.c to test for ACS
>> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
>> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
>> so maybe it can be simplified in that case.... but it may be best to mimic
>> the iommu-set-acs-for-root-port attribute the same manner.
>>
>>>>>>
>>>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>>>
>>>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>>>> don't think multifunction plays a role other than how much do we trust
>>>>>>> the implementation to not allow back channels between functions (the
>>>>>>> answer should probably be not at all).
>>>>>>>
>>>>>> correct. ACS is a *bridge* property.
>>>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>>>> btwn the functions within a device.
>>>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>>>> determine this status about a device (via pci cfg/cap space).
>>>>>
>>>>> Well, there is actually a section of the ACS part of the spec
>>>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>>>> use this as justification for blacklisting all multifunction devices
>>>>> that don't explicitly support ACS, but that makes for pretty course
>>>>> granularity.  For instance, all these devices end up in a group:
>>>>>
>>>>>       +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>>>       +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>>>       +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>>>       +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>>>
>>>>>      00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>>>
>>>>> And these in another:
>>>>>
>>>>>       +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>>>       +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>>>       +-15.2-[08]--
>>>>>       +-15.3-[09]--
>>>>>
>>>>>      00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>>>      00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>>>      00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>>>      00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>>>
>>>>> Am I misinterpreting the spec or is this the correct, if strict,
>>>>> interpretation?
>>>>>
>>>>> Alex
>>>>>
>>>> Well, in digging into the ACS-support in Root port question above,
>>>> I just found out about this ACS support status for multifunctions too.
>>>> I need more time to read/digest, but a quick read says MFDs should have
>>>> an ACS cap with relevant RO-status&   control bits to direct/control
>>>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>>>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>>>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>>>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>>>> albeit to spec.
>>>> Maybe we need a (large?) pci-quirk for the list of existing
>>>> MFDs that don't have ACS caps that would enable the above devices
>>>> to be in separate groups.
>>>> On the flip side, it solves some of the quirks for MFDs that
>>>> use the wrong BDF in their src-id dma packets! :) -- they default
>>>> to the same group now...
>>>
>>> Yep, sounds like you might agree with my patch, it's heavy handed, but
>> Yes, but we should design-in a quirk check list for MFDs,
>> b/c we already know some will fail when they should pass this check,
>> b/c the hw was made post ACS, or the vendors didn't see the benefit
>> of ACS reporting (even if the funcitonality was enabling bit-settings
>> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
>> dev-assignment to virt guests, etc.).
>>
>>> seems to adhere to the spec.  That probably just means we need an option
>>> to allow a more lenient interpretation, that maybe we don't have to
>>> support.  Thanks,
>> Right.  As I said, a hook to do quirk-level additions from the get-go
>> would speed this expected need/addition.
>
> Ok, a quirk should be easy there.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2012-05-21 18:15 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-11 22:55 [PATCH 00/13] IOMMU Groups + VFIO Alex Williamson
2012-05-11 22:55 ` [Qemu-devel] " Alex Williamson
2012-05-11 22:55 ` Alex Williamson
2012-05-11 22:55 ` [PATCH 01/13] driver core: Add iommu_group tracking to struct device Alex Williamson
2012-05-11 22:55   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:55   ` Alex Williamson
2012-05-11 23:38   ` Greg KH
2012-05-11 23:38     ` [Qemu-devel] " Greg KH
2012-05-11 23:58     ` Alex Williamson
2012-05-11 23:58       ` [Qemu-devel] " Alex Williamson
2012-05-11 23:58       ` Alex Williamson
2012-05-12  0:00       ` Greg KH
2012-05-12  0:00         ` [Qemu-devel] " Greg KH
2012-05-11 22:55 ` [PATCH 02/13] iommu: IOMMU Groups Alex Williamson
2012-05-11 22:55   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:55   ` Alex Williamson
2012-05-11 23:39   ` Greg KH
2012-05-11 23:39     ` [Qemu-devel] " Greg KH
2012-05-11 23:58     ` Alex Williamson
2012-05-11 23:58       ` [Qemu-devel] " Alex Williamson
2012-05-11 23:58       ` Alex Williamson
2012-05-14  1:16   ` David Gibson
2012-05-14  1:16     ` [Qemu-devel] " David Gibson
2012-05-14  1:16     ` David Gibson
2012-05-14 17:11     ` Alex Williamson
2012-05-14 17:11       ` [Qemu-devel] " Alex Williamson
2012-05-14 17:11       ` Alex Williamson
2012-05-15  2:03       ` David Gibson
2012-05-15  2:03         ` [Qemu-devel] " David Gibson
2012-05-15  2:03         ` David Gibson
2012-05-15  6:34         ` Alex Williamson
2012-05-15  6:34           ` [Qemu-devel] " Alex Williamson
2012-05-15  6:34           ` Alex Williamson
2012-05-17  3:29           ` David Gibson
2012-05-17  3:29             ` [Qemu-devel] " David Gibson
2012-05-17  3:29             ` David Gibson
2012-05-11 22:55 ` [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi Alex Williamson
2012-05-11 22:55   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:55   ` Alex Williamson
2012-05-17  3:37   ` David Gibson
2012-05-17  3:37     ` [Qemu-devel] " David Gibson
2012-05-17  3:37     ` David Gibson
2012-05-11 22:55 ` [PATCH 04/13] pci: New pci_dma_quirk() Alex Williamson
2012-05-11 22:55   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:55   ` Alex Williamson
2012-05-17  3:39   ` David Gibson
2012-05-17  3:39     ` [Qemu-devel] " David Gibson
2012-05-17  3:39     ` David Gibson
2012-05-17  4:06     ` Alex Williamson
2012-05-17  4:06       ` [Qemu-devel] " Alex Williamson
2012-05-17  4:06       ` Alex Williamson
2012-05-17  7:19   ` Anonymous
2012-05-17  7:19     ` [Qemu-devel] " Anonymous
2012-05-17  7:19     ` Anonymous
2012-05-17 15:22     ` Alex Williamson
2012-05-17 15:22       ` [Qemu-devel] " Alex Williamson
2012-05-17 15:22       ` Alex Williamson
2012-05-11 22:56 ` [PATCH 05/13] pci: New pci_acs_enabled() Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-14 22:02   ` Bjorn Helgaas
2012-05-14 22:02     ` [Qemu-devel] " Bjorn Helgaas
2012-05-14 22:02     ` Bjorn Helgaas
2012-05-14 22:49     ` Alex Williamson
2012-05-14 22:49       ` [Qemu-devel] " Alex Williamson
2012-05-14 22:49       ` Alex Williamson
2012-05-15 19:56       ` Bjorn Helgaas
2012-05-15 19:56         ` [Qemu-devel] " Bjorn Helgaas
2012-05-15 19:56         ` Bjorn Helgaas
2012-05-15 20:05         ` Bjorn Helgaas
2012-05-15 20:05           ` [Qemu-devel] " Bjorn Helgaas
2012-05-15 20:05           ` Bjorn Helgaas
2012-05-15 21:09         ` Alex Williamson
2012-05-15 21:09           ` [Qemu-devel] " Alex Williamson
2012-05-15 21:09           ` Alex Williamson
2012-05-16 13:29           ` Don Dutile
2012-05-16 13:29             ` [Qemu-devel] " Don Dutile
2012-05-16 13:29             ` Don Dutile
2012-05-16 16:21             ` Alex Williamson
2012-05-16 16:21               ` [Qemu-devel] " Alex Williamson
2012-05-16 16:21               ` Alex Williamson
2012-05-16 19:36               ` Alex Williamson
2012-05-16 19:36                 ` [Qemu-devel] " Alex Williamson
2012-05-16 19:36                 ` Alex Williamson
2012-05-18 23:00               ` RESEND3: " Don Dutile
2012-05-18 23:00                 ` [Qemu-devel] " Don Dutile
2012-05-18 23:00                 ` Don Dutile
2012-05-19  2:47                 ` Alex Williamson
2012-05-19  2:47                   ` [Qemu-devel] " Alex Williamson
2012-05-19  2:47                   ` Alex Williamson
2012-05-21 13:31                   ` Don Dutile
2012-05-21 13:31                     ` [Qemu-devel] " Don Dutile
2012-05-21 13:31                     ` Don Dutile
2012-05-21 14:59                     ` Alex Williamson
2012-05-21 14:59                       ` [Qemu-devel] " Alex Williamson
2012-05-21 14:59                       ` Alex Williamson
2012-05-21 18:14                       ` Don Dutile
2012-05-21 18:14                         ` [Qemu-devel] " Don Dutile
2012-05-21 18:14                         ` Don Dutile
2012-05-11 22:56 ` [PATCH 06/13] iommu: Make use of DMA quirking and ACS enabled check for groups Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-11 22:56 ` [PATCH 07/13] vfio: VFIO core Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-11 22:56 ` [PATCH 08/13] vfio: Add documentation Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-11 22:56 ` [PATCH 09/13] vfio: x86 IOMMU implementation Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-11 22:56 ` [PATCH 10/13] pci: export pci_user functions for use by other drivers Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-14 21:20   ` Bjorn Helgaas
2012-05-14 21:20     ` [Qemu-devel] " Bjorn Helgaas
2012-05-14 21:20     ` Bjorn Helgaas
2012-05-11 22:56 ` [PATCH 11/13] pci: Create common pcibios_err_to_errno Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-21 17:55   ` Konrad Rzeszutek Wilk
2012-05-21 17:55     ` [Qemu-devel] " Konrad Rzeszutek Wilk
2012-05-21 17:55     ` Konrad Rzeszutek Wilk
2012-05-11 22:56 ` [PATCH 12/13] pci: Misc pci_reg additions Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson
2012-05-11 22:56 ` [PATCH 13/13] vfio: Add PCI device driver Alex Williamson
2012-05-11 22:56   ` [Qemu-devel] " Alex Williamson
2012-05-11 22:56   ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.