KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Eric Auger <eric.auger@redhat.com>
To: eric.auger.pro@gmail.com, eric.auger@redhat.com,
	iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu,
	joro@8bytes.org, alex.williamson@redhat.com,
	jacob.jun.pan@linux.intel.com, yi.l.liu@intel.com,
	jean-philippe.brucker@arm.com, will.deacon@arm.com,
Cc: kevin.tian@intel.com, ashok.raj@intel.com, marc.zyngier@arm.com,
	peter.maydell@linaro.org, vincent.stehle@arm.com,
	zhangfei.gao@gmail.com, tina.zhang@intel.com
Subject: [PATCH v9 11/11] vfio: Document nested stage control
Date: Thu, 11 Jul 2019 15:56:25 +0200
Message-ID: <20190711135625.20684-12-eric.auger@redhat.com> (raw)
In-Reply-To: <20190711135625.20684-1-eric.auger@redhat.com>

The VFIO API was enhanced to support nested stage control: a bunch of
new iotcls, one DMA FAULT region and an associated specific IRQ.

Let's document the process to follow to set up nested mode.

Signed-off-by: Eric Auger <eric.auger@redhat.com>


v8 -> v9:
- new layout for the DMA FAULT memory region and specific IRQ

v2 -> v3:
- document the new fault API

v1 -> v2:
- use the new ioctl names
- add doc related to fault handling
 Documentation/vfio.txt | 77 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index f1a4d3c3ba0b..563ebcec9224 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -239,6 +239,83 @@ group and can access them as follows::
 	/* Gratuitous device reset and go... */
 	ioctl(device, VFIO_DEVICE_RESET);
+IOMMU Dual Stage Control
+Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to
+the ARM terminology while "level" corresponds to Intel's VTD terminology. In
+the following text we use either without distinction.
+This is useful when the guest is exposed with a virtual IOMMU and some
+devices are assigned to the guest through VFIO. Then the guest OS can use
+stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation
+(GPA -> HPA).
+The guest gets ownership of the stage 1 page tables and also owns stage 1
+configuration structures. The hypervisor owns the root configuration structure
+(for security reason), including stage 2 configuration. This works as long
+configuration structures and page table format are compatible between the
+virtual IOMMU and the physical IOMMU.
+Assuming the HW supports it, this nested mode is selected by choosing the
+This forces the hypervisor to use the stage 2, leaving stage 1 available for
+guest usage.
+Once groups are attached to the container, the guest stage 1 translation
+configuration data can be passed to VFIO by using
+ioctl(container, VFIO_IOMMU_SET_PASID_TABLE, &pasid_table_info);
+This allows to combine the guest stage 1 configuration structure along with
+the hypervisor stage 2 configuration structure. Stage 1 configuration
+structures are dependent on the IOMMU type.
+As the stage 1 translation is fully delegated to the HW, translation faults
+encountered during the translation process need to be propagated up to
+the virtualizer and re-injected into the guest.
+The userspace must be prepared to receive faults. The VFIO-PCI device
+exposes one dedicated DMA FAULT region: it contains a ring buffer and
+its header that allows to manage the head/tail indices. The region is
+identified by the following index/subindex:
+region capability that allows the userspace to retrieve the ABI version
+of the fault records filled by the host.
+On top of that region, the userspace can be notified whenever a fault
+occurs at the physical level. It can use the VFIO_IRQ_TYPE_NESTED/
+VFIO_IRQ_SUBTYPE_DMA_FAULT specific IRQ to attach the eventfd to be
+The ring buffer containing the fault records can be mmapped. When
+the userspace consumes a fault in the queue, it should increment
+the consumer index to allow new fault records to replace the used ones.
+The queue size and the entry size can be retrieved in the header.
+The tail index should never overshoot the producer index as in any
+other circular buffer scheme. Also it must be less than the queue size
+otherwise the change fails.
+When the guest invalidates stage 1 related caches, invalidations must be
+forwarded to the host through
+ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data);
+Those invalidations can happen at various granularity levels, page, context, ...
+The ARM SMMU specification introduces another challenge: MSIs are translated by
+both the virtual SMMU and the physical SMMU. To build a nested mapping for the
+IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI
+doorbell GPA binding to the host. Then the hypervisor can build a nested stage 2
+binding eventually translating into the physical MSI doorbell.
+This is achieved by calling
+ioctl(container, VFIO_IOMMU_SET_MSI_BINDING, &guest_binding);

  parent reply index

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-11 13:56 [PATCH v9 00/11] SMMUv3 Nested Stage Setup (VFIO part) Eric Auger
2019-07-11 13:56 ` [PATCH v9 01/11] vfio: VFIO_IOMMU_SET_PASID_TABLE Eric Auger
2019-07-11 13:56 ` [PATCH v9 02/11] vfio: VFIO_IOMMU_CACHE_INVALIDATE Eric Auger
2019-07-11 13:56 ` [PATCH v9 03/11] vfio: VFIO_IOMMU_SET_MSI_BINDING Eric Auger
2019-07-11 13:56 ` [PATCH v9 04/11] vfio/pci: Add VFIO_REGION_TYPE_NESTED region type Eric Auger
2019-07-11 13:56 ` [PATCH v9 05/11] vfio/pci: Register an iommu fault handler Eric Auger
2019-07-11 13:56 ` [PATCH v9 06/11] vfio/pci: Allow to mmap the fault queue Eric Auger
2019-07-11 13:56 ` [PATCH v9 07/11] vfio: Use capability chains to handle device specific irq Eric Auger
2019-07-11 13:56 ` [PATCH v9 08/11] vfio: Add new IRQ for DMA fault reporting Eric Auger
2019-07-11 13:56 ` [PATCH v9 09/11] vfio/pci: Add framework for custom interrupt indices Eric Auger
2019-07-11 13:56 ` [PATCH v9 10/11] vfio/pci: Register and allow DMA FAULT IRQ signaling Eric Auger
2019-07-11 13:56 ` Eric Auger [this message]
     [not found] ` <f5b4b97b197d4bab8f3703eba2e966c4@huawei.com>
2019-11-12 11:28   ` [PATCH v9 00/11] SMMUv3 Nested Stage Setup (VFIO part) Auger Eric
2019-11-12 13:06     ` Shameerali Kolothum Thodi
2019-11-12 13:21       ` Auger Eric
2019-11-12 14:21         ` Shameerali Kolothum Thodi
2019-11-12 17:56         ` Shameerali Kolothum Thodi
2019-11-12 20:34           ` Auger Eric
2019-11-13 16:24             ` Shameerali Kolothum Thodi
2019-11-20  8:15 ` Tomasz Nowicki
2019-11-20 10:18   ` Auger Eric
2020-03-03 12:57     ` zhangfei
2020-03-03 13:14       ` Auger Eric

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190711135625.20684-12-eric.auger@redhat.com \
    --to=eric.auger@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=ashok.raj@intel.com \
    --cc=eric.auger.pro@gmail.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=jean-philippe.brucker@arm.com \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marc.zyngier@arm.com \
    --cc=peter.maydell@linaro.org \
    --cc=robin.murphy@arm.com \
    --cc=tina.zhang@intel.com \
    --cc=vincent.stehle@arm.com \
    --cc=will.deacon@arm.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhangfei.gao@gmail.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git