All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
@ 2014-08-11  7:04 Le Tan
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 1/5] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Le Tan
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Le Tan @ 2014-08-11  7:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Le Tan,
	Alex Williamson, Jan Kiszka, Anthony Liguori, Paolo Bonzini

Hi,

These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
chipset. The major job in these patches is to add support for emulating Intel
IOMMU according to the VT-d specification, including basic responses to CSRs
accesses, the logics of DMAR (DMA remapping) and DMA memory address
translations.

Features implemented for now are:
1. Response to important CSRs accesses;
2. DMAR (DMA remapping) without PASID support;
3. Primary fault logging;
4. Support both register-based and queued invalidation for IOTLB and context
   cache invalidation;
5. Add DMAR table to ACPI tables to expose VT-d to BIOS;
6. Add "-machine iommu=on|off" option to enable/disable VT-d;
7. Only one DMAR unit for all the devices of PCI Segment 0.

Testing:
1. L1 guest with Linux with intel_iommu=on can interact with VT-d and boot
smoothly, and there exists information about VT-d in the log of kernel;
2. Run L1 with VT-d, L2 guest with Linux can boot smoothly withou PCI device
passthrough;
3. Run L1 with VT-d and "-soundhw ac97 (QEMU_AUDIO_DRV=alsa)", then assign the
sound card to L2; L2 can boot smoothly with legacy PCI assignment and I can
hear the music played in L2 from the host speakers;
4. Jailhouse hypervisor can run smoothly(tested by Jan).
5. Run L1 with VT-d and e1000 network card, then assign e1000 to L2; L2 will be
STUCK when booting. This still remains unsolved now. As far as I know, I suppose
that the L2 crashes when doing e1000_probe(). The QEMU of L1 will dump
something with "KVM: entry failed, hardware error 0x0", and the KVM of host
will print "nested_vmx_exit_handled failed vm entry 7". Unlike assigning the
sound card, after being assigned to L2, there is no translation entry of e1000
through VT-d, which I think means that e1000 doesn't issue any DMA access during
the boot of L2. Sometimes the kernel of L2 will print "divide error" during
booting. Maybe it results from the lack of reset mechanism.
6. VFIO is tested and is similar to legacy pci assignment.

Discussion:
1. There is one functionality called Zero-Length-Read (ZLR) which supports zero
length DMA read requests to write-only pages. If the VT-d emulation supports
ZLR, we need to know the exact length of one access. For now can QEMU express
zero-length requests?

TODO:
1. Context cache and IOTLB cache;
2. Fix the bug of legacy PCI assignment;

Changes since v2:
*address reviewing suggestions given by Jan
-add support for primary fault logging
-add support for queued invalidation

Changes since v1:
*address reviewing suggestions given by Michael, Paolo, Stefan and Jan
-split intel_iommu.h to include/hw/i386/intel_iommu.h and
 hw/i386/intel_iommu_internal.h
-change the copyright information
-change D() to VTD_DPRINTF()
-remove dead code
-rename constant definitions with consistent prefix VTD_
-rename some struct definitions according to QEMU standard
-rename some CSRs access functions
-use endian-save functions to access CSRs
-change machine option to "iommu=on|off"

Thanks very much!

Git trees:
https://github.com/tamlok/qemu

Le Tan (5):
  iommu: add is_write as a parameter to the translate function of
    MemoryRegionIOMMUOps
  intel-iommu: introduce Intel IOMMU (VT-d) emulation
  intel-iommu: add DMAR table to ACPI tables
  intel-iommu: add Intel IOMMU emulation to q35 and add a machine option
    "iommu" as a switch
  intel-iommu: add supports for queued invalidation interface

 exec.c                         |    2 +-
 hw/alpha/typhoon.c             |    3 +-
 hw/core/machine.c              |   27 +-
 hw/i386/Makefile.objs          |    1 +
 hw/i386/acpi-build.c           |   41 +
 hw/i386/acpi-defs.h            |   70 ++
 hw/i386/intel_iommu.c          | 1722 ++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  358 +++++++++
 hw/pci-host/apb.c              |    3 +-
 hw/pci-host/q35.c              |   64 +-
 hw/ppc/spapr_iommu.c           |    3 +-
 include/exec/memory.h          |    2 +-
 include/hw/boards.h            |    1 +
 include/hw/i386/intel_iommu.h  |   90 +++
 include/hw/pci-host/q35.h      |    2 +
 qemu-options.hx                |    5 +-
 vl.c                           |    4 +
 17 files changed, 2384 insertions(+), 14 deletions(-)
 create mode 100644 hw/i386/intel_iommu.c
 create mode 100644 hw/i386/intel_iommu_internal.h
 create mode 100644 include/hw/i386/intel_iommu.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v3 1/5] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps
  2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
@ 2014-08-11  7:04 ` Le Tan
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation Le Tan
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Le Tan @ 2014-08-11  7:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Le Tan,
	Alex Williamson, Jan Kiszka, Anthony Liguori, Paolo Bonzini

Add a bool variable is_write as a parameter to the translate function of
MemoryRegionIOMMUOps to indicate the operation of the access. It can be
used for correct fault reporting from within the callback.
Change the interface of related functions.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
---
 exec.c                | 2 +-
 hw/alpha/typhoon.c    | 3 ++-
 hw/pci-host/apb.c     | 3 ++-
 hw/ppc/spapr_iommu.c  | 3 ++-
 include/exec/memory.h | 2 +-
 5 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/exec.c b/exec.c
index 765bd94..5ccc106 100644
--- a/exec.c
+++ b/exec.c
@@ -373,7 +373,7 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
             break;
         }
 
-        iotlb = mr->iommu_ops->translate(mr, addr);
+        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                 | (addr & iotlb.addr_mask));
         len = MIN(len, (addr | iotlb.addr_mask) - addr + 1);
diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 67a1070..31947d9 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -660,7 +660,8 @@ static bool window_translate(TyphoonWindow *win, hwaddr addr,
 /* Handle PCI-to-system address translation.  */
 /* TODO: A translation failure here ought to set PCI error codes on the
    Pchip and generate a machine check interrupt.  */
-static IOMMUTLBEntry typhoon_translate_iommu(MemoryRegion *iommu, hwaddr addr)
+static IOMMUTLBEntry typhoon_translate_iommu(MemoryRegion *iommu, hwaddr addr,
+                                             bool is_write)
 {
     TyphoonPchip *pchip = container_of(iommu, TyphoonPchip, iommu);
     IOMMUTLBEntry ret;
diff --git a/hw/pci-host/apb.c b/hw/pci-host/apb.c
index d238a84..0e0e0ee 100644
--- a/hw/pci-host/apb.c
+++ b/hw/pci-host/apb.c
@@ -203,7 +203,8 @@ static AddressSpace *pbm_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &is->iommu_as;
 }
 
-static IOMMUTLBEntry pbm_translate_iommu(MemoryRegion *iommu, hwaddr addr)
+static IOMMUTLBEntry pbm_translate_iommu(MemoryRegion *iommu, hwaddr addr,
+                                         bool is_write)
 {
     IOMMUState *is = container_of(iommu, IOMMUState, iommu);
     hwaddr baseaddr, offset;
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index f6e32a4..6c91d8e 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -59,7 +59,8 @@ static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
     return NULL;
 }
 
-static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr)
+static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
+                                               bool is_write)
 {
     sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
     uint64_t tce;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e2c8e3e..834543d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -129,7 +129,7 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
-    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr);
+    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation
  2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 1/5] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Le Tan
@ 2014-08-11  7:04 ` Le Tan
  2014-08-12  7:34   ` Jan Kiszka
  2014-08-14 11:03   ` Michael S. Tsirkin
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables Le Tan
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Le Tan @ 2014-08-11  7:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Le Tan,
	Alex Williamson, Jan Kiszka, Anthony Liguori, Paolo Bonzini

Add support for emulating Intel IOMMU according to the VT-d specification for
the q35 chipset machine. Implement the logics for DMAR (DMA remapping) without
PASID support. The emulation supports register-based invalidation and primary
fault logging.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
---
 hw/i386/Makefile.objs          |    1 +
 hw/i386/intel_iommu.c          | 1345 ++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  345 +++++++++++
 include/hw/i386/intel_iommu.h  |   90 +++
 4 files changed, 1781 insertions(+)
 create mode 100644 hw/i386/intel_iommu.c
 create mode 100644 hw/i386/intel_iommu_internal.h
 create mode 100644 include/hw/i386/intel_iommu.h

diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 48014ab..6936111 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -2,6 +2,7 @@ obj-$(CONFIG_KVM) += kvm/
 obj-y += multiboot.o smbios.o
 obj-y += pc.o pc_piix.o pc_q35.o
 obj-y += pc_sysfw.o
+obj-y += intel_iommu.o
 obj-$(CONFIG_XEN) += ../xenpv/ xen/
 
 obj-y += kvmvapic.o
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
new file mode 100644
index 0000000..b3a4f78
--- /dev/null
+++ b/hw/i386/intel_iommu.c
@@ -0,0 +1,1345 @@
+/*
+ * QEMU emulation of an Intel IOMMU (VT-d)
+ *   (DMA Remapping device)
+ *
+ * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
+ * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "hw/sysbus.h"
+#include "exec/address-spaces.h"
+#include "intel_iommu_internal.h"
+
+
+/*#define DEBUG_INTEL_IOMMU*/
+#ifdef DEBUG_INTEL_IOMMU
+enum {
+    DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
+};
+#define VTD_DBGBIT(x)   (1 << DEBUG_##x)
+static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR) |
+                          VTD_DBGBIT(FLOG);
+
+#define VTD_DPRINTF(what, fmt, ...) do { \
+    if (vtd_dbgflags & VTD_DBGBIT(what)) { \
+        fprintf(stderr, "(vtd)%s: " fmt "\n", __func__, \
+                ## __VA_ARGS__); } \
+    } while (0)
+#else
+#define VTD_DPRINTF(what, fmt, ...) do {} while (0)
+#endif
+
+static inline void define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
+                               uint64_t wmask, uint64_t w1cmask)
+{
+    stq_le_p(&s->csr[addr], val);
+    stq_le_p(&s->wmask[addr], wmask);
+    stq_le_p(&s->w1cmask[addr], w1cmask);
+}
+
+static inline void define_quad_wo(IntelIOMMUState *s, hwaddr addr,
+                                  uint64_t mask)
+{
+    stq_le_p(&s->womask[addr], mask);
+}
+
+static inline void define_long(IntelIOMMUState *s, hwaddr addr, uint32_t val,
+                               uint32_t wmask, uint32_t w1cmask)
+{
+    stl_le_p(&s->csr[addr], val);
+    stl_le_p(&s->wmask[addr], wmask);
+    stl_le_p(&s->w1cmask[addr], w1cmask);
+}
+
+static inline void define_long_wo(IntelIOMMUState *s, hwaddr addr,
+                                  uint32_t mask)
+{
+    stl_le_p(&s->womask[addr], mask);
+}
+
+/* "External" get/set operations */
+static inline void set_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val)
+{
+    uint64_t oldval = ldq_le_p(&s->csr[addr]);
+    uint64_t wmask = ldq_le_p(&s->wmask[addr]);
+    uint64_t w1cmask = ldq_le_p(&s->w1cmask[addr]);
+    stq_le_p(&s->csr[addr],
+             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
+}
+
+static inline void set_long(IntelIOMMUState *s, hwaddr addr, uint32_t val)
+{
+    uint32_t oldval = ldl_le_p(&s->csr[addr]);
+    uint32_t wmask = ldl_le_p(&s->wmask[addr]);
+    uint32_t w1cmask = ldl_le_p(&s->w1cmask[addr]);
+    stl_le_p(&s->csr[addr],
+             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
+}
+
+static inline uint64_t get_quad(IntelIOMMUState *s, hwaddr addr)
+{
+    uint64_t val = ldq_le_p(&s->csr[addr]);
+    uint64_t womask = ldq_le_p(&s->womask[addr]);
+    return val & ~womask;
+}
+
+
+static inline uint32_t get_long(IntelIOMMUState *s, hwaddr addr)
+{
+    uint32_t val = ldl_le_p(&s->csr[addr]);
+    uint32_t womask = ldl_le_p(&s->womask[addr]);
+    return val & ~womask;
+}
+
+/* "Internal" get/set operations */
+static inline uint64_t get_quad_raw(IntelIOMMUState *s, hwaddr addr)
+{
+    return ldq_le_p(&s->csr[addr]);
+}
+
+static inline uint32_t get_long_raw(IntelIOMMUState *s, hwaddr addr)
+{
+    return ldl_le_p(&s->csr[addr]);
+}
+
+static inline void set_quad_raw(IntelIOMMUState *s, hwaddr addr, uint64_t val)
+{
+    stq_le_p(&s->csr[addr], val);
+}
+
+static inline uint32_t set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
+                                           uint32_t clear, uint32_t mask)
+{
+    uint32_t new_val = (ldl_le_p(&s->csr[addr]) & ~clear) | mask;
+    stl_le_p(&s->csr[addr], new_val);
+    return new_val;
+}
+
+static inline uint64_t set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
+                                           uint64_t clear, uint64_t mask)
+{
+    uint64_t new_val = (ldq_le_p(&s->csr[addr]) & ~clear) | mask;
+    stq_le_p(&s->csr[addr], new_val);
+    return new_val;
+}
+
+/* Given the reg addr of both the message data and address, generate an
+ * interrupt via MSI.
+ */
+static void vtd_generate_interrupt(IntelIOMMUState *s, hwaddr mesg_addr_reg,
+                                   hwaddr mesg_data_reg)
+{
+    hwaddr addr;
+    uint32_t data;
+
+    assert(mesg_data_reg < DMAR_REG_SIZE);
+    assert(mesg_addr_reg < DMAR_REG_SIZE);
+
+    addr = get_long_raw(s, mesg_addr_reg);
+    data = get_long_raw(s, mesg_data_reg);
+
+    VTD_DPRINTF(FLOG, "msi: addr 0x%"PRIx64 " data 0x%"PRIx32, addr, data);
+    stl_le_phys(&address_space_memory, addr, data);
+}
+
+/* Generate a fault event to software via MSI if conditions are met.
+ * Notice that the value of FSTS_REG being passed to it should be the one
+ * before any update.
+ */
+static void vtd_generate_fault_event(IntelIOMMUState *s, uint32_t pre_fsts)
+{
+    /* Check if there are any previously reported interrupt conditions */
+    if (pre_fsts & VTD_FSTS_PPF || pre_fsts & VTD_FSTS_PFO ||
+        pre_fsts & VTD_FSTS_IQE) {
+        VTD_DPRINTF(FLOG, "there are previous interrupt conditions "
+                    "to be serviced by software, fault event is not generated "
+                    "(FSTS_REG 0x%"PRIx32 ")", pre_fsts);
+        return;
+    }
+    set_clear_mask_long(s, DMAR_FECTL_REG, 0, VTD_FECTL_IP);
+    if (get_long_raw(s, DMAR_FECTL_REG) & VTD_FECTL_IM) {
+        /* Interrupt Mask */
+        VTD_DPRINTF(FLOG, "Interrupt Mask set, fault event is not generated");
+    } else {
+        /* generate interrupt */
+        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
+        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
+    }
+}
+
+/* Check if the Fault (F) field of the Fault Recording Register referenced by
+ * @index is Set.
+ */
+static inline bool is_frcd_set(IntelIOMMUState *s, uint16_t index)
+{
+    /* Each reg is 128-bit */
+    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
+    addr += 8; /* Access the high 64-bit half */
+
+    assert(index < DMAR_FRCD_REG_NR);
+
+    return get_quad_raw(s, addr) & VTD_FRCD_F;
+}
+
+/* Update the PPF field of Fault Status Register.
+ * Should be called whenever change the F field of any fault recording
+ * registers.
+ */
+static inline void update_fsts_ppf(IntelIOMMUState *s)
+{
+    uint32_t i;
+    uint32_t ppf_mask = 0;
+
+    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
+        if (is_frcd_set(s, i)) {
+            ppf_mask = VTD_FSTS_PPF;
+            break;
+        }
+    }
+    set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_PPF, ppf_mask);
+    VTD_DPRINTF(FLOG, "set PPF of FSTS_REG to %d", ppf_mask ? 1 : 0);
+}
+
+static inline void set_frcd_and_update_ppf(IntelIOMMUState *s, uint16_t index)
+{
+    /* Each reg is 128-bit */
+    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
+    addr += 8; /* Access the high 64-bit half */
+
+    assert(index < DMAR_FRCD_REG_NR);
+
+    set_clear_mask_quad(s, addr, 0, VTD_FRCD_F);
+    update_fsts_ppf(s);
+}
+
+/* Must not update F field now, should be done later */
+static void record_frcd(IntelIOMMUState *s, uint16_t index, uint16_t source_id,
+                        hwaddr addr, VTDFaultReason fault, bool is_write)
+{
+    uint64_t hi = 0, lo;
+    hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
+
+    assert(index < DMAR_FRCD_REG_NR);
+
+    lo = VTD_FRCD_FI(addr);
+    hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
+    if (!is_write) {
+        hi |= VTD_FRCD_T;
+    }
+
+    set_quad_raw(s, frcd_reg_addr, lo);
+    set_quad_raw(s, frcd_reg_addr + 8, hi);
+    VTD_DPRINTF(FLOG, "record to FRCD_REG #%"PRIu16 ": hi 0x%"PRIx64
+                ", lo 0x%"PRIx64, index, hi, lo);
+}
+
+/* Try to collapse multiple pending faults from the same requester */
+static inline bool try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
+{
+    uint32_t i;
+    uint64_t frcd_reg;
+    hwaddr addr = DMAR_FRCD_REG_OFFSET + 8; /* The high 64-bit half */
+
+    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
+        frcd_reg = get_quad_raw(s, addr);
+        VTD_DPRINTF(FLOG, "frcd_reg #%d 0x%"PRIx64, i, frcd_reg);
+        if ((frcd_reg & VTD_FRCD_F) &&
+            ((frcd_reg & VTD_FRCD_SID_MASK) == source_id)) {
+            return true;
+        }
+        addr += 16; /* 128-bit for each */
+    }
+
+    return false;
+}
+
+/* Log and report an DMAR (address translation) fault to software */
+static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
+                                  hwaddr addr, VTDFaultReason fault,
+                                  bool is_write)
+{
+    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
+
+    assert(fault < VTD_FR_MAX);
+
+    if (fault == VTD_FR_RESERVED_ERR) {
+        /* This is not a normal fault reason case. Drop it. */
+        return;
+    }
+
+    VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
+                ", is_write %d", source_id, fault, addr, is_write);
+
+    /* Check PFO field in FSTS_REG */
+    if (fsts_reg & VTD_FSTS_PFO) {
+        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
+                    "Primary Fault Overflow");
+        return;
+    }
+
+    /* Compression of multiple faults from the same requester */
+    if (try_collapse_fault(s, source_id)) {
+        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
+                    "compression of faults");
+        return;
+    }
+
+    /* Check next_frcd_reg to see whether it is overflow now */
+    if (is_frcd_set(s, s->next_frcd_reg)) {
+        VTD_DPRINTF(FLOG, "Primary Fault Overflow and "
+                    "new fault is not recorded, set PFO field");
+        set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_PFO);
+        return;
+    }
+
+    record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
+
+    if (fsts_reg & VTD_FSTS_PPF) {
+        /* There are already one or more pending faults */
+        VTD_DPRINTF(FLOG, "there are pending faults already, "
+                    "fault event is not generated");
+        set_frcd_and_update_ppf(s, s->next_frcd_reg);
+        s->next_frcd_reg++;
+        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
+            s->next_frcd_reg = 0;
+        }
+    } else {
+        set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_FRI_MASK,
+                            VTD_FSTS_FRI(s->next_frcd_reg));
+        set_frcd_and_update_ppf(s, s->next_frcd_reg); /* It will also set PPF */
+        s->next_frcd_reg++;
+        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
+            s->next_frcd_reg = 0;
+        }
+
+        /* This case actually cause the PPF to be Set.
+         * So generate fault event (interrupt).
+         */
+         vtd_generate_fault_event(s, fsts_reg);
+    }
+}
+
+static inline bool root_entry_present(VTDRootEntry *root)
+{
+    return root->val & VTD_ROOT_ENTRY_P;
+}
+
+static int get_root_entry(IntelIOMMUState *s, uint32_t index, VTDRootEntry *re)
+{
+    dma_addr_t addr;
+
+    assert(index < VTD_ROOT_ENTRY_NR);
+
+    addr = s->root + index * sizeof(*re);
+
+    if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
+        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
+                    " + %"PRIu32, s->root, index);
+        re->val = 0;
+        return -VTD_FR_ROOT_TABLE_INV;
+    }
+
+    re->val = le64_to_cpu(re->val);
+    return VTD_FR_RESERVED;
+}
+
+static inline bool context_entry_present(VTDContextEntry *context)
+{
+    return context->lo & VTD_CONTEXT_ENTRY_P;
+}
+
+static int get_context_entry_from_root(VTDRootEntry *root, uint32_t index,
+                                       VTDContextEntry *ce)
+{
+    dma_addr_t addr;
+
+    if (!root_entry_present(root)) {
+        ce->lo = 0;
+        ce->hi = 0;
+        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
+        return -VTD_FR_ROOT_ENTRY_P;
+    }
+
+    assert(index < VTD_CONTEXT_ENTRY_NR);
+
+    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
+
+    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
+        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
+                    " + %"PRIu32,
+                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
+        ce->lo = 0;
+        ce->hi = 0;
+        return -VTD_FR_CONTEXT_TABLE_INV;
+    }
+
+    ce->lo = le64_to_cpu(ce->lo);
+    ce->hi = le64_to_cpu(ce->hi);
+    return VTD_FR_RESERVED;
+}
+
+static inline dma_addr_t get_slpt_base_from_context(VTDContextEntry *ce)
+{
+    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
+}
+
+/* The shift of an addr for a certain level of paging structure */
+static inline uint32_t slpt_level_shift(uint32_t level)
+{
+    return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
+}
+
+static inline uint64_t get_slpte_addr(uint64_t slpte)
+{
+    return slpte & VTD_SL_PT_BASE_ADDR_MASK;
+}
+
+/* Whether the pte indicates the address of the page frame */
+static inline bool is_last_slpte(uint64_t slpte, uint32_t level)
+{
+    return level == VTD_SL_PT_LEVEL || (slpte & VTD_SL_PT_PAGE_SIZE_MASK);
+}
+
+/* Get the content of a spte located in @base_addr[@index] */
+static inline uint64_t get_slpte(dma_addr_t base_addr, uint32_t index)
+{
+    uint64_t slpte;
+
+    assert(index < VTD_SL_PT_ENTRY_NR);
+
+    if (dma_memory_read(&address_space_memory,
+                        base_addr + index * sizeof(slpte), &slpte,
+                        sizeof(slpte))) {
+        slpte = (uint64_t)-1;
+        return slpte;
+    }
+
+    slpte = le64_to_cpu(slpte);
+    return slpte;
+}
+
+/* Given a gpa and the level of paging structure, return the offset of current
+ * level.
+ */
+static inline uint32_t gpa_level_offset(uint64_t gpa, uint32_t level)
+{
+    return (gpa >> slpt_level_shift(level)) & ((1ULL << VTD_SL_LEVEL_BITS) - 1);
+}
+
+/* Check Capability Register to see if the @level of page-table is supported */
+static inline bool is_level_supported(IntelIOMMUState *s, uint32_t level)
+{
+    return VTD_CAP_SAGAW_MASK & s->cap &
+           (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
+}
+
+/* Get the page-table level that hardware should use for the second-level
+ * page-table walk from the Address Width field of context-entry.
+ */
+static inline uint32_t get_level_from_context_entry(VTDContextEntry *ce)
+{
+    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
+}
+
+static inline uint32_t get_agaw_from_context_entry(VTDContextEntry *ce)
+{
+    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
+}
+
+static const uint64_t paging_entry_rsvd_field[] = {
+    [0] = ~0ULL,
+    /* For not large page */
+    [1] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [2] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [3] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [4] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    /* For large page */
+    [5] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [6] = 0x1ff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [7] = 0x3ffff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [8] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+};
+
+static inline bool slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
+{
+    if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
+        /* Maybe large page */
+        return slpte & paging_entry_rsvd_field[level + 4];
+    } else {
+        return slpte & paging_entry_rsvd_field[level];
+    }
+}
+
+/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
+ * of the translation, can be used for deciding the size of large page.
+ * @slptep and @slpte_level will not be touched if error happens.
+ */
+static int gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
+                        uint64_t *slptep, uint32_t *slpte_level)
+{
+    dma_addr_t addr = get_slpt_base_from_context(ce);
+    uint32_t level = get_level_from_context_entry(ce);
+    uint32_t offset;
+    uint64_t slpte;
+    uint32_t ce_agaw = get_agaw_from_context_entry(ce);
+    uint64_t access_right_check;
+
+    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
+     * and AW in context-entry.
+     */
+    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    /* FIXME: what is the Atomics request here? */
+    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
+
+    while (true) {
+        offset = gpa_level_offset(gpa, level);
+        slpte = get_slpte(addr, offset);
+
+        if (slpte == (uint64_t)-1) {
+            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
+                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
+                        level, gpa);
+            if (level == get_level_from_context_entry(ce)) {
+                /* Invalid programming of context-entry */
+                return -VTD_FR_CONTEXT_ENTRY_INV;
+            } else {
+                return -VTD_FR_PAGING_ENTRY_INV;
+            }
+        }
+        if (!(slpte & access_right_check)) {
+            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
+                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
+                        (is_write ? "write" : "read"), gpa, slpte);
+            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
+        }
+        if (slpte_nonzero_rsvd(slpte, level)) {
+            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
+                        "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
+                        level, slpte);
+            return -VTD_FR_PAGING_ENTRY_RSVD;
+        }
+
+        if (is_last_slpte(slpte, level)) {
+            *slptep = slpte;
+            *slpte_level = level;
+            return VTD_FR_RESERVED;
+        }
+        addr = get_slpte_addr(slpte);
+        level--;
+    }
+}
+
+/* Map a device to its corresponding domain (context-entry). @ce will be set
+ * to Zero if error happens while accessing the context-entry.
+ */
+static inline int dev_to_context_entry(IntelIOMMUState *s, int bus_num,
+                                       int devfn, VTDContextEntry *ce)
+{
+    VTDRootEntry re;
+    int ret_fr;
+
+    assert(0 <= bus_num && bus_num < VTD_PCI_BUS_MAX);
+    assert(0 <= devfn && devfn < VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
+
+    ret_fr = get_root_entry(s, bus_num, &re);
+    if (ret_fr) {
+        ce->hi = 0;
+        ce->lo = 0;
+        return ret_fr;
+    }
+
+    if (!root_entry_present(&re)) {
+        VTD_DPRINTF(GENERAL, "error: root-entry #%d is not present", bus_num);
+        ce->hi = 0;
+        ce->lo = 0;
+        return -VTD_FR_ROOT_ENTRY_P;
+    } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
+        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
+        ce->hi = 0;
+        ce->lo = 0;
+        return -VTD_FR_ROOT_ENTRY_RSVD;
+    }
+
+    ret_fr = get_context_entry_from_root(&re, devfn, ce);
+    if (ret_fr) {
+        return ret_fr;
+    }
+
+    if (!context_entry_present(ce)) {
+        VTD_DPRINTF(GENERAL,
+                    "error: context-entry #%d(bus #%d) is not present", devfn,
+                    bus_num);
+        return -VTD_FR_CONTEXT_ENTRY_P;
+    } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
+               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
+        VTD_DPRINTF(GENERAL,
+                    "error: non-zero reserved field in context-entry "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    }
+
+    /* Check if the programming of context-entry is valid */
+    if (!is_level_supported(s, get_level_from_context_entry(ce))) {
+        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
+                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    } else if (ce->lo & VTD_CONTEXT_ENTRY_TT) {
+        VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
+                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    }
+
+    return VTD_FR_RESERVED;
+}
+
+static inline uint16_t make_source_id(int bus_num, int devfn)
+{
+    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
+}
+
+static const bool qualified_faults[] = {
+    [VTD_FR_RESERVED] = false,
+    [VTD_FR_ROOT_ENTRY_P] = false,
+    [VTD_FR_CONTEXT_ENTRY_P] = true,
+    [VTD_FR_CONTEXT_ENTRY_INV] = true,
+    [VTD_FR_ADDR_BEYOND_MGAW] = true,
+    [VTD_FR_WRITE] = true,
+    [VTD_FR_READ] = true,
+    [VTD_FR_PAGING_ENTRY_INV] = true,
+    [VTD_FR_ROOT_TABLE_INV] = false,
+    [VTD_FR_CONTEXT_TABLE_INV] = false,
+    [VTD_FR_ROOT_ENTRY_RSVD] = false,
+    [VTD_FR_PAGING_ENTRY_RSVD] = true,
+    [VTD_FR_CONTEXT_ENTRY_TT] = true,
+    [VTD_FR_RESERVED_ERR] = false,
+    [VTD_FR_MAX] = false,
+};
+
+/* To see if a fault condition is "qualified", which is reported to software
+ * only if the FPD field in the context-entry used to process the faulting
+ * request is 0.
+ */
+static inline bool is_qualified_fault(VTDFaultReason fault)
+{
+    return qualified_faults[fault];
+}
+
+static inline bool is_interrupt_addr(hwaddr addr)
+{
+    return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
+}
+
+/* Map dev to context-entry then do a paging-structures walk to do a iommu
+ * translation.
+ * @bus_num: The bus number
+ * @devfn: The devfn, which is the  combined of device and function number
+ * @is_write: The access is a write operation
+ * @entry: IOMMUTLBEntry that contain the addr to be translated and result
+ */
+static void iommu_translate(IntelIOMMUState *s, int bus_num, int devfn,
+                            hwaddr addr, bool is_write, IOMMUTLBEntry *entry)
+{
+    VTDContextEntry ce;
+    uint64_t slpte;
+    uint32_t level;
+    uint64_t page_mask;
+    uint16_t source_id = make_source_id(bus_num, devfn);
+    int ret_fr;
+    bool is_fpd_set = false;
+
+    /* Check if the request is in interrupt address range */
+    if (is_interrupt_addr(addr)) {
+        if (is_write) {
+            /* FIXME: since we don't know the length of the access here, we
+             * treat Non-DWORD length write requests without PASID as
+             * interrupt requests, too. Withoud interrupt remapping support,
+             * we just use 1:1 mapping.
+             */
+            VTD_DPRINTF(MMU, "write request to interrupt address "
+                        "gpa 0x%"PRIx64, addr);
+            entry->iova = addr & VTD_PAGE_MASK_4K;
+            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
+            entry->addr_mask = ~VTD_PAGE_MASK_4K;
+            entry->perm = IOMMU_WO;
+            return;
+        } else {
+            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
+                        "gpa 0x%"PRIx64, addr);
+            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
+            return;
+        }
+    }
+
+    ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
+    is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+    if (ret_fr) {
+        ret_fr = -ret_fr;
+        if (is_fpd_set && is_qualified_fault(ret_fr)) {
+            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
+                        "through this context-entry (with FPD Set)");
+        } else {
+            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+        }
+        return;
+    }
+
+    ret_fr = gpa_to_slpte(&ce, addr, is_write, &slpte, &level);
+    if (ret_fr) {
+        ret_fr = -ret_fr;
+        if (is_fpd_set && is_qualified_fault(ret_fr)) {
+            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
+                        "through this context-entry (with FPD Set)");
+        } else {
+            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+        }
+        return;
+    }
+
+    if (level == VTD_SL_PT_LEVEL) {
+        /* 4-KB page */
+        page_mask = VTD_PAGE_MASK_4K;
+    } else if (level == VTD_SL_PDP_LEVEL) {
+        /* 1-GB page */
+        page_mask = VTD_PAGE_MASK_1G;
+    } else {
+        /* 2-MB page */
+        page_mask = VTD_PAGE_MASK_2M;
+    }
+
+    entry->iova = addr & page_mask;
+    entry->translated_addr = get_slpte_addr(slpte) & page_mask;
+    entry->addr_mask = ~page_mask;
+    entry->perm = slpte & VTD_SL_RW_MASK;
+}
+
+static void vtd_root_table_setup(IntelIOMMUState *s)
+{
+    s->root = get_quad_raw(s, DMAR_RTADDR_REG);
+    s->root_extended = s->root & VTD_RTADDR_RTT;
+    s->root &= VTD_RTADDR_ADDR_MASK;
+
+    VTD_DPRINTF(CSR, "root_table addr 0x%"PRIx64 " %s", s->root,
+                (s->root_extended ? "(extended)" : ""));
+}
+
+/* Context-cache invalidation
+ * Returns the Context Actual Invalidation Granularity.
+ * @val: the content of the CCMD_REG
+ */
+static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
+{
+    uint64_t caig;
+    uint64_t type = val & VTD_CCMD_CIRG_MASK;
+
+    switch (type) {
+    case VTD_CCMD_GLOBAL_INVL:
+        VTD_DPRINTF(INV, "Global invalidation request");
+        caig = VTD_CCMD_GLOBAL_INVL_A;
+        break;
+
+    case VTD_CCMD_DOMAIN_INVL:
+        VTD_DPRINTF(INV, "Domain-selective invalidation request");
+        caig = VTD_CCMD_DOMAIN_INVL_A;
+        break;
+
+    case VTD_CCMD_DEVICE_INVL:
+        VTD_DPRINTF(INV, "Domain-selective invalidation request");
+        caig = VTD_CCMD_DEVICE_INVL_A;
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL,
+                    "error: wrong context-cache invalidation granularity");
+        caig = 0;
+    }
+
+    return caig;
+}
+
+/* Flush IOTLB
+ * Returns the IOTLB Actual Invalidation Granularity.
+ * @val: the content of the IOTLB_REG
+ */
+static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
+{
+    uint64_t iaig;
+    uint64_t type = val & VTD_TLB_FLUSH_GRANU_MASK;
+
+    switch (type) {
+    case VTD_TLB_GLOBAL_FLUSH:
+        VTD_DPRINTF(INV, "Global IOTLB flush");
+        iaig = VTD_TLB_GLOBAL_FLUSH_A;
+        break;
+
+    case VTD_TLB_DSI_FLUSH:
+        VTD_DPRINTF(INV, "Domain-selective IOTLB flush");
+        iaig = VTD_TLB_DSI_FLUSH_A;
+        break;
+
+    case VTD_TLB_PSI_FLUSH:
+        VTD_DPRINTF(INV, "Page-selective-within-domain IOTLB flush");
+        iaig = VTD_TLB_PSI_FLUSH_A;
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: wrong iotlb flush granularity");
+        iaig = 0;
+    }
+
+    return iaig;
+}
+
+/* Set Root Table Pointer */
+static void handle_gcmd_srtp(IntelIOMMUState *s)
+{
+    VTD_DPRINTF(CSR, "set Root Table Pointer");
+
+    vtd_root_table_setup(s);
+    /* Ok - report back to driver */
+    set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
+}
+
+/* Handle Translation Enable/Disable */
+static void handle_gcmd_te(IntelIOMMUState *s, bool en)
+{
+    VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
+
+    if (en) {
+        s->dmar_enabled = true;
+        /* Ok - report back to driver */
+        set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
+    } else {
+        s->dmar_enabled = false;
+
+        /* Clear the index of Fault Recording Register */
+        s->next_frcd_reg = 0;
+        /* Ok - report back to driver */
+        set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
+    }
+}
+
+/* Handle write to Global Command Register */
+static void handle_gcmd_write(IntelIOMMUState *s)
+{
+    uint32_t status = get_long_raw(s, DMAR_GSTS_REG);
+    uint32_t val = get_long_raw(s, DMAR_GCMD_REG);
+    uint32_t changed = status ^ val;
+
+    VTD_DPRINTF(CSR, "value 0x%"PRIx32 " status 0x%"PRIx32, val, status);
+    if (changed & VTD_GCMD_TE) {
+        /* Translation enable/disable */
+        handle_gcmd_te(s, val & VTD_GCMD_TE);
+    }
+    if (val & VTD_GCMD_SRTP) {
+        /* Set/update the root-table pointer */
+        handle_gcmd_srtp(s);
+    }
+}
+
+/* Handle write to Context Command Register */
+static void handle_ccmd_write(IntelIOMMUState *s)
+{
+    uint64_t ret;
+    uint64_t val = get_quad_raw(s, DMAR_CCMD_REG);
+
+    /* Context-cache invalidation request */
+    if (val & VTD_CCMD_ICC) {
+        ret = vtd_context_cache_invalidate(s, val);
+
+        /* Invalidation completed. Change something to show */
+        set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_ICC, 0ULL);
+        ret = set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_CAIG_MASK, ret);
+        VTD_DPRINTF(INV, "CCMD_REG write-back val: 0x%"PRIx64, ret);
+    }
+}
+
+/* Handle write to IOTLB Invalidation Register */
+static void handle_iotlb_write(IntelIOMMUState *s)
+{
+    uint64_t ret;
+    uint64_t val = get_quad_raw(s, DMAR_IOTLB_REG);
+
+    /* IOTLB invalidation request */
+    if (val & VTD_TLB_IVT) {
+        ret = vtd_iotlb_flush(s, val);
+
+        /* Invalidation completed. Change something to show */
+        set_clear_mask_quad(s, DMAR_IOTLB_REG, VTD_TLB_IVT, 0ULL);
+        ret = set_clear_mask_quad(s, DMAR_IOTLB_REG,
+                                  VTD_TLB_FLUSH_GRANU_MASK_A, ret);
+        VTD_DPRINTF(INV, "IOTLB_REG write-back val: 0x%"PRIx64, ret);
+    }
+}
+
+static inline void handle_fsts_write(IntelIOMMUState *s)
+{
+    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
+    uint32_t fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
+    uint32_t status_fields = VTD_FSTS_PFO | VTD_FSTS_PPF | VTD_FSTS_IQE;
+
+    if ((fectl_reg & VTD_FECTL_IP) && !(fsts_reg & status_fields)) {
+        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
+        VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
+                    "IP field of FECTL_REG");
+    }
+}
+
+static inline void handle_fectl_write(IntelIOMMUState *s)
+{
+    uint32_t fectl_reg;
+    /* When software clears the IM field, check the IP field. But do we
+     * need to compare the old value and the new value to conclude that
+     * software clears the IM field? Or just check if the IM field is zero?
+     */
+    fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
+    if ((fectl_reg & VTD_FECTL_IP) && !(fectl_reg & VTD_FECTL_IM)) {
+        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
+        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
+        VTD_DPRINTF(FLOG, "IM field is cleared, generate "
+                    "fault event interrupt");
+    }
+}
+
+static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
+{
+    IntelIOMMUState *s = opaque;
+    uint64_t val;
+
+    if (addr + size > DMAR_REG_SIZE) {
+        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
+                    ", got 0x%"PRIx64 " %d",
+                    (uint64_t)DMAR_REG_SIZE, addr, size);
+        return (uint64_t)-1;
+    }
+
+    assert(size == 4 || size == 8);
+
+    switch (addr) {
+    /* Root Table Address Register, 64-bit */
+    case DMAR_RTADDR_REG:
+        if (size == 4) {
+            val = s->root & ((1ULL << 32) - 1);
+        } else {
+            val = s->root;
+        }
+        break;
+
+    case DMAR_RTADDR_REG_HI:
+        assert(size == 4);
+        val = s->root >> 32;
+        break;
+
+    default:
+        if (size == 4) {
+            val = get_long(s, addr);
+        } else {
+            val = get_quad(s, addr);
+        }
+    }
+
+    VTD_DPRINTF(CSR, "addr 0x%"PRIx64 " size %d val 0x%"PRIx64,
+                addr, size, val);
+    return val;
+}
+
+static void vtd_mem_write(void *opaque, hwaddr addr,
+                          uint64_t val, unsigned size)
+{
+    IntelIOMMUState *s = opaque;
+
+    if (addr + size > DMAR_REG_SIZE) {
+        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
+                    ", got 0x%"PRIx64 " %d",
+                    (uint64_t)DMAR_REG_SIZE, addr, size);
+        return;
+    }
+
+    assert(size == 4 || size == 8);
+
+    switch (addr) {
+    /* Global Command Register, 32-bit */
+    case DMAR_GCMD_REG:
+        VTD_DPRINTF(CSR, "DMAR_GCMD_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        set_long(s, addr, val);
+        handle_gcmd_write(s);
+        break;
+
+    /* Context Command Register, 64-bit */
+    case DMAR_CCMD_REG:
+        VTD_DPRINTF(CSR, "DMAR_CCMD_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+            handle_ccmd_write(s);
+        }
+        break;
+
+    case DMAR_CCMD_REG_HI:
+        VTD_DPRINTF(CSR, "DMAR_CCMD_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        handle_ccmd_write(s);
+        break;
+
+
+    /* IOTLB Invalidation Register, 64-bit */
+    case DMAR_IOTLB_REG:
+        VTD_DPRINTF(INV, "DMAR_IOTLB_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+            handle_iotlb_write(s);
+        }
+        break;
+
+    case DMAR_IOTLB_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IOTLB_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        handle_iotlb_write(s);
+        break;
+
+    /* Fault Status Register, 32-bit */
+    case DMAR_FSTS_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FSTS_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        handle_fsts_write(s);
+        break;
+
+    /* Fault Event Control Register, 32-bit */
+    case DMAR_FECTL_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FECTL_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        handle_fectl_write(s);
+        break;
+
+    /* Fault Event Data Register, 32-bit */
+    case DMAR_FEDATA_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FEDATA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Fault Event Address Register, 32-bit */
+    case DMAR_FEADDR_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FEADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Fault Event Upper Address Register, 32-bit */
+    case DMAR_FEUADDR_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FEUADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Protected Memory Enable Register, 32-bit */
+    case DMAR_PMEN_REG:
+        VTD_DPRINTF(CSR, "DMAR_PMEN_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+
+    /* Root Table Address Register, 64-bit */
+    case DMAR_RTADDR_REG:
+        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_RTADDR_REG_HI:
+        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Fault Recording Registers, 128-bit */
+    case DMAR_FRCD_REG_0_0:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_FRCD_REG_0_1:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_1 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    case DMAR_FRCD_REG_0_2:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_2 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+            /* May clear bit 127 (Fault), update PPF */
+            update_fsts_ppf(s);
+        }
+        break;
+
+    case DMAR_FRCD_REG_0_3:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_3 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        /* May clear bit 127 (Fault), update PPF */
+        update_fsts_ppf(s);
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: unhandled reg write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+        }
+    }
+
+}
+
+static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
+                                         bool is_write)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    int bus_num = vtd_as->bus_num;
+    int devfn = vtd_as->devfn;
+    IOMMUTLBEntry ret = {
+        .target_as = &address_space_memory,
+        .iova = addr,
+        .translated_addr = 0,
+        .addr_mask = ~(hwaddr)0,
+        .perm = IOMMU_NONE,
+    };
+
+    if (!s->dmar_enabled) {
+        /* DMAR disabled, passthrough, use 4k-page*/
+        ret.iova = addr & VTD_PAGE_MASK_4K;
+        ret.translated_addr = addr & VTD_PAGE_MASK_4K;
+        ret.addr_mask = ~VTD_PAGE_MASK_4K;
+        ret.perm = IOMMU_RW;
+        return ret;
+    }
+
+    iommu_translate(s, bus_num, devfn, addr, is_write, &ret);
+
+    VTD_DPRINTF(MMU,
+                "bus %d slot %d func %d devfn %d gpa %"PRIx64 " hpa %"PRIx64,
+                bus_num, VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), devfn, addr,
+                ret.translated_addr);
+    return ret;
+}
+
+static const VMStateDescription vtd_vmstate = {
+    .name = "iommu_intel",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .minimum_version_id_old = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const MemoryRegionOps vtd_mem_ops = {
+    .read = vtd_mem_read,
+    .write = vtd_mem_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    },
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    },
+};
+
+static Property iommu_properties[] = {
+    DEFINE_PROP_UINT32("version", IntelIOMMUState, version, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+/* Do the real initialization. It will also be called when reset, so pay
+ * attention when adding new initialization stuff.
+ */
+static void do_vtd_init(IntelIOMMUState *s)
+{
+    memset(s->csr, 0, DMAR_REG_SIZE);
+    memset(s->wmask, 0, DMAR_REG_SIZE);
+    memset(s->w1cmask, 0, DMAR_REG_SIZE);
+    memset(s->womask, 0, DMAR_REG_SIZE);
+
+    s->iommu_ops.translate = vtd_iommu_translate;
+    s->root = 0;
+    s->root_extended = false;
+    s->dmar_enabled = false;
+    s->iq_head = 0;
+    s->iq_tail = 0;
+    s->iq = 0;
+    s->iq_size = 0;
+    s->qi_enabled = false;
+    s->iq_last_desc_type = VTD_INV_DESC_NONE;
+    s->next_frcd_reg = 0;
+
+    /* b.0:2 = 6: Number of domains supported: 64K using 16 bit ids
+     * b.3   = 0: Advanced fault logging not supported
+     * b.4   = 0: Required write buffer flushing not supported
+     * b.5   = 0: Protected low memory region not supported
+     * b.6   = 0: Protected high memory region not supported
+     * b.8:12 = 2: SAGAW(Supported Adjusted Guest Address Widths), 39-bit,
+     *             3-level page-table
+     * b.16:21 = 38: MGAW(Maximum Guest Address Width) = 39
+     * b.22 = 0: ZLR(Zero Length Read) zero length DMA read requests
+     *           to write-only pages not supported
+     * b.24:33 = 34: FRO(Fault-recording Register offset)
+     * b.54 = 0: DWD(Write Draining), draining of write requests not supported
+     * b.55 = 0: DRD(Read Draining), draining of read requests not supported
+     */
+    s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
+             VTD_CAP_SAGAW;
+
+    /* b.1 = 0: QI(Queued Invalidation support) not supported
+     * b.2 = 0: DT(Device-TLB support) not supported
+     * b.3 = 0: IR(Interrupt Remapping support) not supported
+     * b.4 = 0: EIM(Extended Interrupt Mode) not supported
+     * b.8:17 = 15: IRO(IOTLB Register Offset)
+     * b.20:23 = 0: MHMV(Maximum Handle Mask Value) not valid
+     */
+    s->ecap = VTD_ECAP_IRO;
+
+    /* Define registers with default values and bit semantics */
+    define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);  /* set MAX = 1, RO */
+    define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
+    define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
+    define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
+    define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
+    define_long(s, DMAR_GSTS_REG, 0, 0, 0); /* All bits RO, default 0 */
+    define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
+    define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
+    define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
+
+    /* Advanced Fault Logging not supported */
+    define_long(s, DMAR_FSTS_REG, 0, 0, 0x11UL);
+    define_long(s, DMAR_FECTL_REG, 0x80000000UL, 0x80000000UL, 0);
+    define_long(s, DMAR_FEDATA_REG, 0, 0x0000ffffUL, 0); /* 15:0 RW */
+    define_long(s, DMAR_FEADDR_REG, 0, 0xfffffffcUL, 0); /* 31:2 RW */
+
+    /* Treated as RsvdZ when EIM in ECAP_REG is not supported
+     * define_long(s, DMAR_FEUADDR_REG, 0, 0xffffffffUL, 0);
+     */
+    define_long(s, DMAR_FEUADDR_REG, 0, 0, 0);
+
+    /* Treated as RO for implementations that PLMR and PHMR fields reported
+     * as Clear in the CAP_REG.
+     * define_long(s, DMAR_PMEN_REG, 0, 0x80000000UL, 0);
+     */
+    define_long(s, DMAR_PMEN_REG, 0, 0, 0);
+
+    /* IOTLB registers */
+    define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
+    define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
+    define_quad_wo(s, DMAR_IVA_REG, 0xfffffffffffff07fULL);
+
+    /* Fault Recording Registers, 128-bit */
+    define_quad(s, DMAR_FRCD_REG_0_0, 0, 0, 0);
+    define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000000000000000ULL);
+}
+
+/* Reset function of QOM
+ * Should not reset address_spaces when reset
+ */
+static void vtd_reset(DeviceState *dev)
+{
+    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
+
+    VTD_DPRINTF(GENERAL, "");
+    do_vtd_init(s);
+}
+
+/* Initialization function of QOM */
+static void vtd_realize(DeviceState *dev, Error **errp)
+{
+    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
+
+    VTD_DPRINTF(GENERAL, "");
+    memset(s->address_spaces, 0, sizeof(s->address_spaces));
+    memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
+                          "intel_iommu", DMAR_REG_SIZE);
+    sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
+    do_vtd_init(s);
+}
+
+static void vtd_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->reset = vtd_reset;
+    dc->realize = vtd_realize;
+    dc->vmsd = &vtd_vmstate;
+    dc->props = iommu_properties;
+}
+
+static const TypeInfo vtd_info = {
+    .name          = TYPE_INTEL_IOMMU_DEVICE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(IntelIOMMUState),
+    .class_init    = vtd_class_init,
+};
+
+static void vtd_register_types(void)
+{
+    VTD_DPRINTF(GENERAL, "");
+    type_register_static(&vtd_info);
+}
+
+type_init(vtd_register_types)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
new file mode 100644
index 0000000..7bc679a
--- /dev/null
+++ b/hw/i386/intel_iommu_internal.h
@@ -0,0 +1,345 @@
+/*
+ * QEMU emulation of an Intel IOMMU (VT-d)
+ *   (DMA Remapping device)
+ *
+ * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
+ * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ * Lots of defines copied from kernel/include/linux/intel-iommu.h:
+ *   Copyright (C) 2006-2008 Intel Corporation
+ *   Author: Ashok Raj <ashok.raj@intel.com>
+ *   Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ *
+ */
+
+#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
+#define HW_I386_INTEL_IOMMU_INTERNAL_H
+#include "hw/i386/intel_iommu.h"
+
+/*
+ * Intel IOMMU register specification
+ */
+#define DMAR_VER_REG    0x0 /* Arch version supported by this IOMMU */
+#define DMAR_CAP_REG    0x8 /* Hardware supported capabilities */
+#define DMAR_CAP_REG_HI 0xc /* High 32-bit of DMAR_CAP_REG */
+#define DMAR_ECAP_REG   0x10    /* Extended capabilities supported */
+#define DMAR_ECAP_REG_HI    0X14
+#define DMAR_GCMD_REG   0x18    /* Global command register */
+#define DMAR_GSTS_REG   0x1c    /* Global status register */
+#define DMAR_RTADDR_REG 0x20    /* Root entry table */
+#define DMAR_RTADDR_REG_HI  0X24
+#define DMAR_CCMD_REG   0x28  /* Context command reg */
+#define DMAR_CCMD_REG_HI    0x2c
+#define DMAR_FSTS_REG   0x34  /* Fault Status register */
+#define DMAR_FECTL_REG  0x38 /* Fault control register */
+#define DMAR_FEDATA_REG 0x3c    /* Fault event interrupt data register */
+#define DMAR_FEADDR_REG 0x40    /* Fault event interrupt addr register */
+#define DMAR_FEUADDR_REG    0x44   /* Upper address register */
+#define DMAR_AFLOG_REG  0x58 /* Advanced Fault control */
+#define DMAR_AFLOG_REG_HI   0X5c
+#define DMAR_PMEN_REG   0x64  /* Enable Protected Memory Region */
+#define DMAR_PLMBASE_REG    0x68    /* PMRR Low addr */
+#define DMAR_PLMLIMIT_REG 0x6c  /* PMRR low limit */
+#define DMAR_PHMBASE_REG 0x70   /* pmrr high base addr */
+#define DMAR_PHMBASE_REG_HI 0X74
+#define DMAR_PHMLIMIT_REG 0x78  /* pmrr high limit */
+#define DMAR_PHMLIMIT_REG_HI 0x7c
+#define DMAR_IQH_REG    0x80   /* Invalidation queue head register */
+#define DMAR_IQH_REG_HI 0X84
+#define DMAR_IQT_REG    0x88   /* Invalidation queue tail register */
+#define DMAR_IQT_REG_HI 0X8c
+#define DMAR_IQ_SHIFT   4 /* Invalidation queue head/tail shift */
+#define DMAR_IQA_REG    0x90   /* Invalidation queue addr register */
+#define DMAR_IQA_REG_HI 0x94
+#define DMAR_ICS_REG    0x9c   /* Invalidation complete status register */
+#define DMAR_IRTA_REG   0xb8    /* Interrupt remapping table addr register */
+#define DMAR_IRTA_REG_HI    0xbc
+
+#define DMAR_IECTL_REG  0xa0    /* Invalidation event control register */
+#define DMAR_IEDATA_REG 0xa4    /* Invalidation event data register */
+#define DMAR_IEADDR_REG 0xa8    /* Invalidation event address register */
+#define DMAR_IEUADDR_REG 0xac    /* Invalidation event address register */
+#define DMAR_PQH_REG    0xc0    /* Page request queue head register */
+#define DMAR_PQH_REG_HI 0xc4
+#define DMAR_PQT_REG    0xc8    /* Page request queue tail register*/
+#define DMAR_PQT_REG_HI     0xcc
+#define DMAR_PQA_REG    0xd0    /* Page request queue address register */
+#define DMAR_PQA_REG_HI 0xd4
+#define DMAR_PRS_REG    0xdc    /* Page request status register */
+#define DMAR_PECTL_REG  0xe0    /* Page request event control register */
+#define DMAR_PEDATA_REG 0xe4    /* Page request event data register */
+#define DMAR_PEADDR_REG 0xe8    /* Page request event address register */
+#define DMAR_PEUADDR_REG  0xec  /* Page event upper address register */
+#define DMAR_MTRRCAP_REG 0x100  /* MTRR capability register */
+#define DMAR_MTRRCAP_REG_HI 0x104
+#define DMAR_MTRRDEF_REG 0x108  /* MTRR default type register */
+#define DMAR_MTRRDEF_REG_HI 0x10c
+
+/* IOTLB */
+#define DMAR_IOTLB_REG_OFFSET 0xf0  /* Offset to the IOTLB registers */
+#define DMAR_IVA_REG DMAR_IOTLB_REG_OFFSET  /* Invalidate Address Register */
+#define DMAR_IVA_REG_HI (DMAR_IVA_REG + 4)
+/* IOTLB Invalidate Register */
+#define DMAR_IOTLB_REG (DMAR_IOTLB_REG_OFFSET + 0x8)
+#define DMAR_IOTLB_REG_HI (DMAR_IOTLB_REG + 4)
+
+/* FRCD */
+#define DMAR_FRCD_REG_OFFSET 0x220 /* Offset to the Fault Recording Registers */
+/* NOTICE: If you change the DMAR_FRCD_REG_NR, please remember to change the
+ * DMAR_REG_SIZE in include/hw/i386/intel_iommu.h.
+ * #define DMAR_REG_SIZE   (DMAR_FRCD_REG_OFFSET + 16 * DMAR_FRCD_REG_NR)
+ */
+#define DMAR_FRCD_REG_NR 1ULL /* Num of Fault Recording Registers */
+
+#define DMAR_FRCD_REG_0_0    0x220 /* The 0th Fault Recording Register */
+#define DMAR_FRCD_REG_0_1    0x224
+#define DMAR_FRCD_REG_0_2    0x228
+#define DMAR_FRCD_REG_0_3    0x22c
+
+/* Interrupt Address Range */
+#define VTD_INTERRUPT_ADDR_FIRST    0xfee00000ULL
+#define VTD_INTERRUPT_ADDR_LAST     0xfeefffffULL
+
+/* IOTLB_REG */
+#define VTD_TLB_GLOBAL_FLUSH (1ULL << 60) /* Global invalidation */
+#define VTD_TLB_DSI_FLUSH (2ULL << 60)  /* Domain-selective invalidation */
+#define VTD_TLB_PSI_FLUSH (3ULL << 60)  /* Page-selective invalidation */
+#define VTD_TLB_FLUSH_GRANU_MASK (3ULL << 60)
+#define VTD_TLB_GLOBAL_FLUSH_A (1ULL << 57)
+#define VTD_TLB_DSI_FLUSH_A (2ULL << 57)
+#define VTD_TLB_PSI_FLUSH_A (3ULL << 57)
+#define VTD_TLB_FLUSH_GRANU_MASK_A (3ULL << 57)
+#define VTD_TLB_IVT (1ULL << 63)
+
+/* GCMD_REG */
+#define VTD_GCMD_TE (1UL << 31)
+#define VTD_GCMD_SRTP (1UL << 30)
+#define VTD_GCMD_SFL (1UL << 29)
+#define VTD_GCMD_EAFL (1UL << 28)
+#define VTD_GCMD_WBF (1UL << 27)
+#define VTD_GCMD_QIE (1UL << 26)
+#define VTD_GCMD_IRE (1UL << 25)
+#define VTD_GCMD_SIRTP (1UL << 24)
+#define VTD_GCMD_CFI (1UL << 23)
+
+/* GSTS_REG */
+#define VTD_GSTS_TES (1UL << 31)
+#define VTD_GSTS_RTPS (1UL << 30)
+#define VTD_GSTS_FLS (1UL << 29)
+#define VTD_GSTS_AFLS (1UL << 28)
+#define VTD_GSTS_WBFS (1UL << 27)
+#define VTD_GSTS_QIES (1UL << 26)
+#define VTD_GSTS_IRES (1UL << 25)
+#define VTD_GSTS_IRTPS (1UL << 24)
+#define VTD_GSTS_CFIS (1UL << 23)
+
+/* CCMD_REG */
+#define VTD_CCMD_ICC (1ULL << 63)
+#define VTD_CCMD_GLOBAL_INVL (1ULL << 61)
+#define VTD_CCMD_DOMAIN_INVL (2ULL << 61)
+#define VTD_CCMD_DEVICE_INVL (3ULL << 61)
+#define VTD_CCMD_CIRG_MASK (3ULL << 61)
+#define VTD_CCMD_GLOBAL_INVL_A (1ULL << 59)
+#define VTD_CCMD_DOMAIN_INVL_A (2ULL << 59)
+#define VTD_CCMD_DEVICE_INVL_A (3ULL << 59)
+#define VTD_CCMD_CAIG_MASK (3ULL << 59)
+
+/* RTADDR_REG */
+#define VTD_RTADDR_RTT (1ULL << 11)
+#define VTD_RTADDR_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL)
+
+/* ECAP_REG */
+#define VTD_ECAP_IRO (DMAR_IOTLB_REG_OFFSET << 4)  /* (offset >> 4) << 8 */
+#define VTD_ECAP_QI  (1ULL << 1)
+
+/* CAP_REG */
+#define VTD_CAP_FRO  (DMAR_FRCD_REG_OFFSET << 20) /* (offset >> 4) << 24 */
+#define VTD_CAP_NFR  ((DMAR_FRCD_REG_NR - 1) << 40)
+#define VTD_DOMAIN_ID_SHIFT     16  /* 16-bit domain id for 64K domains */
+#define VTD_CAP_ND  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
+#define VTD_MGAW    39  /* Maximum Guest Address Width */
+#define VTD_CAP_MGAW    (((VTD_MGAW - 1) & 0x3fULL) << 16)
+
+/* Supported Adjusted Guest Address Widths */
+#define VTD_CAP_SAGAW_SHIFT (8)
+#define VTD_CAP_SAGAW_MASK  (0x1fULL << VTD_CAP_SAGAW_SHIFT)
+ /* 39-bit AGAW, 3-level page-table */
+#define VTD_CAP_SAGAW_39bit (0x2ULL << VTD_CAP_SAGAW_SHIFT)
+ /* 48-bit AGAW, 4-level page-table */
+#define VTD_CAP_SAGAW_48bit (0x4ULL << VTD_CAP_SAGAW_SHIFT)
+#define VTD_CAP_SAGAW       VTD_CAP_SAGAW_39bit
+
+/* IQT_REG */
+#define VTD_IQT_QT(val)     (((val) >> 4) & 0x7fffULL)
+
+/* IQA_REG */
+#define VTD_IQA_IQA_MASK    (VTD_HAW_MASK ^ 0xfffULL)
+#define VTD_IQA_QS          (0x7ULL)
+
+/* IQH_REG */
+#define VTD_IQH_QH_SHIFT    (4)
+#define VTD_IQH_QH_MASK     (0x7fff0ULL)
+
+/* ICS_REG */
+#define VTD_ICS_IWC         (1UL)
+
+/* IECTL_REG */
+#define VTD_IECTL_IM        (1UL << 31)
+#define VTD_IECTL_IP        (1UL << 30)
+
+/* FSTS_REG */
+#define VTD_FSTS_FRI_MASK  (0xff00)
+#define VTD_FSTS_FRI(val)  ((((uint32_t)(val)) << 8) & VTD_FSTS_FRI_MASK)
+#define VTD_FSTS_IQE       (1UL << 4)
+#define VTD_FSTS_PPF       (1UL << 1)
+#define VTD_FSTS_PFO       (1UL)
+
+/* FECTL_REG */
+#define VTD_FECTL_IM       (1UL << 31)
+#define VTD_FECTL_IP       (1UL << 30)
+
+/* Fault Recording Register */
+/* For the high 64-bit of 128-bit */
+#define VTD_FRCD_F         (1ULL << 63)
+#define VTD_FRCD_T         (1ULL << 62)
+#define VTD_FRCD_FR(val)   (((val) & 0xffULL) << 32)
+#define VTD_FRCD_SID_MASK   0xffffULL
+#define VTD_FRCD_SID(val)  ((val) & VTD_FRCD_SID_MASK)
+/* For the low 64-bit of 128-bit */
+#define VTD_FRCD_FI(val)   ((val) & (((1ULL << VTD_MGAW) - 1) ^ 0xfffULL))
+
+/* DMA Remapping Fault Conditions */
+typedef enum VTDFaultReason {
+    /* Reserved for Advanced Fault logging. We use this to represent the case
+     * with no fault event.
+     */
+    VTD_FR_RESERVED = 0,
+    VTD_FR_ROOT_ENTRY_P = 1, /* The Present(P) field of root-entry is 0 */
+    VTD_FR_CONTEXT_ENTRY_P, /* The Present(P) field of context-entry is 0 */
+    VTD_FR_CONTEXT_ENTRY_INV, /* Invalid programming of a context-entry */
+    VTD_FR_ADDR_BEYOND_MGAW, /* Input-address above (2^x-1) */
+    VTD_FR_WRITE, /* No write permission */
+    VTD_FR_READ, /* No read permission */
+    /* Fail to access a second-level paging entry (not SL_PML4E) */
+    VTD_FR_PAGING_ENTRY_INV,
+    VTD_FR_ROOT_TABLE_INV, /* Fail to access a root-entry */
+    VTD_FR_CONTEXT_TABLE_INV, /* Fail to access a context-entry */
+    /* Non-zero reserved field in a present root-entry */
+    VTD_FR_ROOT_ENTRY_RSVD,
+    /* Non-zero reserved field in a present context-entry */
+    VTD_FR_CONTEXT_ENTRY_RSVD,
+    /* Non-zero reserved field in a second-level paging entry with at lease one
+     * Read(R) and Write(W) or Execute(E) field is Set.
+     */
+    VTD_FR_PAGING_ENTRY_RSVD,
+    /* Translation request or translated request explicitly blocked dut to the
+     * programming of the Translation Type (T) field in the present
+     * context-entry.
+     */
+    VTD_FR_CONTEXT_ENTRY_TT,
+    /* This is not a normal fault reason. We use this to indicate some faults
+     * that are not referenced by the VT-d specification.
+     * Fault event with such reason should not be recorded.
+     */
+    VTD_FR_RESERVED_ERR,
+    /* Guard */
+    VTD_FR_MAX,
+} VTDFaultReason;
+
+
+/* Masks for Queued Invalidation Descriptor */
+#define VTD_INV_DESC_TYPE  (0xf)
+#define VTD_INV_DESC_CC    (0x1) /* Context-cache Invalidate Descriptor */
+#define VTD_INV_DESC_IOTLB (0x2)
+#define VTD_INV_DESC_WAIT  (0x5) /* Invalidation Wait Descriptor */
+#define VTD_INV_DESC_NONE  (0)   /* Not an Invalidate Descriptor */
+
+
+/* Pagesize of VTD paging structures, including root and context tables */
+#define VTD_PAGE_SHIFT      (12)
+#define VTD_PAGE_SIZE       (1ULL << VTD_PAGE_SHIFT)
+
+#define VTD_PAGE_SHIFT_4K   (12)
+#define VTD_PAGE_MASK_4K    (~((1ULL << VTD_PAGE_SHIFT_4K) - 1))
+#define VTD_PAGE_SHIFT_2M   (21)
+#define VTD_PAGE_MASK_2M    (~((1ULL << VTD_PAGE_SHIFT_2M) - 1))
+#define VTD_PAGE_SHIFT_1G   (30)
+#define VTD_PAGE_MASK_1G    (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
+
+/* Root-Entry
+ * 0: Present
+ * 1-11: Reserved
+ * 12-63: Context-table Pointer
+ * 64-127: Reserved
+ */
+struct VTDRootEntry {
+    uint64_t val;
+    uint64_t rsvd;
+};
+typedef struct VTDRootEntry VTDRootEntry;
+
+/* Masks for struct VTDRootEntry */
+#define VTD_ROOT_ENTRY_P (1ULL << 0)
+#define VTD_ROOT_ENTRY_CTP  (~0xfffULL)
+
+#define VTD_ROOT_ENTRY_NR   (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
+#define VTD_ROOT_ENTRY_RSVD (0xffeULL | ~VTD_HAW_MASK)
+
+/* Context-Entry */
+struct VTDContextEntry {
+    uint64_t lo;
+    uint64_t hi;
+};
+typedef struct VTDContextEntry VTDContextEntry;
+
+/* Masks for struct VTDContextEntry */
+/* lo */
+#define VTD_CONTEXT_ENTRY_P (1ULL << 0)
+#define VTD_CONTEXT_ENTRY_FPD   (1ULL << 1) /* Fault Processing Disable */
+#define VTD_CONTEXT_ENTRY_TT    (3ULL << 2) /* Translation Type */
+#define VTD_CONTEXT_TT_MULTI_LEVEL  (0)
+#define VTD_CONTEXT_TT_DEV_IOTLB    (1)
+#define VTD_CONTEXT_TT_PASS_THROUGH (2)
+/* Second Level Page Translation Pointer*/
+#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
+#define VTD_CONTEXT_ENTRY_RSVD_LO   (0xff0ULL | ~VTD_HAW_MASK)
+/* hi */
+#define VTD_CONTEXT_ENTRY_AW    (7ULL) /* Adjusted guest-address-width */
+#define VTD_CONTEXT_ENTRY_DID   (0xffffULL << 8)    /* Domain Identifier */
+#define VTD_CONTEXT_ENTRY_RSVD_HI   (0xffffffffff000080ULL)
+
+#define VTD_CONTEXT_ENTRY_NR    (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
+
+
+/* Paging Structure common */
+#define VTD_SL_PT_PAGE_SIZE_MASK   (1ULL << 7)
+#define VTD_SL_LEVEL_BITS   9   /* Bits to decide the offset for each level */
+
+/* Second Level Paging Structure */
+#define VTD_SL_PML4_LEVEL   4
+#define VTD_SL_PDP_LEVEL    3
+#define VTD_SL_PD_LEVEL     2
+#define VTD_SL_PT_LEVEL     1
+#define VTD_SL_PT_ENTRY_NR  512
+
+/* Masks for Second Level Paging Entry */
+#define VTD_SL_RW_MASK              (3ULL)
+#define VTD_SL_R                    (1ULL)
+#define VTD_SL_W                    (1ULL << 1)
+#define VTD_SL_PT_BASE_ADDR_MASK    (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK)
+#define VTD_SL_IGN_COM    (0xbff0000000000000ULL)
+
+#endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
new file mode 100644
index 0000000..6601e62
--- /dev/null
+++ b/include/hw/i386/intel_iommu.h
@@ -0,0 +1,90 @@
+/*
+ * QEMU emulation of an Intel IOMMU (VT-d)
+ *   (DMA Remapping device)
+ *
+ * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
+ * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef INTEL_IOMMU_H
+#define INTEL_IOMMU_H
+#include "hw/qdev.h"
+#include "sysemu/dma.h"
+
+#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
+#define INTEL_IOMMU_DEVICE(obj) \
+     OBJECT_CHECK(IntelIOMMUState, (obj), TYPE_INTEL_IOMMU_DEVICE)
+
+/* DMAR Hardware Unit Definition address (IOMMU unit) */
+#define Q35_HOST_BRIDGE_IOMMU_ADDR 0xfed90000ULL
+
+#define VTD_PCI_BUS_MAX 256
+#define VTD_PCI_SLOT_MAX 32
+#define VTD_PCI_FUNC_MAX 8
+#define VTD_PCI_SLOT(devfn)         (((devfn) >> 3) & 0x1f)
+#define VTD_PCI_FUNC(devfn)         ((devfn) & 0x07)
+
+#define DMAR_REG_SIZE   0x230
+
+/* FIXME: do not know how to decide the haw */
+#define VTD_HOST_ADDRESS_WIDTH  39
+#define VTD_HAW_MASK    ((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
+
+typedef struct IntelIOMMUState IntelIOMMUState;
+typedef struct VTDAddressSpace VTDAddressSpace;
+
+struct VTDAddressSpace {
+    int bus_num;
+    int devfn;
+    AddressSpace as;
+    MemoryRegion iommu;
+    IntelIOMMUState *iommu_state;
+};
+
+/* The iommu (DMAR) device state struct */
+struct IntelIOMMUState {
+    SysBusDevice busdev;
+    MemoryRegion csrmem;
+    uint8_t csr[DMAR_REG_SIZE];     /* register values */
+    uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
+    uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
+    uint8_t womask[DMAR_REG_SIZE]; /* WO (write only - read returns 0) */
+    uint32_t version;
+
+    dma_addr_t root;        /* Current root table pointer */
+    bool root_extended;     /* Type of root table (extended or not) */
+    bool dmar_enabled;      /* Set if DMA remapping is enabled */
+
+    uint16_t iq_head;       /* Current invalidation queue head */
+    uint16_t iq_tail;       /* Current invalidation queue tail */
+    dma_addr_t iq;          /* Current invalidation queue (IQ) pointer */
+    uint16_t iq_size;       /* IQ Size in number of entries */
+    bool qi_enabled;        /* Set if the QI is enabled */
+    uint8_t iq_last_desc_type; /* The type of last completed descriptor */
+
+    /* The index of the Fault Recording Register to be used next.
+     * Wraps around from N-1 to 0, where N is the number of FRCD_REG.
+     */
+    uint16_t next_frcd_reg;
+
+    uint64_t cap;           /* The value of Capability Register */
+    uint64_t ecap;          /* The value of Extended Capability Register */
+
+    MemoryRegionIOMMUOps iommu_ops;
+    VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
+};
+
+#endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 1/5] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Le Tan
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation Le Tan
@ 2014-08-11  7:05 ` Le Tan
  2014-08-14 11:06   ` Michael S. Tsirkin
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Le Tan
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Le Tan @ 2014-08-11  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Le Tan,
	Alex Williamson, Jan Kiszka, Anthony Liguori, Paolo Bonzini

Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
is only one hardware unit without INTR_REMAP capability on the platform.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
---
 hw/i386/acpi-build.c | 41 ++++++++++++++++++++++++++++++
 hw/i386/acpi-defs.h  | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 111 insertions(+)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 816c6d9..595f501 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -47,6 +47,7 @@
 #include "hw/i386/ich9.h"
 #include "hw/pci/pci_bus.h"
 #include "hw/pci-host/q35.h"
+#include "hw/i386/intel_iommu.h"
 
 #include "hw/i386/q35-acpi-dsdt.hex"
 #include "hw/i386/acpi-dsdt.hex"
@@ -1350,6 +1351,31 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
 }
 
 static void
+build_dmar_q35(GArray *table_data, GArray *linker)
+{
+    int dmar_start = table_data->len;
+
+    AcpiTableDmar *dmar;
+    AcpiDmarHardwareUnit *drhd;
+
+    dmar = acpi_data_push(table_data, sizeof(*dmar));
+    dmar->host_address_width = VTD_HOST_ADDRESS_WIDTH - 1;
+    dmar->flags = 0;    /* No intr_remap for now */
+
+    /* DMAR Remapping Hardware Unit Definition structure */
+    drhd = acpi_data_push(table_data, sizeof(*drhd));
+    drhd->type = cpu_to_le16(ACPI_DMAR_TYPE_HARDWARE_UNIT);
+    drhd->length = cpu_to_le16(sizeof(*drhd));   /* No device scope now */
+    drhd->flags = ACPI_DMAR_INCLUDE_PCI_ALL;
+    drhd->pci_segment = cpu_to_le16(0);
+    drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
+
+    build_header(linker, table_data, (void *)(table_data->data + dmar_start),
+                 "DMAR", table_data->len - dmar_start, 1);
+}
+
+
+static void
 build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
 {
     AcpiTableHeader *dsdt;
@@ -1470,6 +1496,17 @@ static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
     return true;
 }
 
+static bool acpi_has_iommu(void)
+{
+    bool ambiguous;
+    Object *intel_iommu;
+
+    intel_iommu = object_resolve_path_type("", TYPE_INTEL_IOMMU_DEVICE,
+                                           &ambiguous);
+    return intel_iommu && !ambiguous;
+}
+
+
 static
 void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
 {
@@ -1539,6 +1576,10 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
         acpi_add_table(table_offsets, tables->table_data);
         build_mcfg_q35(tables->table_data, tables->linker, &mcfg);
     }
+    if (acpi_has_iommu()) {
+        acpi_add_table(table_offsets, tables->table_data);
+        build_dmar_q35(tables->table_data, tables->linker);
+    }
 
     /* Add tables supplied by user (if any) */
     for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
diff --git a/hw/i386/acpi-defs.h b/hw/i386/acpi-defs.h
index e93babb..9674825 100644
--- a/hw/i386/acpi-defs.h
+++ b/hw/i386/acpi-defs.h
@@ -314,4 +314,74 @@ struct AcpiTableMcfg {
 } QEMU_PACKED;
 typedef struct AcpiTableMcfg AcpiTableMcfg;
 
+/* DMAR - DMA Remapping table r2.2 */
+struct AcpiTableDmar {
+    ACPI_TABLE_HEADER_DEF
+    uint8_t host_address_width; /* Maximum DMA physical addressability */
+    uint8_t flags;
+    uint8_t reserved[10];
+} QEMU_PACKED;
+typedef struct AcpiTableDmar AcpiTableDmar;
+
+/* Masks for Flags field above */
+#define ACPI_DMAR_INTR_REMAP    (1)
+#define ACPI_DMAR_X2APIC_OPT_OUT    (2)
+
+/*
+ * DMAR sub-structures (Follow DMA Remapping table)
+ */
+#define ACPI_DMAR_SUB_HEADER_DEF /* Common ACPI DMAR sub-structure header */\
+    uint16_t type;  \
+    uint16_t length;
+
+/* Values for sub-structure type for DMAR */
+enum {
+    ACPI_DMAR_TYPE_HARDWARE_UNIT = 0,   /* DRHD */
+    ACPI_DMAR_TYPE_RESERVED_MEMORY = 1, /* RMRR */
+    ACPI_DMAR_TYPE_ATSR = 2,    /* ATSR */
+    ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,   /* RHSR */
+    ACPI_DMAR_TYPE_ANDD = 4,    /* ANDD */
+    ACPI_DMAR_TYPE_RESERVED = 5 /* Reserved for furture use */
+};
+
+/*
+ * Sub-structures for DMAR, correspond to Type in ACPI_DMAR_SUB_HEADER_DEF
+ */
+
+/* DMAR Device Scope structures */
+struct AcpiDmarDeviceScope {
+    uint8_t type;
+    uint8_t length;
+    uint16_t reserved;
+    uint8_t enumeration_id;
+    uint8_t start_bus_number;
+    uint8_t path[0];
+} QEMU_PACKED;
+typedef struct AcpiDmarDeviceScope AcpiDmarDeviceScope;
+
+/* Values for type in struct AcpiDmarDeviceScope */
+enum {
+    ACPI_DMAR_SCOPE_TYPE_NOT_USED = 0,
+    ACPI_DMAR_SCOPE_TYPE_ENDPOINT = 1,
+    ACPI_DMAR_SCOPE_TYPE_BRIDGE = 2,
+    ACPI_DMAR_SCOPE_TYPE_IOAPIC = 3,
+    ACPI_DMAR_SCOPE_TYPE_HPET = 4,
+    ACPI_DMAR_SCOPE_TYPE_ACPI = 5,
+    ACPI_DMAR_SCOPE_TYPE_RESERVED = 6 /* Reserved for future use */
+};
+
+/* 0: Hardware Unit Definition */
+struct AcpiDmarHardwareUnit {
+    ACPI_DMAR_SUB_HEADER_DEF
+    uint8_t flags;
+    uint8_t reserved;
+    uint16_t pci_segment;   /* The PCI Segment associated with this unit */
+    uint64_t address;   /* Base address of remapping hardware register-set */
+} QEMU_PACKED;
+typedef struct AcpiDmarHardwareUnit AcpiDmarHardwareUnit;
+
+/* Masks for Flags field above */
+#define ACPI_DMAR_INCLUDE_PCI_ALL (1)
+
+
 #endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
  2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
                   ` (2 preceding siblings ...)
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables Le Tan
@ 2014-08-11  7:05 ` Le Tan
  2014-08-14 11:12   ` Michael S. Tsirkin
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 5/5] intel-iommu: add supports for queued invalidation interface Le Tan
  2014-08-14 11:15 ` [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Michael S. Tsirkin
  5 siblings, 1 reply; 34+ messages in thread
From: Le Tan @ 2014-08-11  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Le Tan,
	Alex Williamson, Jan Kiszka, Anthony Liguori, Paolo Bonzini

Add Intel IOMMU emulation to q35 chipset and expose it to the guest.
1. Add a machine option. Users can use "-machine iommu=on|off" in the command
line to enable/disable Intel IOMMU. The default is off.
2. Accroding to the machine option, q35 will initialize the Intel IOMMU and
use pci_setup_iommu() to setup q35_host_dma_iommu() as the IOMMU function for
the pci bus.
3. q35_host_dma_iommu() will return different address space according to the
bus_num and devfn of the device.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
---
 hw/core/machine.c         | 27 +++++++++++++++++---
 hw/pci-host/q35.c         | 64 +++++++++++++++++++++++++++++++++++++++++++----
 include/hw/boards.h       |  1 +
 include/hw/pci-host/q35.h |  2 ++
 qemu-options.hx           |  5 +++-
 vl.c                      |  4 +++
 6 files changed, 94 insertions(+), 9 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 7a66c57..f0046d6 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -235,6 +235,20 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp)
     ms->firmware = g_strdup(value);
 }
 
+static bool machine_get_iommu(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->iommu;
+}
+
+static void machine_set_iommu(Object *obj, bool value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->iommu = value;
+}
+
 static void machine_initfn(Object *obj)
 {
     object_property_add_str(obj, "accel",
@@ -270,10 +284,17 @@ static void machine_initfn(Object *obj)
                              machine_set_dump_guest_core,
                              NULL);
     object_property_add_bool(obj, "mem-merge",
-                             machine_get_mem_merge, machine_set_mem_merge, NULL);
-    object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
+                             machine_get_mem_merge,
+                             machine_set_mem_merge, NULL);
+    object_property_add_bool(obj, "usb",
+                             machine_get_usb,
+                             machine_set_usb, NULL);
     object_property_add_str(obj, "firmware",
-                            machine_get_firmware, machine_set_firmware, NULL);
+                            machine_get_firmware,
+                            machine_set_firmware, NULL);
+    object_property_add_bool(obj, "iommu",
+                             machine_get_iommu,
+                             machine_set_iommu, NULL);
 }
 
 static void machine_finalize(Object *obj)
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index a0a3068..3342711 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -347,6 +347,53 @@ static void mch_reset(DeviceState *qdev)
     mch_update(mch);
 }
 
+static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDAddressSpace **pvtd_as;
+    VTDAddressSpace *vtd_as;
+    int bus_num = pci_bus_num(bus);
+
+    assert(devfn >= 0);
+
+    pvtd_as = s->address_spaces[bus_num];
+    if (!pvtd_as) {
+        /* No corresponding free() */
+        pvtd_as = g_malloc0(sizeof(VTDAddressSpace *) *
+                            VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
+        s->address_spaces[bus_num] = pvtd_as;
+    }
+
+    vtd_as = *(pvtd_as + devfn);
+    if (!vtd_as) {
+        vtd_as = g_malloc0(sizeof(*vtd_as));
+        *(pvtd_as + devfn) = vtd_as;
+
+        vtd_as->bus_num = bus_num;
+        vtd_as->devfn = devfn;
+        vtd_as->iommu_state = s;
+        memory_region_init_iommu(&vtd_as->iommu, OBJECT(s), &s->iommu_ops,
+                                 "intel_iommu", UINT64_MAX);
+        address_space_init(&vtd_as->as, &vtd_as->iommu, "intel_iommu");
+    }
+
+    return &vtd_as->as;
+}
+
+static void mch_init_dmar(MCHPCIState *mch)
+{
+    PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
+
+    mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, TYPE_INTEL_IOMMU_DEVICE));
+    object_property_add_child(OBJECT(mch), "intel-iommu",
+                              OBJECT(mch->iommu), NULL);
+    qdev_init_nofail(DEVICE(mch->iommu));
+    sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
+
+    pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
+}
+
+
 static int mch_init(PCIDevice *d)
 {
     int i;
@@ -363,13 +410,20 @@ static int mch_init(PCIDevice *d)
     memory_region_add_subregion_overlap(mch->system_memory, 0xa0000,
                                         &mch->smram_region, 1);
     memory_region_set_enabled(&mch->smram_region, false);
-    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
-             &mch->pam_regions[0], PAM_BIOS_BASE, PAM_BIOS_SIZE);
+    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
+             mch->pci_address_space, &mch->pam_regions[0], PAM_BIOS_BASE,
+             PAM_BIOS_SIZE);
     for (i = 0; i < 12; ++i) {
-        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
-                 &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
-                 PAM_EXPAN_SIZE);
+        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
+                 mch->pci_address_space, &mch->pam_regions[i+1],
+                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
+    }
+
+    /* Intel IOMMU (VT-d) */
+    if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
+        mch_init_dmar(mch);
     }
+
     return 0;
 }
 
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 605a970..dfb6718 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -123,6 +123,7 @@ struct MachineState {
     bool mem_merge;
     bool usb;
     char *firmware;
+    bool iommu;
 
     ram_addr_t ram_size;
     ram_addr_t maxram_size;
diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index d9ee978..025d6e6 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -33,6 +33,7 @@
 #include "hw/acpi/acpi.h"
 #include "hw/acpi/ich9.h"
 #include "hw/pci-host/pam.h"
+#include "hw/i386/intel_iommu.h"
 
 #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
 #define Q35_HOST_DEVICE(obj) \
@@ -60,6 +61,7 @@ typedef struct MCHPCIState {
     uint64_t pci_hole64_size;
     PcGuestInfo *guest_info;
     uint32_t short_root_bus;
+    IntelIOMMUState *iommu;
 } MCHPCIState;
 
 typedef struct Q35PCIHost {
diff --git a/qemu-options.hx b/qemu-options.hx
index 96516c1..7406a17 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -35,7 +35,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                kernel_irqchip=on|off controls accelerated irqchip support\n"
     "                kvm_shadow_mem=size of KVM shadow MMU\n"
     "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
-    "                mem-merge=on|off controls memory merge support (default: on)\n",
+    "                mem-merge=on|off controls memory merge support (default: on)\n"
+    "                iommu=on|off controls emulated Intel IOMMU (VT-d) support (default=off)\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
@@ -58,6 +59,8 @@ Include guest memory in a core dump. The default is on.
 Enables or disables memory merge support. This feature, when supported by
 the host, de-duplicates identical memory pages among VMs instances
 (enabled by default).
+@item iommu=on|off
+Enables or disables emulated Intel IOMMU (VT-d) support. The default is off.
 @end table
 ETEXI
 
diff --git a/vl.c b/vl.c
index a8029d5..2ab1643 100644
--- a/vl.c
+++ b/vl.c
@@ -388,6 +388,10 @@ static QemuOptsList qemu_machine_opts = {
             .name = PC_MACHINE_MAX_RAM_BELOW_4G,
             .type = QEMU_OPT_SIZE,
             .help = "maximum ram below the 4G boundary (32bit boundary)",
+        },{
+            .name = "iommu",
+            .type = QEMU_OPT_BOOL,
+            .help = "Set on/off to enable/disable Intel IOMMU (VT-d)",
         },
         { /* End of list */ }
     },
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v3 5/5] intel-iommu: add supports for queued invalidation interface
  2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
                   ` (3 preceding siblings ...)
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Le Tan
@ 2014-08-11  7:05 ` Le Tan
  2014-08-14 11:15 ` [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Michael S. Tsirkin
  5 siblings, 0 replies; 34+ messages in thread
From: Le Tan @ 2014-08-11  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Le Tan,
	Alex Williamson, Jan Kiszka, Anthony Liguori, Paolo Bonzini

Add supports for queued invalidation interface, an expended invalidation
interface with extended capabilities.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
---
 hw/i386/intel_iommu.c          | 381 ++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  15 +-
 2 files changed, 393 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b3a4f78..0e7d62d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -332,6 +332,43 @@ static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
     }
 }
 
+/* Handle Invalidation Queue Errors of queued invalidation interface error
+ * conditions.
+ */
+static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
+{
+    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
+
+    set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_IQE);
+    vtd_generate_fault_event(s, fsts_reg);
+}
+
+/* Set the IWC field and try to generate an invalidation completion interrupt */
+static void vtd_generate_completion_event(IntelIOMMUState *s)
+{
+    VTD_DPRINTF(INV, "completes an invalidation wait command with "
+                "Interrupt Flag");
+    if (get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
+        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
+                    "serviced by software, "
+                    "new invalidation event is not generated");
+        return;
+    }
+
+    set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
+    set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
+    if (get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
+        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
+                    "event is not generated");
+        return;
+    } else {
+        /* Generate the interrupt event */
+        vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
+        set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
+    }
+
+}
+
 static inline bool root_entry_present(VTDRootEntry *root)
 {
     return root->val & VTD_ROOT_ENTRY_P;
@@ -809,6 +846,54 @@ static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
     return iaig;
 }
 
+static inline bool queued_inv_enable_check(IntelIOMMUState *s)
+{
+    return s->iq_tail == 0;
+}
+
+static inline bool queued_inv_disable_check(IntelIOMMUState *s)
+{
+    return s->qi_enabled && (s->iq_tail == s->iq_head) &&
+           (s->iq_last_desc_type == VTD_INV_DESC_WAIT);
+}
+
+static void handle_gcmd_qie(IntelIOMMUState *s, bool en)
+{
+    uint64_t iqa_val = get_quad_raw(s, DMAR_IQA_REG);
+    VTD_DPRINTF(INV, "Queued Invalidation Enable %s", (en ? "on" : "off"));
+
+    if (en) {
+        if (queued_inv_enable_check(s)) {
+            s->iq = iqa_val & VTD_IQA_IQA_MASK;
+            /* 2^(x+8) entries */
+            s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
+            s->qi_enabled = true;
+            VTD_DPRINTF(INV, "DMAR_IQA_REG 0x%"PRIx64, iqa_val);
+            VTD_DPRINTF(INV, "Invalidation Queue addr 0x%"PRIx64 " size %d",
+                        s->iq, s->iq_size);
+            /* Ok - report back to driver */
+            set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_QIES);
+        } else {
+            VTD_DPRINTF(GENERAL, "error: can't enable Queued Invalidation: "
+                        "tail %"PRIu16, s->iq_tail);
+        }
+    } else {
+        if (queued_inv_disable_check(s)) {
+            /* disable Queued Invalidation */
+            set_quad_raw(s, DMAR_IQH_REG, 0);
+            s->iq_head = 0;
+            s->qi_enabled = false;
+            /* Ok - report back to driver */
+            set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_QIES, 0);
+        } else {
+            VTD_DPRINTF(GENERAL, "error: can't disable Queued Invalidation: "
+                        "head %"PRIu16 ", tail %"PRIu16
+                        ", last_descriptor %"PRIu8,
+                        s->iq_head, s->iq_tail, s->iq_last_desc_type);
+        }
+    }
+}
+
 /* Set Root Table Pointer */
 static void handle_gcmd_srtp(IntelIOMMUState *s)
 {
@@ -854,6 +939,10 @@ static void handle_gcmd_write(IntelIOMMUState *s)
         /* Set/update the root-table pointer */
         handle_gcmd_srtp(s);
     }
+    if (changed & VTD_GCMD_QIE) {
+        /* Queued Invalidation Enable */
+        handle_gcmd_qie(s, val & VTD_GCMD_QIE);
+    }
 }
 
 /* Handle write to Context Command Register */
@@ -864,6 +953,11 @@ static void handle_ccmd_write(IntelIOMMUState *s)
 
     /* Context-cache invalidation request */
     if (val & VTD_CCMD_ICC) {
+        if (s->qi_enabled) {
+            VTD_DPRINTF(GENERAL, "error: Queued Invalidation enabled, "
+                        "should not use register-based invalidation");
+            return;
+        }
         ret = vtd_context_cache_invalidate(s, val);
 
         /* Invalidation completed. Change something to show */
@@ -881,6 +975,11 @@ static void handle_iotlb_write(IntelIOMMUState *s)
 
     /* IOTLB invalidation request */
     if (val & VTD_TLB_IVT) {
+        if (s->qi_enabled) {
+            VTD_DPRINTF(GENERAL, "error: Queued Invalidation enabled, "
+                        "should not use register-based invalidation");
+            return;
+        }
         ret = vtd_iotlb_flush(s, val);
 
         /* Invalidation completed. Change something to show */
@@ -891,6 +990,142 @@ static void handle_iotlb_write(IntelIOMMUState *s)
     }
 }
 
+/* Fetch an Invalidation Descriptor from the Invalidation Queue */
+static bool get_inv_desc(dma_addr_t base_addr, uint32_t offset,
+                         VTDInvDesc *inv_desc)
+{
+    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
+    if (dma_memory_read(&address_space_memory, addr, inv_desc,
+        sizeof(*inv_desc))) {
+        VTD_DPRINTF(GENERAL, "error: fail to fetch Invalidation Descriptor "
+                    "base_addr 0x%"PRIx64 " offset %"PRIu32, base_addr, offset);
+        inv_desc->lo = 0;
+        inv_desc->hi = 0;
+
+        return false;
+    }
+
+    inv_desc->lo = le64_to_cpu(inv_desc->lo);
+    inv_desc->hi = le64_to_cpu(inv_desc->hi);
+    return true;
+}
+
+static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
+{
+    if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
+        /* Status Write */
+        uint32_t status_data = (uint32_t)(inv_desc->lo >>
+                               VTD_INV_DESC_WAIT_DATA_SHIFT);
+
+        assert(!(inv_desc->lo & VTD_INV_DESC_WAIT_IF));
+
+        /* FIXME: need to be masked with HAW? */
+        dma_addr_t status_addr = inv_desc->hi;
+        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
+                    status_data, status_addr);
+        status_data = cpu_to_le32(status_data);
+        if (dma_memory_write(&address_space_memory, status_addr, &status_data,
+                             sizeof(status_data))) {
+            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
+            return false;
+        }
+    } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
+        /* Interrupt flag */
+        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
+        vtd_generate_completion_event(s);
+    } else {
+        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
+        return false;
+    }
+    return true;
+}
+
+static inline bool vtd_process_inv_desc(IntelIOMMUState *s)
+{
+    VTDInvDesc inv_desc;
+    uint8_t desc_type;
+
+    VTD_DPRINTF(INV, "iq head %"PRIu16, s->iq_head);
+
+    if (!get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
+        s->iq_last_desc_type = VTD_INV_DESC_NONE;
+        return false;
+    }
+    desc_type = inv_desc.lo & VTD_INV_DESC_TYPE;
+    /* FIXME: should update at first or at last? */
+    s->iq_last_desc_type = desc_type;
+
+    switch (desc_type) {
+    case VTD_INV_DESC_CC:
+        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
+                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        break;
+
+    case VTD_INV_DESC_IOTLB:
+        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
+                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        break;
+
+    case VTD_INV_DESC_WAIT:
+        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
+                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_wait_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
+                    inv_desc.hi, inv_desc.lo, desc_type);
+        return false;
+    }
+    s->iq_head++;
+    if (s->iq_head == s->iq_size) {
+        s->iq_head = 0;
+    }
+    return true;
+}
+
+/* Try to fetch and process more Invalidation Descriptors */
+static void vtd_fetch_inv_desc(IntelIOMMUState *s)
+{
+    VTD_DPRINTF(INV, "fetch Invalidation Descriptors");
+    if (s->iq_tail >= s->iq_size) {
+        /* Detects an invalid Tail pointer */
+        VTD_DPRINTF(GENERAL, "error: iq_tail is %"PRIu16
+                    " while iq_size is %"PRIu16, s->iq_tail, s->iq_size);
+        vtd_handle_inv_queue_error(s);
+        return;
+    }
+    while (s->iq_head != s->iq_tail) {
+        if (!vtd_process_inv_desc(s)) {
+            /* Invalidation Queue Errors */
+            vtd_handle_inv_queue_error(s);
+            break;
+        }
+        /* Must update the IQH_REG in time */
+        set_quad_raw(s, DMAR_IQH_REG,
+                     (((uint64_t)(s->iq_head)) << VTD_IQH_QH_SHIFT) &
+                     VTD_IQH_QH_MASK);
+    }
+}
+
+/* Handle write to Invalidation Queue Tail Register */
+static inline void handle_iqt_write(IntelIOMMUState *s)
+{
+    uint64_t val = get_quad_raw(s, DMAR_IQT_REG);
+
+    s->iq_tail = VTD_IQT_QT(val);
+    VTD_DPRINTF(INV, "set iq tail %"PRIu16, s->iq_tail);
+
+    if (s->qi_enabled && !(get_long_raw(s, DMAR_FSTS_REG) & VTD_FSTS_IQE)) {
+        /* Process Invalidation Queue here */
+        vtd_fetch_inv_desc(s);
+    }
+}
+
 static inline void handle_fsts_write(IntelIOMMUState *s)
 {
     uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
@@ -902,6 +1137,9 @@ static inline void handle_fsts_write(IntelIOMMUState *s)
         VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
                     "IP field of FECTL_REG");
     }
+    /* FIXME: when IQE is Clear, should we try to fetch some Invalidation
+     * Descriptors if there are any when Queued Invalidation is enabled?
+     */
 }
 
 static inline void handle_fectl_write(IntelIOMMUState *s)
@@ -920,6 +1158,34 @@ static inline void handle_fectl_write(IntelIOMMUState *s)
     }
 }
 
+static inline void handle_ics_write(IntelIOMMUState *s)
+{
+    uint32_t ics_reg = get_long_raw(s, DMAR_ICS_REG);
+    uint32_t iectl_reg = get_long_raw(s, DMAR_IECTL_REG);
+
+    if ((iectl_reg & VTD_IECTL_IP) && !(ics_reg & VTD_ICS_IWC)) {
+        set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
+        VTD_DPRINTF(INV, "pending completion interrupt condition serviced, "
+                    "clear IP field of IECTL_REG");
+    }
+}
+
+static inline void handle_iectl_write(IntelIOMMUState *s)
+{
+    uint32_t iectl_reg;
+    /* FIXME: when software clears the IM field, check the IP field. But do we
+     * need to compare the old value and the new value to conclude that
+     * software clears the IM field? Or just check if the IM field is zero?
+     */
+    iectl_reg = get_long_raw(s, DMAR_IECTL_REG);
+    if ((iectl_reg & VTD_IECTL_IP) && !(iectl_reg & VTD_IECTL_IM)) {
+        vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
+        set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
+        VTD_DPRINTF(INV, "IM field is cleared, generate "
+                    "invalidation event interrupt");
+    }
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
     IntelIOMMUState *s = opaque;
@@ -949,6 +1215,19 @@ static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
         val = s->root >> 32;
         break;
 
+    /* Invalidation Queue Address Register, 64-bit */
+    case DMAR_IQA_REG:
+        val = s->iq | (get_quad(s, DMAR_IQA_REG) & VTD_IQA_QS);
+        if (size == 4) {
+            val = val & ((1ULL << 32) - 1);
+        }
+        break;
+
+    case DMAR_IQA_REG_HI:
+        assert(size == 4);
+        val = s->iq >> 32;
+        break;
+
     default:
         if (size == 4) {
             val = get_long(s, addr);
@@ -1095,6 +1374,87 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         set_long(s, addr, val);
         break;
 
+    /* Invalidation Queue Tail Register, 64-bit */
+    case DMAR_IQT_REG:
+        VTD_DPRINTF(INV, "DMAR_IQT_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+        }
+        handle_iqt_write(s);
+        break;
+
+    case DMAR_IQT_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IQT_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        /* 19:63 of IQT_REG is RsvdZ, do nothing here */
+        break;
+
+    /* Invalidation Queue Address Register, 64-bit */
+    case DMAR_IQA_REG:
+        VTD_DPRINTF(INV, "DMAR_IQA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            set_long(s, addr, val);
+        } else {
+            set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_IQA_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IQA_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Invalidation Completion Status Register, 32-bit */
+    case DMAR_ICS_REG:
+        VTD_DPRINTF(INV, "DMAR_ICS_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        handle_ics_write(s);
+        break;
+
+    /* Invalidation Event Control Register, 32-bit */
+    case DMAR_IECTL_REG:
+        VTD_DPRINTF(INV, "DMAR_IECTL_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        handle_iectl_write(s);
+        break;
+
+    /* Invalidation Event Data Register, 32-bit */
+    case DMAR_IEDATA_REG:
+        VTD_DPRINTF(INV, "DMAR_IEDATA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Invalidation Event Address Register, 32-bit */
+    case DMAR_IEADDR_REG:
+        VTD_DPRINTF(INV, "DMAR_IEADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+    /* Invalidation Event Upper Address Register, 32-bit */
+    case DMAR_IEUADDR_REG:
+        VTD_DPRINTF(INV, "DMAR_IEUADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        set_long(s, addr, val);
+        break;
+
+
     /* Fault Recording Registers, 128-bit */
     case DMAR_FRCD_REG_0_0:
         VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
@@ -1248,14 +1608,14 @@ static void do_vtd_init(IntelIOMMUState *s)
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
              VTD_CAP_SAGAW;
 
-    /* b.1 = 0: QI(Queued Invalidation support) not supported
+    /* b.1 = 1: QI(Queued Invalidation support) supported
      * b.2 = 0: DT(Device-TLB support) not supported
      * b.3 = 0: IR(Interrupt Remapping support) not supported
      * b.4 = 0: EIM(Extended Interrupt Mode) not supported
      * b.8:17 = 15: IRO(IOTLB Register Offset)
      * b.20:23 = 0: MHMV(Maximum Handle Mask Value) not valid
      */
-    s->ecap = VTD_ECAP_IRO;
+    s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
     /* Define registers with default values and bit semantics */
     define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);  /* set MAX = 1, RO */
@@ -1285,6 +1645,23 @@ static void do_vtd_init(IntelIOMMUState *s)
      */
     define_long(s, DMAR_PMEN_REG, 0, 0, 0);
 
+    /* Bits 18:4 (0x7fff0) is RO, rest is RsvdZ */
+    define_quad(s, DMAR_IQH_REG, 0, 0, 0);
+    define_quad(s, DMAR_IQT_REG, 0, 0x7fff0ULL, 0);
+    define_quad(s, DMAR_IQA_REG, 0, 0xfffffffffffff007ULL, 0);
+
+    /* Bit 0 is RW1CS - rest is RsvdZ */
+    define_long(s, DMAR_ICS_REG, 0, 0, 0x1UL);
+
+    /* b.31 is RW, b.30 RO, rest: RsvdZ */
+    define_long(s, DMAR_IECTL_REG, 0x80000000UL, 0x80000000UL, 0);
+
+    define_long(s, DMAR_IEDATA_REG, 0, 0xffffffffUL, 0);
+    define_long(s, DMAR_IEADDR_REG, 0, 0xfffffffcUL, 0);
+
+    /* Treadted as RsvdZ when EIM in ECAP_REG is not supported */
+    define_long(s, DMAR_IEUADDR_REG, 0, 0, 0);
+
     /* IOTLB registers */
     define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
     define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 7bc679a..71577ff 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -260,13 +260,26 @@ typedef enum VTDFaultReason {
 } VTDFaultReason;
 
 
-/* Masks for Queued Invalidation Descriptor */
+/* Queued Invalidation Descriptor */
+struct VTDInvDesc {
+    uint64_t lo;
+    uint64_t hi;
+};
+typedef struct VTDInvDesc VTDInvDesc;
+
+/* Masks for struct VTDInvDesc */
 #define VTD_INV_DESC_TYPE  (0xf)
 #define VTD_INV_DESC_CC    (0x1) /* Context-cache Invalidate Descriptor */
 #define VTD_INV_DESC_IOTLB (0x2)
 #define VTD_INV_DESC_WAIT  (0x5) /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_NONE  (0)   /* Not an Invalidate Descriptor */
 
+/* Masks for Invalidation Wait Descriptor*/
+#define VTD_INV_DESC_WAIT_SW    (1ULL << 5)
+#define VTD_INV_DESC_WAIT_IF    (1ULL << 4)
+#define VTD_INV_DESC_WAIT_FN    (1ULL << 6)
+#define VTD_INV_DESC_WAIT_DATA_SHIFT (32)
+
 
 /* Pagesize of VTD paging structures, including root and context tables */
 #define VTD_PAGE_SHIFT      (12)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation Le Tan
@ 2014-08-12  7:34   ` Jan Kiszka
  2014-08-12  9:04     ` Le Tan
  2014-08-14 11:03   ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Kiszka @ 2014-08-12  7:34 UTC (permalink / raw)
  To: Le Tan, qemu-devel
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 72101 bytes --]

On 2014-08-11 09:04, Le Tan wrote:
> Add support for emulating Intel IOMMU according to the VT-d specification for
> the q35 chipset machine. Implement the logics for DMAR (DMA remapping) without
> PASID support. The emulation supports register-based invalidation and primary
> fault logging.

Some arbitrary comments below (means, I didn't read every line and
likely missed some things). In general, this looks and works pretty good!

> 
> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> ---
>  hw/i386/Makefile.objs          |    1 +
>  hw/i386/intel_iommu.c          | 1345 ++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  345 +++++++++++
>  include/hw/i386/intel_iommu.h  |   90 +++
>  4 files changed, 1781 insertions(+)
>  create mode 100644 hw/i386/intel_iommu.c
>  create mode 100644 hw/i386/intel_iommu_internal.h
>  create mode 100644 include/hw/i386/intel_iommu.h
> 
> diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
> index 48014ab..6936111 100644
> --- a/hw/i386/Makefile.objs
> +++ b/hw/i386/Makefile.objs
> @@ -2,6 +2,7 @@ obj-$(CONFIG_KVM) += kvm/
>  obj-y += multiboot.o smbios.o
>  obj-y += pc.o pc_piix.o pc_q35.o
>  obj-y += pc_sysfw.o
> +obj-y += intel_iommu.o
>  obj-$(CONFIG_XEN) += ../xenpv/ xen/
>  
>  obj-y += kvmvapic.o
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> new file mode 100644
> index 0000000..b3a4f78
> --- /dev/null
> +++ b/hw/i386/intel_iommu.c
> @@ -0,0 +1,1345 @@
> +/*
> + * QEMU emulation of an Intel IOMMU (VT-d)
> + *   (DMA Remapping device)
> + *
> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "hw/sysbus.h"
> +#include "exec/address-spaces.h"
> +#include "intel_iommu_internal.h"
> +
> +
> +/*#define DEBUG_INTEL_IOMMU*/
> +#ifdef DEBUG_INTEL_IOMMU
> +enum {
> +    DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
> +};
> +#define VTD_DBGBIT(x)   (1 << DEBUG_##x)
> +static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR) |
> +                          VTD_DBGBIT(FLOG);
> +
> +#define VTD_DPRINTF(what, fmt, ...) do { \
> +    if (vtd_dbgflags & VTD_DBGBIT(what)) { \
> +        fprintf(stderr, "(vtd)%s: " fmt "\n", __func__, \
> +                ## __VA_ARGS__); } \
> +    } while (0)
> +#else
> +#define VTD_DPRINTF(what, fmt, ...) do {} while (0)
> +#endif
> +
> +static inline void define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
> +                               uint64_t wmask, uint64_t w1cmask)

In general, don't declare functions inline needlessly. It makes sense
for trivial ones you export via a header, but even a 2- or 3-liner like
this can be more efficient as stand-alone function. Anything bigger
definitely does not deserve that tag.

Inline is just a hint to the compiler anyway, so you can perfectly leave
it out for any almost-trivial static function.

> +{
> +    stq_le_p(&s->csr[addr], val);
> +    stq_le_p(&s->wmask[addr], wmask);
> +    stq_le_p(&s->w1cmask[addr], w1cmask);
> +}
> +
> +static inline void define_quad_wo(IntelIOMMUState *s, hwaddr addr,
> +                                  uint64_t mask)
> +{
> +    stq_le_p(&s->womask[addr], mask);
> +}
> +
> +static inline void define_long(IntelIOMMUState *s, hwaddr addr, uint32_t val,
> +                               uint32_t wmask, uint32_t w1cmask)
> +{
> +    stl_le_p(&s->csr[addr], val);
> +    stl_le_p(&s->wmask[addr], wmask);
> +    stl_le_p(&s->w1cmask[addr], w1cmask);
> +}
> +
> +static inline void define_long_wo(IntelIOMMUState *s, hwaddr addr,
> +                                  uint32_t mask)
> +{
> +    stl_le_p(&s->womask[addr], mask);
> +}
> +
> +/* "External" get/set operations */
> +static inline void set_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val)
> +{
> +    uint64_t oldval = ldq_le_p(&s->csr[addr]);
> +    uint64_t wmask = ldq_le_p(&s->wmask[addr]);
> +    uint64_t w1cmask = ldq_le_p(&s->w1cmask[addr]);
> +    stq_le_p(&s->csr[addr],
> +             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
> +}
> +
> +static inline void set_long(IntelIOMMUState *s, hwaddr addr, uint32_t val)
> +{
> +    uint32_t oldval = ldl_le_p(&s->csr[addr]);
> +    uint32_t wmask = ldl_le_p(&s->wmask[addr]);
> +    uint32_t w1cmask = ldl_le_p(&s->w1cmask[addr]);
> +    stl_le_p(&s->csr[addr],
> +             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
> +}
> +
> +static inline uint64_t get_quad(IntelIOMMUState *s, hwaddr addr)
> +{
> +    uint64_t val = ldq_le_p(&s->csr[addr]);
> +    uint64_t womask = ldq_le_p(&s->womask[addr]);
> +    return val & ~womask;
> +}
> +
> +
> +static inline uint32_t get_long(IntelIOMMUState *s, hwaddr addr)
> +{
> +    uint32_t val = ldl_le_p(&s->csr[addr]);
> +    uint32_t womask = ldl_le_p(&s->womask[addr]);
> +    return val & ~womask;
> +}
> +
> +/* "Internal" get/set operations */
> +static inline uint64_t get_quad_raw(IntelIOMMUState *s, hwaddr addr)
> +{
> +    return ldq_le_p(&s->csr[addr]);
> +}
> +
> +static inline uint32_t get_long_raw(IntelIOMMUState *s, hwaddr addr)
> +{
> +    return ldl_le_p(&s->csr[addr]);
> +}
> +
> +static inline void set_quad_raw(IntelIOMMUState *s, hwaddr addr, uint64_t val)
> +{
> +    stq_le_p(&s->csr[addr], val);
> +}
> +
> +static inline uint32_t set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
> +                                           uint32_t clear, uint32_t mask)
> +{
> +    uint32_t new_val = (ldl_le_p(&s->csr[addr]) & ~clear) | mask;
> +    stl_le_p(&s->csr[addr], new_val);
> +    return new_val;
> +}
> +
> +static inline uint64_t set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
> +                                           uint64_t clear, uint64_t mask)
> +{
> +    uint64_t new_val = (ldq_le_p(&s->csr[addr]) & ~clear) | mask;
> +    stq_le_p(&s->csr[addr], new_val);
> +    return new_val;
> +}
> +
> +/* Given the reg addr of both the message data and address, generate an
> + * interrupt via MSI.
> + */
> +static void vtd_generate_interrupt(IntelIOMMUState *s, hwaddr mesg_addr_reg,
> +                                   hwaddr mesg_data_reg)
> +{
> +    hwaddr addr;
> +    uint32_t data;
> +
> +    assert(mesg_data_reg < DMAR_REG_SIZE);
> +    assert(mesg_addr_reg < DMAR_REG_SIZE);
> +
> +    addr = get_long_raw(s, mesg_addr_reg);
> +    data = get_long_raw(s, mesg_data_reg);
> +
> +    VTD_DPRINTF(FLOG, "msi: addr 0x%"PRIx64 " data 0x%"PRIx32, addr, data);
> +    stl_le_phys(&address_space_memory, addr, data);
> +}
> +
> +/* Generate a fault event to software via MSI if conditions are met.
> + * Notice that the value of FSTS_REG being passed to it should be the one
> + * before any update.
> + */
> +static void vtd_generate_fault_event(IntelIOMMUState *s, uint32_t pre_fsts)
> +{
> +    /* Check if there are any previously reported interrupt conditions */
> +    if (pre_fsts & VTD_FSTS_PPF || pre_fsts & VTD_FSTS_PFO ||
> +        pre_fsts & VTD_FSTS_IQE) {
> +        VTD_DPRINTF(FLOG, "there are previous interrupt conditions "
> +                    "to be serviced by software, fault event is not generated "
> +                    "(FSTS_REG 0x%"PRIx32 ")", pre_fsts);
> +        return;
> +    }
> +    set_clear_mask_long(s, DMAR_FECTL_REG, 0, VTD_FECTL_IP);
> +    if (get_long_raw(s, DMAR_FECTL_REG) & VTD_FECTL_IM) {
> +        /* Interrupt Mask */
> +        VTD_DPRINTF(FLOG, "Interrupt Mask set, fault event is not generated");
> +    } else {
> +        /* generate interrupt */
> +        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
> +    }
> +}
> +
> +/* Check if the Fault (F) field of the Fault Recording Register referenced by
> + * @index is Set.
> + */
> +static inline bool is_frcd_set(IntelIOMMUState *s, uint16_t index)
> +{
> +    /* Each reg is 128-bit */
> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> +    addr += 8; /* Access the high 64-bit half */
> +
> +    assert(index < DMAR_FRCD_REG_NR);
> +
> +    return get_quad_raw(s, addr) & VTD_FRCD_F;
> +}
> +
> +/* Update the PPF field of Fault Status Register.
> + * Should be called whenever change the F field of any fault recording
> + * registers.
> + */
> +static inline void update_fsts_ppf(IntelIOMMUState *s)
> +{
> +    uint32_t i;
> +    uint32_t ppf_mask = 0;
> +
> +    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
> +        if (is_frcd_set(s, i)) {
> +            ppf_mask = VTD_FSTS_PPF;
> +            break;
> +        }
> +    }
> +    set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_PPF, ppf_mask);
> +    VTD_DPRINTF(FLOG, "set PPF of FSTS_REG to %d", ppf_mask ? 1 : 0);
> +}
> +
> +static inline void set_frcd_and_update_ppf(IntelIOMMUState *s, uint16_t index)
> +{
> +    /* Each reg is 128-bit */
> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> +    addr += 8; /* Access the high 64-bit half */
> +
> +    assert(index < DMAR_FRCD_REG_NR);
> +
> +    set_clear_mask_quad(s, addr, 0, VTD_FRCD_F);
> +    update_fsts_ppf(s);
> +}
> +
> +/* Must not update F field now, should be done later */
> +static void record_frcd(IntelIOMMUState *s, uint16_t index, uint16_t source_id,
> +                        hwaddr addr, VTDFaultReason fault, bool is_write)
> +{
> +    uint64_t hi = 0, lo;
> +    hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> +
> +    assert(index < DMAR_FRCD_REG_NR);
> +
> +    lo = VTD_FRCD_FI(addr);
> +    hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
> +    if (!is_write) {
> +        hi |= VTD_FRCD_T;
> +    }
> +
> +    set_quad_raw(s, frcd_reg_addr, lo);
> +    set_quad_raw(s, frcd_reg_addr + 8, hi);
> +    VTD_DPRINTF(FLOG, "record to FRCD_REG #%"PRIu16 ": hi 0x%"PRIx64
> +                ", lo 0x%"PRIx64, index, hi, lo);
> +}
> +
> +/* Try to collapse multiple pending faults from the same requester */
> +static inline bool try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
> +{
> +    uint32_t i;
> +    uint64_t frcd_reg;
> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + 8; /* The high 64-bit half */
> +
> +    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
> +        frcd_reg = get_quad_raw(s, addr);
> +        VTD_DPRINTF(FLOG, "frcd_reg #%d 0x%"PRIx64, i, frcd_reg);
> +        if ((frcd_reg & VTD_FRCD_F) &&
> +            ((frcd_reg & VTD_FRCD_SID_MASK) == source_id)) {
> +            return true;
> +        }
> +        addr += 16; /* 128-bit for each */
> +    }
> +
> +    return false;
> +}
> +
> +/* Log and report an DMAR (address translation) fault to software */
> +static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
> +                                  hwaddr addr, VTDFaultReason fault,
> +                                  bool is_write)
> +{
> +    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
> +
> +    assert(fault < VTD_FR_MAX);
> +
> +    if (fault == VTD_FR_RESERVED_ERR) {
> +        /* This is not a normal fault reason case. Drop it. */
> +        return;
> +    }
> +
> +    VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
> +                ", is_write %d", source_id, fault, addr, is_write);
> +
> +    /* Check PFO field in FSTS_REG */
> +    if (fsts_reg & VTD_FSTS_PFO) {
> +        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
> +                    "Primary Fault Overflow");
> +        return;
> +    }
> +
> +    /* Compression of multiple faults from the same requester */
> +    if (try_collapse_fault(s, source_id)) {
> +        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
> +                    "compression of faults");
> +        return;
> +    }
> +
> +    /* Check next_frcd_reg to see whether it is overflow now */
> +    if (is_frcd_set(s, s->next_frcd_reg)) {
> +        VTD_DPRINTF(FLOG, "Primary Fault Overflow and "
> +                    "new fault is not recorded, set PFO field");
> +        set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_PFO);
> +        return;
> +    }
> +
> +    record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
> +
> +    if (fsts_reg & VTD_FSTS_PPF) {
> +        /* There are already one or more pending faults */
> +        VTD_DPRINTF(FLOG, "there are pending faults already, "
> +                    "fault event is not generated");
> +        set_frcd_and_update_ppf(s, s->next_frcd_reg);
> +        s->next_frcd_reg++;
> +        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
> +            s->next_frcd_reg = 0;
> +        }
> +    } else {
> +        set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_FRI_MASK,
> +                            VTD_FSTS_FRI(s->next_frcd_reg));
> +        set_frcd_and_update_ppf(s, s->next_frcd_reg); /* It will also set PPF */
> +        s->next_frcd_reg++;
> +        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
> +            s->next_frcd_reg = 0;
> +        }
> +
> +        /* This case actually cause the PPF to be Set.
> +         * So generate fault event (interrupt).
> +         */
> +         vtd_generate_fault_event(s, fsts_reg);
> +    }
> +}
> +
> +static inline bool root_entry_present(VTDRootEntry *root)
> +{
> +    return root->val & VTD_ROOT_ENTRY_P;
> +}
> +
> +static int get_root_entry(IntelIOMMUState *s, uint32_t index, VTDRootEntry *re)
> +{
> +    dma_addr_t addr;
> +
> +    assert(index < VTD_ROOT_ENTRY_NR);
> +
> +    addr = s->root + index * sizeof(*re);
> +
> +    if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
> +        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
> +                    " + %"PRIu32, s->root, index);
> +        re->val = 0;
> +        return -VTD_FR_ROOT_TABLE_INV;
> +    }
> +
> +    re->val = le64_to_cpu(re->val);
> +    return VTD_FR_RESERVED;

This looks a bit weird, here and elsewhere: VTD_FR_RESERVED is a
reserved error code in the VT-d specification, and it's 0. OK, but here
the meaning of returning 0 is actually "everything went well". So either
provide a constant that documents this or simply use 0 consistently to
declare the absence of errors.

> +}
> +
> +static inline bool context_entry_present(VTDContextEntry *context)
> +{
> +    return context->lo & VTD_CONTEXT_ENTRY_P;
> +}
> +
> +static int get_context_entry_from_root(VTDRootEntry *root, uint32_t index,
> +                                       VTDContextEntry *ce)
> +{
> +    dma_addr_t addr;
> +
> +    if (!root_entry_present(root)) {
> +        ce->lo = 0;
> +        ce->hi = 0;
> +        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
> +        return -VTD_FR_ROOT_ENTRY_P;
> +    }
> +
> +    assert(index < VTD_CONTEXT_ENTRY_NR);
> +
> +    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
> +
> +    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> +        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
> +                    " + %"PRIu32,
> +                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
> +        ce->lo = 0;
> +        ce->hi = 0;
> +        return -VTD_FR_CONTEXT_TABLE_INV;
> +    }
> +
> +    ce->lo = le64_to_cpu(ce->lo);
> +    ce->hi = le64_to_cpu(ce->hi);
> +    return VTD_FR_RESERVED;
> +}
> +
> +static inline dma_addr_t get_slpt_base_from_context(VTDContextEntry *ce)
> +{
> +    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
> +}
> +
> +/* The shift of an addr for a certain level of paging structure */
> +static inline uint32_t slpt_level_shift(uint32_t level)
> +{
> +    return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
> +}
> +
> +static inline uint64_t get_slpte_addr(uint64_t slpte)
> +{
> +    return slpte & VTD_SL_PT_BASE_ADDR_MASK;
> +}
> +
> +/* Whether the pte indicates the address of the page frame */
> +static inline bool is_last_slpte(uint64_t slpte, uint32_t level)
> +{
> +    return level == VTD_SL_PT_LEVEL || (slpte & VTD_SL_PT_PAGE_SIZE_MASK);
> +}
> +
> +/* Get the content of a spte located in @base_addr[@index] */
> +static inline uint64_t get_slpte(dma_addr_t base_addr, uint32_t index)
> +{
> +    uint64_t slpte;
> +
> +    assert(index < VTD_SL_PT_ENTRY_NR);
> +
> +    if (dma_memory_read(&address_space_memory,
> +                        base_addr + index * sizeof(slpte), &slpte,
> +                        sizeof(slpte))) {
> +        slpte = (uint64_t)-1;
> +        return slpte;
> +    }
> +
> +    slpte = le64_to_cpu(slpte);
> +    return slpte;
> +}
> +
> +/* Given a gpa and the level of paging structure, return the offset of current
> + * level.
> + */
> +static inline uint32_t gpa_level_offset(uint64_t gpa, uint32_t level)
> +{
> +    return (gpa >> slpt_level_shift(level)) & ((1ULL << VTD_SL_LEVEL_BITS) - 1);
> +}
> +
> +/* Check Capability Register to see if the @level of page-table is supported */
> +static inline bool is_level_supported(IntelIOMMUState *s, uint32_t level)
> +{
> +    return VTD_CAP_SAGAW_MASK & s->cap &
> +           (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
> +}
> +
> +/* Get the page-table level that hardware should use for the second-level
> + * page-table walk from the Address Width field of context-entry.
> + */
> +static inline uint32_t get_level_from_context_entry(VTDContextEntry *ce)
> +{
> +    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
> +}
> +
> +static inline uint32_t get_agaw_from_context_entry(VTDContextEntry *ce)
> +{
> +    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
> +}
> +
> +static const uint64_t paging_entry_rsvd_field[] = {
> +    [0] = ~0ULL,
> +    /* For not large page */
> +    [1] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [2] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [3] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [4] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    /* For large page */
> +    [5] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [6] = 0x1ff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [7] = 0x3ffff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [8] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +};
> +
> +static inline bool slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> +{
> +    if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> +        /* Maybe large page */
> +        return slpte & paging_entry_rsvd_field[level + 4];
> +    } else {
> +        return slpte & paging_entry_rsvd_field[level];
> +    }
> +}
> +
> +/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
> + * of the translation, can be used for deciding the size of large page.
> + * @slptep and @slpte_level will not be touched if error happens.
> + */
> +static int gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
> +                        uint64_t *slptep, uint32_t *slpte_level)
> +{
> +    dma_addr_t addr = get_slpt_base_from_context(ce);
> +    uint32_t level = get_level_from_context_entry(ce);
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint32_t ce_agaw = get_agaw_from_context_entry(ce);
> +    uint64_t access_right_check;
> +
> +    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
> +     * and AW in context-entry.
> +     */
> +    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    /* FIXME: what is the Atomics request here? */
> +    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
> +
> +    while (true) {
> +        offset = gpa_level_offset(gpa, level);
> +        slpte = get_slpte(addr, offset);
> +
> +        if (slpte == (uint64_t)-1) {
> +            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
> +                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
> +                        level, gpa);
> +            if (level == get_level_from_context_entry(ce)) {
> +                /* Invalid programming of context-entry */
> +                return -VTD_FR_CONTEXT_ENTRY_INV;
> +            } else {
> +                return -VTD_FR_PAGING_ENTRY_INV;
> +            }
> +        }
> +        if (!(slpte & access_right_check)) {
> +            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
> +                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
> +                        (is_write ? "write" : "read"), gpa, slpte);
> +            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> +        }
> +        if (slpte_nonzero_rsvd(slpte, level)) {
> +            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
> +                        "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
> +                        level, slpte);
> +            return -VTD_FR_PAGING_ENTRY_RSVD;
> +        }
> +
> +        if (is_last_slpte(slpte, level)) {
> +            *slptep = slpte;
> +            *slpte_level = level;
> +            return VTD_FR_RESERVED;
> +        }
> +        addr = get_slpte_addr(slpte);
> +        level--;
> +    }
> +}
> +
> +/* Map a device to its corresponding domain (context-entry). @ce will be set
> + * to Zero if error happens while accessing the context-entry.
> + */
> +static inline int dev_to_context_entry(IntelIOMMUState *s, int bus_num,
> +                                       int devfn, VTDContextEntry *ce)
> +{
> +    VTDRootEntry re;
> +    int ret_fr;
> +
> +    assert(0 <= bus_num && bus_num < VTD_PCI_BUS_MAX);
> +    assert(0 <= devfn && devfn < VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);

Use the proper types for bus_num and devfn, and your can get rid of
these assertions: uint8_t. I know that the PCI layer improperly uses int
for them in many places, but you don't need to copy this.

> +
> +    ret_fr = get_root_entry(s, bus_num, &re);
> +    if (ret_fr) {
> +        ce->hi = 0;
> +        ce->lo = 0;

That's a bit too defensive programming: The context entry is simply
invalid when such a function returns an error, no? You can document that
in the function description.

> +        return ret_fr;
> +    }
> +
> +    if (!root_entry_present(&re)) {
> +        VTD_DPRINTF(GENERAL, "error: root-entry #%d is not present", bus_num);
> +        ce->hi = 0;
> +        ce->lo = 0;
> +        return -VTD_FR_ROOT_ENTRY_P;
> +    } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
> +        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
> +                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
> +        ce->hi = 0;
> +        ce->lo = 0;
> +        return -VTD_FR_ROOT_ENTRY_RSVD;
> +    }
> +
> +    ret_fr = get_context_entry_from_root(&re, devfn, ce);
> +    if (ret_fr) {
> +        return ret_fr;
> +    }
> +
> +    if (!context_entry_present(ce)) {
> +        VTD_DPRINTF(GENERAL,
> +                    "error: context-entry #%d(bus #%d) is not present", devfn,
> +                    bus_num);
> +        return -VTD_FR_CONTEXT_ENTRY_P;
> +    } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
> +               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> +        VTD_DPRINTF(GENERAL,
> +                    "error: non-zero reserved field in context-entry "
> +                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
> +        return -VTD_FR_CONTEXT_ENTRY_RSVD;
> +    }
> +
> +    /* Check if the programming of context-entry is valid */
> +    if (!is_level_supported(s, get_level_from_context_entry(ce))) {
> +        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
> +                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> +                    ce->hi, ce->lo);
> +        return -VTD_FR_CONTEXT_ENTRY_INV;
> +    } else if (ce->lo & VTD_CONTEXT_ENTRY_TT) {
> +        VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
> +                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> +                    ce->hi, ce->lo);
> +        return -VTD_FR_CONTEXT_ENTRY_INV;
> +    }
> +
> +    return VTD_FR_RESERVED;
> +}
> +
> +static inline uint16_t make_source_id(int bus_num, int devfn)
> +{
> +    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> +}
> +
> +static const bool qualified_faults[] = {
> +    [VTD_FR_RESERVED] = false,
> +    [VTD_FR_ROOT_ENTRY_P] = false,
> +    [VTD_FR_CONTEXT_ENTRY_P] = true,
> +    [VTD_FR_CONTEXT_ENTRY_INV] = true,
> +    [VTD_FR_ADDR_BEYOND_MGAW] = true,
> +    [VTD_FR_WRITE] = true,
> +    [VTD_FR_READ] = true,
> +    [VTD_FR_PAGING_ENTRY_INV] = true,
> +    [VTD_FR_ROOT_TABLE_INV] = false,
> +    [VTD_FR_CONTEXT_TABLE_INV] = false,
> +    [VTD_FR_ROOT_ENTRY_RSVD] = false,
> +    [VTD_FR_PAGING_ENTRY_RSVD] = true,
> +    [VTD_FR_CONTEXT_ENTRY_TT] = true,
> +    [VTD_FR_RESERVED_ERR] = false,
> +    [VTD_FR_MAX] = false,
> +};
> +
> +/* To see if a fault condition is "qualified", which is reported to software
> + * only if the FPD field in the context-entry used to process the faulting
> + * request is 0.
> + */
> +static inline bool is_qualified_fault(VTDFaultReason fault)
> +{
> +    return qualified_faults[fault];
> +}
> +
> +static inline bool is_interrupt_addr(hwaddr addr)
> +{
> +    return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
> +}
> +
> +/* Map dev to context-entry then do a paging-structures walk to do a iommu
> + * translation.
> + * @bus_num: The bus number
> + * @devfn: The devfn, which is the  combined of device and function number
> + * @is_write: The access is a write operation
> + * @entry: IOMMUTLBEntry that contain the addr to be translated and result
> + */
> +static void iommu_translate(IntelIOMMUState *s, int bus_num, int devfn,
> +                            hwaddr addr, bool is_write, IOMMUTLBEntry *entry)
> +{
> +    VTDContextEntry ce;
> +    uint64_t slpte;
> +    uint32_t level;
> +    uint64_t page_mask;
> +    uint16_t source_id = make_source_id(bus_num, devfn);
> +    int ret_fr;
> +    bool is_fpd_set = false;
> +
> +    /* Check if the request is in interrupt address range */
> +    if (is_interrupt_addr(addr)) {
> +        if (is_write) {
> +            /* FIXME: since we don't know the length of the access here, we
> +             * treat Non-DWORD length write requests without PASID as
> +             * interrupt requests, too. Withoud interrupt remapping support,
> +             * we just use 1:1 mapping.
> +             */
> +            VTD_DPRINTF(MMU, "write request to interrupt address "
> +                        "gpa 0x%"PRIx64, addr);
> +            entry->iova = addr & VTD_PAGE_MASK_4K;
> +            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
> +            entry->addr_mask = ~VTD_PAGE_MASK_4K;
> +            entry->perm = IOMMU_WO;
> +            return;
> +        } else {
> +            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
> +                        "gpa 0x%"PRIx64, addr);
> +            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
> +            return;
> +        }
> +    }
> +
> +    ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> +    is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> +    if (ret_fr) {
> +        ret_fr = -ret_fr;
> +        if (is_fpd_set && is_qualified_fault(ret_fr)) {
> +            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
> +                        "through this context-entry (with FPD Set)");
> +        } else {
> +            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> +        }
> +        return;
> +    }
> +
> +    ret_fr = gpa_to_slpte(&ce, addr, is_write, &slpte, &level);
> +    if (ret_fr) {
> +        ret_fr = -ret_fr;
> +        if (is_fpd_set && is_qualified_fault(ret_fr)) {
> +            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
> +                        "through this context-entry (with FPD Set)");
> +        } else {
> +            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> +        }
> +        return;
> +    }
> +
> +    if (level == VTD_SL_PT_LEVEL) {
> +        /* 4-KB page */
> +        page_mask = VTD_PAGE_MASK_4K;
> +    } else if (level == VTD_SL_PDP_LEVEL) {
> +        /* 1-GB page */
> +        page_mask = VTD_PAGE_MASK_1G;
> +    } else {
> +        /* 2-MB page */
> +        page_mask = VTD_PAGE_MASK_2M;
> +    }

You don't declare 1G and 2M pages as supported in caps.sllps, do you?
I'm wondering if we should have some device property for intel-iommu
that enables all available features, even if our emulated chipset never
supported them (I guess, Q35 had no support - my younger QM57 does not
have as well). Then you could do "-global intel-iommu.full_featured=on"
and have all those nice things available.

> +
> +    entry->iova = addr & page_mask;
> +    entry->translated_addr = get_slpte_addr(slpte) & page_mask;
> +    entry->addr_mask = ~page_mask;
> +    entry->perm = slpte & VTD_SL_RW_MASK;
> +}
> +
> +static void vtd_root_table_setup(IntelIOMMUState *s)
> +{
> +    s->root = get_quad_raw(s, DMAR_RTADDR_REG);
> +    s->root_extended = s->root & VTD_RTADDR_RTT;
> +    s->root &= VTD_RTADDR_ADDR_MASK;
> +
> +    VTD_DPRINTF(CSR, "root_table addr 0x%"PRIx64 " %s", s->root,
> +                (s->root_extended ? "(extended)" : ""));
> +}
> +
> +/* Context-cache invalidation
> + * Returns the Context Actual Invalidation Granularity.
> + * @val: the content of the CCMD_REG
> + */
> +static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
> +{
> +    uint64_t caig;
> +    uint64_t type = val & VTD_CCMD_CIRG_MASK;
> +
> +    switch (type) {
> +    case VTD_CCMD_GLOBAL_INVL:
> +        VTD_DPRINTF(INV, "Global invalidation request");
> +        caig = VTD_CCMD_GLOBAL_INVL_A;
> +        break;
> +
> +    case VTD_CCMD_DOMAIN_INVL:
> +        VTD_DPRINTF(INV, "Domain-selective invalidation request");
> +        caig = VTD_CCMD_DOMAIN_INVL_A;
> +        break;
> +
> +    case VTD_CCMD_DEVICE_INVL:
> +        VTD_DPRINTF(INV, "Domain-selective invalidation request");
> +        caig = VTD_CCMD_DEVICE_INVL_A;
> +        break;
> +
> +    default:
> +        VTD_DPRINTF(GENERAL,
> +                    "error: wrong context-cache invalidation granularity");
> +        caig = 0;
> +    }
> +
> +    return caig;
> +}
> +
> +/* Flush IOTLB
> + * Returns the IOTLB Actual Invalidation Granularity.
> + * @val: the content of the IOTLB_REG
> + */
> +static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
> +{
> +    uint64_t iaig;
> +    uint64_t type = val & VTD_TLB_FLUSH_GRANU_MASK;
> +
> +    switch (type) {
> +    case VTD_TLB_GLOBAL_FLUSH:
> +        VTD_DPRINTF(INV, "Global IOTLB flush");
> +        iaig = VTD_TLB_GLOBAL_FLUSH_A;
> +        break;
> +
> +    case VTD_TLB_DSI_FLUSH:
> +        VTD_DPRINTF(INV, "Domain-selective IOTLB flush");
> +        iaig = VTD_TLB_DSI_FLUSH_A;
> +        break;
> +
> +    case VTD_TLB_PSI_FLUSH:
> +        VTD_DPRINTF(INV, "Page-selective-within-domain IOTLB flush");
> +        iaig = VTD_TLB_PSI_FLUSH_A;
> +        break;
> +
> +    default:
> +        VTD_DPRINTF(GENERAL, "error: wrong iotlb flush granularity");
> +        iaig = 0;
> +    }
> +
> +    return iaig;
> +}
> +
> +/* Set Root Table Pointer */
> +static void handle_gcmd_srtp(IntelIOMMUState *s)
> +{
> +    VTD_DPRINTF(CSR, "set Root Table Pointer");
> +
> +    vtd_root_table_setup(s);
> +    /* Ok - report back to driver */
> +    set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
> +}
> +
> +/* Handle Translation Enable/Disable */
> +static void handle_gcmd_te(IntelIOMMUState *s, bool en)
> +{
> +    VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
> +
> +    if (en) {
> +        s->dmar_enabled = true;
> +        /* Ok - report back to driver */
> +        set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
> +    } else {
> +        s->dmar_enabled = false;
> +
> +        /* Clear the index of Fault Recording Register */
> +        s->next_frcd_reg = 0;
> +        /* Ok - report back to driver */
> +        set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
> +    }
> +}
> +
> +/* Handle write to Global Command Register */
> +static void handle_gcmd_write(IntelIOMMUState *s)
> +{
> +    uint32_t status = get_long_raw(s, DMAR_GSTS_REG);
> +    uint32_t val = get_long_raw(s, DMAR_GCMD_REG);
> +    uint32_t changed = status ^ val;
> +
> +    VTD_DPRINTF(CSR, "value 0x%"PRIx32 " status 0x%"PRIx32, val, status);
> +    if (changed & VTD_GCMD_TE) {
> +        /* Translation enable/disable */
> +        handle_gcmd_te(s, val & VTD_GCMD_TE);
> +    }
> +    if (val & VTD_GCMD_SRTP) {
> +        /* Set/update the root-table pointer */
> +        handle_gcmd_srtp(s);
> +    }
> +}
> +
> +/* Handle write to Context Command Register */
> +static void handle_ccmd_write(IntelIOMMUState *s)
> +{
> +    uint64_t ret;
> +    uint64_t val = get_quad_raw(s, DMAR_CCMD_REG);
> +
> +    /* Context-cache invalidation request */
> +    if (val & VTD_CCMD_ICC) {
> +        ret = vtd_context_cache_invalidate(s, val);
> +
> +        /* Invalidation completed. Change something to show */
> +        set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_ICC, 0ULL);
> +        ret = set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_CAIG_MASK, ret);
> +        VTD_DPRINTF(INV, "CCMD_REG write-back val: 0x%"PRIx64, ret);
> +    }
> +}
> +
> +/* Handle write to IOTLB Invalidation Register */
> +static void handle_iotlb_write(IntelIOMMUState *s)
> +{
> +    uint64_t ret;
> +    uint64_t val = get_quad_raw(s, DMAR_IOTLB_REG);
> +
> +    /* IOTLB invalidation request */
> +    if (val & VTD_TLB_IVT) {
> +        ret = vtd_iotlb_flush(s, val);
> +
> +        /* Invalidation completed. Change something to show */
> +        set_clear_mask_quad(s, DMAR_IOTLB_REG, VTD_TLB_IVT, 0ULL);
> +        ret = set_clear_mask_quad(s, DMAR_IOTLB_REG,
> +                                  VTD_TLB_FLUSH_GRANU_MASK_A, ret);
> +        VTD_DPRINTF(INV, "IOTLB_REG write-back val: 0x%"PRIx64, ret);
> +    }
> +}
> +
> +static inline void handle_fsts_write(IntelIOMMUState *s)
> +{
> +    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
> +    uint32_t fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
> +    uint32_t status_fields = VTD_FSTS_PFO | VTD_FSTS_PPF | VTD_FSTS_IQE;
> +
> +    if ((fectl_reg & VTD_FECTL_IP) && !(fsts_reg & status_fields)) {
> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
> +        VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
> +                    "IP field of FECTL_REG");
> +    }
> +}
> +
> +static inline void handle_fectl_write(IntelIOMMUState *s)
> +{
> +    uint32_t fectl_reg;
> +    /* When software clears the IM field, check the IP field. But do we
> +     * need to compare the old value and the new value to conclude that
> +     * software clears the IM field? Or just check if the IM field is zero?
> +     */
> +    fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
> +    if ((fectl_reg & VTD_FECTL_IP) && !(fectl_reg & VTD_FECTL_IM)) {
> +        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
> +        VTD_DPRINTF(FLOG, "IM field is cleared, generate "
> +                    "fault event interrupt");
> +    }
> +}
> +
> +static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    IntelIOMMUState *s = opaque;
> +    uint64_t val;
> +
> +    if (addr + size > DMAR_REG_SIZE) {
> +        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
> +                    ", got 0x%"PRIx64 " %d",
> +                    (uint64_t)DMAR_REG_SIZE, addr, size);
> +        return (uint64_t)-1;
> +    }
> +
> +    assert(size == 4 || size == 8);

You already declare in the ops that you only support 4 and 8 byte
accesses, no?

> +
> +    switch (addr) {
> +    /* Root Table Address Register, 64-bit */
> +    case DMAR_RTADDR_REG:
> +        if (size == 4) {
> +            val = s->root & ((1ULL << 32) - 1);
> +        } else {
> +            val = s->root;
> +        }
> +        break;
> +
> +    case DMAR_RTADDR_REG_HI:
> +        assert(size == 4);
> +        val = s->root >> 32;
> +        break;
> +
> +    default:
> +        if (size == 4) {
> +            val = get_long(s, addr);
> +        } else {
> +            val = get_quad(s, addr);
> +        }
> +    }
> +
> +    VTD_DPRINTF(CSR, "addr 0x%"PRIx64 " size %d val 0x%"PRIx64,
> +                addr, size, val);
> +    return val;
> +}
> +
> +static void vtd_mem_write(void *opaque, hwaddr addr,
> +                          uint64_t val, unsigned size)
> +{
> +    IntelIOMMUState *s = opaque;
> +
> +    if (addr + size > DMAR_REG_SIZE) {
> +        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
> +                    ", got 0x%"PRIx64 " %d",
> +                    (uint64_t)DMAR_REG_SIZE, addr, size);
> +        return;
> +    }
> +
> +    assert(size == 4 || size == 8);
> +
> +    switch (addr) {
> +    /* Global Command Register, 32-bit */
> +    case DMAR_GCMD_REG:
> +        VTD_DPRINTF(CSR, "DMAR_GCMD_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        set_long(s, addr, val);
> +        handle_gcmd_write(s);
> +        break;
> +
> +    /* Context Command Register, 64-bit */
> +    case DMAR_CCMD_REG:
> +        VTD_DPRINTF(CSR, "DMAR_CCMD_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +            handle_ccmd_write(s);
> +        }
> +        break;
> +
> +    case DMAR_CCMD_REG_HI:
> +        VTD_DPRINTF(CSR, "DMAR_CCMD_REG_HI write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_ccmd_write(s);
> +        break;
> +
> +
> +    /* IOTLB Invalidation Register, 64-bit */
> +    case DMAR_IOTLB_REG:
> +        VTD_DPRINTF(INV, "DMAR_IOTLB_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +            handle_iotlb_write(s);
> +        }
> +        break;
> +
> +    case DMAR_IOTLB_REG_HI:
> +        VTD_DPRINTF(INV, "DMAR_IOTLB_REG_HI write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_iotlb_write(s);
> +        break;
> +
> +    /* Fault Status Register, 32-bit */
> +    case DMAR_FSTS_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FSTS_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_fsts_write(s);
> +        break;
> +
> +    /* Fault Event Control Register, 32-bit */
> +    case DMAR_FECTL_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FECTL_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_fectl_write(s);
> +        break;
> +
> +    /* Fault Event Data Register, 32-bit */
> +    case DMAR_FEDATA_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FEDATA_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Fault Event Address Register, 32-bit */
> +    case DMAR_FEADDR_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FEADDR_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Fault Event Upper Address Register, 32-bit */
> +    case DMAR_FEUADDR_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FEUADDR_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Protected Memory Enable Register, 32-bit */
> +    case DMAR_PMEN_REG:
> +        VTD_DPRINTF(CSR, "DMAR_PMEN_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +
> +    /* Root Table Address Register, 64-bit */
> +    case DMAR_RTADDR_REG:
> +        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +        }
> +        break;
> +
> +    case DMAR_RTADDR_REG_HI:
> +        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG_HI write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Fault Recording Registers, 128-bit */
> +    case DMAR_FRCD_REG_0_0:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +        }
> +        break;
> +
> +    case DMAR_FRCD_REG_0_1:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_1 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    case DMAR_FRCD_REG_0_2:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_2 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +            /* May clear bit 127 (Fault), update PPF */
> +            update_fsts_ppf(s);
> +        }
> +        break;
> +
> +    case DMAR_FRCD_REG_0_3:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_3 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        /* May clear bit 127 (Fault), update PPF */
> +        update_fsts_ppf(s);
> +        break;
> +
> +    default:
> +        VTD_DPRINTF(GENERAL, "error: unhandled reg write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +        }
> +    }
> +
> +}
> +
> +static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
> +                                         bool is_write)
> +{
> +    VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    int bus_num = vtd_as->bus_num;
> +    int devfn = vtd_as->devfn;
> +    IOMMUTLBEntry ret = {
> +        .target_as = &address_space_memory,
> +        .iova = addr,
> +        .translated_addr = 0,
> +        .addr_mask = ~(hwaddr)0,
> +        .perm = IOMMU_NONE,
> +    };
> +
> +    if (!s->dmar_enabled) {
> +        /* DMAR disabled, passthrough, use 4k-page*/
> +        ret.iova = addr & VTD_PAGE_MASK_4K;
> +        ret.translated_addr = addr & VTD_PAGE_MASK_4K;
> +        ret.addr_mask = ~VTD_PAGE_MASK_4K;
> +        ret.perm = IOMMU_RW;
> +        return ret;
> +    }
> +
> +    iommu_translate(s, bus_num, devfn, addr, is_write, &ret);
> +
> +    VTD_DPRINTF(MMU,
> +                "bus %d slot %d func %d devfn %d gpa %"PRIx64 " hpa %"PRIx64,
> +                bus_num, VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), devfn, addr,
> +                ret.translated_addr);
> +    return ret;
> +}
> +
> +static const VMStateDescription vtd_vmstate = {
> +    .name = "iommu_intel",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .minimum_version_id_old = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};

Did you test migration? I suppose not. :)

Background: you mirror several register states into IntelIOMMUState
fields, I guess to make them more handle to use. However, those need to
be updated on vmload. And there are surely more internal states that
have to be migrated as well, e.g. the currently active root pointer.

I would suggest to either review and fix this or leave migration support
out for now (".unmigratable = 1").

> +
> +static const MemoryRegionOps vtd_mem_ops = {
> +    .read = vtd_mem_read,
> +    .write = vtd_mem_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static Property iommu_properties[] = {
> +    DEFINE_PROP_UINT32("version", IntelIOMMUState, version, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +/* Do the real initialization. It will also be called when reset, so pay
> + * attention when adding new initialization stuff.
> + */
> +static void do_vtd_init(IntelIOMMUState *s)
> +{
> +    memset(s->csr, 0, DMAR_REG_SIZE);
> +    memset(s->wmask, 0, DMAR_REG_SIZE);
> +    memset(s->w1cmask, 0, DMAR_REG_SIZE);
> +    memset(s->womask, 0, DMAR_REG_SIZE);
> +
> +    s->iommu_ops.translate = vtd_iommu_translate;
> +    s->root = 0;
> +    s->root_extended = false;
> +    s->dmar_enabled = false;
> +    s->iq_head = 0;
> +    s->iq_tail = 0;
> +    s->iq = 0;
> +    s->iq_size = 0;
> +    s->qi_enabled = false;
> +    s->iq_last_desc_type = VTD_INV_DESC_NONE;
> +    s->next_frcd_reg = 0;
> +
> +    /* b.0:2 = 6: Number of domains supported: 64K using 16 bit ids
> +     * b.3   = 0: Advanced fault logging not supported
> +     * b.4   = 0: Required write buffer flushing not supported
> +     * b.5   = 0: Protected low memory region not supported
> +     * b.6   = 0: Protected high memory region not supported
> +     * b.8:12 = 2: SAGAW(Supported Adjusted Guest Address Widths), 39-bit,
> +     *             3-level page-table
> +     * b.16:21 = 38: MGAW(Maximum Guest Address Width) = 39
> +     * b.22 = 0: ZLR(Zero Length Read) zero length DMA read requests
> +     *           to write-only pages not supported
> +     * b.24:33 = 34: FRO(Fault-recording Register offset)
> +     * b.54 = 0: DWD(Write Draining), draining of write requests not supported
> +     * b.55 = 0: DRD(Read Draining), draining of read requests not supported
> +     */

I think this level of documentation is a bit overkill. You already
document the register layout implicitly by defining the constants.
Applies elsewhere, too.

> +    s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> +             VTD_CAP_SAGAW;
> +
> +    /* b.1 = 0: QI(Queued Invalidation support) not supported
> +     * b.2 = 0: DT(Device-TLB support) not supported
> +     * b.3 = 0: IR(Interrupt Remapping support) not supported
> +     * b.4 = 0: EIM(Extended Interrupt Mode) not supported
> +     * b.8:17 = 15: IRO(IOTLB Register Offset)
> +     * b.20:23 = 0: MHMV(Maximum Handle Mask Value) not valid
> +     */
> +    s->ecap = VTD_ECAP_IRO;
> +
> +    /* Define registers with default values and bit semantics */
> +    define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);  /* set MAX = 1, RO */
> +    define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
> +    define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
> +    define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
> +    define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
> +    define_long(s, DMAR_GSTS_REG, 0, 0, 0); /* All bits RO, default 0 */
> +    define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
> +    define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
> +    define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
> +
> +    /* Advanced Fault Logging not supported */
> +    define_long(s, DMAR_FSTS_REG, 0, 0, 0x11UL);
> +    define_long(s, DMAR_FECTL_REG, 0x80000000UL, 0x80000000UL, 0);
> +    define_long(s, DMAR_FEDATA_REG, 0, 0x0000ffffUL, 0); /* 15:0 RW */
> +    define_long(s, DMAR_FEADDR_REG, 0, 0xfffffffcUL, 0); /* 31:2 RW */
> +
> +    /* Treated as RsvdZ when EIM in ECAP_REG is not supported
> +     * define_long(s, DMAR_FEUADDR_REG, 0, 0xffffffffUL, 0);
> +     */
> +    define_long(s, DMAR_FEUADDR_REG, 0, 0, 0);
> +
> +    /* Treated as RO for implementations that PLMR and PHMR fields reported
> +     * as Clear in the CAP_REG.
> +     * define_long(s, DMAR_PMEN_REG, 0, 0x80000000UL, 0);
> +     */
> +    define_long(s, DMAR_PMEN_REG, 0, 0, 0);
> +
> +    /* IOTLB registers */
> +    define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
> +    define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
> +    define_quad_wo(s, DMAR_IVA_REG, 0xfffffffffffff07fULL);
> +
> +    /* Fault Recording Registers, 128-bit */
> +    define_quad(s, DMAR_FRCD_REG_0_0, 0, 0, 0);
> +    define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000000000000000ULL);
> +}
> +
> +/* Reset function of QOM
> + * Should not reset address_spaces when reset

What does "should not" mean here? Is it an open todo?

> + */
> +static void vtd_reset(DeviceState *dev)
> +{
> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
> +
> +    VTD_DPRINTF(GENERAL, "");
> +    do_vtd_init(s);
> +}
> +
> +/* Initialization function of QOM */
> +static void vtd_realize(DeviceState *dev, Error **errp)
> +{
> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
> +
> +    VTD_DPRINTF(GENERAL, "");
> +    memset(s->address_spaces, 0, sizeof(s->address_spaces));
> +    memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> +                          "intel_iommu", DMAR_REG_SIZE);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
> +    do_vtd_init(s);
> +}
> +
> +static void vtd_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->reset = vtd_reset;
> +    dc->realize = vtd_realize;
> +    dc->vmsd = &vtd_vmstate;
> +    dc->props = iommu_properties;
> +}
> +
> +static const TypeInfo vtd_info = {
> +    .name          = TYPE_INTEL_IOMMU_DEVICE,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(IntelIOMMUState),
> +    .class_init    = vtd_class_init,
> +};
> +
> +static void vtd_register_types(void)
> +{
> +    VTD_DPRINTF(GENERAL, "");
> +    type_register_static(&vtd_info);
> +}
> +
> +type_init(vtd_register_types)
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> new file mode 100644
> index 0000000..7bc679a
> --- /dev/null
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -0,0 +1,345 @@
> +/*
> + * QEMU emulation of an Intel IOMMU (VT-d)
> + *   (DMA Remapping device)
> + *
> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + *
> + * Lots of defines copied from kernel/include/linux/intel-iommu.h:
> + *   Copyright (C) 2006-2008 Intel Corporation
> + *   Author: Ashok Raj <ashok.raj@intel.com>
> + *   Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
> + *
> + */
> +
> +#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
> +#define HW_I386_INTEL_IOMMU_INTERNAL_H
> +#include "hw/i386/intel_iommu.h"
> +
> +/*
> + * Intel IOMMU register specification
> + */
> +#define DMAR_VER_REG    0x0 /* Arch version supported by this IOMMU */
> +#define DMAR_CAP_REG    0x8 /* Hardware supported capabilities */
> +#define DMAR_CAP_REG_HI 0xc /* High 32-bit of DMAR_CAP_REG */
> +#define DMAR_ECAP_REG   0x10    /* Extended capabilities supported */
> +#define DMAR_ECAP_REG_HI    0X14
> +#define DMAR_GCMD_REG   0x18    /* Global command register */
> +#define DMAR_GSTS_REG   0x1c    /* Global status register */
> +#define DMAR_RTADDR_REG 0x20    /* Root entry table */
> +#define DMAR_RTADDR_REG_HI  0X24
> +#define DMAR_CCMD_REG   0x28  /* Context command reg */
> +#define DMAR_CCMD_REG_HI    0x2c
> +#define DMAR_FSTS_REG   0x34  /* Fault Status register */
> +#define DMAR_FECTL_REG  0x38 /* Fault control register */
> +#define DMAR_FEDATA_REG 0x3c    /* Fault event interrupt data register */
> +#define DMAR_FEADDR_REG 0x40    /* Fault event interrupt addr register */
> +#define DMAR_FEUADDR_REG    0x44   /* Upper address register */
> +#define DMAR_AFLOG_REG  0x58 /* Advanced Fault control */
> +#define DMAR_AFLOG_REG_HI   0X5c
> +#define DMAR_PMEN_REG   0x64  /* Enable Protected Memory Region */
> +#define DMAR_PLMBASE_REG    0x68    /* PMRR Low addr */
> +#define DMAR_PLMLIMIT_REG 0x6c  /* PMRR low limit */
> +#define DMAR_PHMBASE_REG 0x70   /* pmrr high base addr */
> +#define DMAR_PHMBASE_REG_HI 0X74
> +#define DMAR_PHMLIMIT_REG 0x78  /* pmrr high limit */
> +#define DMAR_PHMLIMIT_REG_HI 0x7c
> +#define DMAR_IQH_REG    0x80   /* Invalidation queue head register */
> +#define DMAR_IQH_REG_HI 0X84
> +#define DMAR_IQT_REG    0x88   /* Invalidation queue tail register */
> +#define DMAR_IQT_REG_HI 0X8c
> +#define DMAR_IQ_SHIFT   4 /* Invalidation queue head/tail shift */
> +#define DMAR_IQA_REG    0x90   /* Invalidation queue addr register */
> +#define DMAR_IQA_REG_HI 0x94
> +#define DMAR_ICS_REG    0x9c   /* Invalidation complete status register */
> +#define DMAR_IRTA_REG   0xb8    /* Interrupt remapping table addr register */
> +#define DMAR_IRTA_REG_HI    0xbc

Please align all those constants:

#define CONSTANT                        0x1234
#define CONSTANT_WITH_LONGER_NAME       0x5678

> +
> +#define DMAR_IECTL_REG  0xa0    /* Invalidation event control register */
> +#define DMAR_IEDATA_REG 0xa4    /* Invalidation event data register */
> +#define DMAR_IEADDR_REG 0xa8    /* Invalidation event address register */
> +#define DMAR_IEUADDR_REG 0xac    /* Invalidation event address register */
> +#define DMAR_PQH_REG    0xc0    /* Page request queue head register */
> +#define DMAR_PQH_REG_HI 0xc4
> +#define DMAR_PQT_REG    0xc8    /* Page request queue tail register*/
> +#define DMAR_PQT_REG_HI     0xcc
> +#define DMAR_PQA_REG    0xd0    /* Page request queue address register */
> +#define DMAR_PQA_REG_HI 0xd4
> +#define DMAR_PRS_REG    0xdc    /* Page request status register */
> +#define DMAR_PECTL_REG  0xe0    /* Page request event control register */
> +#define DMAR_PEDATA_REG 0xe4    /* Page request event data register */
> +#define DMAR_PEADDR_REG 0xe8    /* Page request event address register */
> +#define DMAR_PEUADDR_REG  0xec  /* Page event upper address register */
> +#define DMAR_MTRRCAP_REG 0x100  /* MTRR capability register */
> +#define DMAR_MTRRCAP_REG_HI 0x104
> +#define DMAR_MTRRDEF_REG 0x108  /* MTRR default type register */
> +#define DMAR_MTRRDEF_REG_HI 0x10c
> +
> +/* IOTLB */
> +#define DMAR_IOTLB_REG_OFFSET 0xf0  /* Offset to the IOTLB registers */
> +#define DMAR_IVA_REG DMAR_IOTLB_REG_OFFSET  /* Invalidate Address Register */
> +#define DMAR_IVA_REG_HI (DMAR_IVA_REG + 4)
> +/* IOTLB Invalidate Register */
> +#define DMAR_IOTLB_REG (DMAR_IOTLB_REG_OFFSET + 0x8)
> +#define DMAR_IOTLB_REG_HI (DMAR_IOTLB_REG + 4)
> +
> +/* FRCD */
> +#define DMAR_FRCD_REG_OFFSET 0x220 /* Offset to the Fault Recording Registers */
> +/* NOTICE: If you change the DMAR_FRCD_REG_NR, please remember to change the
> + * DMAR_REG_SIZE in include/hw/i386/intel_iommu.h.
> + * #define DMAR_REG_SIZE   (DMAR_FRCD_REG_OFFSET + 16 * DMAR_FRCD_REG_NR)
> + */
> +#define DMAR_FRCD_REG_NR 1ULL /* Num of Fault Recording Registers */
> +
> +#define DMAR_FRCD_REG_0_0    0x220 /* The 0th Fault Recording Register */
> +#define DMAR_FRCD_REG_0_1    0x224
> +#define DMAR_FRCD_REG_0_2    0x228
> +#define DMAR_FRCD_REG_0_3    0x22c
> +
> +/* Interrupt Address Range */
> +#define VTD_INTERRUPT_ADDR_FIRST    0xfee00000ULL
> +#define VTD_INTERRUPT_ADDR_LAST     0xfeefffffULL
> +
> +/* IOTLB_REG */
> +#define VTD_TLB_GLOBAL_FLUSH (1ULL << 60) /* Global invalidation */
> +#define VTD_TLB_DSI_FLUSH (2ULL << 60)  /* Domain-selective invalidation */
> +#define VTD_TLB_PSI_FLUSH (3ULL << 60)  /* Page-selective invalidation */
> +#define VTD_TLB_FLUSH_GRANU_MASK (3ULL << 60)
> +#define VTD_TLB_GLOBAL_FLUSH_A (1ULL << 57)
> +#define VTD_TLB_DSI_FLUSH_A (2ULL << 57)
> +#define VTD_TLB_PSI_FLUSH_A (3ULL << 57)
> +#define VTD_TLB_FLUSH_GRANU_MASK_A (3ULL << 57)
> +#define VTD_TLB_IVT (1ULL << 63)
> +
> +/* GCMD_REG */
> +#define VTD_GCMD_TE (1UL << 31)
> +#define VTD_GCMD_SRTP (1UL << 30)
> +#define VTD_GCMD_SFL (1UL << 29)
> +#define VTD_GCMD_EAFL (1UL << 28)
> +#define VTD_GCMD_WBF (1UL << 27)
> +#define VTD_GCMD_QIE (1UL << 26)
> +#define VTD_GCMD_IRE (1UL << 25)
> +#define VTD_GCMD_SIRTP (1UL << 24)
> +#define VTD_GCMD_CFI (1UL << 23)
> +
> +/* GSTS_REG */
> +#define VTD_GSTS_TES (1UL << 31)
> +#define VTD_GSTS_RTPS (1UL << 30)
> +#define VTD_GSTS_FLS (1UL << 29)
> +#define VTD_GSTS_AFLS (1UL << 28)
> +#define VTD_GSTS_WBFS (1UL << 27)
> +#define VTD_GSTS_QIES (1UL << 26)
> +#define VTD_GSTS_IRES (1UL << 25)
> +#define VTD_GSTS_IRTPS (1UL << 24)
> +#define VTD_GSTS_CFIS (1UL << 23)
> +
> +/* CCMD_REG */
> +#define VTD_CCMD_ICC (1ULL << 63)
> +#define VTD_CCMD_GLOBAL_INVL (1ULL << 61)
> +#define VTD_CCMD_DOMAIN_INVL (2ULL << 61)
> +#define VTD_CCMD_DEVICE_INVL (3ULL << 61)
> +#define VTD_CCMD_CIRG_MASK (3ULL << 61)
> +#define VTD_CCMD_GLOBAL_INVL_A (1ULL << 59)
> +#define VTD_CCMD_DOMAIN_INVL_A (2ULL << 59)
> +#define VTD_CCMD_DEVICE_INVL_A (3ULL << 59)
> +#define VTD_CCMD_CAIG_MASK (3ULL << 59)
> +
> +/* RTADDR_REG */
> +#define VTD_RTADDR_RTT (1ULL << 11)
> +#define VTD_RTADDR_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL)
> +
> +/* ECAP_REG */
> +#define VTD_ECAP_IRO (DMAR_IOTLB_REG_OFFSET << 4)  /* (offset >> 4) << 8 */
> +#define VTD_ECAP_QI  (1ULL << 1)
> +
> +/* CAP_REG */
> +#define VTD_CAP_FRO  (DMAR_FRCD_REG_OFFSET << 20) /* (offset >> 4) << 24 */
> +#define VTD_CAP_NFR  ((DMAR_FRCD_REG_NR - 1) << 40)
> +#define VTD_DOMAIN_ID_SHIFT     16  /* 16-bit domain id for 64K domains */
> +#define VTD_CAP_ND  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
> +#define VTD_MGAW    39  /* Maximum Guest Address Width */
> +#define VTD_CAP_MGAW    (((VTD_MGAW - 1) & 0x3fULL) << 16)
> +
> +/* Supported Adjusted Guest Address Widths */
> +#define VTD_CAP_SAGAW_SHIFT (8)
> +#define VTD_CAP_SAGAW_MASK  (0x1fULL << VTD_CAP_SAGAW_SHIFT)
> + /* 39-bit AGAW, 3-level page-table */
> +#define VTD_CAP_SAGAW_39bit (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> + /* 48-bit AGAW, 4-level page-table */
> +#define VTD_CAP_SAGAW_48bit (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> +#define VTD_CAP_SAGAW       VTD_CAP_SAGAW_39bit
> +
> +/* IQT_REG */
> +#define VTD_IQT_QT(val)     (((val) >> 4) & 0x7fffULL)
> +
> +/* IQA_REG */
> +#define VTD_IQA_IQA_MASK    (VTD_HAW_MASK ^ 0xfffULL)
> +#define VTD_IQA_QS          (0x7ULL)
> +
> +/* IQH_REG */
> +#define VTD_IQH_QH_SHIFT    (4)
> +#define VTD_IQH_QH_MASK     (0x7fff0ULL)

No need for braces around plain values (i.e. when there are no
operators), here and elsewhere.

> +
> +/* ICS_REG */
> +#define VTD_ICS_IWC         (1UL)
> +
> +/* IECTL_REG */
> +#define VTD_IECTL_IM        (1UL << 31)
> +#define VTD_IECTL_IP        (1UL << 30)
> +
> +/* FSTS_REG */
> +#define VTD_FSTS_FRI_MASK  (0xff00)
> +#define VTD_FSTS_FRI(val)  ((((uint32_t)(val)) << 8) & VTD_FSTS_FRI_MASK)
> +#define VTD_FSTS_IQE       (1UL << 4)
> +#define VTD_FSTS_PPF       (1UL << 1)
> +#define VTD_FSTS_PFO       (1UL)
> +
> +/* FECTL_REG */
> +#define VTD_FECTL_IM       (1UL << 31)
> +#define VTD_FECTL_IP       (1UL << 30)
> +
> +/* Fault Recording Register */
> +/* For the high 64-bit of 128-bit */
> +#define VTD_FRCD_F         (1ULL << 63)
> +#define VTD_FRCD_T         (1ULL << 62)
> +#define VTD_FRCD_FR(val)   (((val) & 0xffULL) << 32)
> +#define VTD_FRCD_SID_MASK   0xffffULL
> +#define VTD_FRCD_SID(val)  ((val) & VTD_FRCD_SID_MASK)
> +/* For the low 64-bit of 128-bit */
> +#define VTD_FRCD_FI(val)   ((val) & (((1ULL << VTD_MGAW) - 1) ^ 0xfffULL))
> +
> +/* DMA Remapping Fault Conditions */
> +typedef enum VTDFaultReason {
> +    /* Reserved for Advanced Fault logging. We use this to represent the case
> +     * with no fault event.
> +     */
> +    VTD_FR_RESERVED = 0,
> +    VTD_FR_ROOT_ENTRY_P = 1, /* The Present(P) field of root-entry is 0 */
> +    VTD_FR_CONTEXT_ENTRY_P, /* The Present(P) field of context-entry is 0 */
> +    VTD_FR_CONTEXT_ENTRY_INV, /* Invalid programming of a context-entry */
> +    VTD_FR_ADDR_BEYOND_MGAW, /* Input-address above (2^x-1) */
> +    VTD_FR_WRITE, /* No write permission */
> +    VTD_FR_READ, /* No read permission */
> +    /* Fail to access a second-level paging entry (not SL_PML4E) */
> +    VTD_FR_PAGING_ENTRY_INV,
> +    VTD_FR_ROOT_TABLE_INV, /* Fail to access a root-entry */
> +    VTD_FR_CONTEXT_TABLE_INV, /* Fail to access a context-entry */
> +    /* Non-zero reserved field in a present root-entry */
> +    VTD_FR_ROOT_ENTRY_RSVD,
> +    /* Non-zero reserved field in a present context-entry */
> +    VTD_FR_CONTEXT_ENTRY_RSVD,
> +    /* Non-zero reserved field in a second-level paging entry with at lease one
> +     * Read(R) and Write(W) or Execute(E) field is Set.
> +     */
> +    VTD_FR_PAGING_ENTRY_RSVD,
> +    /* Translation request or translated request explicitly blocked dut to the
> +     * programming of the Translation Type (T) field in the present
> +     * context-entry.
> +     */
> +    VTD_FR_CONTEXT_ENTRY_TT,
> +    /* This is not a normal fault reason. We use this to indicate some faults
> +     * that are not referenced by the VT-d specification.
> +     * Fault event with such reason should not be recorded.
> +     */
> +    VTD_FR_RESERVED_ERR,
> +    /* Guard */
> +    VTD_FR_MAX,
> +} VTDFaultReason;
> +
> +
> +/* Masks for Queued Invalidation Descriptor */
> +#define VTD_INV_DESC_TYPE  (0xf)
> +#define VTD_INV_DESC_CC    (0x1) /* Context-cache Invalidate Descriptor */
> +#define VTD_INV_DESC_IOTLB (0x2)
> +#define VTD_INV_DESC_WAIT  (0x5) /* Invalidation Wait Descriptor */
> +#define VTD_INV_DESC_NONE  (0)   /* Not an Invalidate Descriptor */
> +
> +
> +/* Pagesize of VTD paging structures, including root and context tables */
> +#define VTD_PAGE_SHIFT      (12)
> +#define VTD_PAGE_SIZE       (1ULL << VTD_PAGE_SHIFT)
> +
> +#define VTD_PAGE_SHIFT_4K   (12)
> +#define VTD_PAGE_MASK_4K    (~((1ULL << VTD_PAGE_SHIFT_4K) - 1))
> +#define VTD_PAGE_SHIFT_2M   (21)
> +#define VTD_PAGE_MASK_2M    (~((1ULL << VTD_PAGE_SHIFT_2M) - 1))
> +#define VTD_PAGE_SHIFT_1G   (30)
> +#define VTD_PAGE_MASK_1G    (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
> +
> +/* Root-Entry
> + * 0: Present
> + * 1-11: Reserved
> + * 12-63: Context-table Pointer
> + * 64-127: Reserved
> + */
> +struct VTDRootEntry {
> +    uint64_t val;
> +    uint64_t rsvd;
> +};
> +typedef struct VTDRootEntry VTDRootEntry;
> +
> +/* Masks for struct VTDRootEntry */
> +#define VTD_ROOT_ENTRY_P (1ULL << 0)
> +#define VTD_ROOT_ENTRY_CTP  (~0xfffULL)
> +
> +#define VTD_ROOT_ENTRY_NR   (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
> +#define VTD_ROOT_ENTRY_RSVD (0xffeULL | ~VTD_HAW_MASK)
> +
> +/* Context-Entry */
> +struct VTDContextEntry {
> +    uint64_t lo;
> +    uint64_t hi;
> +};
> +typedef struct VTDContextEntry VTDContextEntry;
> +
> +/* Masks for struct VTDContextEntry */
> +/* lo */
> +#define VTD_CONTEXT_ENTRY_P (1ULL << 0)
> +#define VTD_CONTEXT_ENTRY_FPD   (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_CONTEXT_ENTRY_TT    (3ULL << 2) /* Translation Type */
> +#define VTD_CONTEXT_TT_MULTI_LEVEL  (0)
> +#define VTD_CONTEXT_TT_DEV_IOTLB    (1)
> +#define VTD_CONTEXT_TT_PASS_THROUGH (2)
> +/* Second Level Page Translation Pointer*/
> +#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
> +#define VTD_CONTEXT_ENTRY_RSVD_LO   (0xff0ULL | ~VTD_HAW_MASK)
> +/* hi */
> +#define VTD_CONTEXT_ENTRY_AW    (7ULL) /* Adjusted guest-address-width */
> +#define VTD_CONTEXT_ENTRY_DID   (0xffffULL << 8)    /* Domain Identifier */
> +#define VTD_CONTEXT_ENTRY_RSVD_HI   (0xffffffffff000080ULL)
> +
> +#define VTD_CONTEXT_ENTRY_NR    (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
> +
> +
> +/* Paging Structure common */
> +#define VTD_SL_PT_PAGE_SIZE_MASK   (1ULL << 7)
> +#define VTD_SL_LEVEL_BITS   9   /* Bits to decide the offset for each level */
> +
> +/* Second Level Paging Structure */
> +#define VTD_SL_PML4_LEVEL   4
> +#define VTD_SL_PDP_LEVEL    3
> +#define VTD_SL_PD_LEVEL     2
> +#define VTD_SL_PT_LEVEL     1
> +#define VTD_SL_PT_ENTRY_NR  512
> +
> +/* Masks for Second Level Paging Entry */
> +#define VTD_SL_RW_MASK              (3ULL)
> +#define VTD_SL_R                    (1ULL)
> +#define VTD_SL_W                    (1ULL << 1)
> +#define VTD_SL_PT_BASE_ADDR_MASK    (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK)
> +#define VTD_SL_IGN_COM    (0xbff0000000000000ULL)
> +
> +#endif
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> new file mode 100644
> index 0000000..6601e62
> --- /dev/null
> +++ b/include/hw/i386/intel_iommu.h
> @@ -0,0 +1,90 @@
> +/*
> + * QEMU emulation of an Intel IOMMU (VT-d)
> + *   (DMA Remapping device)
> + *
> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef INTEL_IOMMU_H
> +#define INTEL_IOMMU_H
> +#include "hw/qdev.h"
> +#include "sysemu/dma.h"
> +
> +#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> +#define INTEL_IOMMU_DEVICE(obj) \
> +     OBJECT_CHECK(IntelIOMMUState, (obj), TYPE_INTEL_IOMMU_DEVICE)
> +
> +/* DMAR Hardware Unit Definition address (IOMMU unit) */
> +#define Q35_HOST_BRIDGE_IOMMU_ADDR 0xfed90000ULL
> +
> +#define VTD_PCI_BUS_MAX 256
> +#define VTD_PCI_SLOT_MAX 32
> +#define VTD_PCI_FUNC_MAX 8
> +#define VTD_PCI_SLOT(devfn)         (((devfn) >> 3) & 0x1f)
> +#define VTD_PCI_FUNC(devfn)         ((devfn) & 0x07)
> +
> +#define DMAR_REG_SIZE   0x230
> +
> +/* FIXME: do not know how to decide the haw */

Nothing to fix IMHO. Just state that this definition is arbitrary, just
large enough to cover all currently expected guest RAM sizes.

> +#define VTD_HOST_ADDRESS_WIDTH  39
> +#define VTD_HAW_MASK    ((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
> +
> +typedef struct IntelIOMMUState IntelIOMMUState;
> +typedef struct VTDAddressSpace VTDAddressSpace;
> +
> +struct VTDAddressSpace {
> +    int bus_num;
> +    int devfn;
> +    AddressSpace as;
> +    MemoryRegion iommu;
> +    IntelIOMMUState *iommu_state;
> +};
> +
> +/* The iommu (DMAR) device state struct */
> +struct IntelIOMMUState {
> +    SysBusDevice busdev;
> +    MemoryRegion csrmem;
> +    uint8_t csr[DMAR_REG_SIZE];     /* register values */
> +    uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
> +    uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
> +    uint8_t womask[DMAR_REG_SIZE]; /* WO (write only - read returns 0) */
> +    uint32_t version;
> +
> +    dma_addr_t root;        /* Current root table pointer */
> +    bool root_extended;     /* Type of root table (extended or not) */
> +    bool dmar_enabled;      /* Set if DMA remapping is enabled */
> +
> +    uint16_t iq_head;       /* Current invalidation queue head */
> +    uint16_t iq_tail;       /* Current invalidation queue tail */
> +    dma_addr_t iq;          /* Current invalidation queue (IQ) pointer */
> +    uint16_t iq_size;       /* IQ Size in number of entries */
> +    bool qi_enabled;        /* Set if the QI is enabled */
> +    uint8_t iq_last_desc_type; /* The type of last completed descriptor */
> +
> +    /* The index of the Fault Recording Register to be used next.
> +     * Wraps around from N-1 to 0, where N is the number of FRCD_REG.
> +     */
> +    uint16_t next_frcd_reg;
> +
> +    uint64_t cap;           /* The value of Capability Register */
> +    uint64_t ecap;          /* The value of Extended Capability Register */
> +
> +    MemoryRegionIOMMUOps iommu_ops;
> +    VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
> +};
> +
> +#endif
> 

Very nice job!

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation
  2014-08-12  7:34   ` Jan Kiszka
@ 2014-08-12  9:04     ` Le Tan
  0 siblings, 0 replies; 34+ messages in thread
From: Le Tan @ 2014-08-12  9:04 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, qemu-devel,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

Hi Jan,

2014-08-12 15:34 GMT+08:00 Jan Kiszka <jan.kiszka@web.de>:
> On 2014-08-11 09:04, Le Tan wrote:
>> Add support for emulating Intel IOMMU according to the VT-d specification for
>> the q35 chipset machine. Implement the logics for DMAR (DMA remapping) without
>> PASID support. The emulation supports register-based invalidation and primary
>> fault logging.
>
> Some arbitrary comments below (means, I didn't read every line and
> likely missed some things). In general, this looks and works pretty good!
>
>>
>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
>> ---
>>  hw/i386/Makefile.objs          |    1 +
>>  hw/i386/intel_iommu.c          | 1345 ++++++++++++++++++++++++++++++++++++++++
>>  hw/i386/intel_iommu_internal.h |  345 +++++++++++
>>  include/hw/i386/intel_iommu.h  |   90 +++
>>  4 files changed, 1781 insertions(+)
>>  create mode 100644 hw/i386/intel_iommu.c
>>  create mode 100644 hw/i386/intel_iommu_internal.h
>>  create mode 100644 include/hw/i386/intel_iommu.h
>>
>> diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
>> index 48014ab..6936111 100644
>> --- a/hw/i386/Makefile.objs
>> +++ b/hw/i386/Makefile.objs
>> @@ -2,6 +2,7 @@ obj-$(CONFIG_KVM) += kvm/
>>  obj-y += multiboot.o smbios.o
>>  obj-y += pc.o pc_piix.o pc_q35.o
>>  obj-y += pc_sysfw.o
>> +obj-y += intel_iommu.o
>>  obj-$(CONFIG_XEN) += ../xenpv/ xen/
>>
>>  obj-y += kvmvapic.o
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> new file mode 100644
>> index 0000000..b3a4f78
>> --- /dev/null
>> +++ b/hw/i386/intel_iommu.c
>> @@ -0,0 +1,1345 @@
>> +/*
>> + * QEMU emulation of an Intel IOMMU (VT-d)
>> + *   (DMA Remapping device)
>> + *
>> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
>> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "hw/sysbus.h"
>> +#include "exec/address-spaces.h"
>> +#include "intel_iommu_internal.h"
>> +
>> +
>> +/*#define DEBUG_INTEL_IOMMU*/
>> +#ifdef DEBUG_INTEL_IOMMU
>> +enum {
>> +    DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
>> +};
>> +#define VTD_DBGBIT(x)   (1 << DEBUG_##x)
>> +static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR) |
>> +                          VTD_DBGBIT(FLOG);
>> +
>> +#define VTD_DPRINTF(what, fmt, ...) do { \
>> +    if (vtd_dbgflags & VTD_DBGBIT(what)) { \
>> +        fprintf(stderr, "(vtd)%s: " fmt "\n", __func__, \
>> +                ## __VA_ARGS__); } \
>> +    } while (0)
>> +#else
>> +#define VTD_DPRINTF(what, fmt, ...) do {} while (0)
>> +#endif
>> +
>> +static inline void define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
>> +                               uint64_t wmask, uint64_t w1cmask)
>
> In general, don't declare functions inline needlessly. It makes sense
> for trivial ones you export via a header, but even a 2- or 3-liner like
> this can be more efficient as stand-alone function. Anything bigger
> definitely does not deserve that tag.
>
> Inline is just a hint to the compiler anyway, so you can perfectly leave
> it out for any almost-trivial static function.

OK, I will delete most of them except those just have one "return" line.

>> +{
>> +    stq_le_p(&s->csr[addr], val);
>> +    stq_le_p(&s->wmask[addr], wmask);
>> +    stq_le_p(&s->w1cmask[addr], w1cmask);
>> +}
>> +
>> +static inline void define_quad_wo(IntelIOMMUState *s, hwaddr addr,
>> +                                  uint64_t mask)
>> +{
>> +    stq_le_p(&s->womask[addr], mask);
>> +}
>> +
>> +static inline void define_long(IntelIOMMUState *s, hwaddr addr, uint32_t val,
>> +                               uint32_t wmask, uint32_t w1cmask)
>> +{
>> +    stl_le_p(&s->csr[addr], val);
>> +    stl_le_p(&s->wmask[addr], wmask);
>> +    stl_le_p(&s->w1cmask[addr], w1cmask);
>> +}
>> +
>> +static inline void define_long_wo(IntelIOMMUState *s, hwaddr addr,
>> +                                  uint32_t mask)
>> +{
>> +    stl_le_p(&s->womask[addr], mask);
>> +}
>> +
>> +/* "External" get/set operations */
>> +static inline void set_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val)
>> +{
>> +    uint64_t oldval = ldq_le_p(&s->csr[addr]);
>> +    uint64_t wmask = ldq_le_p(&s->wmask[addr]);
>> +    uint64_t w1cmask = ldq_le_p(&s->w1cmask[addr]);
>> +    stq_le_p(&s->csr[addr],
>> +             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
>> +}
>> +
>> +static inline void set_long(IntelIOMMUState *s, hwaddr addr, uint32_t val)
>> +{
>> +    uint32_t oldval = ldl_le_p(&s->csr[addr]);
>> +    uint32_t wmask = ldl_le_p(&s->wmask[addr]);
>> +    uint32_t w1cmask = ldl_le_p(&s->w1cmask[addr]);
>> +    stl_le_p(&s->csr[addr],
>> +             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
>> +}
>> +
>> +static inline uint64_t get_quad(IntelIOMMUState *s, hwaddr addr)
>> +{
>> +    uint64_t val = ldq_le_p(&s->csr[addr]);
>> +    uint64_t womask = ldq_le_p(&s->womask[addr]);
>> +    return val & ~womask;
>> +}
>> +
>> +
>> +static inline uint32_t get_long(IntelIOMMUState *s, hwaddr addr)
>> +{
>> +    uint32_t val = ldl_le_p(&s->csr[addr]);
>> +    uint32_t womask = ldl_le_p(&s->womask[addr]);
>> +    return val & ~womask;
>> +}
>> +
>> +/* "Internal" get/set operations */
>> +static inline uint64_t get_quad_raw(IntelIOMMUState *s, hwaddr addr)
>> +{
>> +    return ldq_le_p(&s->csr[addr]);
>> +}
>> +
>> +static inline uint32_t get_long_raw(IntelIOMMUState *s, hwaddr addr)
>> +{
>> +    return ldl_le_p(&s->csr[addr]);
>> +}
>> +
>> +static inline void set_quad_raw(IntelIOMMUState *s, hwaddr addr, uint64_t val)
>> +{
>> +    stq_le_p(&s->csr[addr], val);
>> +}
>> +
>> +static inline uint32_t set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
>> +                                           uint32_t clear, uint32_t mask)
>> +{
>> +    uint32_t new_val = (ldl_le_p(&s->csr[addr]) & ~clear) | mask;
>> +    stl_le_p(&s->csr[addr], new_val);
>> +    return new_val;
>> +}
>> +
>> +static inline uint64_t set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
>> +                                           uint64_t clear, uint64_t mask)
>> +{
>> +    uint64_t new_val = (ldq_le_p(&s->csr[addr]) & ~clear) | mask;
>> +    stq_le_p(&s->csr[addr], new_val);
>> +    return new_val;
>> +}
>> +
>> +/* Given the reg addr of both the message data and address, generate an
>> + * interrupt via MSI.
>> + */
>> +static void vtd_generate_interrupt(IntelIOMMUState *s, hwaddr mesg_addr_reg,
>> +                                   hwaddr mesg_data_reg)
>> +{
>> +    hwaddr addr;
>> +    uint32_t data;
>> +
>> +    assert(mesg_data_reg < DMAR_REG_SIZE);
>> +    assert(mesg_addr_reg < DMAR_REG_SIZE);
>> +
>> +    addr = get_long_raw(s, mesg_addr_reg);
>> +    data = get_long_raw(s, mesg_data_reg);
>> +
>> +    VTD_DPRINTF(FLOG, "msi: addr 0x%"PRIx64 " data 0x%"PRIx32, addr, data);
>> +    stl_le_phys(&address_space_memory, addr, data);
>> +}
>> +
>> +/* Generate a fault event to software via MSI if conditions are met.
>> + * Notice that the value of FSTS_REG being passed to it should be the one
>> + * before any update.
>> + */
>> +static void vtd_generate_fault_event(IntelIOMMUState *s, uint32_t pre_fsts)
>> +{
>> +    /* Check if there are any previously reported interrupt conditions */
>> +    if (pre_fsts & VTD_FSTS_PPF || pre_fsts & VTD_FSTS_PFO ||
>> +        pre_fsts & VTD_FSTS_IQE) {
>> +        VTD_DPRINTF(FLOG, "there are previous interrupt conditions "
>> +                    "to be serviced by software, fault event is not generated "
>> +                    "(FSTS_REG 0x%"PRIx32 ")", pre_fsts);
>> +        return;
>> +    }
>> +    set_clear_mask_long(s, DMAR_FECTL_REG, 0, VTD_FECTL_IP);
>> +    if (get_long_raw(s, DMAR_FECTL_REG) & VTD_FECTL_IM) {
>> +        /* Interrupt Mask */
>> +        VTD_DPRINTF(FLOG, "Interrupt Mask set, fault event is not generated");
>> +    } else {
>> +        /* generate interrupt */
>> +        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
>> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
>> +    }
>> +}
>> +
>> +/* Check if the Fault (F) field of the Fault Recording Register referenced by
>> + * @index is Set.
>> + */
>> +static inline bool is_frcd_set(IntelIOMMUState *s, uint16_t index)
>> +{
>> +    /* Each reg is 128-bit */
>> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
>> +    addr += 8; /* Access the high 64-bit half */
>> +
>> +    assert(index < DMAR_FRCD_REG_NR);
>> +
>> +    return get_quad_raw(s, addr) & VTD_FRCD_F;
>> +}
>> +
>> +/* Update the PPF field of Fault Status Register.
>> + * Should be called whenever change the F field of any fault recording
>> + * registers.
>> + */
>> +static inline void update_fsts_ppf(IntelIOMMUState *s)
>> +{
>> +    uint32_t i;
>> +    uint32_t ppf_mask = 0;
>> +
>> +    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
>> +        if (is_frcd_set(s, i)) {
>> +            ppf_mask = VTD_FSTS_PPF;
>> +            break;
>> +        }
>> +    }
>> +    set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_PPF, ppf_mask);
>> +    VTD_DPRINTF(FLOG, "set PPF of FSTS_REG to %d", ppf_mask ? 1 : 0);
>> +}
>> +
>> +static inline void set_frcd_and_update_ppf(IntelIOMMUState *s, uint16_t index)
>> +{
>> +    /* Each reg is 128-bit */
>> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
>> +    addr += 8; /* Access the high 64-bit half */
>> +
>> +    assert(index < DMAR_FRCD_REG_NR);
>> +
>> +    set_clear_mask_quad(s, addr, 0, VTD_FRCD_F);
>> +    update_fsts_ppf(s);
>> +}
>> +
>> +/* Must not update F field now, should be done later */
>> +static void record_frcd(IntelIOMMUState *s, uint16_t index, uint16_t source_id,
>> +                        hwaddr addr, VTDFaultReason fault, bool is_write)
>> +{
>> +    uint64_t hi = 0, lo;
>> +    hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
>> +
>> +    assert(index < DMAR_FRCD_REG_NR);
>> +
>> +    lo = VTD_FRCD_FI(addr);
>> +    hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
>> +    if (!is_write) {
>> +        hi |= VTD_FRCD_T;
>> +    }
>> +
>> +    set_quad_raw(s, frcd_reg_addr, lo);
>> +    set_quad_raw(s, frcd_reg_addr + 8, hi);
>> +    VTD_DPRINTF(FLOG, "record to FRCD_REG #%"PRIu16 ": hi 0x%"PRIx64
>> +                ", lo 0x%"PRIx64, index, hi, lo);
>> +}
>> +
>> +/* Try to collapse multiple pending faults from the same requester */
>> +static inline bool try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
>> +{
>> +    uint32_t i;
>> +    uint64_t frcd_reg;
>> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + 8; /* The high 64-bit half */
>> +
>> +    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
>> +        frcd_reg = get_quad_raw(s, addr);
>> +        VTD_DPRINTF(FLOG, "frcd_reg #%d 0x%"PRIx64, i, frcd_reg);
>> +        if ((frcd_reg & VTD_FRCD_F) &&
>> +            ((frcd_reg & VTD_FRCD_SID_MASK) == source_id)) {
>> +            return true;
>> +        }
>> +        addr += 16; /* 128-bit for each */
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/* Log and report an DMAR (address translation) fault to software */
>> +static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
>> +                                  hwaddr addr, VTDFaultReason fault,
>> +                                  bool is_write)
>> +{
>> +    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
>> +
>> +    assert(fault < VTD_FR_MAX);
>> +
>> +    if (fault == VTD_FR_RESERVED_ERR) {
>> +        /* This is not a normal fault reason case. Drop it. */
>> +        return;
>> +    }
>> +
>> +    VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
>> +                ", is_write %d", source_id, fault, addr, is_write);
>> +
>> +    /* Check PFO field in FSTS_REG */
>> +    if (fsts_reg & VTD_FSTS_PFO) {
>> +        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
>> +                    "Primary Fault Overflow");
>> +        return;
>> +    }
>> +
>> +    /* Compression of multiple faults from the same requester */
>> +    if (try_collapse_fault(s, source_id)) {
>> +        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
>> +                    "compression of faults");
>> +        return;
>> +    }
>> +
>> +    /* Check next_frcd_reg to see whether it is overflow now */
>> +    if (is_frcd_set(s, s->next_frcd_reg)) {
>> +        VTD_DPRINTF(FLOG, "Primary Fault Overflow and "
>> +                    "new fault is not recorded, set PFO field");
>> +        set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_PFO);
>> +        return;
>> +    }
>> +
>> +    record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
>> +
>> +    if (fsts_reg & VTD_FSTS_PPF) {
>> +        /* There are already one or more pending faults */
>> +        VTD_DPRINTF(FLOG, "there are pending faults already, "
>> +                    "fault event is not generated");
>> +        set_frcd_and_update_ppf(s, s->next_frcd_reg);
>> +        s->next_frcd_reg++;
>> +        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
>> +            s->next_frcd_reg = 0;
>> +        }
>> +    } else {
>> +        set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_FRI_MASK,
>> +                            VTD_FSTS_FRI(s->next_frcd_reg));
>> +        set_frcd_and_update_ppf(s, s->next_frcd_reg); /* It will also set PPF */
>> +        s->next_frcd_reg++;
>> +        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
>> +            s->next_frcd_reg = 0;
>> +        }
>> +
>> +        /* This case actually cause the PPF to be Set.
>> +         * So generate fault event (interrupt).
>> +         */
>> +         vtd_generate_fault_event(s, fsts_reg);
>> +    }
>> +}
>> +
>> +static inline bool root_entry_present(VTDRootEntry *root)
>> +{
>> +    return root->val & VTD_ROOT_ENTRY_P;
>> +}
>> +
>> +static int get_root_entry(IntelIOMMUState *s, uint32_t index, VTDRootEntry *re)
>> +{
>> +    dma_addr_t addr;
>> +
>> +    assert(index < VTD_ROOT_ENTRY_NR);
>> +
>> +    addr = s->root + index * sizeof(*re);
>> +
>> +    if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
>> +        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
>> +                    " + %"PRIu32, s->root, index);
>> +        re->val = 0;
>> +        return -VTD_FR_ROOT_TABLE_INV;
>> +    }
>> +
>> +    re->val = le64_to_cpu(re->val);
>> +    return VTD_FR_RESERVED;
>
> This looks a bit weird, here and elsewhere: VTD_FR_RESERVED is a
> reserved error code in the VT-d specification, and it's 0. OK, but here
> the meaning of returning 0 is actually "everything went well". So either
> provide a constant that documents this or simply use 0 consistently to
> declare the absence of errors.

I think VTD_FR_RESERVED is reserved and VT-d will not use it, and it
happens to be zero, so I just use it to represent the absence of
errors. So maybe I just simply return 0 here.

>> +}
>> +
>> +static inline bool context_entry_present(VTDContextEntry *context)
>> +{
>> +    return context->lo & VTD_CONTEXT_ENTRY_P;
>> +}
>> +
>> +static int get_context_entry_from_root(VTDRootEntry *root, uint32_t index,
>> +                                       VTDContextEntry *ce)
>> +{
>> +    dma_addr_t addr;
>> +
>> +    if (!root_entry_present(root)) {
>> +        ce->lo = 0;
>> +        ce->hi = 0;
>> +        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
>> +        return -VTD_FR_ROOT_ENTRY_P;
>> +    }
>> +
>> +    assert(index < VTD_CONTEXT_ENTRY_NR);
>> +
>> +    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
>> +
>> +    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
>> +        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
>> +                    " + %"PRIu32,
>> +                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
>> +        ce->lo = 0;
>> +        ce->hi = 0;
>> +        return -VTD_FR_CONTEXT_TABLE_INV;
>> +    }
>> +
>> +    ce->lo = le64_to_cpu(ce->lo);
>> +    ce->hi = le64_to_cpu(ce->hi);
>> +    return VTD_FR_RESERVED;
>> +}
>> +
>> +static inline dma_addr_t get_slpt_base_from_context(VTDContextEntry *ce)
>> +{
>> +    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
>> +}
>> +
>> +/* The shift of an addr for a certain level of paging structure */
>> +static inline uint32_t slpt_level_shift(uint32_t level)
>> +{
>> +    return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
>> +}
>> +
>> +static inline uint64_t get_slpte_addr(uint64_t slpte)
>> +{
>> +    return slpte & VTD_SL_PT_BASE_ADDR_MASK;
>> +}
>> +
>> +/* Whether the pte indicates the address of the page frame */
>> +static inline bool is_last_slpte(uint64_t slpte, uint32_t level)
>> +{
>> +    return level == VTD_SL_PT_LEVEL || (slpte & VTD_SL_PT_PAGE_SIZE_MASK);
>> +}
>> +
>> +/* Get the content of a spte located in @base_addr[@index] */
>> +static inline uint64_t get_slpte(dma_addr_t base_addr, uint32_t index)
>> +{
>> +    uint64_t slpte;
>> +
>> +    assert(index < VTD_SL_PT_ENTRY_NR);
>> +
>> +    if (dma_memory_read(&address_space_memory,
>> +                        base_addr + index * sizeof(slpte), &slpte,
>> +                        sizeof(slpte))) {
>> +        slpte = (uint64_t)-1;
>> +        return slpte;
>> +    }
>> +
>> +    slpte = le64_to_cpu(slpte);
>> +    return slpte;
>> +}
>> +
>> +/* Given a gpa and the level of paging structure, return the offset of current
>> + * level.
>> + */
>> +static inline uint32_t gpa_level_offset(uint64_t gpa, uint32_t level)
>> +{
>> +    return (gpa >> slpt_level_shift(level)) & ((1ULL << VTD_SL_LEVEL_BITS) - 1);
>> +}
>> +
>> +/* Check Capability Register to see if the @level of page-table is supported */
>> +static inline bool is_level_supported(IntelIOMMUState *s, uint32_t level)
>> +{
>> +    return VTD_CAP_SAGAW_MASK & s->cap &
>> +           (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
>> +}
>> +
>> +/* Get the page-table level that hardware should use for the second-level
>> + * page-table walk from the Address Width field of context-entry.
>> + */
>> +static inline uint32_t get_level_from_context_entry(VTDContextEntry *ce)
>> +{
>> +    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
>> +}
>> +
>> +static inline uint32_t get_agaw_from_context_entry(VTDContextEntry *ce)
>> +{
>> +    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>> +}
>> +
>> +static const uint64_t paging_entry_rsvd_field[] = {
>> +    [0] = ~0ULL,
>> +    /* For not large page */
>> +    [1] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    [2] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    [3] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    [4] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    /* For large page */
>> +    [5] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    [6] = 0x1ff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    [7] = 0x3ffff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +    [8] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
>> +};
>> +
>> +static inline bool slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>> +{
>> +    if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
>> +        /* Maybe large page */
>> +        return slpte & paging_entry_rsvd_field[level + 4];
>> +    } else {
>> +        return slpte & paging_entry_rsvd_field[level];
>> +    }
>> +}
>> +
>> +/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
>> + * of the translation, can be used for deciding the size of large page.
>> + * @slptep and @slpte_level will not be touched if error happens.
>> + */
>> +static int gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
>> +                        uint64_t *slptep, uint32_t *slpte_level)
>> +{
>> +    dma_addr_t addr = get_slpt_base_from_context(ce);
>> +    uint32_t level = get_level_from_context_entry(ce);
>> +    uint32_t offset;
>> +    uint64_t slpte;
>> +    uint32_t ce_agaw = get_agaw_from_context_entry(ce);
>> +    uint64_t access_right_check;
>> +
>> +    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
>> +     * and AW in context-entry.
>> +     */
>> +    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
>> +        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
>> +        return -VTD_FR_ADDR_BEYOND_MGAW;
>> +    }
>> +
>> +    /* FIXME: what is the Atomics request here? */
>> +    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
>> +
>> +    while (true) {
>> +        offset = gpa_level_offset(gpa, level);
>> +        slpte = get_slpte(addr, offset);
>> +
>> +        if (slpte == (uint64_t)-1) {
>> +            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
>> +                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
>> +                        level, gpa);
>> +            if (level == get_level_from_context_entry(ce)) {
>> +                /* Invalid programming of context-entry */
>> +                return -VTD_FR_CONTEXT_ENTRY_INV;
>> +            } else {
>> +                return -VTD_FR_PAGING_ENTRY_INV;
>> +            }
>> +        }
>> +        if (!(slpte & access_right_check)) {
>> +            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
>> +                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
>> +                        (is_write ? "write" : "read"), gpa, slpte);
>> +            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
>> +        }
>> +        if (slpte_nonzero_rsvd(slpte, level)) {
>> +            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
>> +                        "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
>> +                        level, slpte);
>> +            return -VTD_FR_PAGING_ENTRY_RSVD;
>> +        }
>> +
>> +        if (is_last_slpte(slpte, level)) {
>> +            *slptep = slpte;
>> +            *slpte_level = level;
>> +            return VTD_FR_RESERVED;
>> +        }
>> +        addr = get_slpte_addr(slpte);
>> +        level--;
>> +    }
>> +}
>> +
>> +/* Map a device to its corresponding domain (context-entry). @ce will be set
>> + * to Zero if error happens while accessing the context-entry.
>> + */
>> +static inline int dev_to_context_entry(IntelIOMMUState *s, int bus_num,
>> +                                       int devfn, VTDContextEntry *ce)
>> +{
>> +    VTDRootEntry re;
>> +    int ret_fr;
>> +
>> +    assert(0 <= bus_num && bus_num < VTD_PCI_BUS_MAX);
>> +    assert(0 <= devfn && devfn < VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
>
> Use the proper types for bus_num and devfn, and your can get rid of
> these assertions: uint8_t. I know that the PCI layer improperly uses int
> for them in many places, but you don't need to copy this.

OK. I will translate them to uint8_t in iommu_translate().

>> +
>> +    ret_fr = get_root_entry(s, bus_num, &re);
>> +    if (ret_fr) {
>> +        ce->hi = 0;
>> +        ce->lo = 0;
>
> That's a bit too defensive programming: The context entry is simply
> invalid when such a function returns an error, no? You can document that
> in the function description.

Ah, yes, it is needless. I made a mistake before. I thought I need to
evaluate the FPD field in context entry, so to avoid using arbitrary
value, I set it to 0 when error happens. But it is needless because
is_qualified_fault(ret_fr) will fail when the context entry is
undefined (it is surly caused by some not qualified fault). Thanks!

>> +        return ret_fr;
>> +    }
>> +
>> +    if (!root_entry_present(&re)) {
>> +        VTD_DPRINTF(GENERAL, "error: root-entry #%d is not present", bus_num);
>> +        ce->hi = 0;
>> +        ce->lo = 0;
>> +        return -VTD_FR_ROOT_ENTRY_P;
>> +    } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
>> +        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
>> +                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
>> +        ce->hi = 0;
>> +        ce->lo = 0;
>> +        return -VTD_FR_ROOT_ENTRY_RSVD;
>> +    }
>> +
>> +    ret_fr = get_context_entry_from_root(&re, devfn, ce);
>> +    if (ret_fr) {
>> +        return ret_fr;
>> +    }
>> +
>> +    if (!context_entry_present(ce)) {
>> +        VTD_DPRINTF(GENERAL,
>> +                    "error: context-entry #%d(bus #%d) is not present", devfn,
>> +                    bus_num);
>> +        return -VTD_FR_CONTEXT_ENTRY_P;
>> +    } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
>> +               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
>> +        VTD_DPRINTF(GENERAL,
>> +                    "error: non-zero reserved field in context-entry "
>> +                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
>> +        return -VTD_FR_CONTEXT_ENTRY_RSVD;
>> +    }
>> +
>> +    /* Check if the programming of context-entry is valid */
>> +    if (!is_level_supported(s, get_level_from_context_entry(ce))) {
>> +        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
>> +                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
>> +                    ce->hi, ce->lo);
>> +        return -VTD_FR_CONTEXT_ENTRY_INV;
>> +    } else if (ce->lo & VTD_CONTEXT_ENTRY_TT) {
>> +        VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
>> +                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
>> +                    ce->hi, ce->lo);
>> +        return -VTD_FR_CONTEXT_ENTRY_INV;
>> +    }
>> +
>> +    return VTD_FR_RESERVED;
>> +}
>> +
>> +static inline uint16_t make_source_id(int bus_num, int devfn)
>> +{
>> +    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
>> +}
>> +
>> +static const bool qualified_faults[] = {
>> +    [VTD_FR_RESERVED] = false,
>> +    [VTD_FR_ROOT_ENTRY_P] = false,
>> +    [VTD_FR_CONTEXT_ENTRY_P] = true,
>> +    [VTD_FR_CONTEXT_ENTRY_INV] = true,
>> +    [VTD_FR_ADDR_BEYOND_MGAW] = true,
>> +    [VTD_FR_WRITE] = true,
>> +    [VTD_FR_READ] = true,
>> +    [VTD_FR_PAGING_ENTRY_INV] = true,
>> +    [VTD_FR_ROOT_TABLE_INV] = false,
>> +    [VTD_FR_CONTEXT_TABLE_INV] = false,
>> +    [VTD_FR_ROOT_ENTRY_RSVD] = false,
>> +    [VTD_FR_PAGING_ENTRY_RSVD] = true,
>> +    [VTD_FR_CONTEXT_ENTRY_TT] = true,
>> +    [VTD_FR_RESERVED_ERR] = false,
>> +    [VTD_FR_MAX] = false,
>> +};
>> +
>> +/* To see if a fault condition is "qualified", which is reported to software
>> + * only if the FPD field in the context-entry used to process the faulting
>> + * request is 0.
>> + */
>> +static inline bool is_qualified_fault(VTDFaultReason fault)
>> +{
>> +    return qualified_faults[fault];
>> +}
>> +
>> +static inline bool is_interrupt_addr(hwaddr addr)
>> +{
>> +    return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
>> +}
>> +
>> +/* Map dev to context-entry then do a paging-structures walk to do a iommu
>> + * translation.
>> + * @bus_num: The bus number
>> + * @devfn: The devfn, which is the  combined of device and function number
>> + * @is_write: The access is a write operation
>> + * @entry: IOMMUTLBEntry that contain the addr to be translated and result
>> + */
>> +static void iommu_translate(IntelIOMMUState *s, int bus_num, int devfn,
>> +                            hwaddr addr, bool is_write, IOMMUTLBEntry *entry)
>> +{
>> +    VTDContextEntry ce;
>> +    uint64_t slpte;
>> +    uint32_t level;
>> +    uint64_t page_mask;
>> +    uint16_t source_id = make_source_id(bus_num, devfn);
>> +    int ret_fr;
>> +    bool is_fpd_set = false;
>> +
>> +    /* Check if the request is in interrupt address range */
>> +    if (is_interrupt_addr(addr)) {
>> +        if (is_write) {
>> +            /* FIXME: since we don't know the length of the access here, we
>> +             * treat Non-DWORD length write requests without PASID as
>> +             * interrupt requests, too. Withoud interrupt remapping support,
>> +             * we just use 1:1 mapping.
>> +             */
>> +            VTD_DPRINTF(MMU, "write request to interrupt address "
>> +                        "gpa 0x%"PRIx64, addr);
>> +            entry->iova = addr & VTD_PAGE_MASK_4K;
>> +            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
>> +            entry->addr_mask = ~VTD_PAGE_MASK_4K;
>> +            entry->perm = IOMMU_WO;
>> +            return;
>> +        } else {
>> +            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
>> +                        "gpa 0x%"PRIx64, addr);
>> +            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
>> +            return;
>> +        }
>> +    }
>> +
>> +    ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
>> +    is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>> +    if (ret_fr) {
>> +        ret_fr = -ret_fr;
>> +        if (is_fpd_set && is_qualified_fault(ret_fr)) {
>> +            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
>> +                        "through this context-entry (with FPD Set)");
>> +        } else {
>> +            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>> +        }
>> +        return;
>> +    }
>> +
>> +    ret_fr = gpa_to_slpte(&ce, addr, is_write, &slpte, &level);
>> +    if (ret_fr) {
>> +        ret_fr = -ret_fr;
>> +        if (is_fpd_set && is_qualified_fault(ret_fr)) {
>> +            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
>> +                        "through this context-entry (with FPD Set)");
>> +        } else {
>> +            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>> +        }
>> +        return;
>> +    }
>> +
>> +    if (level == VTD_SL_PT_LEVEL) {
>> +        /* 4-KB page */
>> +        page_mask = VTD_PAGE_MASK_4K;
>> +    } else if (level == VTD_SL_PDP_LEVEL) {
>> +        /* 1-GB page */
>> +        page_mask = VTD_PAGE_MASK_1G;
>> +    } else {
>> +        /* 2-MB page */
>> +        page_mask = VTD_PAGE_MASK_2M;
>> +    }
>
> You don't declare 1G and 2M pages as supported in caps.sllps, do you?
> I'm wondering if we should have some device property for intel-iommu
> that enables all available features, even if our emulated chipset never
> supported them (I guess, Q35 had no support - my younger QM57 does not
> have as well). Then you could do "-global intel-iommu.full_featured=on"
> and have all those nice things available.

Oh, I forgot to report this in the Capability Register. Maybe I can
add the device property later.

>> +
>> +    entry->iova = addr & page_mask;
>> +    entry->translated_addr = get_slpte_addr(slpte) & page_mask;
>> +    entry->addr_mask = ~page_mask;
>> +    entry->perm = slpte & VTD_SL_RW_MASK;
>> +}
>> +
>> +static void vtd_root_table_setup(IntelIOMMUState *s)
>> +{
>> +    s->root = get_quad_raw(s, DMAR_RTADDR_REG);
>> +    s->root_extended = s->root & VTD_RTADDR_RTT;
>> +    s->root &= VTD_RTADDR_ADDR_MASK;
>> +
>> +    VTD_DPRINTF(CSR, "root_table addr 0x%"PRIx64 " %s", s->root,
>> +                (s->root_extended ? "(extended)" : ""));
>> +}
>> +
>> +/* Context-cache invalidation
>> + * Returns the Context Actual Invalidation Granularity.
>> + * @val: the content of the CCMD_REG
>> + */
>> +static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
>> +{
>> +    uint64_t caig;
>> +    uint64_t type = val & VTD_CCMD_CIRG_MASK;
>> +
>> +    switch (type) {
>> +    case VTD_CCMD_GLOBAL_INVL:
>> +        VTD_DPRINTF(INV, "Global invalidation request");
>> +        caig = VTD_CCMD_GLOBAL_INVL_A;
>> +        break;
>> +
>> +    case VTD_CCMD_DOMAIN_INVL:
>> +        VTD_DPRINTF(INV, "Domain-selective invalidation request");
>> +        caig = VTD_CCMD_DOMAIN_INVL_A;
>> +        break;
>> +
>> +    case VTD_CCMD_DEVICE_INVL:
>> +        VTD_DPRINTF(INV, "Domain-selective invalidation request");
>> +        caig = VTD_CCMD_DEVICE_INVL_A;
>> +        break;
>> +
>> +    default:
>> +        VTD_DPRINTF(GENERAL,
>> +                    "error: wrong context-cache invalidation granularity");
>> +        caig = 0;
>> +    }
>> +
>> +    return caig;
>> +}
>> +
>> +/* Flush IOTLB
>> + * Returns the IOTLB Actual Invalidation Granularity.
>> + * @val: the content of the IOTLB_REG
>> + */
>> +static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
>> +{
>> +    uint64_t iaig;
>> +    uint64_t type = val & VTD_TLB_FLUSH_GRANU_MASK;
>> +
>> +    switch (type) {
>> +    case VTD_TLB_GLOBAL_FLUSH:
>> +        VTD_DPRINTF(INV, "Global IOTLB flush");
>> +        iaig = VTD_TLB_GLOBAL_FLUSH_A;
>> +        break;
>> +
>> +    case VTD_TLB_DSI_FLUSH:
>> +        VTD_DPRINTF(INV, "Domain-selective IOTLB flush");
>> +        iaig = VTD_TLB_DSI_FLUSH_A;
>> +        break;
>> +
>> +    case VTD_TLB_PSI_FLUSH:
>> +        VTD_DPRINTF(INV, "Page-selective-within-domain IOTLB flush");
>> +        iaig = VTD_TLB_PSI_FLUSH_A;
>> +        break;
>> +
>> +    default:
>> +        VTD_DPRINTF(GENERAL, "error: wrong iotlb flush granularity");
>> +        iaig = 0;
>> +    }
>> +
>> +    return iaig;
>> +}
>> +
>> +/* Set Root Table Pointer */
>> +static void handle_gcmd_srtp(IntelIOMMUState *s)
>> +{
>> +    VTD_DPRINTF(CSR, "set Root Table Pointer");
>> +
>> +    vtd_root_table_setup(s);
>> +    /* Ok - report back to driver */
>> +    set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
>> +}
>> +
>> +/* Handle Translation Enable/Disable */
>> +static void handle_gcmd_te(IntelIOMMUState *s, bool en)
>> +{
>> +    VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>> +
>> +    if (en) {
>> +        s->dmar_enabled = true;
>> +        /* Ok - report back to driver */
>> +        set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
>> +    } else {
>> +        s->dmar_enabled = false;
>> +
>> +        /* Clear the index of Fault Recording Register */
>> +        s->next_frcd_reg = 0;
>> +        /* Ok - report back to driver */
>> +        set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>> +    }
>> +}
>> +
>> +/* Handle write to Global Command Register */
>> +static void handle_gcmd_write(IntelIOMMUState *s)
>> +{
>> +    uint32_t status = get_long_raw(s, DMAR_GSTS_REG);
>> +    uint32_t val = get_long_raw(s, DMAR_GCMD_REG);
>> +    uint32_t changed = status ^ val;
>> +
>> +    VTD_DPRINTF(CSR, "value 0x%"PRIx32 " status 0x%"PRIx32, val, status);
>> +    if (changed & VTD_GCMD_TE) {
>> +        /* Translation enable/disable */
>> +        handle_gcmd_te(s, val & VTD_GCMD_TE);
>> +    }
>> +    if (val & VTD_GCMD_SRTP) {
>> +        /* Set/update the root-table pointer */
>> +        handle_gcmd_srtp(s);
>> +    }
>> +}
>> +
>> +/* Handle write to Context Command Register */
>> +static void handle_ccmd_write(IntelIOMMUState *s)
>> +{
>> +    uint64_t ret;
>> +    uint64_t val = get_quad_raw(s, DMAR_CCMD_REG);
>> +
>> +    /* Context-cache invalidation request */
>> +    if (val & VTD_CCMD_ICC) {
>> +        ret = vtd_context_cache_invalidate(s, val);
>> +
>> +        /* Invalidation completed. Change something to show */
>> +        set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_ICC, 0ULL);
>> +        ret = set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_CAIG_MASK, ret);
>> +        VTD_DPRINTF(INV, "CCMD_REG write-back val: 0x%"PRIx64, ret);
>> +    }
>> +}
>> +
>> +/* Handle write to IOTLB Invalidation Register */
>> +static void handle_iotlb_write(IntelIOMMUState *s)
>> +{
>> +    uint64_t ret;
>> +    uint64_t val = get_quad_raw(s, DMAR_IOTLB_REG);
>> +
>> +    /* IOTLB invalidation request */
>> +    if (val & VTD_TLB_IVT) {
>> +        ret = vtd_iotlb_flush(s, val);
>> +
>> +        /* Invalidation completed. Change something to show */
>> +        set_clear_mask_quad(s, DMAR_IOTLB_REG, VTD_TLB_IVT, 0ULL);
>> +        ret = set_clear_mask_quad(s, DMAR_IOTLB_REG,
>> +                                  VTD_TLB_FLUSH_GRANU_MASK_A, ret);
>> +        VTD_DPRINTF(INV, "IOTLB_REG write-back val: 0x%"PRIx64, ret);
>> +    }
>> +}
>> +
>> +static inline void handle_fsts_write(IntelIOMMUState *s)
>> +{
>> +    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
>> +    uint32_t fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
>> +    uint32_t status_fields = VTD_FSTS_PFO | VTD_FSTS_PPF | VTD_FSTS_IQE;
>> +
>> +    if ((fectl_reg & VTD_FECTL_IP) && !(fsts_reg & status_fields)) {
>> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
>> +        VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
>> +                    "IP field of FECTL_REG");
>> +    }
>> +}
>> +
>> +static inline void handle_fectl_write(IntelIOMMUState *s)
>> +{
>> +    uint32_t fectl_reg;
>> +    /* When software clears the IM field, check the IP field. But do we
>> +     * need to compare the old value and the new value to conclude that
>> +     * software clears the IM field? Or just check if the IM field is zero?
>> +     */
>> +    fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
>> +    if ((fectl_reg & VTD_FECTL_IP) && !(fectl_reg & VTD_FECTL_IM)) {
>> +        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
>> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
>> +        VTD_DPRINTF(FLOG, "IM field is cleared, generate "
>> +                    "fault event interrupt");
>> +    }
>> +}
>> +
>> +static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    uint64_t val;
>> +
>> +    if (addr + size > DMAR_REG_SIZE) {
>> +        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
>> +                    ", got 0x%"PRIx64 " %d",
>> +                    (uint64_t)DMAR_REG_SIZE, addr, size);
>> +        return (uint64_t)-1;
>> +    }
>> +
>> +    assert(size == 4 || size == 8);
>
> You already declare in the ops that you only support 4 and 8 byte
> accesses, no?

Yes, I added this for safe because I didn't know the mechanism of QEMU
well. I will remove them soon. :)

>> +
>> +    switch (addr) {
>> +    /* Root Table Address Register, 64-bit */
>> +    case DMAR_RTADDR_REG:
>> +        if (size == 4) {
>> +            val = s->root & ((1ULL << 32) - 1);
>> +        } else {
>> +            val = s->root;
>> +        }
>> +        break;
>> +
>> +    case DMAR_RTADDR_REG_HI:
>> +        assert(size == 4);
>> +        val = s->root >> 32;
>> +        break;
>> +
>> +    default:
>> +        if (size == 4) {
>> +            val = get_long(s, addr);
>> +        } else {
>> +            val = get_quad(s, addr);
>> +        }
>> +    }
>> +
>> +    VTD_DPRINTF(CSR, "addr 0x%"PRIx64 " size %d val 0x%"PRIx64,
>> +                addr, size, val);
>> +    return val;
>> +}
>> +
>> +static void vtd_mem_write(void *opaque, hwaddr addr,
>> +                          uint64_t val, unsigned size)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +
>> +    if (addr + size > DMAR_REG_SIZE) {
>> +        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
>> +                    ", got 0x%"PRIx64 " %d",
>> +                    (uint64_t)DMAR_REG_SIZE, addr, size);
>> +        return;
>> +    }
>> +
>> +    assert(size == 4 || size == 8);
>> +
>> +    switch (addr) {
>> +    /* Global Command Register, 32-bit */
>> +    case DMAR_GCMD_REG:
>> +        VTD_DPRINTF(CSR, "DMAR_GCMD_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        set_long(s, addr, val);
>> +        handle_gcmd_write(s);
>> +        break;
>> +
>> +    /* Context Command Register, 64-bit */
>> +    case DMAR_CCMD_REG:
>> +        VTD_DPRINTF(CSR, "DMAR_CCMD_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        if (size == 4) {
>> +            set_long(s, addr, val);
>> +        } else {
>> +            set_quad(s, addr, val);
>> +            handle_ccmd_write(s);
>> +        }
>> +        break;
>> +
>> +    case DMAR_CCMD_REG_HI:
>> +        VTD_DPRINTF(CSR, "DMAR_CCMD_REG_HI write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        handle_ccmd_write(s);
>> +        break;
>> +
>> +
>> +    /* IOTLB Invalidation Register, 64-bit */
>> +    case DMAR_IOTLB_REG:
>> +        VTD_DPRINTF(INV, "DMAR_IOTLB_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        if (size == 4) {
>> +            set_long(s, addr, val);
>> +        } else {
>> +            set_quad(s, addr, val);
>> +            handle_iotlb_write(s);
>> +        }
>> +        break;
>> +
>> +    case DMAR_IOTLB_REG_HI:
>> +        VTD_DPRINTF(INV, "DMAR_IOTLB_REG_HI write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        handle_iotlb_write(s);
>> +        break;
>> +
>> +    /* Fault Status Register, 32-bit */
>> +    case DMAR_FSTS_REG:
>> +        VTD_DPRINTF(FLOG, "DMAR_FSTS_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        handle_fsts_write(s);
>> +        break;
>> +
>> +    /* Fault Event Control Register, 32-bit */
>> +    case DMAR_FECTL_REG:
>> +        VTD_DPRINTF(FLOG, "DMAR_FECTL_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        handle_fectl_write(s);
>> +        break;
>> +
>> +    /* Fault Event Data Register, 32-bit */
>> +    case DMAR_FEDATA_REG:
>> +        VTD_DPRINTF(FLOG, "DMAR_FEDATA_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        break;
>> +
>> +    /* Fault Event Address Register, 32-bit */
>> +    case DMAR_FEADDR_REG:
>> +        VTD_DPRINTF(FLOG, "DMAR_FEADDR_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        break;
>> +
>> +    /* Fault Event Upper Address Register, 32-bit */
>> +    case DMAR_FEUADDR_REG:
>> +        VTD_DPRINTF(FLOG, "DMAR_FEUADDR_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        break;
>> +
>> +    /* Protected Memory Enable Register, 32-bit */
>> +    case DMAR_PMEN_REG:
>> +        VTD_DPRINTF(CSR, "DMAR_PMEN_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        break;
>> +
>> +
>> +    /* Root Table Address Register, 64-bit */
>> +    case DMAR_RTADDR_REG:
>> +        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        if (size == 4) {
>> +            set_long(s, addr, val);
>> +        } else {
>> +            set_quad(s, addr, val);
>> +        }
>> +        break;
>> +
>> +    case DMAR_RTADDR_REG_HI:
>> +        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG_HI write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        break;
>> +
>> +    /* Fault Recording Registers, 128-bit */
>> +    case DMAR_FRCD_REG_0_0:
>> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        if (size == 4) {
>> +            set_long(s, addr, val);
>> +        } else {
>> +            set_quad(s, addr, val);
>> +        }
>> +        break;
>> +
>> +    case DMAR_FRCD_REG_0_1:
>> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_1 write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        break;
>> +
>> +    case DMAR_FRCD_REG_0_2:
>> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_2 write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        if (size == 4) {
>> +            set_long(s, addr, val);
>> +        } else {
>> +            set_quad(s, addr, val);
>> +            /* May clear bit 127 (Fault), update PPF */
>> +            update_fsts_ppf(s);
>> +        }
>> +        break;
>> +
>> +    case DMAR_FRCD_REG_0_3:
>> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_3 write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        assert(size == 4);
>> +        set_long(s, addr, val);
>> +        /* May clear bit 127 (Fault), update PPF */
>> +        update_fsts_ppf(s);
>> +        break;
>> +
>> +    default:
>> +        VTD_DPRINTF(GENERAL, "error: unhandled reg write addr 0x%"PRIx64
>> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
>> +        if (size == 4) {
>> +            set_long(s, addr, val);
>> +        } else {
>> +            set_quad(s, addr, val);
>> +        }
>> +    }
>> +
>> +}
>> +
>> +static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
>> +                                         bool is_write)
>> +{
>> +    VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    int bus_num = vtd_as->bus_num;
>> +    int devfn = vtd_as->devfn;
>> +    IOMMUTLBEntry ret = {
>> +        .target_as = &address_space_memory,
>> +        .iova = addr,
>> +        .translated_addr = 0,
>> +        .addr_mask = ~(hwaddr)0,
>> +        .perm = IOMMU_NONE,
>> +    };
>> +
>> +    if (!s->dmar_enabled) {
>> +        /* DMAR disabled, passthrough, use 4k-page*/
>> +        ret.iova = addr & VTD_PAGE_MASK_4K;
>> +        ret.translated_addr = addr & VTD_PAGE_MASK_4K;
>> +        ret.addr_mask = ~VTD_PAGE_MASK_4K;
>> +        ret.perm = IOMMU_RW;
>> +        return ret;
>> +    }
>> +
>> +    iommu_translate(s, bus_num, devfn, addr, is_write, &ret);
>> +
>> +    VTD_DPRINTF(MMU,
>> +                "bus %d slot %d func %d devfn %d gpa %"PRIx64 " hpa %"PRIx64,
>> +                bus_num, VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), devfn, addr,
>> +                ret.translated_addr);
>> +    return ret;
>> +}
>> +
>> +static const VMStateDescription vtd_vmstate = {
>> +    .name = "iommu_intel",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .minimum_version_id_old = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>
> Did you test migration? I suppose not. :)
>
> Background: you mirror several register states into IntelIOMMUState
> fields, I guess to make them more handle to use. However, those need to
> be updated on vmload. And there are surely more internal states that
> have to be migrated as well, e.g. the currently active root pointer.
>
> I would suggest to either review and fix this or leave migration support
> out for now (".unmigratable = 1").

Not tested yet. I declare some of them for the efficiency. For
example, the dmar_enabled variable. We need to decide whether the dmar
is enabled or not in iommu_translate(). I think it will be faster to
just read the dmar_enabled variable than using ldl_le_p().
So maybe for now I just leave migration support out and come back to
fix it someday.

>> +
>> +static const MemoryRegionOps vtd_mem_ops = {
>> +    .read = vtd_mem_read,
>> +    .write = vtd_mem_write,
>> +    .endianness = DEVICE_LITTLE_ENDIAN,
>> +    .impl = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +    },
>> +    .valid = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static Property iommu_properties[] = {
>> +    DEFINE_PROP_UINT32("version", IntelIOMMUState, version, 0),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +/* Do the real initialization. It will also be called when reset, so pay
>> + * attention when adding new initialization stuff.
>> + */
>> +static void do_vtd_init(IntelIOMMUState *s)
>> +{
>> +    memset(s->csr, 0, DMAR_REG_SIZE);
>> +    memset(s->wmask, 0, DMAR_REG_SIZE);
>> +    memset(s->w1cmask, 0, DMAR_REG_SIZE);
>> +    memset(s->womask, 0, DMAR_REG_SIZE);
>> +
>> +    s->iommu_ops.translate = vtd_iommu_translate;
>> +    s->root = 0;
>> +    s->root_extended = false;
>> +    s->dmar_enabled = false;
>> +    s->iq_head = 0;
>> +    s->iq_tail = 0;
>> +    s->iq = 0;
>> +    s->iq_size = 0;
>> +    s->qi_enabled = false;
>> +    s->iq_last_desc_type = VTD_INV_DESC_NONE;
>> +    s->next_frcd_reg = 0;
>> +
>> +    /* b.0:2 = 6: Number of domains supported: 64K using 16 bit ids
>> +     * b.3   = 0: Advanced fault logging not supported
>> +     * b.4   = 0: Required write buffer flushing not supported
>> +     * b.5   = 0: Protected low memory region not supported
>> +     * b.6   = 0: Protected high memory region not supported
>> +     * b.8:12 = 2: SAGAW(Supported Adjusted Guest Address Widths), 39-bit,
>> +     *             3-level page-table
>> +     * b.16:21 = 38: MGAW(Maximum Guest Address Width) = 39
>> +     * b.22 = 0: ZLR(Zero Length Read) zero length DMA read requests
>> +     *           to write-only pages not supported
>> +     * b.24:33 = 34: FRO(Fault-recording Register offset)
>> +     * b.54 = 0: DWD(Write Draining), draining of write requests not supported
>> +     * b.55 = 0: DRD(Read Draining), draining of read requests not supported
>> +     */
>
> I think this level of documentation is a bit overkill. You already
> document the register layout implicitly by defining the constants.
> Applies elsewhere, too.

Get it.

>
>> +    s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
>> +             VTD_CAP_SAGAW;
>> +
>> +    /* b.1 = 0: QI(Queued Invalidation support) not supported
>> +     * b.2 = 0: DT(Device-TLB support) not supported
>> +     * b.3 = 0: IR(Interrupt Remapping support) not supported
>> +     * b.4 = 0: EIM(Extended Interrupt Mode) not supported
>> +     * b.8:17 = 15: IRO(IOTLB Register Offset)
>> +     * b.20:23 = 0: MHMV(Maximum Handle Mask Value) not valid
>> +     */
>> +    s->ecap = VTD_ECAP_IRO;
>> +
>> +    /* Define registers with default values and bit semantics */
>> +    define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);  /* set MAX = 1, RO */
>> +    define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>> +    define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>> +    define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>> +    define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>> +    define_long(s, DMAR_GSTS_REG, 0, 0, 0); /* All bits RO, default 0 */
>> +    define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
>> +    define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
>> +    define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
>> +
>> +    /* Advanced Fault Logging not supported */
>> +    define_long(s, DMAR_FSTS_REG, 0, 0, 0x11UL);
>> +    define_long(s, DMAR_FECTL_REG, 0x80000000UL, 0x80000000UL, 0);
>> +    define_long(s, DMAR_FEDATA_REG, 0, 0x0000ffffUL, 0); /* 15:0 RW */
>> +    define_long(s, DMAR_FEADDR_REG, 0, 0xfffffffcUL, 0); /* 31:2 RW */
>> +
>> +    /* Treated as RsvdZ when EIM in ECAP_REG is not supported
>> +     * define_long(s, DMAR_FEUADDR_REG, 0, 0xffffffffUL, 0);
>> +     */
>> +    define_long(s, DMAR_FEUADDR_REG, 0, 0, 0);
>> +
>> +    /* Treated as RO for implementations that PLMR and PHMR fields reported
>> +     * as Clear in the CAP_REG.
>> +     * define_long(s, DMAR_PMEN_REG, 0, 0x80000000UL, 0);
>> +     */
>> +    define_long(s, DMAR_PMEN_REG, 0, 0, 0);
>> +
>> +    /* IOTLB registers */
>> +    define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
>> +    define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
>> +    define_quad_wo(s, DMAR_IVA_REG, 0xfffffffffffff07fULL);
>> +
>> +    /* Fault Recording Registers, 128-bit */
>> +    define_quad(s, DMAR_FRCD_REG_0_0, 0, 0, 0);
>> +    define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000000000000000ULL);
>> +}
>> +
>> +/* Reset function of QOM
>> + * Should not reset address_spaces when reset
>
> What does "should not" mean here? Is it an open todo?

Because after the realize of VT-d, devices will call
pci_device_iommu_address_space() to get the address space, and then
qemu will call the vtd_reset(). What's more, when L1 resets,
vtd_reset() will be called twice, but devices won't try to get the
address space again. They just use the original one. So I think we
should not reset the address_spaces[] in vtd_reset(). And anyway,
there seems to be nothing in VTDAddressSpace need to be reset.

>> + */
>> +static void vtd_reset(DeviceState *dev)
>> +{
>> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>> +
>> +    VTD_DPRINTF(GENERAL, "");
>> +    do_vtd_init(s);
>> +}
>> +
>> +/* Initialization function of QOM */
>> +static void vtd_realize(DeviceState *dev, Error **errp)
>> +{
>> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>> +
>> +    VTD_DPRINTF(GENERAL, "");
>> +    memset(s->address_spaces, 0, sizeof(s->address_spaces));
>> +    memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>> +                          "intel_iommu", DMAR_REG_SIZE);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
>> +    do_vtd_init(s);
>> +}
>> +
>> +static void vtd_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->reset = vtd_reset;
>> +    dc->realize = vtd_realize;
>> +    dc->vmsd = &vtd_vmstate;
>> +    dc->props = iommu_properties;
>> +}
>> +
>> +static const TypeInfo vtd_info = {
>> +    .name          = TYPE_INTEL_IOMMU_DEVICE,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(IntelIOMMUState),
>> +    .class_init    = vtd_class_init,
>> +};
>> +
>> +static void vtd_register_types(void)
>> +{
>> +    VTD_DPRINTF(GENERAL, "");
>> +    type_register_static(&vtd_info);
>> +}
>> +
>> +type_init(vtd_register_types)
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> new file mode 100644
>> index 0000000..7bc679a
>> --- /dev/null
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -0,0 +1,345 @@
>> +/*
>> + * QEMU emulation of an Intel IOMMU (VT-d)
>> + *   (DMA Remapping device)
>> + *
>> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
>> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + *
>> + * Lots of defines copied from kernel/include/linux/intel-iommu.h:
>> + *   Copyright (C) 2006-2008 Intel Corporation
>> + *   Author: Ashok Raj <ashok.raj@intel.com>
>> + *   Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
>> + *
>> + */
>> +
>> +#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>> +#define HW_I386_INTEL_IOMMU_INTERNAL_H
>> +#include "hw/i386/intel_iommu.h"
>> +
>> +/*
>> + * Intel IOMMU register specification
>> + */
>> +#define DMAR_VER_REG    0x0 /* Arch version supported by this IOMMU */
>> +#define DMAR_CAP_REG    0x8 /* Hardware supported capabilities */
>> +#define DMAR_CAP_REG_HI 0xc /* High 32-bit of DMAR_CAP_REG */
>> +#define DMAR_ECAP_REG   0x10    /* Extended capabilities supported */
>> +#define DMAR_ECAP_REG_HI    0X14
>> +#define DMAR_GCMD_REG   0x18    /* Global command register */
>> +#define DMAR_GSTS_REG   0x1c    /* Global status register */
>> +#define DMAR_RTADDR_REG 0x20    /* Root entry table */
>> +#define DMAR_RTADDR_REG_HI  0X24
>> +#define DMAR_CCMD_REG   0x28  /* Context command reg */
>> +#define DMAR_CCMD_REG_HI    0x2c
>> +#define DMAR_FSTS_REG   0x34  /* Fault Status register */
>> +#define DMAR_FECTL_REG  0x38 /* Fault control register */
>> +#define DMAR_FEDATA_REG 0x3c    /* Fault event interrupt data register */
>> +#define DMAR_FEADDR_REG 0x40    /* Fault event interrupt addr register */
>> +#define DMAR_FEUADDR_REG    0x44   /* Upper address register */
>> +#define DMAR_AFLOG_REG  0x58 /* Advanced Fault control */
>> +#define DMAR_AFLOG_REG_HI   0X5c
>> +#define DMAR_PMEN_REG   0x64  /* Enable Protected Memory Region */
>> +#define DMAR_PLMBASE_REG    0x68    /* PMRR Low addr */
>> +#define DMAR_PLMLIMIT_REG 0x6c  /* PMRR low limit */
>> +#define DMAR_PHMBASE_REG 0x70   /* pmrr high base addr */
>> +#define DMAR_PHMBASE_REG_HI 0X74
>> +#define DMAR_PHMLIMIT_REG 0x78  /* pmrr high limit */
>> +#define DMAR_PHMLIMIT_REG_HI 0x7c
>> +#define DMAR_IQH_REG    0x80   /* Invalidation queue head register */
>> +#define DMAR_IQH_REG_HI 0X84
>> +#define DMAR_IQT_REG    0x88   /* Invalidation queue tail register */
>> +#define DMAR_IQT_REG_HI 0X8c
>> +#define DMAR_IQ_SHIFT   4 /* Invalidation queue head/tail shift */
>> +#define DMAR_IQA_REG    0x90   /* Invalidation queue addr register */
>> +#define DMAR_IQA_REG_HI 0x94
>> +#define DMAR_ICS_REG    0x9c   /* Invalidation complete status register */
>> +#define DMAR_IRTA_REG   0xb8    /* Interrupt remapping table addr register */
>> +#define DMAR_IRTA_REG_HI    0xbc
>
> Please align all those constants:
>
> #define CONSTANT                        0x1234
> #define CONSTANT_WITH_LONGER_NAME       0x5678
>

What does align mean here? Like this?
#define CONSTANT                                            0x1234
#define CONSTANT_WITH_LONGER_NAME       0x5678

>> +
>> +#define DMAR_IECTL_REG  0xa0    /* Invalidation event control register */
>> +#define DMAR_IEDATA_REG 0xa4    /* Invalidation event data register */
>> +#define DMAR_IEADDR_REG 0xa8    /* Invalidation event address register */
>> +#define DMAR_IEUADDR_REG 0xac    /* Invalidation event address register */
>> +#define DMAR_PQH_REG    0xc0    /* Page request queue head register */
>> +#define DMAR_PQH_REG_HI 0xc4
>> +#define DMAR_PQT_REG    0xc8    /* Page request queue tail register*/
>> +#define DMAR_PQT_REG_HI     0xcc
>> +#define DMAR_PQA_REG    0xd0    /* Page request queue address register */
>> +#define DMAR_PQA_REG_HI 0xd4
>> +#define DMAR_PRS_REG    0xdc    /* Page request status register */
>> +#define DMAR_PECTL_REG  0xe0    /* Page request event control register */
>> +#define DMAR_PEDATA_REG 0xe4    /* Page request event data register */
>> +#define DMAR_PEADDR_REG 0xe8    /* Page request event address register */
>> +#define DMAR_PEUADDR_REG  0xec  /* Page event upper address register */
>> +#define DMAR_MTRRCAP_REG 0x100  /* MTRR capability register */
>> +#define DMAR_MTRRCAP_REG_HI 0x104
>> +#define DMAR_MTRRDEF_REG 0x108  /* MTRR default type register */
>> +#define DMAR_MTRRDEF_REG_HI 0x10c
>> +
>> +/* IOTLB */
>> +#define DMAR_IOTLB_REG_OFFSET 0xf0  /* Offset to the IOTLB registers */
>> +#define DMAR_IVA_REG DMAR_IOTLB_REG_OFFSET  /* Invalidate Address Register */
>> +#define DMAR_IVA_REG_HI (DMAR_IVA_REG + 4)
>> +/* IOTLB Invalidate Register */
>> +#define DMAR_IOTLB_REG (DMAR_IOTLB_REG_OFFSET + 0x8)
>> +#define DMAR_IOTLB_REG_HI (DMAR_IOTLB_REG + 4)
>> +
>> +/* FRCD */
>> +#define DMAR_FRCD_REG_OFFSET 0x220 /* Offset to the Fault Recording Registers */
>> +/* NOTICE: If you change the DMAR_FRCD_REG_NR, please remember to change the
>> + * DMAR_REG_SIZE in include/hw/i386/intel_iommu.h.
>> + * #define DMAR_REG_SIZE   (DMAR_FRCD_REG_OFFSET + 16 * DMAR_FRCD_REG_NR)
>> + */
>> +#define DMAR_FRCD_REG_NR 1ULL /* Num of Fault Recording Registers */
>> +
>> +#define DMAR_FRCD_REG_0_0    0x220 /* The 0th Fault Recording Register */
>> +#define DMAR_FRCD_REG_0_1    0x224
>> +#define DMAR_FRCD_REG_0_2    0x228
>> +#define DMAR_FRCD_REG_0_3    0x22c
>> +
>> +/* Interrupt Address Range */
>> +#define VTD_INTERRUPT_ADDR_FIRST    0xfee00000ULL
>> +#define VTD_INTERRUPT_ADDR_LAST     0xfeefffffULL
>> +
>> +/* IOTLB_REG */
>> +#define VTD_TLB_GLOBAL_FLUSH (1ULL << 60) /* Global invalidation */
>> +#define VTD_TLB_DSI_FLUSH (2ULL << 60)  /* Domain-selective invalidation */
>> +#define VTD_TLB_PSI_FLUSH (3ULL << 60)  /* Page-selective invalidation */
>> +#define VTD_TLB_FLUSH_GRANU_MASK (3ULL << 60)
>> +#define VTD_TLB_GLOBAL_FLUSH_A (1ULL << 57)
>> +#define VTD_TLB_DSI_FLUSH_A (2ULL << 57)
>> +#define VTD_TLB_PSI_FLUSH_A (3ULL << 57)
>> +#define VTD_TLB_FLUSH_GRANU_MASK_A (3ULL << 57)
>> +#define VTD_TLB_IVT (1ULL << 63)
>> +
>> +/* GCMD_REG */
>> +#define VTD_GCMD_TE (1UL << 31)
>> +#define VTD_GCMD_SRTP (1UL << 30)
>> +#define VTD_GCMD_SFL (1UL << 29)
>> +#define VTD_GCMD_EAFL (1UL << 28)
>> +#define VTD_GCMD_WBF (1UL << 27)
>> +#define VTD_GCMD_QIE (1UL << 26)
>> +#define VTD_GCMD_IRE (1UL << 25)
>> +#define VTD_GCMD_SIRTP (1UL << 24)
>> +#define VTD_GCMD_CFI (1UL << 23)
>> +
>> +/* GSTS_REG */
>> +#define VTD_GSTS_TES (1UL << 31)
>> +#define VTD_GSTS_RTPS (1UL << 30)
>> +#define VTD_GSTS_FLS (1UL << 29)
>> +#define VTD_GSTS_AFLS (1UL << 28)
>> +#define VTD_GSTS_WBFS (1UL << 27)
>> +#define VTD_GSTS_QIES (1UL << 26)
>> +#define VTD_GSTS_IRES (1UL << 25)
>> +#define VTD_GSTS_IRTPS (1UL << 24)
>> +#define VTD_GSTS_CFIS (1UL << 23)
>> +
>> +/* CCMD_REG */
>> +#define VTD_CCMD_ICC (1ULL << 63)
>> +#define VTD_CCMD_GLOBAL_INVL (1ULL << 61)
>> +#define VTD_CCMD_DOMAIN_INVL (2ULL << 61)
>> +#define VTD_CCMD_DEVICE_INVL (3ULL << 61)
>> +#define VTD_CCMD_CIRG_MASK (3ULL << 61)
>> +#define VTD_CCMD_GLOBAL_INVL_A (1ULL << 59)
>> +#define VTD_CCMD_DOMAIN_INVL_A (2ULL << 59)
>> +#define VTD_CCMD_DEVICE_INVL_A (3ULL << 59)
>> +#define VTD_CCMD_CAIG_MASK (3ULL << 59)
>> +
>> +/* RTADDR_REG */
>> +#define VTD_RTADDR_RTT (1ULL << 11)
>> +#define VTD_RTADDR_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL)
>> +
>> +/* ECAP_REG */
>> +#define VTD_ECAP_IRO (DMAR_IOTLB_REG_OFFSET << 4)  /* (offset >> 4) << 8 */
>> +#define VTD_ECAP_QI  (1ULL << 1)
>> +
>> +/* CAP_REG */
>> +#define VTD_CAP_FRO  (DMAR_FRCD_REG_OFFSET << 20) /* (offset >> 4) << 24 */
>> +#define VTD_CAP_NFR  ((DMAR_FRCD_REG_NR - 1) << 40)
>> +#define VTD_DOMAIN_ID_SHIFT     16  /* 16-bit domain id for 64K domains */
>> +#define VTD_CAP_ND  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
>> +#define VTD_MGAW    39  /* Maximum Guest Address Width */
>> +#define VTD_CAP_MGAW    (((VTD_MGAW - 1) & 0x3fULL) << 16)
>> +
>> +/* Supported Adjusted Guest Address Widths */
>> +#define VTD_CAP_SAGAW_SHIFT (8)
>> +#define VTD_CAP_SAGAW_MASK  (0x1fULL << VTD_CAP_SAGAW_SHIFT)
>> + /* 39-bit AGAW, 3-level page-table */
>> +#define VTD_CAP_SAGAW_39bit (0x2ULL << VTD_CAP_SAGAW_SHIFT)
>> + /* 48-bit AGAW, 4-level page-table */
>> +#define VTD_CAP_SAGAW_48bit (0x4ULL << VTD_CAP_SAGAW_SHIFT)
>> +#define VTD_CAP_SAGAW       VTD_CAP_SAGAW_39bit
>> +
>> +/* IQT_REG */
>> +#define VTD_IQT_QT(val)     (((val) >> 4) & 0x7fffULL)
>> +
>> +/* IQA_REG */
>> +#define VTD_IQA_IQA_MASK    (VTD_HAW_MASK ^ 0xfffULL)
>> +#define VTD_IQA_QS          (0x7ULL)
>> +
>> +/* IQH_REG */
>> +#define VTD_IQH_QH_SHIFT    (4)
>> +#define VTD_IQH_QH_MASK     (0x7fff0ULL)
>
> No need for braces around plain values (i.e. when there are no
> operators), here and elsewhere.

Get it.

>> +
>> +/* ICS_REG */
>> +#define VTD_ICS_IWC         (1UL)
>> +
>> +/* IECTL_REG */
>> +#define VTD_IECTL_IM        (1UL << 31)
>> +#define VTD_IECTL_IP        (1UL << 30)
>> +
>> +/* FSTS_REG */
>> +#define VTD_FSTS_FRI_MASK  (0xff00)
>> +#define VTD_FSTS_FRI(val)  ((((uint32_t)(val)) << 8) & VTD_FSTS_FRI_MASK)
>> +#define VTD_FSTS_IQE       (1UL << 4)
>> +#define VTD_FSTS_PPF       (1UL << 1)
>> +#define VTD_FSTS_PFO       (1UL)
>> +
>> +/* FECTL_REG */
>> +#define VTD_FECTL_IM       (1UL << 31)
>> +#define VTD_FECTL_IP       (1UL << 30)
>> +
>> +/* Fault Recording Register */
>> +/* For the high 64-bit of 128-bit */
>> +#define VTD_FRCD_F         (1ULL << 63)
>> +#define VTD_FRCD_T         (1ULL << 62)
>> +#define VTD_FRCD_FR(val)   (((val) & 0xffULL) << 32)
>> +#define VTD_FRCD_SID_MASK   0xffffULL
>> +#define VTD_FRCD_SID(val)  ((val) & VTD_FRCD_SID_MASK)
>> +/* For the low 64-bit of 128-bit */
>> +#define VTD_FRCD_FI(val)   ((val) & (((1ULL << VTD_MGAW) - 1) ^ 0xfffULL))
>> +
>> +/* DMA Remapping Fault Conditions */
>> +typedef enum VTDFaultReason {
>> +    /* Reserved for Advanced Fault logging. We use this to represent the case
>> +     * with no fault event.
>> +     */
>> +    VTD_FR_RESERVED = 0,
>> +    VTD_FR_ROOT_ENTRY_P = 1, /* The Present(P) field of root-entry is 0 */
>> +    VTD_FR_CONTEXT_ENTRY_P, /* The Present(P) field of context-entry is 0 */
>> +    VTD_FR_CONTEXT_ENTRY_INV, /* Invalid programming of a context-entry */
>> +    VTD_FR_ADDR_BEYOND_MGAW, /* Input-address above (2^x-1) */
>> +    VTD_FR_WRITE, /* No write permission */
>> +    VTD_FR_READ, /* No read permission */
>> +    /* Fail to access a second-level paging entry (not SL_PML4E) */
>> +    VTD_FR_PAGING_ENTRY_INV,
>> +    VTD_FR_ROOT_TABLE_INV, /* Fail to access a root-entry */
>> +    VTD_FR_CONTEXT_TABLE_INV, /* Fail to access a context-entry */
>> +    /* Non-zero reserved field in a present root-entry */
>> +    VTD_FR_ROOT_ENTRY_RSVD,
>> +    /* Non-zero reserved field in a present context-entry */
>> +    VTD_FR_CONTEXT_ENTRY_RSVD,
>> +    /* Non-zero reserved field in a second-level paging entry with at lease one
>> +     * Read(R) and Write(W) or Execute(E) field is Set.
>> +     */
>> +    VTD_FR_PAGING_ENTRY_RSVD,
>> +    /* Translation request or translated request explicitly blocked dut to the
>> +     * programming of the Translation Type (T) field in the present
>> +     * context-entry.
>> +     */
>> +    VTD_FR_CONTEXT_ENTRY_TT,
>> +    /* This is not a normal fault reason. We use this to indicate some faults
>> +     * that are not referenced by the VT-d specification.
>> +     * Fault event with such reason should not be recorded.
>> +     */
>> +    VTD_FR_RESERVED_ERR,
>> +    /* Guard */
>> +    VTD_FR_MAX,
>> +} VTDFaultReason;
>> +
>> +
>> +/* Masks for Queued Invalidation Descriptor */
>> +#define VTD_INV_DESC_TYPE  (0xf)
>> +#define VTD_INV_DESC_CC    (0x1) /* Context-cache Invalidate Descriptor */
>> +#define VTD_INV_DESC_IOTLB (0x2)
>> +#define VTD_INV_DESC_WAIT  (0x5) /* Invalidation Wait Descriptor */
>> +#define VTD_INV_DESC_NONE  (0)   /* Not an Invalidate Descriptor */
>> +
>> +
>> +/* Pagesize of VTD paging structures, including root and context tables */
>> +#define VTD_PAGE_SHIFT      (12)
>> +#define VTD_PAGE_SIZE       (1ULL << VTD_PAGE_SHIFT)
>> +
>> +#define VTD_PAGE_SHIFT_4K   (12)
>> +#define VTD_PAGE_MASK_4K    (~((1ULL << VTD_PAGE_SHIFT_4K) - 1))
>> +#define VTD_PAGE_SHIFT_2M   (21)
>> +#define VTD_PAGE_MASK_2M    (~((1ULL << VTD_PAGE_SHIFT_2M) - 1))
>> +#define VTD_PAGE_SHIFT_1G   (30)
>> +#define VTD_PAGE_MASK_1G    (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
>> +
>> +/* Root-Entry
>> + * 0: Present
>> + * 1-11: Reserved
>> + * 12-63: Context-table Pointer
>> + * 64-127: Reserved
>> + */
>> +struct VTDRootEntry {
>> +    uint64_t val;
>> +    uint64_t rsvd;
>> +};
>> +typedef struct VTDRootEntry VTDRootEntry;
>> +
>> +/* Masks for struct VTDRootEntry */
>> +#define VTD_ROOT_ENTRY_P (1ULL << 0)
>> +#define VTD_ROOT_ENTRY_CTP  (~0xfffULL)
>> +
>> +#define VTD_ROOT_ENTRY_NR   (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
>> +#define VTD_ROOT_ENTRY_RSVD (0xffeULL | ~VTD_HAW_MASK)
>> +
>> +/* Context-Entry */
>> +struct VTDContextEntry {
>> +    uint64_t lo;
>> +    uint64_t hi;
>> +};
>> +typedef struct VTDContextEntry VTDContextEntry;
>> +
>> +/* Masks for struct VTDContextEntry */
>> +/* lo */
>> +#define VTD_CONTEXT_ENTRY_P (1ULL << 0)
>> +#define VTD_CONTEXT_ENTRY_FPD   (1ULL << 1) /* Fault Processing Disable */
>> +#define VTD_CONTEXT_ENTRY_TT    (3ULL << 2) /* Translation Type */
>> +#define VTD_CONTEXT_TT_MULTI_LEVEL  (0)
>> +#define VTD_CONTEXT_TT_DEV_IOTLB    (1)
>> +#define VTD_CONTEXT_TT_PASS_THROUGH (2)
>> +/* Second Level Page Translation Pointer*/
>> +#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
>> +#define VTD_CONTEXT_ENTRY_RSVD_LO   (0xff0ULL | ~VTD_HAW_MASK)
>> +/* hi */
>> +#define VTD_CONTEXT_ENTRY_AW    (7ULL) /* Adjusted guest-address-width */
>> +#define VTD_CONTEXT_ENTRY_DID   (0xffffULL << 8)    /* Domain Identifier */
>> +#define VTD_CONTEXT_ENTRY_RSVD_HI   (0xffffffffff000080ULL)
>> +
>> +#define VTD_CONTEXT_ENTRY_NR    (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
>> +
>> +
>> +/* Paging Structure common */
>> +#define VTD_SL_PT_PAGE_SIZE_MASK   (1ULL << 7)
>> +#define VTD_SL_LEVEL_BITS   9   /* Bits to decide the offset for each level */
>> +
>> +/* Second Level Paging Structure */
>> +#define VTD_SL_PML4_LEVEL   4
>> +#define VTD_SL_PDP_LEVEL    3
>> +#define VTD_SL_PD_LEVEL     2
>> +#define VTD_SL_PT_LEVEL     1
>> +#define VTD_SL_PT_ENTRY_NR  512
>> +
>> +/* Masks for Second Level Paging Entry */
>> +#define VTD_SL_RW_MASK              (3ULL)
>> +#define VTD_SL_R                    (1ULL)
>> +#define VTD_SL_W                    (1ULL << 1)
>> +#define VTD_SL_PT_BASE_ADDR_MASK    (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK)
>> +#define VTD_SL_IGN_COM    (0xbff0000000000000ULL)
>> +
>> +#endif
>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>> new file mode 100644
>> index 0000000..6601e62
>> --- /dev/null
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -0,0 +1,90 @@
>> +/*
>> + * QEMU emulation of an Intel IOMMU (VT-d)
>> + *   (DMA Remapping device)
>> + *
>> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
>> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef INTEL_IOMMU_H
>> +#define INTEL_IOMMU_H
>> +#include "hw/qdev.h"
>> +#include "sysemu/dma.h"
>> +
>> +#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
>> +#define INTEL_IOMMU_DEVICE(obj) \
>> +     OBJECT_CHECK(IntelIOMMUState, (obj), TYPE_INTEL_IOMMU_DEVICE)
>> +
>> +/* DMAR Hardware Unit Definition address (IOMMU unit) */
>> +#define Q35_HOST_BRIDGE_IOMMU_ADDR 0xfed90000ULL
>> +
>> +#define VTD_PCI_BUS_MAX 256
>> +#define VTD_PCI_SLOT_MAX 32
>> +#define VTD_PCI_FUNC_MAX 8
>> +#define VTD_PCI_SLOT(devfn)         (((devfn) >> 3) & 0x1f)
>> +#define VTD_PCI_FUNC(devfn)         ((devfn) & 0x07)
>> +
>> +#define DMAR_REG_SIZE   0x230
>> +
>> +/* FIXME: do not know how to decide the haw */
>
> Nothing to fix IMHO. Just state that this definition is arbitrary, just
> large enough to cover all currently expected guest RAM sizes.

OK.

>> +#define VTD_HOST_ADDRESS_WIDTH  39
>> +#define VTD_HAW_MASK    ((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
>> +
>> +typedef struct IntelIOMMUState IntelIOMMUState;
>> +typedef struct VTDAddressSpace VTDAddressSpace;
>> +
>> +struct VTDAddressSpace {
>> +    int bus_num;
>> +    int devfn;
>> +    AddressSpace as;
>> +    MemoryRegion iommu;
>> +    IntelIOMMUState *iommu_state;
>> +};
>> +
>> +/* The iommu (DMAR) device state struct */
>> +struct IntelIOMMUState {
>> +    SysBusDevice busdev;
>> +    MemoryRegion csrmem;
>> +    uint8_t csr[DMAR_REG_SIZE];     /* register values */
>> +    uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
>> +    uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
>> +    uint8_t womask[DMAR_REG_SIZE]; /* WO (write only - read returns 0) */
>> +    uint32_t version;
>> +
>> +    dma_addr_t root;        /* Current root table pointer */
>> +    bool root_extended;     /* Type of root table (extended or not) */
>> +    bool dmar_enabled;      /* Set if DMA remapping is enabled */
>> +
>> +    uint16_t iq_head;       /* Current invalidation queue head */
>> +    uint16_t iq_tail;       /* Current invalidation queue tail */
>> +    dma_addr_t iq;          /* Current invalidation queue (IQ) pointer */
>> +    uint16_t iq_size;       /* IQ Size in number of entries */
>> +    bool qi_enabled;        /* Set if the QI is enabled */
>> +    uint8_t iq_last_desc_type; /* The type of last completed descriptor */
>> +
>> +    /* The index of the Fault Recording Register to be used next.
>> +     * Wraps around from N-1 to 0, where N is the number of FRCD_REG.
>> +     */
>> +    uint16_t next_frcd_reg;
>> +
>> +    uint64_t cap;           /* The value of Capability Register */
>> +    uint64_t ecap;          /* The value of Extended Capability Register */
>> +
>> +    MemoryRegionIOMMUOps iommu_ops;
>> +    VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
>> +};
>> +
>> +#endif
>>
>
> Very nice job!

Thanks very much for your review! :)

Le

> Jan
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation
  2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation Le Tan
  2014-08-12  7:34   ` Jan Kiszka
@ 2014-08-14 11:03   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:03 UTC (permalink / raw)
  To: Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

On Mon, Aug 11, 2014 at 03:04:59PM +0800, Le Tan wrote:
> Add support for emulating Intel IOMMU according to the VT-d specification for
> the q35 chipset machine. Implement the logics for DMAR (DMA remapping) without
> PASID support. The emulation supports register-based invalidation and primary
> fault logging.
> 
> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> ---
>  hw/i386/Makefile.objs          |    1 +
>  hw/i386/intel_iommu.c          | 1345 ++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  345 +++++++++++
>  include/hw/i386/intel_iommu.h  |   90 +++
>  4 files changed, 1781 insertions(+)
>  create mode 100644 hw/i386/intel_iommu.c
>  create mode 100644 hw/i386/intel_iommu_internal.h
>  create mode 100644 include/hw/i386/intel_iommu.h
> 
> diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
> index 48014ab..6936111 100644
> --- a/hw/i386/Makefile.objs
> +++ b/hw/i386/Makefile.objs
> @@ -2,6 +2,7 @@ obj-$(CONFIG_KVM) += kvm/
>  obj-y += multiboot.o smbios.o
>  obj-y += pc.o pc_piix.o pc_q35.o
>  obj-y += pc_sysfw.o
> +obj-y += intel_iommu.o
>  obj-$(CONFIG_XEN) += ../xenpv/ xen/
>  
>  obj-y += kvmvapic.o
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> new file mode 100644
> index 0000000..b3a4f78
> --- /dev/null
> +++ b/hw/i386/intel_iommu.c
> @@ -0,0 +1,1345 @@
> +/*
> + * QEMU emulation of an Intel IOMMU (VT-d)
> + *   (DMA Remapping device)
> + *
> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "hw/sysbus.h"
> +#include "exec/address-spaces.h"
> +#include "intel_iommu_internal.h"
> +
> +
> +/*#define DEBUG_INTEL_IOMMU*/
> +#ifdef DEBUG_INTEL_IOMMU
> +enum {
> +    DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
> +};
> +#define VTD_DBGBIT(x)   (1 << DEBUG_##x)
> +static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR) |
> +                          VTD_DBGBIT(FLOG);
> +
> +#define VTD_DPRINTF(what, fmt, ...) do { \
> +    if (vtd_dbgflags & VTD_DBGBIT(what)) { \
> +        fprintf(stderr, "(vtd)%s: " fmt "\n", __func__, \
> +                ## __VA_ARGS__); } \
> +    } while (0)
> +#else
> +#define VTD_DPRINTF(what, fmt, ...) do {} while (0)
> +#endif
> +
> +static inline void define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
> +                               uint64_t wmask, uint64_t w1cmask)

Please prefix functions with intel_iommu_ or vtd_ , we don't want
build failing when someone adds such a function in
a global header, and will serve as a hint to
the type of the 1st parameter.


> +{
> +    stq_le_p(&s->csr[addr], val);
> +    stq_le_p(&s->wmask[addr], wmask);
> +    stq_le_p(&s->w1cmask[addr], w1cmask);
> +}
> +
> +static inline void define_quad_wo(IntelIOMMUState *s, hwaddr addr,
> +                                  uint64_t mask)
> +{
> +    stq_le_p(&s->womask[addr], mask);
> +}
> +
> +static inline void define_long(IntelIOMMUState *s, hwaddr addr, uint32_t val,
> +                               uint32_t wmask, uint32_t w1cmask)
> +{
> +    stl_le_p(&s->csr[addr], val);
> +    stl_le_p(&s->wmask[addr], wmask);
> +    stl_le_p(&s->w1cmask[addr], w1cmask);
> +}
> +
> +static inline void define_long_wo(IntelIOMMUState *s, hwaddr addr,
> +                                  uint32_t mask)
> +{
> +    stl_le_p(&s->womask[addr], mask);
> +}
> +
> +/* "External" get/set operations */
> +static inline void set_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val)
> +{
> +    uint64_t oldval = ldq_le_p(&s->csr[addr]);
> +    uint64_t wmask = ldq_le_p(&s->wmask[addr]);
> +    uint64_t w1cmask = ldq_le_p(&s->w1cmask[addr]);
> +    stq_le_p(&s->csr[addr],
> +             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
> +}
> +
> +static inline void set_long(IntelIOMMUState *s, hwaddr addr, uint32_t val)
> +{
> +    uint32_t oldval = ldl_le_p(&s->csr[addr]);
> +    uint32_t wmask = ldl_le_p(&s->wmask[addr]);
> +    uint32_t w1cmask = ldl_le_p(&s->w1cmask[addr]);
> +    stl_le_p(&s->csr[addr],
> +             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
> +}
> +
> +static inline uint64_t get_quad(IntelIOMMUState *s, hwaddr addr)
> +{
> +    uint64_t val = ldq_le_p(&s->csr[addr]);
> +    uint64_t womask = ldq_le_p(&s->womask[addr]);
> +    return val & ~womask;
> +}
> +
> +
> +static inline uint32_t get_long(IntelIOMMUState *s, hwaddr addr)
> +{
> +    uint32_t val = ldl_le_p(&s->csr[addr]);
> +    uint32_t womask = ldl_le_p(&s->womask[addr]);
> +    return val & ~womask;
> +}
> +
> +/* "Internal" get/set operations */
> +static inline uint64_t get_quad_raw(IntelIOMMUState *s, hwaddr addr)
> +{
> +    return ldq_le_p(&s->csr[addr]);
> +}
> +
> +static inline uint32_t get_long_raw(IntelIOMMUState *s, hwaddr addr)
> +{
> +    return ldl_le_p(&s->csr[addr]);
> +}
> +
> +static inline void set_quad_raw(IntelIOMMUState *s, hwaddr addr, uint64_t val)
> +{
> +    stq_le_p(&s->csr[addr], val);
> +}
> +
> +static inline uint32_t set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
> +                                           uint32_t clear, uint32_t mask)
> +{
> +    uint32_t new_val = (ldl_le_p(&s->csr[addr]) & ~clear) | mask;
> +    stl_le_p(&s->csr[addr], new_val);
> +    return new_val;
> +}
> +
> +static inline uint64_t set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
> +                                           uint64_t clear, uint64_t mask)
> +{
> +    uint64_t new_val = (ldq_le_p(&s->csr[addr]) & ~clear) | mask;
> +    stq_le_p(&s->csr[addr], new_val);
> +    return new_val;
> +}
> +
> +/* Given the reg addr of both the message data and address, generate an
> + * interrupt via MSI.
> + */
> +static void vtd_generate_interrupt(IntelIOMMUState *s, hwaddr mesg_addr_reg,
> +                                   hwaddr mesg_data_reg)
> +{
> +    hwaddr addr;
> +    uint32_t data;
> +
> +    assert(mesg_data_reg < DMAR_REG_SIZE);
> +    assert(mesg_addr_reg < DMAR_REG_SIZE);
> +
> +    addr = get_long_raw(s, mesg_addr_reg);
> +    data = get_long_raw(s, mesg_data_reg);
> +
> +    VTD_DPRINTF(FLOG, "msi: addr 0x%"PRIx64 " data 0x%"PRIx32, addr, data);
> +    stl_le_phys(&address_space_memory, addr, data);
> +}
> +
> +/* Generate a fault event to software via MSI if conditions are met.
> + * Notice that the value of FSTS_REG being passed to it should be the one
> + * before any update.
> + */
> +static void vtd_generate_fault_event(IntelIOMMUState *s, uint32_t pre_fsts)
> +{
> +    /* Check if there are any previously reported interrupt conditions */
> +    if (pre_fsts & VTD_FSTS_PPF || pre_fsts & VTD_FSTS_PFO ||
> +        pre_fsts & VTD_FSTS_IQE) {
> +        VTD_DPRINTF(FLOG, "there are previous interrupt conditions "
> +                    "to be serviced by software, fault event is not generated "
> +                    "(FSTS_REG 0x%"PRIx32 ")", pre_fsts);
> +        return;
> +    }
> +    set_clear_mask_long(s, DMAR_FECTL_REG, 0, VTD_FECTL_IP);
> +    if (get_long_raw(s, DMAR_FECTL_REG) & VTD_FECTL_IM) {
> +        /* Interrupt Mask */
> +        VTD_DPRINTF(FLOG, "Interrupt Mask set, fault event is not generated");
> +    } else {
> +        /* generate interrupt */
> +        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
> +    }
> +}
> +
> +/* Check if the Fault (F) field of the Fault Recording Register referenced by
> + * @index is Set.
> + */
> +static inline bool is_frcd_set(IntelIOMMUState *s, uint16_t index)
> +{
> +    /* Each reg is 128-bit */
> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> +    addr += 8; /* Access the high 64-bit half */
> +
> +    assert(index < DMAR_FRCD_REG_NR);
> +
> +    return get_quad_raw(s, addr) & VTD_FRCD_F;
> +}
> +
> +/* Update the PPF field of Fault Status Register.
> + * Should be called whenever change the F field of any fault recording
> + * registers.
> + */
> +static inline void update_fsts_ppf(IntelIOMMUState *s)
> +{
> +    uint32_t i;
> +    uint32_t ppf_mask = 0;
> +
> +    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
> +        if (is_frcd_set(s, i)) {
> +            ppf_mask = VTD_FSTS_PPF;
> +            break;
> +        }
> +    }
> +    set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_PPF, ppf_mask);
> +    VTD_DPRINTF(FLOG, "set PPF of FSTS_REG to %d", ppf_mask ? 1 : 0);
> +}
> +
> +static inline void set_frcd_and_update_ppf(IntelIOMMUState *s, uint16_t index)
> +{
> +    /* Each reg is 128-bit */
> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> +    addr += 8; /* Access the high 64-bit half */
> +
> +    assert(index < DMAR_FRCD_REG_NR);
> +
> +    set_clear_mask_quad(s, addr, 0, VTD_FRCD_F);
> +    update_fsts_ppf(s);
> +}
> +
> +/* Must not update F field now, should be done later */
> +static void record_frcd(IntelIOMMUState *s, uint16_t index, uint16_t source_id,
> +                        hwaddr addr, VTDFaultReason fault, bool is_write)
> +{
> +    uint64_t hi = 0, lo;
> +    hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> +
> +    assert(index < DMAR_FRCD_REG_NR);
> +
> +    lo = VTD_FRCD_FI(addr);
> +    hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
> +    if (!is_write) {
> +        hi |= VTD_FRCD_T;
> +    }
> +
> +    set_quad_raw(s, frcd_reg_addr, lo);
> +    set_quad_raw(s, frcd_reg_addr + 8, hi);
> +    VTD_DPRINTF(FLOG, "record to FRCD_REG #%"PRIu16 ": hi 0x%"PRIx64
> +                ", lo 0x%"PRIx64, index, hi, lo);
> +}
> +
> +/* Try to collapse multiple pending faults from the same requester */
> +static inline bool try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
> +{
> +    uint32_t i;
> +    uint64_t frcd_reg;
> +    hwaddr addr = DMAR_FRCD_REG_OFFSET + 8; /* The high 64-bit half */
> +
> +    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
> +        frcd_reg = get_quad_raw(s, addr);
> +        VTD_DPRINTF(FLOG, "frcd_reg #%d 0x%"PRIx64, i, frcd_reg);
> +        if ((frcd_reg & VTD_FRCD_F) &&
> +            ((frcd_reg & VTD_FRCD_SID_MASK) == source_id)) {
> +            return true;
> +        }
> +        addr += 16; /* 128-bit for each */
> +    }
> +
> +    return false;
> +}
> +
> +/* Log and report an DMAR (address translation) fault to software */
> +static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
> +                                  hwaddr addr, VTDFaultReason fault,
> +                                  bool is_write)
> +{
> +    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
> +
> +    assert(fault < VTD_FR_MAX);
> +
> +    if (fault == VTD_FR_RESERVED_ERR) {
> +        /* This is not a normal fault reason case. Drop it. */
> +        return;
> +    }
> +
> +    VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
> +                ", is_write %d", source_id, fault, addr, is_write);
> +
> +    /* Check PFO field in FSTS_REG */
> +    if (fsts_reg & VTD_FSTS_PFO) {
> +        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
> +                    "Primary Fault Overflow");
> +        return;
> +    }
> +
> +    /* Compression of multiple faults from the same requester */
> +    if (try_collapse_fault(s, source_id)) {
> +        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
> +                    "compression of faults");
> +        return;
> +    }
> +
> +    /* Check next_frcd_reg to see whether it is overflow now */
> +    if (is_frcd_set(s, s->next_frcd_reg)) {
> +        VTD_DPRINTF(FLOG, "Primary Fault Overflow and "
> +                    "new fault is not recorded, set PFO field");
> +        set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_PFO);
> +        return;
> +    }
> +
> +    record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
> +
> +    if (fsts_reg & VTD_FSTS_PPF) {
> +        /* There are already one or more pending faults */
> +        VTD_DPRINTF(FLOG, "there are pending faults already, "
> +                    "fault event is not generated");
> +        set_frcd_and_update_ppf(s, s->next_frcd_reg);
> +        s->next_frcd_reg++;
> +        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
> +            s->next_frcd_reg = 0;
> +        }
> +    } else {
> +        set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_FRI_MASK,
> +                            VTD_FSTS_FRI(s->next_frcd_reg));
> +        set_frcd_and_update_ppf(s, s->next_frcd_reg); /* It will also set PPF */
> +        s->next_frcd_reg++;
> +        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
> +            s->next_frcd_reg = 0;
> +        }
> +
> +        /* This case actually cause the PPF to be Set.
> +         * So generate fault event (interrupt).
> +         */
> +         vtd_generate_fault_event(s, fsts_reg);
> +    }
> +}
> +
> +static inline bool root_entry_present(VTDRootEntry *root)
> +{
> +    return root->val & VTD_ROOT_ENTRY_P;
> +}
> +
> +static int get_root_entry(IntelIOMMUState *s, uint32_t index, VTDRootEntry *re)
> +{
> +    dma_addr_t addr;
> +
> +    assert(index < VTD_ROOT_ENTRY_NR);
> +
> +    addr = s->root + index * sizeof(*re);
> +
> +    if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
> +        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
> +                    " + %"PRIu32, s->root, index);
> +        re->val = 0;
> +        return -VTD_FR_ROOT_TABLE_INV;
> +    }
> +
> +    re->val = le64_to_cpu(re->val);
> +    return VTD_FR_RESERVED;
> +}
> +
> +static inline bool context_entry_present(VTDContextEntry *context)
> +{
> +    return context->lo & VTD_CONTEXT_ENTRY_P;
> +}
> +
> +static int get_context_entry_from_root(VTDRootEntry *root, uint32_t index,
> +                                       VTDContextEntry *ce)
> +{
> +    dma_addr_t addr;
> +
> +    if (!root_entry_present(root)) {
> +        ce->lo = 0;
> +        ce->hi = 0;
> +        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
> +        return -VTD_FR_ROOT_ENTRY_P;
> +    }
> +
> +    assert(index < VTD_CONTEXT_ENTRY_NR);
> +
> +    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
> +
> +    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> +        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
> +                    " + %"PRIu32,
> +                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
> +        ce->lo = 0;
> +        ce->hi = 0;
> +        return -VTD_FR_CONTEXT_TABLE_INV;
> +    }
> +
> +    ce->lo = le64_to_cpu(ce->lo);
> +    ce->hi = le64_to_cpu(ce->hi);
> +    return VTD_FR_RESERVED;
> +}
> +
> +static inline dma_addr_t get_slpt_base_from_context(VTDContextEntry *ce)
> +{
> +    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
> +}
> +
> +/* The shift of an addr for a certain level of paging structure */
> +static inline uint32_t slpt_level_shift(uint32_t level)
> +{
> +    return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
> +}
> +
> +static inline uint64_t get_slpte_addr(uint64_t slpte)
> +{
> +    return slpte & VTD_SL_PT_BASE_ADDR_MASK;
> +}
> +
> +/* Whether the pte indicates the address of the page frame */
> +static inline bool is_last_slpte(uint64_t slpte, uint32_t level)
> +{
> +    return level == VTD_SL_PT_LEVEL || (slpte & VTD_SL_PT_PAGE_SIZE_MASK);
> +}
> +
> +/* Get the content of a spte located in @base_addr[@index] */
> +static inline uint64_t get_slpte(dma_addr_t base_addr, uint32_t index)
> +{
> +    uint64_t slpte;
> +
> +    assert(index < VTD_SL_PT_ENTRY_NR);
> +
> +    if (dma_memory_read(&address_space_memory,
> +                        base_addr + index * sizeof(slpte), &slpte,
> +                        sizeof(slpte))) {
> +        slpte = (uint64_t)-1;
> +        return slpte;
> +    }
> +
> +    slpte = le64_to_cpu(slpte);
> +    return slpte;
> +}
> +
> +/* Given a gpa and the level of paging structure, return the offset of current
> + * level.
> + */
> +static inline uint32_t gpa_level_offset(uint64_t gpa, uint32_t level)
> +{
> +    return (gpa >> slpt_level_shift(level)) & ((1ULL << VTD_SL_LEVEL_BITS) - 1);
> +}
> +
> +/* Check Capability Register to see if the @level of page-table is supported */
> +static inline bool is_level_supported(IntelIOMMUState *s, uint32_t level)
> +{
> +    return VTD_CAP_SAGAW_MASK & s->cap &
> +           (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
> +}
> +
> +/* Get the page-table level that hardware should use for the second-level
> + * page-table walk from the Address Width field of context-entry.
> + */
> +static inline uint32_t get_level_from_context_entry(VTDContextEntry *ce)
> +{
> +    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
> +}
> +
> +static inline uint32_t get_agaw_from_context_entry(VTDContextEntry *ce)
> +{
> +    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
> +}
> +
> +static const uint64_t paging_entry_rsvd_field[] = {
> +    [0] = ~0ULL,
> +    /* For not large page */
> +    [1] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [2] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [3] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [4] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    /* For large page */
> +    [5] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [6] = 0x1ff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [7] = 0x3ffff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +    [8] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
> +};
> +
> +static inline bool slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> +{
> +    if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> +        /* Maybe large page */
> +        return slpte & paging_entry_rsvd_field[level + 4];
> +    } else {
> +        return slpte & paging_entry_rsvd_field[level];
> +    }
> +}
> +
> +/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
> + * of the translation, can be used for deciding the size of large page.
> + * @slptep and @slpte_level will not be touched if error happens.
> + */
> +static int gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
> +                        uint64_t *slptep, uint32_t *slpte_level)
> +{
> +    dma_addr_t addr = get_slpt_base_from_context(ce);
> +    uint32_t level = get_level_from_context_entry(ce);
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint32_t ce_agaw = get_agaw_from_context_entry(ce);
> +    uint64_t access_right_check;
> +
> +    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
> +     * and AW in context-entry.
> +     */
> +    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    /* FIXME: what is the Atomics request here? */
> +    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
> +
> +    while (true) {
> +        offset = gpa_level_offset(gpa, level);
> +        slpte = get_slpte(addr, offset);
> +
> +        if (slpte == (uint64_t)-1) {
> +            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
> +                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
> +                        level, gpa);
> +            if (level == get_level_from_context_entry(ce)) {
> +                /* Invalid programming of context-entry */
> +                return -VTD_FR_CONTEXT_ENTRY_INV;
> +            } else {
> +                return -VTD_FR_PAGING_ENTRY_INV;
> +            }
> +        }
> +        if (!(slpte & access_right_check)) {
> +            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
> +                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
> +                        (is_write ? "write" : "read"), gpa, slpte);
> +            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> +        }
> +        if (slpte_nonzero_rsvd(slpte, level)) {
> +            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
> +                        "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
> +                        level, slpte);
> +            return -VTD_FR_PAGING_ENTRY_RSVD;
> +        }
> +
> +        if (is_last_slpte(slpte, level)) {
> +            *slptep = slpte;
> +            *slpte_level = level;
> +            return VTD_FR_RESERVED;
> +        }
> +        addr = get_slpte_addr(slpte);
> +        level--;
> +    }
> +}
> +
> +/* Map a device to its corresponding domain (context-entry). @ce will be set
> + * to Zero if error happens while accessing the context-entry.
> + */
> +static inline int dev_to_context_entry(IntelIOMMUState *s, int bus_num,
> +                                       int devfn, VTDContextEntry *ce)
> +{
> +    VTDRootEntry re;
> +    int ret_fr;
> +
> +    assert(0 <= bus_num && bus_num < VTD_PCI_BUS_MAX);
> +    assert(0 <= devfn && devfn < VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
> +
> +    ret_fr = get_root_entry(s, bus_num, &re);
> +    if (ret_fr) {
> +        ce->hi = 0;
> +        ce->lo = 0;
> +        return ret_fr;
> +    }
> +
> +    if (!root_entry_present(&re)) {
> +        VTD_DPRINTF(GENERAL, "error: root-entry #%d is not present", bus_num);
> +        ce->hi = 0;
> +        ce->lo = 0;
> +        return -VTD_FR_ROOT_ENTRY_P;
> +    } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
> +        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
> +                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
> +        ce->hi = 0;
> +        ce->lo = 0;
> +        return -VTD_FR_ROOT_ENTRY_RSVD;
> +    }
> +
> +    ret_fr = get_context_entry_from_root(&re, devfn, ce);
> +    if (ret_fr) {
> +        return ret_fr;
> +    }
> +
> +    if (!context_entry_present(ce)) {
> +        VTD_DPRINTF(GENERAL,
> +                    "error: context-entry #%d(bus #%d) is not present", devfn,
> +                    bus_num);
> +        return -VTD_FR_CONTEXT_ENTRY_P;
> +    } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
> +               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> +        VTD_DPRINTF(GENERAL,
> +                    "error: non-zero reserved field in context-entry "
> +                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
> +        return -VTD_FR_CONTEXT_ENTRY_RSVD;
> +    }
> +
> +    /* Check if the programming of context-entry is valid */
> +    if (!is_level_supported(s, get_level_from_context_entry(ce))) {
> +        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
> +                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> +                    ce->hi, ce->lo);
> +        return -VTD_FR_CONTEXT_ENTRY_INV;
> +    } else if (ce->lo & VTD_CONTEXT_ENTRY_TT) {
> +        VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
> +                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> +                    ce->hi, ce->lo);
> +        return -VTD_FR_CONTEXT_ENTRY_INV;
> +    }
> +
> +    return VTD_FR_RESERVED;
> +}
> +
> +static inline uint16_t make_source_id(int bus_num, int devfn)
> +{
> +    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> +}
> +
> +static const bool qualified_faults[] = {
> +    [VTD_FR_RESERVED] = false,
> +    [VTD_FR_ROOT_ENTRY_P] = false,
> +    [VTD_FR_CONTEXT_ENTRY_P] = true,
> +    [VTD_FR_CONTEXT_ENTRY_INV] = true,
> +    [VTD_FR_ADDR_BEYOND_MGAW] = true,
> +    [VTD_FR_WRITE] = true,
> +    [VTD_FR_READ] = true,
> +    [VTD_FR_PAGING_ENTRY_INV] = true,
> +    [VTD_FR_ROOT_TABLE_INV] = false,
> +    [VTD_FR_CONTEXT_TABLE_INV] = false,
> +    [VTD_FR_ROOT_ENTRY_RSVD] = false,
> +    [VTD_FR_PAGING_ENTRY_RSVD] = true,
> +    [VTD_FR_CONTEXT_ENTRY_TT] = true,
> +    [VTD_FR_RESERVED_ERR] = false,
> +    [VTD_FR_MAX] = false,
> +};
> +
> +/* To see if a fault condition is "qualified", which is reported to software
> + * only if the FPD field in the context-entry used to process the faulting
> + * request is 0.
> + */
> +static inline bool is_qualified_fault(VTDFaultReason fault)
> +{
> +    return qualified_faults[fault];
> +}
> +
> +static inline bool is_interrupt_addr(hwaddr addr)
> +{
> +    return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
> +}
> +
> +/* Map dev to context-entry then do a paging-structures walk to do a iommu
> + * translation.
> + * @bus_num: The bus number
> + * @devfn: The devfn, which is the  combined of device and function number
> + * @is_write: The access is a write operation
> + * @entry: IOMMUTLBEntry that contain the addr to be translated and result
> + */
> +static void iommu_translate(IntelIOMMUState *s, int bus_num, int devfn,
> +                            hwaddr addr, bool is_write, IOMMUTLBEntry *entry)
> +{
> +    VTDContextEntry ce;
> +    uint64_t slpte;
> +    uint32_t level;
> +    uint64_t page_mask;
> +    uint16_t source_id = make_source_id(bus_num, devfn);
> +    int ret_fr;
> +    bool is_fpd_set = false;
> +
> +    /* Check if the request is in interrupt address range */
> +    if (is_interrupt_addr(addr)) {
> +        if (is_write) {
> +            /* FIXME: since we don't know the length of the access here, we
> +             * treat Non-DWORD length write requests without PASID as
> +             * interrupt requests, too. Withoud interrupt remapping support,
> +             * we just use 1:1 mapping.
> +             */
> +            VTD_DPRINTF(MMU, "write request to interrupt address "
> +                        "gpa 0x%"PRIx64, addr);
> +            entry->iova = addr & VTD_PAGE_MASK_4K;
> +            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
> +            entry->addr_mask = ~VTD_PAGE_MASK_4K;
> +            entry->perm = IOMMU_WO;
> +            return;
> +        } else {
> +            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
> +                        "gpa 0x%"PRIx64, addr);
> +            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
> +            return;
> +        }
> +    }
> +
> +    ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> +    is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> +    if (ret_fr) {
> +        ret_fr = -ret_fr;
> +        if (is_fpd_set && is_qualified_fault(ret_fr)) {
> +            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
> +                        "through this context-entry (with FPD Set)");
> +        } else {
> +            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> +        }
> +        return;
> +    }
> +
> +    ret_fr = gpa_to_slpte(&ce, addr, is_write, &slpte, &level);
> +    if (ret_fr) {
> +        ret_fr = -ret_fr;
> +        if (is_fpd_set && is_qualified_fault(ret_fr)) {
> +            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
> +                        "through this context-entry (with FPD Set)");
> +        } else {
> +            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> +        }
> +        return;
> +    }
> +
> +    if (level == VTD_SL_PT_LEVEL) {
> +        /* 4-KB page */
> +        page_mask = VTD_PAGE_MASK_4K;
> +    } else if (level == VTD_SL_PDP_LEVEL) {
> +        /* 1-GB page */
> +        page_mask = VTD_PAGE_MASK_1G;
> +    } else {
> +        /* 2-MB page */
> +        page_mask = VTD_PAGE_MASK_2M;
> +    }
> +
> +    entry->iova = addr & page_mask;
> +    entry->translated_addr = get_slpte_addr(slpte) & page_mask;
> +    entry->addr_mask = ~page_mask;
> +    entry->perm = slpte & VTD_SL_RW_MASK;
> +}
> +
> +static void vtd_root_table_setup(IntelIOMMUState *s)
> +{
> +    s->root = get_quad_raw(s, DMAR_RTADDR_REG);
> +    s->root_extended = s->root & VTD_RTADDR_RTT;
> +    s->root &= VTD_RTADDR_ADDR_MASK;
> +
> +    VTD_DPRINTF(CSR, "root_table addr 0x%"PRIx64 " %s", s->root,
> +                (s->root_extended ? "(extended)" : ""));
> +}
> +
> +/* Context-cache invalidation
> + * Returns the Context Actual Invalidation Granularity.
> + * @val: the content of the CCMD_REG
> + */
> +static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
> +{
> +    uint64_t caig;
> +    uint64_t type = val & VTD_CCMD_CIRG_MASK;
> +
> +    switch (type) {
> +    case VTD_CCMD_GLOBAL_INVL:
> +        VTD_DPRINTF(INV, "Global invalidation request");
> +        caig = VTD_CCMD_GLOBAL_INVL_A;
> +        break;
> +
> +    case VTD_CCMD_DOMAIN_INVL:
> +        VTD_DPRINTF(INV, "Domain-selective invalidation request");
> +        caig = VTD_CCMD_DOMAIN_INVL_A;
> +        break;
> +
> +    case VTD_CCMD_DEVICE_INVL:
> +        VTD_DPRINTF(INV, "Domain-selective invalidation request");
> +        caig = VTD_CCMD_DEVICE_INVL_A;
> +        break;
> +
> +    default:
> +        VTD_DPRINTF(GENERAL,
> +                    "error: wrong context-cache invalidation granularity");
> +        caig = 0;
> +    }
> +
> +    return caig;
> +}
> +
> +/* Flush IOTLB
> + * Returns the IOTLB Actual Invalidation Granularity.
> + * @val: the content of the IOTLB_REG
> + */
> +static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
> +{
> +    uint64_t iaig;
> +    uint64_t type = val & VTD_TLB_FLUSH_GRANU_MASK;
> +
> +    switch (type) {
> +    case VTD_TLB_GLOBAL_FLUSH:
> +        VTD_DPRINTF(INV, "Global IOTLB flush");
> +        iaig = VTD_TLB_GLOBAL_FLUSH_A;
> +        break;
> +
> +    case VTD_TLB_DSI_FLUSH:
> +        VTD_DPRINTF(INV, "Domain-selective IOTLB flush");
> +        iaig = VTD_TLB_DSI_FLUSH_A;
> +        break;
> +
> +    case VTD_TLB_PSI_FLUSH:
> +        VTD_DPRINTF(INV, "Page-selective-within-domain IOTLB flush");
> +        iaig = VTD_TLB_PSI_FLUSH_A;
> +        break;
> +
> +    default:
> +        VTD_DPRINTF(GENERAL, "error: wrong iotlb flush granularity");
> +        iaig = 0;
> +    }
> +
> +    return iaig;
> +}
> +
> +/* Set Root Table Pointer */
> +static void handle_gcmd_srtp(IntelIOMMUState *s)
> +{
> +    VTD_DPRINTF(CSR, "set Root Table Pointer");
> +
> +    vtd_root_table_setup(s);
> +    /* Ok - report back to driver */
> +    set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
> +}
> +
> +/* Handle Translation Enable/Disable */
> +static void handle_gcmd_te(IntelIOMMUState *s, bool en)
> +{
> +    VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
> +
> +    if (en) {
> +        s->dmar_enabled = true;
> +        /* Ok - report back to driver */
> +        set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
> +    } else {
> +        s->dmar_enabled = false;
> +
> +        /* Clear the index of Fault Recording Register */
> +        s->next_frcd_reg = 0;
> +        /* Ok - report back to driver */
> +        set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
> +    }
> +}
> +
> +/* Handle write to Global Command Register */
> +static void handle_gcmd_write(IntelIOMMUState *s)
> +{
> +    uint32_t status = get_long_raw(s, DMAR_GSTS_REG);
> +    uint32_t val = get_long_raw(s, DMAR_GCMD_REG);
> +    uint32_t changed = status ^ val;
> +
> +    VTD_DPRINTF(CSR, "value 0x%"PRIx32 " status 0x%"PRIx32, val, status);
> +    if (changed & VTD_GCMD_TE) {
> +        /* Translation enable/disable */
> +        handle_gcmd_te(s, val & VTD_GCMD_TE);
> +    }
> +    if (val & VTD_GCMD_SRTP) {
> +        /* Set/update the root-table pointer */
> +        handle_gcmd_srtp(s);
> +    }
> +}
> +
> +/* Handle write to Context Command Register */
> +static void handle_ccmd_write(IntelIOMMUState *s)
> +{
> +    uint64_t ret;
> +    uint64_t val = get_quad_raw(s, DMAR_CCMD_REG);
> +
> +    /* Context-cache invalidation request */
> +    if (val & VTD_CCMD_ICC) {
> +        ret = vtd_context_cache_invalidate(s, val);
> +
> +        /* Invalidation completed. Change something to show */
> +        set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_ICC, 0ULL);
> +        ret = set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_CAIG_MASK, ret);
> +        VTD_DPRINTF(INV, "CCMD_REG write-back val: 0x%"PRIx64, ret);
> +    }
> +}
> +
> +/* Handle write to IOTLB Invalidation Register */
> +static void handle_iotlb_write(IntelIOMMUState *s)
> +{
> +    uint64_t ret;
> +    uint64_t val = get_quad_raw(s, DMAR_IOTLB_REG);
> +
> +    /* IOTLB invalidation request */
> +    if (val & VTD_TLB_IVT) {
> +        ret = vtd_iotlb_flush(s, val);
> +
> +        /* Invalidation completed. Change something to show */
> +        set_clear_mask_quad(s, DMAR_IOTLB_REG, VTD_TLB_IVT, 0ULL);
> +        ret = set_clear_mask_quad(s, DMAR_IOTLB_REG,
> +                                  VTD_TLB_FLUSH_GRANU_MASK_A, ret);
> +        VTD_DPRINTF(INV, "IOTLB_REG write-back val: 0x%"PRIx64, ret);
> +    }
> +}
> +
> +static inline void handle_fsts_write(IntelIOMMUState *s)
> +{
> +    uint32_t fsts_reg = get_long_raw(s, DMAR_FSTS_REG);
> +    uint32_t fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
> +    uint32_t status_fields = VTD_FSTS_PFO | VTD_FSTS_PPF | VTD_FSTS_IQE;
> +
> +    if ((fectl_reg & VTD_FECTL_IP) && !(fsts_reg & status_fields)) {
> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
> +        VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
> +                    "IP field of FECTL_REG");
> +    }
> +}
> +
> +static inline void handle_fectl_write(IntelIOMMUState *s)
> +{
> +    uint32_t fectl_reg;
> +    /* When software clears the IM field, check the IP field. But do we
> +     * need to compare the old value and the new value to conclude that
> +     * software clears the IM field? Or just check if the IM field is zero?
> +     */
> +    fectl_reg = get_long_raw(s, DMAR_FECTL_REG);
> +    if ((fectl_reg & VTD_FECTL_IP) && !(fectl_reg & VTD_FECTL_IM)) {
> +        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
> +        set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
> +        VTD_DPRINTF(FLOG, "IM field is cleared, generate "
> +                    "fault event interrupt");
> +    }
> +}
> +
> +static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    IntelIOMMUState *s = opaque;
> +    uint64_t val;
> +
> +    if (addr + size > DMAR_REG_SIZE) {
> +        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
> +                    ", got 0x%"PRIx64 " %d",
> +                    (uint64_t)DMAR_REG_SIZE, addr, size);
> +        return (uint64_t)-1;
> +    }
> +
> +    assert(size == 4 || size == 8);
> +
> +    switch (addr) {
> +    /* Root Table Address Register, 64-bit */
> +    case DMAR_RTADDR_REG:
> +        if (size == 4) {
> +            val = s->root & ((1ULL << 32) - 1);
> +        } else {
> +            val = s->root;
> +        }
> +        break;
> +
> +    case DMAR_RTADDR_REG_HI:
> +        assert(size == 4);
> +        val = s->root >> 32;
> +        break;
> +
> +    default:
> +        if (size == 4) {
> +            val = get_long(s, addr);
> +        } else {
> +            val = get_quad(s, addr);
> +        }
> +    }
> +
> +    VTD_DPRINTF(CSR, "addr 0x%"PRIx64 " size %d val 0x%"PRIx64,
> +                addr, size, val);
> +    return val;
> +}
> +
> +static void vtd_mem_write(void *opaque, hwaddr addr,
> +                          uint64_t val, unsigned size)
> +{
> +    IntelIOMMUState *s = opaque;
> +
> +    if (addr + size > DMAR_REG_SIZE) {
> +        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
> +                    ", got 0x%"PRIx64 " %d",
> +                    (uint64_t)DMAR_REG_SIZE, addr, size);
> +        return;
> +    }
> +
> +    assert(size == 4 || size == 8);
> +
> +    switch (addr) {
> +    /* Global Command Register, 32-bit */
> +    case DMAR_GCMD_REG:
> +        VTD_DPRINTF(CSR, "DMAR_GCMD_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        set_long(s, addr, val);
> +        handle_gcmd_write(s);
> +        break;
> +
> +    /* Context Command Register, 64-bit */
> +    case DMAR_CCMD_REG:
> +        VTD_DPRINTF(CSR, "DMAR_CCMD_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +            handle_ccmd_write(s);
> +        }
> +        break;
> +
> +    case DMAR_CCMD_REG_HI:
> +        VTD_DPRINTF(CSR, "DMAR_CCMD_REG_HI write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_ccmd_write(s);
> +        break;
> +
> +
> +    /* IOTLB Invalidation Register, 64-bit */
> +    case DMAR_IOTLB_REG:
> +        VTD_DPRINTF(INV, "DMAR_IOTLB_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +            handle_iotlb_write(s);
> +        }
> +        break;
> +
> +    case DMAR_IOTLB_REG_HI:
> +        VTD_DPRINTF(INV, "DMAR_IOTLB_REG_HI write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_iotlb_write(s);
> +        break;
> +
> +    /* Fault Status Register, 32-bit */
> +    case DMAR_FSTS_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FSTS_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_fsts_write(s);
> +        break;
> +
> +    /* Fault Event Control Register, 32-bit */
> +    case DMAR_FECTL_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FECTL_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        handle_fectl_write(s);
> +        break;
> +
> +    /* Fault Event Data Register, 32-bit */
> +    case DMAR_FEDATA_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FEDATA_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Fault Event Address Register, 32-bit */
> +    case DMAR_FEADDR_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FEADDR_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Fault Event Upper Address Register, 32-bit */
> +    case DMAR_FEUADDR_REG:
> +        VTD_DPRINTF(FLOG, "DMAR_FEUADDR_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Protected Memory Enable Register, 32-bit */
> +    case DMAR_PMEN_REG:
> +        VTD_DPRINTF(CSR, "DMAR_PMEN_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +
> +    /* Root Table Address Register, 64-bit */
> +    case DMAR_RTADDR_REG:
> +        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +        }
> +        break;
> +
> +    case DMAR_RTADDR_REG_HI:
> +        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG_HI write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    /* Fault Recording Registers, 128-bit */
> +    case DMAR_FRCD_REG_0_0:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +        }
> +        break;
> +
> +    case DMAR_FRCD_REG_0_1:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_1 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        break;
> +
> +    case DMAR_FRCD_REG_0_2:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_2 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +            /* May clear bit 127 (Fault), update PPF */
> +            update_fsts_ppf(s);
> +        }
> +        break;
> +
> +    case DMAR_FRCD_REG_0_3:
> +        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_3 write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        assert(size == 4);
> +        set_long(s, addr, val);
> +        /* May clear bit 127 (Fault), update PPF */
> +        update_fsts_ppf(s);
> +        break;
> +
> +    default:
> +        VTD_DPRINTF(GENERAL, "error: unhandled reg write addr 0x%"PRIx64
> +                    ", size %d, val 0x%"PRIx64, addr, size, val);
> +        if (size == 4) {
> +            set_long(s, addr, val);
> +        } else {
> +            set_quad(s, addr, val);
> +        }
> +    }
> +
> +}
> +
> +static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
> +                                         bool is_write)
> +{
> +    VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    int bus_num = vtd_as->bus_num;
> +    int devfn = vtd_as->devfn;
> +    IOMMUTLBEntry ret = {
> +        .target_as = &address_space_memory,
> +        .iova = addr,
> +        .translated_addr = 0,
> +        .addr_mask = ~(hwaddr)0,
> +        .perm = IOMMU_NONE,
> +    };
> +
> +    if (!s->dmar_enabled) {
> +        /* DMAR disabled, passthrough, use 4k-page*/
> +        ret.iova = addr & VTD_PAGE_MASK_4K;
> +        ret.translated_addr = addr & VTD_PAGE_MASK_4K;
> +        ret.addr_mask = ~VTD_PAGE_MASK_4K;
> +        ret.perm = IOMMU_RW;
> +        return ret;
> +    }
> +
> +    iommu_translate(s, bus_num, devfn, addr, is_write, &ret);
> +
> +    VTD_DPRINTF(MMU,
> +                "bus %d slot %d func %d devfn %d gpa %"PRIx64 " hpa %"PRIx64,
> +                bus_num, VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), devfn, addr,
> +                ret.translated_addr);
> +    return ret;
> +}
> +
> +static const VMStateDescription vtd_vmstate = {
> +    .name = "iommu_intel",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .minimum_version_id_old = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static const MemoryRegionOps vtd_mem_ops = {
> +    .read = vtd_mem_read,
> +    .write = vtd_mem_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static Property iommu_properties[] = {
> +    DEFINE_PROP_UINT32("version", IntelIOMMUState, version, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +/* Do the real initialization. It will also be called when reset, so pay
> + * attention when adding new initialization stuff.
> + */
> +static void do_vtd_init(IntelIOMMUState *s)
> +{
> +    memset(s->csr, 0, DMAR_REG_SIZE);
> +    memset(s->wmask, 0, DMAR_REG_SIZE);
> +    memset(s->w1cmask, 0, DMAR_REG_SIZE);
> +    memset(s->womask, 0, DMAR_REG_SIZE);
> +
> +    s->iommu_ops.translate = vtd_iommu_translate;
> +    s->root = 0;
> +    s->root_extended = false;
> +    s->dmar_enabled = false;
> +    s->iq_head = 0;
> +    s->iq_tail = 0;
> +    s->iq = 0;
> +    s->iq_size = 0;
> +    s->qi_enabled = false;
> +    s->iq_last_desc_type = VTD_INV_DESC_NONE;
> +    s->next_frcd_reg = 0;
> +
> +    /* b.0:2 = 6: Number of domains supported: 64K using 16 bit ids
> +     * b.3   = 0: Advanced fault logging not supported
> +     * b.4   = 0: Required write buffer flushing not supported
> +     * b.5   = 0: Protected low memory region not supported
> +     * b.6   = 0: Protected high memory region not supported
> +     * b.8:12 = 2: SAGAW(Supported Adjusted Guest Address Widths), 39-bit,
> +     *             3-level page-table
> +     * b.16:21 = 38: MGAW(Maximum Guest Address Width) = 39
> +     * b.22 = 0: ZLR(Zero Length Read) zero length DMA read requests
> +     *           to write-only pages not supported
> +     * b.24:33 = 34: FRO(Fault-recording Register offset)
> +     * b.54 = 0: DWD(Write Draining), draining of write requests not supported
> +     * b.55 = 0: DRD(Read Draining), draining of read requests not supported
> +     */
> +    s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> +             VTD_CAP_SAGAW;
> +
> +    /* b.1 = 0: QI(Queued Invalidation support) not supported
> +     * b.2 = 0: DT(Device-TLB support) not supported
> +     * b.3 = 0: IR(Interrupt Remapping support) not supported
> +     * b.4 = 0: EIM(Extended Interrupt Mode) not supported
> +     * b.8:17 = 15: IRO(IOTLB Register Offset)
> +     * b.20:23 = 0: MHMV(Maximum Handle Mask Value) not valid
> +     */
> +    s->ecap = VTD_ECAP_IRO;
> +
> +    /* Define registers with default values and bit semantics */
> +    define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);  /* set MAX = 1, RO */
> +    define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
> +    define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
> +    define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
> +    define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
> +    define_long(s, DMAR_GSTS_REG, 0, 0, 0); /* All bits RO, default 0 */
> +    define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
> +    define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
> +    define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
> +
> +    /* Advanced Fault Logging not supported */
> +    define_long(s, DMAR_FSTS_REG, 0, 0, 0x11UL);
> +    define_long(s, DMAR_FECTL_REG, 0x80000000UL, 0x80000000UL, 0);
> +    define_long(s, DMAR_FEDATA_REG, 0, 0x0000ffffUL, 0); /* 15:0 RW */
> +    define_long(s, DMAR_FEADDR_REG, 0, 0xfffffffcUL, 0); /* 31:2 RW */
> +
> +    /* Treated as RsvdZ when EIM in ECAP_REG is not supported
> +     * define_long(s, DMAR_FEUADDR_REG, 0, 0xffffffffUL, 0);
> +     */
> +    define_long(s, DMAR_FEUADDR_REG, 0, 0, 0);
> +
> +    /* Treated as RO for implementations that PLMR and PHMR fields reported
> +     * as Clear in the CAP_REG.
> +     * define_long(s, DMAR_PMEN_REG, 0, 0x80000000UL, 0);
> +     */
> +    define_long(s, DMAR_PMEN_REG, 0, 0, 0);
> +
> +    /* IOTLB registers */
> +    define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
> +    define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
> +    define_quad_wo(s, DMAR_IVA_REG, 0xfffffffffffff07fULL);
> +
> +    /* Fault Recording Registers, 128-bit */
> +    define_quad(s, DMAR_FRCD_REG_0_0, 0, 0, 0);
> +    define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000000000000000ULL);
> +}
> +
> +/* Reset function of QOM
> + * Should not reset address_spaces when reset
> + */
> +static void vtd_reset(DeviceState *dev)
> +{
> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
> +
> +    VTD_DPRINTF(GENERAL, "");
> +    do_vtd_init(s);
> +}
> +
> +/* Initialization function of QOM */
> +static void vtd_realize(DeviceState *dev, Error **errp)
> +{
> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
> +
> +    VTD_DPRINTF(GENERAL, "");
> +    memset(s->address_spaces, 0, sizeof(s->address_spaces));
> +    memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> +                          "intel_iommu", DMAR_REG_SIZE);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
> +    do_vtd_init(s);
> +}
> +
> +static void vtd_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->reset = vtd_reset;
> +    dc->realize = vtd_realize;
> +    dc->vmsd = &vtd_vmstate;
> +    dc->props = iommu_properties;
> +}
> +
> +static const TypeInfo vtd_info = {
> +    .name          = TYPE_INTEL_IOMMU_DEVICE,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(IntelIOMMUState),
> +    .class_init    = vtd_class_init,
> +};
> +
> +static void vtd_register_types(void)
> +{
> +    VTD_DPRINTF(GENERAL, "");
> +    type_register_static(&vtd_info);
> +}
> +
> +type_init(vtd_register_types)
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> new file mode 100644
> index 0000000..7bc679a
> --- /dev/null
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -0,0 +1,345 @@
> +/*
> + * QEMU emulation of an Intel IOMMU (VT-d)
> + *   (DMA Remapping device)
> + *
> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + *
> + * Lots of defines copied from kernel/include/linux/intel-iommu.h:
> + *   Copyright (C) 2006-2008 Intel Corporation
> + *   Author: Ashok Raj <ashok.raj@intel.com>
> + *   Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
> + *
> + */
> +
> +#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
> +#define HW_I386_INTEL_IOMMU_INTERNAL_H
> +#include "hw/i386/intel_iommu.h"
> +
> +/*
> + * Intel IOMMU register specification
> + */
> +#define DMAR_VER_REG    0x0 /* Arch version supported by this IOMMU */
> +#define DMAR_CAP_REG    0x8 /* Hardware supported capabilities */
> +#define DMAR_CAP_REG_HI 0xc /* High 32-bit of DMAR_CAP_REG */
> +#define DMAR_ECAP_REG   0x10    /* Extended capabilities supported */
> +#define DMAR_ECAP_REG_HI    0X14
> +#define DMAR_GCMD_REG   0x18    /* Global command register */
> +#define DMAR_GSTS_REG   0x1c    /* Global status register */
> +#define DMAR_RTADDR_REG 0x20    /* Root entry table */
> +#define DMAR_RTADDR_REG_HI  0X24
> +#define DMAR_CCMD_REG   0x28  /* Context command reg */
> +#define DMAR_CCMD_REG_HI    0x2c
> +#define DMAR_FSTS_REG   0x34  /* Fault Status register */
> +#define DMAR_FECTL_REG  0x38 /* Fault control register */
> +#define DMAR_FEDATA_REG 0x3c    /* Fault event interrupt data register */
> +#define DMAR_FEADDR_REG 0x40    /* Fault event interrupt addr register */
> +#define DMAR_FEUADDR_REG    0x44   /* Upper address register */
> +#define DMAR_AFLOG_REG  0x58 /* Advanced Fault control */
> +#define DMAR_AFLOG_REG_HI   0X5c
> +#define DMAR_PMEN_REG   0x64  /* Enable Protected Memory Region */
> +#define DMAR_PLMBASE_REG    0x68    /* PMRR Low addr */
> +#define DMAR_PLMLIMIT_REG 0x6c  /* PMRR low limit */
> +#define DMAR_PHMBASE_REG 0x70   /* pmrr high base addr */
> +#define DMAR_PHMBASE_REG_HI 0X74
> +#define DMAR_PHMLIMIT_REG 0x78  /* pmrr high limit */
> +#define DMAR_PHMLIMIT_REG_HI 0x7c
> +#define DMAR_IQH_REG    0x80   /* Invalidation queue head register */
> +#define DMAR_IQH_REG_HI 0X84
> +#define DMAR_IQT_REG    0x88   /* Invalidation queue tail register */
> +#define DMAR_IQT_REG_HI 0X8c
> +#define DMAR_IQ_SHIFT   4 /* Invalidation queue head/tail shift */
> +#define DMAR_IQA_REG    0x90   /* Invalidation queue addr register */
> +#define DMAR_IQA_REG_HI 0x94
> +#define DMAR_ICS_REG    0x9c   /* Invalidation complete status register */
> +#define DMAR_IRTA_REG   0xb8    /* Interrupt remapping table addr register */
> +#define DMAR_IRTA_REG_HI    0xbc
> +
> +#define DMAR_IECTL_REG  0xa0    /* Invalidation event control register */
> +#define DMAR_IEDATA_REG 0xa4    /* Invalidation event data register */
> +#define DMAR_IEADDR_REG 0xa8    /* Invalidation event address register */
> +#define DMAR_IEUADDR_REG 0xac    /* Invalidation event address register */
> +#define DMAR_PQH_REG    0xc0    /* Page request queue head register */
> +#define DMAR_PQH_REG_HI 0xc4
> +#define DMAR_PQT_REG    0xc8    /* Page request queue tail register*/
> +#define DMAR_PQT_REG_HI     0xcc
> +#define DMAR_PQA_REG    0xd0    /* Page request queue address register */
> +#define DMAR_PQA_REG_HI 0xd4
> +#define DMAR_PRS_REG    0xdc    /* Page request status register */
> +#define DMAR_PECTL_REG  0xe0    /* Page request event control register */
> +#define DMAR_PEDATA_REG 0xe4    /* Page request event data register */
> +#define DMAR_PEADDR_REG 0xe8    /* Page request event address register */
> +#define DMAR_PEUADDR_REG  0xec  /* Page event upper address register */
> +#define DMAR_MTRRCAP_REG 0x100  /* MTRR capability register */
> +#define DMAR_MTRRCAP_REG_HI 0x104
> +#define DMAR_MTRRDEF_REG 0x108  /* MTRR default type register */
> +#define DMAR_MTRRDEF_REG_HI 0x10c
> +
> +/* IOTLB */
> +#define DMAR_IOTLB_REG_OFFSET 0xf0  /* Offset to the IOTLB registers */
> +#define DMAR_IVA_REG DMAR_IOTLB_REG_OFFSET  /* Invalidate Address Register */
> +#define DMAR_IVA_REG_HI (DMAR_IVA_REG + 4)
> +/* IOTLB Invalidate Register */
> +#define DMAR_IOTLB_REG (DMAR_IOTLB_REG_OFFSET + 0x8)
> +#define DMAR_IOTLB_REG_HI (DMAR_IOTLB_REG + 4)
> +
> +/* FRCD */
> +#define DMAR_FRCD_REG_OFFSET 0x220 /* Offset to the Fault Recording Registers */
> +/* NOTICE: If you change the DMAR_FRCD_REG_NR, please remember to change the
> + * DMAR_REG_SIZE in include/hw/i386/intel_iommu.h.
> + * #define DMAR_REG_SIZE   (DMAR_FRCD_REG_OFFSET + 16 * DMAR_FRCD_REG_NR)
> + */
> +#define DMAR_FRCD_REG_NR 1ULL /* Num of Fault Recording Registers */
> +
> +#define DMAR_FRCD_REG_0_0    0x220 /* The 0th Fault Recording Register */
> +#define DMAR_FRCD_REG_0_1    0x224
> +#define DMAR_FRCD_REG_0_2    0x228
> +#define DMAR_FRCD_REG_0_3    0x22c
> +
> +/* Interrupt Address Range */
> +#define VTD_INTERRUPT_ADDR_FIRST    0xfee00000ULL
> +#define VTD_INTERRUPT_ADDR_LAST     0xfeefffffULL
> +
> +/* IOTLB_REG */
> +#define VTD_TLB_GLOBAL_FLUSH (1ULL << 60) /* Global invalidation */
> +#define VTD_TLB_DSI_FLUSH (2ULL << 60)  /* Domain-selective invalidation */
> +#define VTD_TLB_PSI_FLUSH (3ULL << 60)  /* Page-selective invalidation */
> +#define VTD_TLB_FLUSH_GRANU_MASK (3ULL << 60)
> +#define VTD_TLB_GLOBAL_FLUSH_A (1ULL << 57)
> +#define VTD_TLB_DSI_FLUSH_A (2ULL << 57)
> +#define VTD_TLB_PSI_FLUSH_A (3ULL << 57)
> +#define VTD_TLB_FLUSH_GRANU_MASK_A (3ULL << 57)
> +#define VTD_TLB_IVT (1ULL << 63)
> +
> +/* GCMD_REG */
> +#define VTD_GCMD_TE (1UL << 31)
> +#define VTD_GCMD_SRTP (1UL << 30)
> +#define VTD_GCMD_SFL (1UL << 29)
> +#define VTD_GCMD_EAFL (1UL << 28)
> +#define VTD_GCMD_WBF (1UL << 27)
> +#define VTD_GCMD_QIE (1UL << 26)
> +#define VTD_GCMD_IRE (1UL << 25)
> +#define VTD_GCMD_SIRTP (1UL << 24)
> +#define VTD_GCMD_CFI (1UL << 23)
> +
> +/* GSTS_REG */
> +#define VTD_GSTS_TES (1UL << 31)
> +#define VTD_GSTS_RTPS (1UL << 30)
> +#define VTD_GSTS_FLS (1UL << 29)
> +#define VTD_GSTS_AFLS (1UL << 28)
> +#define VTD_GSTS_WBFS (1UL << 27)
> +#define VTD_GSTS_QIES (1UL << 26)
> +#define VTD_GSTS_IRES (1UL << 25)
> +#define VTD_GSTS_IRTPS (1UL << 24)
> +#define VTD_GSTS_CFIS (1UL << 23)
> +
> +/* CCMD_REG */
> +#define VTD_CCMD_ICC (1ULL << 63)
> +#define VTD_CCMD_GLOBAL_INVL (1ULL << 61)
> +#define VTD_CCMD_DOMAIN_INVL (2ULL << 61)
> +#define VTD_CCMD_DEVICE_INVL (3ULL << 61)
> +#define VTD_CCMD_CIRG_MASK (3ULL << 61)
> +#define VTD_CCMD_GLOBAL_INVL_A (1ULL << 59)
> +#define VTD_CCMD_DOMAIN_INVL_A (2ULL << 59)
> +#define VTD_CCMD_DEVICE_INVL_A (3ULL << 59)
> +#define VTD_CCMD_CAIG_MASK (3ULL << 59)
> +
> +/* RTADDR_REG */
> +#define VTD_RTADDR_RTT (1ULL << 11)
> +#define VTD_RTADDR_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL)
> +
> +/* ECAP_REG */
> +#define VTD_ECAP_IRO (DMAR_IOTLB_REG_OFFSET << 4)  /* (offset >> 4) << 8 */
> +#define VTD_ECAP_QI  (1ULL << 1)
> +
> +/* CAP_REG */
> +#define VTD_CAP_FRO  (DMAR_FRCD_REG_OFFSET << 20) /* (offset >> 4) << 24 */
> +#define VTD_CAP_NFR  ((DMAR_FRCD_REG_NR - 1) << 40)
> +#define VTD_DOMAIN_ID_SHIFT     16  /* 16-bit domain id for 64K domains */
> +#define VTD_CAP_ND  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
> +#define VTD_MGAW    39  /* Maximum Guest Address Width */
> +#define VTD_CAP_MGAW    (((VTD_MGAW - 1) & 0x3fULL) << 16)
> +
> +/* Supported Adjusted Guest Address Widths */
> +#define VTD_CAP_SAGAW_SHIFT (8)
> +#define VTD_CAP_SAGAW_MASK  (0x1fULL << VTD_CAP_SAGAW_SHIFT)
> + /* 39-bit AGAW, 3-level page-table */
> +#define VTD_CAP_SAGAW_39bit (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> + /* 48-bit AGAW, 4-level page-table */
> +#define VTD_CAP_SAGAW_48bit (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> +#define VTD_CAP_SAGAW       VTD_CAP_SAGAW_39bit
> +
> +/* IQT_REG */
> +#define VTD_IQT_QT(val)     (((val) >> 4) & 0x7fffULL)
> +
> +/* IQA_REG */
> +#define VTD_IQA_IQA_MASK    (VTD_HAW_MASK ^ 0xfffULL)
> +#define VTD_IQA_QS          (0x7ULL)
> +
> +/* IQH_REG */
> +#define VTD_IQH_QH_SHIFT    (4)
> +#define VTD_IQH_QH_MASK     (0x7fff0ULL)
> +
> +/* ICS_REG */
> +#define VTD_ICS_IWC         (1UL)
> +
> +/* IECTL_REG */
> +#define VTD_IECTL_IM        (1UL << 31)
> +#define VTD_IECTL_IP        (1UL << 30)
> +
> +/* FSTS_REG */
> +#define VTD_FSTS_FRI_MASK  (0xff00)
> +#define VTD_FSTS_FRI(val)  ((((uint32_t)(val)) << 8) & VTD_FSTS_FRI_MASK)
> +#define VTD_FSTS_IQE       (1UL << 4)
> +#define VTD_FSTS_PPF       (1UL << 1)
> +#define VTD_FSTS_PFO       (1UL)
> +
> +/* FECTL_REG */
> +#define VTD_FECTL_IM       (1UL << 31)
> +#define VTD_FECTL_IP       (1UL << 30)
> +
> +/* Fault Recording Register */
> +/* For the high 64-bit of 128-bit */
> +#define VTD_FRCD_F         (1ULL << 63)
> +#define VTD_FRCD_T         (1ULL << 62)
> +#define VTD_FRCD_FR(val)   (((val) & 0xffULL) << 32)
> +#define VTD_FRCD_SID_MASK   0xffffULL
> +#define VTD_FRCD_SID(val)  ((val) & VTD_FRCD_SID_MASK)
> +/* For the low 64-bit of 128-bit */
> +#define VTD_FRCD_FI(val)   ((val) & (((1ULL << VTD_MGAW) - 1) ^ 0xfffULL))
> +
> +/* DMA Remapping Fault Conditions */
> +typedef enum VTDFaultReason {
> +    /* Reserved for Advanced Fault logging. We use this to represent the case
> +     * with no fault event.
> +     */
> +    VTD_FR_RESERVED = 0,
> +    VTD_FR_ROOT_ENTRY_P = 1, /* The Present(P) field of root-entry is 0 */
> +    VTD_FR_CONTEXT_ENTRY_P, /* The Present(P) field of context-entry is 0 */
> +    VTD_FR_CONTEXT_ENTRY_INV, /* Invalid programming of a context-entry */
> +    VTD_FR_ADDR_BEYOND_MGAW, /* Input-address above (2^x-1) */
> +    VTD_FR_WRITE, /* No write permission */
> +    VTD_FR_READ, /* No read permission */
> +    /* Fail to access a second-level paging entry (not SL_PML4E) */
> +    VTD_FR_PAGING_ENTRY_INV,
> +    VTD_FR_ROOT_TABLE_INV, /* Fail to access a root-entry */
> +    VTD_FR_CONTEXT_TABLE_INV, /* Fail to access a context-entry */
> +    /* Non-zero reserved field in a present root-entry */
> +    VTD_FR_ROOT_ENTRY_RSVD,
> +    /* Non-zero reserved field in a present context-entry */
> +    VTD_FR_CONTEXT_ENTRY_RSVD,
> +    /* Non-zero reserved field in a second-level paging entry with at lease one
> +     * Read(R) and Write(W) or Execute(E) field is Set.
> +     */
> +    VTD_FR_PAGING_ENTRY_RSVD,
> +    /* Translation request or translated request explicitly blocked dut to the
> +     * programming of the Translation Type (T) field in the present
> +     * context-entry.
> +     */
> +    VTD_FR_CONTEXT_ENTRY_TT,
> +    /* This is not a normal fault reason. We use this to indicate some faults
> +     * that are not referenced by the VT-d specification.
> +     * Fault event with such reason should not be recorded.
> +     */
> +    VTD_FR_RESERVED_ERR,
> +    /* Guard */
> +    VTD_FR_MAX,
> +} VTDFaultReason;
> +
> +
> +/* Masks for Queued Invalidation Descriptor */
> +#define VTD_INV_DESC_TYPE  (0xf)
> +#define VTD_INV_DESC_CC    (0x1) /* Context-cache Invalidate Descriptor */
> +#define VTD_INV_DESC_IOTLB (0x2)
> +#define VTD_INV_DESC_WAIT  (0x5) /* Invalidation Wait Descriptor */
> +#define VTD_INV_DESC_NONE  (0)   /* Not an Invalidate Descriptor */
> +
> +
> +/* Pagesize of VTD paging structures, including root and context tables */
> +#define VTD_PAGE_SHIFT      (12)
> +#define VTD_PAGE_SIZE       (1ULL << VTD_PAGE_SHIFT)
> +
> +#define VTD_PAGE_SHIFT_4K   (12)
> +#define VTD_PAGE_MASK_4K    (~((1ULL << VTD_PAGE_SHIFT_4K) - 1))
> +#define VTD_PAGE_SHIFT_2M   (21)
> +#define VTD_PAGE_MASK_2M    (~((1ULL << VTD_PAGE_SHIFT_2M) - 1))
> +#define VTD_PAGE_SHIFT_1G   (30)
> +#define VTD_PAGE_MASK_1G    (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
> +
> +/* Root-Entry
> + * 0: Present
> + * 1-11: Reserved
> + * 12-63: Context-table Pointer
> + * 64-127: Reserved
> + */
> +struct VTDRootEntry {
> +    uint64_t val;
> +    uint64_t rsvd;
> +};
> +typedef struct VTDRootEntry VTDRootEntry;
> +
> +/* Masks for struct VTDRootEntry */
> +#define VTD_ROOT_ENTRY_P (1ULL << 0)
> +#define VTD_ROOT_ENTRY_CTP  (~0xfffULL)
> +
> +#define VTD_ROOT_ENTRY_NR   (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
> +#define VTD_ROOT_ENTRY_RSVD (0xffeULL | ~VTD_HAW_MASK)
> +
> +/* Context-Entry */
> +struct VTDContextEntry {
> +    uint64_t lo;
> +    uint64_t hi;
> +};
> +typedef struct VTDContextEntry VTDContextEntry;
> +
> +/* Masks for struct VTDContextEntry */
> +/* lo */
> +#define VTD_CONTEXT_ENTRY_P (1ULL << 0)
> +#define VTD_CONTEXT_ENTRY_FPD   (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_CONTEXT_ENTRY_TT    (3ULL << 2) /* Translation Type */
> +#define VTD_CONTEXT_TT_MULTI_LEVEL  (0)
> +#define VTD_CONTEXT_TT_DEV_IOTLB    (1)
> +#define VTD_CONTEXT_TT_PASS_THROUGH (2)
> +/* Second Level Page Translation Pointer*/
> +#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
> +#define VTD_CONTEXT_ENTRY_RSVD_LO   (0xff0ULL | ~VTD_HAW_MASK)
> +/* hi */
> +#define VTD_CONTEXT_ENTRY_AW    (7ULL) /* Adjusted guest-address-width */
> +#define VTD_CONTEXT_ENTRY_DID   (0xffffULL << 8)    /* Domain Identifier */
> +#define VTD_CONTEXT_ENTRY_RSVD_HI   (0xffffffffff000080ULL)
> +
> +#define VTD_CONTEXT_ENTRY_NR    (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
> +
> +
> +/* Paging Structure common */
> +#define VTD_SL_PT_PAGE_SIZE_MASK   (1ULL << 7)
> +#define VTD_SL_LEVEL_BITS   9   /* Bits to decide the offset for each level */
> +
> +/* Second Level Paging Structure */
> +#define VTD_SL_PML4_LEVEL   4
> +#define VTD_SL_PDP_LEVEL    3
> +#define VTD_SL_PD_LEVEL     2
> +#define VTD_SL_PT_LEVEL     1
> +#define VTD_SL_PT_ENTRY_NR  512
> +
> +/* Masks for Second Level Paging Entry */
> +#define VTD_SL_RW_MASK              (3ULL)
> +#define VTD_SL_R                    (1ULL)
> +#define VTD_SL_W                    (1ULL << 1)
> +#define VTD_SL_PT_BASE_ADDR_MASK    (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK)
> +#define VTD_SL_IGN_COM    (0xbff0000000000000ULL)
> +
> +#endif
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> new file mode 100644
> index 0000000..6601e62
> --- /dev/null
> +++ b/include/hw/i386/intel_iommu.h
> @@ -0,0 +1,90 @@
> +/*
> + * QEMU emulation of an Intel IOMMU (VT-d)
> + *   (DMA Remapping device)
> + *
> + * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
> + * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef INTEL_IOMMU_H
> +#define INTEL_IOMMU_H
> +#include "hw/qdev.h"
> +#include "sysemu/dma.h"
> +
> +#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> +#define INTEL_IOMMU_DEVICE(obj) \
> +     OBJECT_CHECK(IntelIOMMUState, (obj), TYPE_INTEL_IOMMU_DEVICE)
> +
> +/* DMAR Hardware Unit Definition address (IOMMU unit) */
> +#define Q35_HOST_BRIDGE_IOMMU_ADDR 0xfed90000ULL
> +
> +#define VTD_PCI_BUS_MAX 256
> +#define VTD_PCI_SLOT_MAX 32
> +#define VTD_PCI_FUNC_MAX 8
> +#define VTD_PCI_SLOT(devfn)         (((devfn) >> 3) & 0x1f)
> +#define VTD_PCI_FUNC(devfn)         ((devfn) & 0x07)
> +
> +#define DMAR_REG_SIZE   0x230
> +
> +/* FIXME: do not know how to decide the haw */
> +#define VTD_HOST_ADDRESS_WIDTH  39
> +#define VTD_HAW_MASK    ((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
> +
> +typedef struct IntelIOMMUState IntelIOMMUState;
> +typedef struct VTDAddressSpace VTDAddressSpace;
> +
> +struct VTDAddressSpace {
> +    int bus_num;
> +    int devfn;
> +    AddressSpace as;
> +    MemoryRegion iommu;
> +    IntelIOMMUState *iommu_state;
> +};
> +
> +/* The iommu (DMAR) device state struct */
> +struct IntelIOMMUState {
> +    SysBusDevice busdev;
> +    MemoryRegion csrmem;
> +    uint8_t csr[DMAR_REG_SIZE];     /* register values */
> +    uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
> +    uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
> +    uint8_t womask[DMAR_REG_SIZE]; /* WO (write only - read returns 0) */
> +    uint32_t version;
> +
> +    dma_addr_t root;        /* Current root table pointer */
> +    bool root_extended;     /* Type of root table (extended or not) */
> +    bool dmar_enabled;      /* Set if DMA remapping is enabled */
> +
> +    uint16_t iq_head;       /* Current invalidation queue head */
> +    uint16_t iq_tail;       /* Current invalidation queue tail */
> +    dma_addr_t iq;          /* Current invalidation queue (IQ) pointer */
> +    uint16_t iq_size;       /* IQ Size in number of entries */
> +    bool qi_enabled;        /* Set if the QI is enabled */
> +    uint8_t iq_last_desc_type; /* The type of last completed descriptor */
> +
> +    /* The index of the Fault Recording Register to be used next.
> +     * Wraps around from N-1 to 0, where N is the number of FRCD_REG.
> +     */
> +    uint16_t next_frcd_reg;
> +
> +    uint64_t cap;           /* The value of Capability Register */
> +    uint64_t ecap;          /* The value of Extended Capability Register */
> +
> +    MemoryRegionIOMMUOps iommu_ops;
> +    VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
> +};
> +
> +#endif
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables Le Tan
@ 2014-08-14 11:06   ` Michael S. Tsirkin
  2014-08-14 11:36     ` Jan Kiszka
  2014-08-14 11:37     ` Le Tan
  0 siblings, 2 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:06 UTC (permalink / raw)
  To: Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
> is only one hardware unit without INTR_REMAP capability on the platform.
> 
> Signed-off-by: Le Tan <tamlokveer@gmail.com>

Could you add a unit test please?

> ---
>  hw/i386/acpi-build.c | 41 ++++++++++++++++++++++++++++++
>  hw/i386/acpi-defs.h  | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 111 insertions(+)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 816c6d9..595f501 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -47,6 +47,7 @@
>  #include "hw/i386/ich9.h"
>  #include "hw/pci/pci_bus.h"
>  #include "hw/pci-host/q35.h"
> +#include "hw/i386/intel_iommu.h"
>  
>  #include "hw/i386/q35-acpi-dsdt.hex"
>  #include "hw/i386/acpi-dsdt.hex"
> @@ -1350,6 +1351,31 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
>  }
>  
>  static void
> +build_dmar_q35(GArray *table_data, GArray *linker)
> +{
> +    int dmar_start = table_data->len;
> +
> +    AcpiTableDmar *dmar;
> +    AcpiDmarHardwareUnit *drhd;
> +
> +    dmar = acpi_data_push(table_data, sizeof(*dmar));
> +    dmar->host_address_width = VTD_HOST_ADDRESS_WIDTH - 1;
> +    dmar->flags = 0;    /* No intr_remap for now */
> +
> +    /* DMAR Remapping Hardware Unit Definition structure */
> +    drhd = acpi_data_push(table_data, sizeof(*drhd));
> +    drhd->type = cpu_to_le16(ACPI_DMAR_TYPE_HARDWARE_UNIT);
> +    drhd->length = cpu_to_le16(sizeof(*drhd));   /* No device scope now */
> +    drhd->flags = ACPI_DMAR_INCLUDE_PCI_ALL;
> +    drhd->pci_segment = cpu_to_le16(0);
> +    drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
> +
> +    build_header(linker, table_data, (void *)(table_data->data + dmar_start),
> +                 "DMAR", table_data->len - dmar_start, 1);
> +}
> +
> +
> +static void
>  build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
>  {
>      AcpiTableHeader *dsdt;
> @@ -1470,6 +1496,17 @@ static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
>      return true;
>  }
>  
> +static bool acpi_has_iommu(void)
> +{
> +    bool ambiguous;
> +    Object *intel_iommu;
> +
> +    intel_iommu = object_resolve_path_type("", TYPE_INTEL_IOMMU_DEVICE,
> +                                           &ambiguous);
> +    return intel_iommu && !ambiguous;
> +}
> +
> +
>  static
>  void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
>  {
> @@ -1539,6 +1576,10 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
>          acpi_add_table(table_offsets, tables->table_data);
>          build_mcfg_q35(tables->table_data, tables->linker, &mcfg);
>      }
> +    if (acpi_has_iommu()) {
> +        acpi_add_table(table_offsets, tables->table_data);
> +        build_dmar_q35(tables->table_data, tables->linker);
> +    }
>  
>      /* Add tables supplied by user (if any) */
>      for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
> diff --git a/hw/i386/acpi-defs.h b/hw/i386/acpi-defs.h
> index e93babb..9674825 100644
> --- a/hw/i386/acpi-defs.h
> +++ b/hw/i386/acpi-defs.h
> @@ -314,4 +314,74 @@ struct AcpiTableMcfg {
>  } QEMU_PACKED;
>  typedef struct AcpiTableMcfg AcpiTableMcfg;
>  
> +/* DMAR - DMA Remapping table r2.2 */
> +struct AcpiTableDmar {
> +    ACPI_TABLE_HEADER_DEF
> +    uint8_t host_address_width; /* Maximum DMA physical addressability */
> +    uint8_t flags;
> +    uint8_t reserved[10];
> +} QEMU_PACKED;
> +typedef struct AcpiTableDmar AcpiTableDmar;
> +
> +/* Masks for Flags field above */
> +#define ACPI_DMAR_INTR_REMAP    (1)
> +#define ACPI_DMAR_X2APIC_OPT_OUT    (2)
> +
> +/*
> + * DMAR sub-structures (Follow DMA Remapping table)
> + */
> +#define ACPI_DMAR_SUB_HEADER_DEF /* Common ACPI DMAR sub-structure header */\
> +    uint16_t type;  \
> +    uint16_t length;

Really necessary? cleaner to just open-code, it's
only used once ...

> +
> +/* Values for sub-structure type for DMAR */
> +enum {
> +    ACPI_DMAR_TYPE_HARDWARE_UNIT = 0,   /* DRHD */
> +    ACPI_DMAR_TYPE_RESERVED_MEMORY = 1, /* RMRR */
> +    ACPI_DMAR_TYPE_ATSR = 2,    /* ATSR */
> +    ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,   /* RHSR */
> +    ACPI_DMAR_TYPE_ANDD = 4,    /* ANDD */
> +    ACPI_DMAR_TYPE_RESERVED = 5 /* Reserved for furture use */
> +};
> +
> +/*
> + * Sub-structures for DMAR, correspond to Type in ACPI_DMAR_SUB_HEADER_DEF
> + */
> +
> +/* DMAR Device Scope structures */
> +struct AcpiDmarDeviceScope {
> +    uint8_t type;
> +    uint8_t length;
> +    uint16_t reserved;
> +    uint8_t enumeration_id;
> +    uint8_t start_bus_number;
> +    uint8_t path[0];
> +} QEMU_PACKED;
> +typedef struct AcpiDmarDeviceScope AcpiDmarDeviceScope;
> +
> +/* Values for type in struct AcpiDmarDeviceScope */
> +enum {
> +    ACPI_DMAR_SCOPE_TYPE_NOT_USED = 0,
> +    ACPI_DMAR_SCOPE_TYPE_ENDPOINT = 1,
> +    ACPI_DMAR_SCOPE_TYPE_BRIDGE = 2,
> +    ACPI_DMAR_SCOPE_TYPE_IOAPIC = 3,
> +    ACPI_DMAR_SCOPE_TYPE_HPET = 4,
> +    ACPI_DMAR_SCOPE_TYPE_ACPI = 5,
> +    ACPI_DMAR_SCOPE_TYPE_RESERVED = 6 /* Reserved for future use */
> +};
> +
> +/* 0: Hardware Unit Definition */
> +struct AcpiDmarHardwareUnit {
> +    ACPI_DMAR_SUB_HEADER_DEF
> +    uint8_t flags;
> +    uint8_t reserved;
> +    uint16_t pci_segment;   /* The PCI Segment associated with this unit */
> +    uint64_t address;   /* Base address of remapping hardware register-set */
> +} QEMU_PACKED;
> +typedef struct AcpiDmarHardwareUnit AcpiDmarHardwareUnit;
> +
> +/* Masks for Flags field above */
> +#define ACPI_DMAR_INCLUDE_PCI_ALL (1)
> +
> +

Don't add two empty lines in a row.
Same in other places.

>  #endif
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Le Tan
@ 2014-08-14 11:12   ` Michael S. Tsirkin
  2014-08-14 11:33     ` Le Tan
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:12 UTC (permalink / raw)
  To: Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

On Mon, Aug 11, 2014 at 03:05:01PM +0800, Le Tan wrote:
> Add Intel IOMMU emulation to q35 chipset and expose it to the guest.
> 1. Add a machine option. Users can use "-machine iommu=on|off" in the command
> line to enable/disable Intel IOMMU. The default is off.
> 2. Accroding to the machine option, q35 will initialize the Intel IOMMU and
> use pci_setup_iommu() to setup q35_host_dma_iommu() as the IOMMU function for
> the pci bus.
> 3. q35_host_dma_iommu() will return different address space according to the
> bus_num and devfn of the device.
> 
> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> ---
>  hw/core/machine.c         | 27 +++++++++++++++++---
>  hw/pci-host/q35.c         | 64 +++++++++++++++++++++++++++++++++++++++++++----
>  include/hw/boards.h       |  1 +
>  include/hw/pci-host/q35.h |  2 ++
>  qemu-options.hx           |  5 +++-
>  vl.c                      |  4 +++
>  6 files changed, 94 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 7a66c57..f0046d6 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -235,6 +235,20 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp)
>      ms->firmware = g_strdup(value);
>  }
>  
> +static bool machine_get_iommu(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->iommu;
> +}
> +
> +static void machine_set_iommu(Object *obj, bool value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->iommu = value;
> +}
> +
>  static void machine_initfn(Object *obj)
>  {
>      object_property_add_str(obj, "accel",
> @@ -270,10 +284,17 @@ static void machine_initfn(Object *obj)
>                               machine_set_dump_guest_core,
>                               NULL);
>      object_property_add_bool(obj, "mem-merge",
> -                             machine_get_mem_merge, machine_set_mem_merge, NULL);
> -    object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
> +                             machine_get_mem_merge,
> +                             machine_set_mem_merge, NULL);
> +    object_property_add_bool(obj, "usb",
> +                             machine_get_usb,
> +                             machine_set_usb, NULL);
>      object_property_add_str(obj, "firmware",
> -                            machine_get_firmware, machine_set_firmware, NULL);
> +                            machine_get_firmware,
> +                            machine_set_firmware, NULL);
> +    object_property_add_bool(obj, "iommu",
> +                             machine_get_iommu,
> +                             machine_set_iommu, NULL);
>  }
>  
>  static void machine_finalize(Object *obj)
> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
> index a0a3068..3342711 100644
> --- a/hw/pci-host/q35.c
> +++ b/hw/pci-host/q35.c
> @@ -347,6 +347,53 @@ static void mch_reset(DeviceState *qdev)
>      mch_update(mch);
>  }
>  
> +static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDAddressSpace **pvtd_as;
> +    VTDAddressSpace *vtd_as;
> +    int bus_num = pci_bus_num(bus);
> +
> +    assert(devfn >= 0);
> +
> +    pvtd_as = s->address_spaces[bus_num];
> +    if (!pvtd_as) {
> +        /* No corresponding free() */
> +        pvtd_as = g_malloc0(sizeof(VTDAddressSpace *) *
> +                            VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
> +        s->address_spaces[bus_num] = pvtd_as;
> +    }
> +
> +    vtd_as = *(pvtd_as + devfn);

pvtd_as[devfn] is cleaner.
In fact, you can then drop vtd_as local variable, use pvtd_as[devfn]
directly.

> +    if (!vtd_as) {
> +        vtd_as = g_malloc0(sizeof(*vtd_as));
> +        *(pvtd_as + devfn) = vtd_as;
> +
> +        vtd_as->bus_num = bus_num;
> +        vtd_as->devfn = devfn;
> +        vtd_as->iommu_state = s;
> +        memory_region_init_iommu(&vtd_as->iommu, OBJECT(s), &s->iommu_ops,
> +                                 "intel_iommu", UINT64_MAX);
> +        address_space_init(&vtd_as->as, &vtd_as->iommu, "intel_iommu");
> +    }
> +
> +    return &vtd_as->as;
> +}
> +
> +static void mch_init_dmar(MCHPCIState *mch)
> +{
> +    PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
> +
> +    mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, TYPE_INTEL_IOMMU_DEVICE));
> +    object_property_add_child(OBJECT(mch), "intel-iommu",
> +                              OBJECT(mch->iommu), NULL);
> +    qdev_init_nofail(DEVICE(mch->iommu));
> +    sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
> +
> +    pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
> +}
> +
> +
>  static int mch_init(PCIDevice *d)
>  {
>      int i;
> @@ -363,13 +410,20 @@ static int mch_init(PCIDevice *d)
>      memory_region_add_subregion_overlap(mch->system_memory, 0xa0000,
>                                          &mch->smram_region, 1);
>      memory_region_set_enabled(&mch->smram_region, false);
> -    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
> -             &mch->pam_regions[0], PAM_BIOS_BASE, PAM_BIOS_SIZE);
> +    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
> +             mch->pci_address_space, &mch->pam_regions[0], PAM_BIOS_BASE,
> +             PAM_BIOS_SIZE);
>      for (i = 0; i < 12; ++i) {
> -        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
> -                 &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
> -                 PAM_EXPAN_SIZE);
> +        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
> +                 mch->pci_address_space, &mch->pam_regions[i+1],
> +                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);

Random formatting changes above? Make it a separate patch please.

> +    }
> +
> +    /* Intel IOMMU (VT-d) */
> +    if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
> +        mch_init_dmar(mch);
>      }
> +
>      return 0;
>  }
>  
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 605a970..dfb6718 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -123,6 +123,7 @@ struct MachineState {
>      bool mem_merge;
>      bool usb;
>      char *firmware;
> +    bool iommu;
>  
>      ram_addr_t ram_size;
>      ram_addr_t maxram_size;
> diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
> index d9ee978..025d6e6 100644
> --- a/include/hw/pci-host/q35.h
> +++ b/include/hw/pci-host/q35.h
> @@ -33,6 +33,7 @@
>  #include "hw/acpi/acpi.h"
>  #include "hw/acpi/ich9.h"
>  #include "hw/pci-host/pam.h"
> +#include "hw/i386/intel_iommu.h"
>  
>  #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
>  #define Q35_HOST_DEVICE(obj) \
> @@ -60,6 +61,7 @@ typedef struct MCHPCIState {
>      uint64_t pci_hole64_size;
>      PcGuestInfo *guest_info;
>      uint32_t short_root_bus;
> +    IntelIOMMUState *iommu;
>  } MCHPCIState;
>  
>  typedef struct Q35PCIHost {
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 96516c1..7406a17 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -35,7 +35,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                kernel_irqchip=on|off controls accelerated irqchip support\n"
>      "                kvm_shadow_mem=size of KVM shadow MMU\n"
>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> -    "                mem-merge=on|off controls memory merge support (default: on)\n",
> +    "                mem-merge=on|off controls memory merge support (default: on)\n"
> +    "                iommu=on|off controls emulated Intel IOMMU (VT-d) support (default=off)\n",
>      QEMU_ARCH_ALL)
>  STEXI
>  @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
> @@ -58,6 +59,8 @@ Include guest memory in a core dump. The default is on.
>  Enables or disables memory merge support. This feature, when supported by
>  the host, de-duplicates identical memory pages among VMs instances
>  (enabled by default).
> +@item iommu=on|off
> +Enables or disables emulated Intel IOMMU (VT-d) support. The default is off.
>  @end table
>  ETEXI
>  
> diff --git a/vl.c b/vl.c
> index a8029d5..2ab1643 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -388,6 +388,10 @@ static QemuOptsList qemu_machine_opts = {
>              .name = PC_MACHINE_MAX_RAM_BELOW_4G,
>              .type = QEMU_OPT_SIZE,
>              .help = "maximum ram below the 4G boundary (32bit boundary)",
> +        },{
> +            .name = "iommu",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "Set on/off to enable/disable Intel IOMMU (VT-d)",
>          },
>          { /* End of list */ }
>      },
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
                   ` (4 preceding siblings ...)
  2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 5/5] intel-iommu: add supports for queued invalidation interface Le Tan
@ 2014-08-14 11:15 ` Michael S. Tsirkin
  2014-08-14 12:10   ` Jan Kiszka
  5 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:15 UTC (permalink / raw)
  To: Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> Hi,
> 
> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> chipset. The major job in these patches is to add support for emulating Intel
> IOMMU according to the VT-d specification, including basic responses to CSRs
> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> translations.

Thanks!
Looks very good overall, I noted some coding style issues - I didn't
bother reporting each issue in every place where it appears - reported
each issue once only, so please find and fix all instances of each
issue.

> Features implemented for now are:
> 1. Response to important CSRs accesses;
> 2. DMAR (DMA remapping) without PASID support;
> 3. Primary fault logging;
> 4. Support both register-based and queued invalidation for IOTLB and context
>    cache invalidation;
> 5. Add DMAR table to ACPI tables to expose VT-d to BIOS;
> 6. Add "-machine iommu=on|off" option to enable/disable VT-d;
> 7. Only one DMAR unit for all the devices of PCI Segment 0.
> 
> Testing:
> 1. L1 guest with Linux with intel_iommu=on can interact with VT-d and boot
> smoothly, and there exists information about VT-d in the log of kernel;
> 2. Run L1 with VT-d, L2 guest with Linux can boot smoothly withou PCI device
> passthrough;
> 3. Run L1 with VT-d and "-soundhw ac97 (QEMU_AUDIO_DRV=alsa)", then assign the
> sound card to L2; L2 can boot smoothly with legacy PCI assignment and I can
> hear the music played in L2 from the host speakers;
> 4. Jailhouse hypervisor can run smoothly(tested by Jan).
> 5. Run L1 with VT-d and e1000 network card, then assign e1000 to L2; L2 will be
> STUCK when booting. This still remains unsolved now. As far as I know, I suppose
> that the L2 crashes when doing e1000_probe(). The QEMU of L1 will dump
> something with "KVM: entry failed, hardware error 0x0", and the KVM of host
> will print "nested_vmx_exit_handled failed vm entry 7". Unlike assigning the
> sound card, after being assigned to L2, there is no translation entry of e1000
> through VT-d, which I think means that e1000 doesn't issue any DMA access during
> the boot of L2. Sometimes the kernel of L2 will print "divide error" during
> booting. Maybe it results from the lack of reset mechanism.
> 6. VFIO is tested and is similar to legacy pci assignment.
> 
> Discussion:
> 1. There is one functionality called Zero-Length-Read (ZLR) which supports zero
> length DMA read requests to write-only pages. If the VT-d emulation supports
> ZLR, we need to know the exact length of one access. For now can QEMU express
> zero-length requests?
> 
> TODO:
> 1. Context cache and IOTLB cache;
> 2. Fix the bug of legacy PCI assignment;
> 
> Changes since v2:
> *address reviewing suggestions given by Jan
> -add support for primary fault logging
> -add support for queued invalidation
> 
> Changes since v1:
> *address reviewing suggestions given by Michael, Paolo, Stefan and Jan
> -split intel_iommu.h to include/hw/i386/intel_iommu.h and
>  hw/i386/intel_iommu_internal.h
> -change the copyright information
> -change D() to VTD_DPRINTF()
> -remove dead code
> -rename constant definitions with consistent prefix VTD_
> -rename some struct definitions according to QEMU standard
> -rename some CSRs access functions
> -use endian-save functions to access CSRs
> -change machine option to "iommu=on|off"
> 
> Thanks very much!
> 
> Git trees:
> https://github.com/tamlok/qemu
> 
> Le Tan (5):
>   iommu: add is_write as a parameter to the translate function of
>     MemoryRegionIOMMUOps
>   intel-iommu: introduce Intel IOMMU (VT-d) emulation
>   intel-iommu: add DMAR table to ACPI tables
>   intel-iommu: add Intel IOMMU emulation to q35 and add a machine option
>     "iommu" as a switch
>   intel-iommu: add supports for queued invalidation interface
> 
>  exec.c                         |    2 +-
>  hw/alpha/typhoon.c             |    3 +-
>  hw/core/machine.c              |   27 +-
>  hw/i386/Makefile.objs          |    1 +
>  hw/i386/acpi-build.c           |   41 +
>  hw/i386/acpi-defs.h            |   70 ++
>  hw/i386/intel_iommu.c          | 1722 ++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  358 +++++++++
>  hw/pci-host/apb.c              |    3 +-
>  hw/pci-host/q35.c              |   64 +-
>  hw/ppc/spapr_iommu.c           |    3 +-
>  include/exec/memory.h          |    2 +-
>  include/hw/boards.h            |    1 +
>  include/hw/i386/intel_iommu.h  |   90 +++
>  include/hw/pci-host/q35.h      |    2 +
>  qemu-options.hx                |    5 +-
>  vl.c                           |    4 +
>  17 files changed, 2384 insertions(+), 14 deletions(-)
>  create mode 100644 hw/i386/intel_iommu.c
>  create mode 100644 hw/i386/intel_iommu_internal.h
>  create mode 100644 include/hw/i386/intel_iommu.h
> 
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
  2014-08-14 11:12   ` Michael S. Tsirkin
@ 2014-08-14 11:33     ` Le Tan
  2014-08-14 11:35       ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Le Tan @ 2014-08-14 11:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

2014-08-14 19:12 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Aug 11, 2014 at 03:05:01PM +0800, Le Tan wrote:
>> Add Intel IOMMU emulation to q35 chipset and expose it to the guest.
>> 1. Add a machine option. Users can use "-machine iommu=on|off" in the command
>> line to enable/disable Intel IOMMU. The default is off.
>> 2. Accroding to the machine option, q35 will initialize the Intel IOMMU and
>> use pci_setup_iommu() to setup q35_host_dma_iommu() as the IOMMU function for
>> the pci bus.
>> 3. q35_host_dma_iommu() will return different address space according to the
>> bus_num and devfn of the device.
>>
>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
>> ---
>>  hw/core/machine.c         | 27 +++++++++++++++++---
>>  hw/pci-host/q35.c         | 64 +++++++++++++++++++++++++++++++++++++++++++----
>>  include/hw/boards.h       |  1 +
>>  include/hw/pci-host/q35.h |  2 ++
>>  qemu-options.hx           |  5 +++-
>>  vl.c                      |  4 +++
>>  6 files changed, 94 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 7a66c57..f0046d6 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -235,6 +235,20 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp)
>>      ms->firmware = g_strdup(value);
>>  }
>>
>> +static bool machine_get_iommu(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->iommu;
>> +}
>> +
>> +static void machine_set_iommu(Object *obj, bool value, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->iommu = value;
>> +}
>> +
>>  static void machine_initfn(Object *obj)
>>  {
>>      object_property_add_str(obj, "accel",
>> @@ -270,10 +284,17 @@ static void machine_initfn(Object *obj)
>>                               machine_set_dump_guest_core,
>>                               NULL);
>>      object_property_add_bool(obj, "mem-merge",
>> -                             machine_get_mem_merge, machine_set_mem_merge, NULL);
>> -    object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
>> +                             machine_get_mem_merge,
>> +                             machine_set_mem_merge, NULL);
>> +    object_property_add_bool(obj, "usb",
>> +                             machine_get_usb,
>> +                             machine_set_usb, NULL);
>>      object_property_add_str(obj, "firmware",
>> -                            machine_get_firmware, machine_set_firmware, NULL);
>> +                            machine_get_firmware,
>> +                            machine_set_firmware, NULL);
>> +    object_property_add_bool(obj, "iommu",
>> +                             machine_get_iommu,
>> +                             machine_set_iommu, NULL);
>>  }
>>
>>  static void machine_finalize(Object *obj)
>> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>> index a0a3068..3342711 100644
>> --- a/hw/pci-host/q35.c
>> +++ b/hw/pci-host/q35.c
>> @@ -347,6 +347,53 @@ static void mch_reset(DeviceState *qdev)
>>      mch_update(mch);
>>  }
>>
>> +static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    VTDAddressSpace **pvtd_as;
>> +    VTDAddressSpace *vtd_as;
>> +    int bus_num = pci_bus_num(bus);
>> +
>> +    assert(devfn >= 0);
>> +
>> +    pvtd_as = s->address_spaces[bus_num];
>> +    if (!pvtd_as) {
>> +        /* No corresponding free() */
>> +        pvtd_as = g_malloc0(sizeof(VTDAddressSpace *) *
>> +                            VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
>> +        s->address_spaces[bus_num] = pvtd_as;
>> +    }
>> +
>> +    vtd_as = *(pvtd_as + devfn);
>
> pvtd_as[devfn] is cleaner.
> In fact, you can then drop vtd_as local variable, use pvtd_as[devfn]
> directly.
>
>> +    if (!vtd_as) {
>> +        vtd_as = g_malloc0(sizeof(*vtd_as));
>> +        *(pvtd_as + devfn) = vtd_as;
>> +
>> +        vtd_as->bus_num = bus_num;
>> +        vtd_as->devfn = devfn;
>> +        vtd_as->iommu_state = s;
>> +        memory_region_init_iommu(&vtd_as->iommu, OBJECT(s), &s->iommu_ops,
>> +                                 "intel_iommu", UINT64_MAX);
>> +        address_space_init(&vtd_as->as, &vtd_as->iommu, "intel_iommu");
>> +    }
>> +
>> +    return &vtd_as->as;
>> +}
>> +
>> +static void mch_init_dmar(MCHPCIState *mch)
>> +{
>> +    PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
>> +
>> +    mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, TYPE_INTEL_IOMMU_DEVICE));
>> +    object_property_add_child(OBJECT(mch), "intel-iommu",
>> +                              OBJECT(mch->iommu), NULL);
>> +    qdev_init_nofail(DEVICE(mch->iommu));
>> +    sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>> +
>> +    pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
>> +}
>> +
>> +
>>  static int mch_init(PCIDevice *d)
>>  {
>>      int i;
>> @@ -363,13 +410,20 @@ static int mch_init(PCIDevice *d)
>>      memory_region_add_subregion_overlap(mch->system_memory, 0xa0000,
>>                                          &mch->smram_region, 1);
>>      memory_region_set_enabled(&mch->smram_region, false);
>> -    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
>> -             &mch->pam_regions[0], PAM_BIOS_BASE, PAM_BIOS_SIZE);
>> +    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
>> +             mch->pci_address_space, &mch->pam_regions[0], PAM_BIOS_BASE,
>> +             PAM_BIOS_SIZE);
>>      for (i = 0; i < 12; ++i) {
>> -        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
>> -                 &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
>> -                 PAM_EXPAN_SIZE);
>> +        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
>> +                 mch->pci_address_space, &mch->pam_regions[i+1],
>> +                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
>
> Random formatting changes above? Make it a separate patch please.

Yes, I was told that I could fix some coding style issues around
places that my patch touches. If that is improper, I will just first
leave them out. :)
Thanks for your reviews!

Le

>> +    }
>> +
>> +    /* Intel IOMMU (VT-d) */
>> +    if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
>> +        mch_init_dmar(mch);
>>      }
>> +
>>      return 0;
>>  }
>>
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 605a970..dfb6718 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -123,6 +123,7 @@ struct MachineState {
>>      bool mem_merge;
>>      bool usb;
>>      char *firmware;
>> +    bool iommu;
>>
>>      ram_addr_t ram_size;
>>      ram_addr_t maxram_size;
>> diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
>> index d9ee978..025d6e6 100644
>> --- a/include/hw/pci-host/q35.h
>> +++ b/include/hw/pci-host/q35.h
>> @@ -33,6 +33,7 @@
>>  #include "hw/acpi/acpi.h"
>>  #include "hw/acpi/ich9.h"
>>  #include "hw/pci-host/pam.h"
>> +#include "hw/i386/intel_iommu.h"
>>
>>  #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
>>  #define Q35_HOST_DEVICE(obj) \
>> @@ -60,6 +61,7 @@ typedef struct MCHPCIState {
>>      uint64_t pci_hole64_size;
>>      PcGuestInfo *guest_info;
>>      uint32_t short_root_bus;
>> +    IntelIOMMUState *iommu;
>>  } MCHPCIState;
>>
>>  typedef struct Q35PCIHost {
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 96516c1..7406a17 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -35,7 +35,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>      "                kernel_irqchip=on|off controls accelerated irqchip support\n"
>>      "                kvm_shadow_mem=size of KVM shadow MMU\n"
>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>> -    "                mem-merge=on|off controls memory merge support (default: on)\n",
>> +    "                mem-merge=on|off controls memory merge support (default: on)\n"
>> +    "                iommu=on|off controls emulated Intel IOMMU (VT-d) support (default=off)\n",
>>      QEMU_ARCH_ALL)
>>  STEXI
>>  @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
>> @@ -58,6 +59,8 @@ Include guest memory in a core dump. The default is on.
>>  Enables or disables memory merge support. This feature, when supported by
>>  the host, de-duplicates identical memory pages among VMs instances
>>  (enabled by default).
>> +@item iommu=on|off
>> +Enables or disables emulated Intel IOMMU (VT-d) support. The default is off.
>>  @end table
>>  ETEXI
>>
>> diff --git a/vl.c b/vl.c
>> index a8029d5..2ab1643 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -388,6 +388,10 @@ static QemuOptsList qemu_machine_opts = {
>>              .name = PC_MACHINE_MAX_RAM_BELOW_4G,
>>              .type = QEMU_OPT_SIZE,
>>              .help = "maximum ram below the 4G boundary (32bit boundary)",
>> +        },{
>> +            .name = "iommu",
>> +            .type = QEMU_OPT_BOOL,
>> +            .help = "Set on/off to enable/disable Intel IOMMU (VT-d)",
>>          },
>>          { /* End of list */ }
>>      },
>> --
>> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
  2014-08-14 11:33     ` Le Tan
@ 2014-08-14 11:35       ` Michael S. Tsirkin
  2014-08-14 11:39         ` Le Tan
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:35 UTC (permalink / raw)
  To: Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

On Thu, Aug 14, 2014 at 07:33:29PM +0800, Le Tan wrote:
> 2014-08-14 19:12 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:
> > On Mon, Aug 11, 2014 at 03:05:01PM +0800, Le Tan wrote:
> >> Add Intel IOMMU emulation to q35 chipset and expose it to the guest.
> >> 1. Add a machine option. Users can use "-machine iommu=on|off" in the command
> >> line to enable/disable Intel IOMMU. The default is off.
> >> 2. Accroding to the machine option, q35 will initialize the Intel IOMMU and
> >> use pci_setup_iommu() to setup q35_host_dma_iommu() as the IOMMU function for
> >> the pci bus.
> >> 3. q35_host_dma_iommu() will return different address space according to the
> >> bus_num and devfn of the device.
> >>
> >> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> >> ---
> >>  hw/core/machine.c         | 27 +++++++++++++++++---
> >>  hw/pci-host/q35.c         | 64 +++++++++++++++++++++++++++++++++++++++++++----
> >>  include/hw/boards.h       |  1 +
> >>  include/hw/pci-host/q35.h |  2 ++
> >>  qemu-options.hx           |  5 +++-
> >>  vl.c                      |  4 +++
> >>  6 files changed, 94 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >> index 7a66c57..f0046d6 100644
> >> --- a/hw/core/machine.c
> >> +++ b/hw/core/machine.c
> >> @@ -235,6 +235,20 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp)
> >>      ms->firmware = g_strdup(value);
> >>  }
> >>
> >> +static bool machine_get_iommu(Object *obj, Error **errp)
> >> +{
> >> +    MachineState *ms = MACHINE(obj);
> >> +
> >> +    return ms->iommu;
> >> +}
> >> +
> >> +static void machine_set_iommu(Object *obj, bool value, Error **errp)
> >> +{
> >> +    MachineState *ms = MACHINE(obj);
> >> +
> >> +    ms->iommu = value;
> >> +}
> >> +
> >>  static void machine_initfn(Object *obj)
> >>  {
> >>      object_property_add_str(obj, "accel",
> >> @@ -270,10 +284,17 @@ static void machine_initfn(Object *obj)
> >>                               machine_set_dump_guest_core,
> >>                               NULL);
> >>      object_property_add_bool(obj, "mem-merge",
> >> -                             machine_get_mem_merge, machine_set_mem_merge, NULL);
> >> -    object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
> >> +                             machine_get_mem_merge,
> >> +                             machine_set_mem_merge, NULL);
> >> +    object_property_add_bool(obj, "usb",
> >> +                             machine_get_usb,
> >> +                             machine_set_usb, NULL);
> >>      object_property_add_str(obj, "firmware",
> >> -                            machine_get_firmware, machine_set_firmware, NULL);
> >> +                            machine_get_firmware,
> >> +                            machine_set_firmware, NULL);
> >> +    object_property_add_bool(obj, "iommu",
> >> +                             machine_get_iommu,
> >> +                             machine_set_iommu, NULL);
> >>  }
> >>
> >>  static void machine_finalize(Object *obj)
> >> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
> >> index a0a3068..3342711 100644
> >> --- a/hw/pci-host/q35.c
> >> +++ b/hw/pci-host/q35.c
> >> @@ -347,6 +347,53 @@ static void mch_reset(DeviceState *qdev)
> >>      mch_update(mch);
> >>  }
> >>
> >> +static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >> +{
> >> +    IntelIOMMUState *s = opaque;
> >> +    VTDAddressSpace **pvtd_as;
> >> +    VTDAddressSpace *vtd_as;
> >> +    int bus_num = pci_bus_num(bus);
> >> +
> >> +    assert(devfn >= 0);
> >> +
> >> +    pvtd_as = s->address_spaces[bus_num];
> >> +    if (!pvtd_as) {
> >> +        /* No corresponding free() */
> >> +        pvtd_as = g_malloc0(sizeof(VTDAddressSpace *) *
> >> +                            VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
> >> +        s->address_spaces[bus_num] = pvtd_as;
> >> +    }
> >> +
> >> +    vtd_as = *(pvtd_as + devfn);
> >
> > pvtd_as[devfn] is cleaner.
> > In fact, you can then drop vtd_as local variable, use pvtd_as[devfn]
> > directly.
> >
> >> +    if (!vtd_as) {
> >> +        vtd_as = g_malloc0(sizeof(*vtd_as));
> >> +        *(pvtd_as + devfn) = vtd_as;
> >> +
> >> +        vtd_as->bus_num = bus_num;
> >> +        vtd_as->devfn = devfn;
> >> +        vtd_as->iommu_state = s;
> >> +        memory_region_init_iommu(&vtd_as->iommu, OBJECT(s), &s->iommu_ops,
> >> +                                 "intel_iommu", UINT64_MAX);
> >> +        address_space_init(&vtd_as->as, &vtd_as->iommu, "intel_iommu");
> >> +    }
> >> +
> >> +    return &vtd_as->as;
> >> +}
> >> +
> >> +static void mch_init_dmar(MCHPCIState *mch)
> >> +{
> >> +    PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
> >> +
> >> +    mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, TYPE_INTEL_IOMMU_DEVICE));
> >> +    object_property_add_child(OBJECT(mch), "intel-iommu",
> >> +                              OBJECT(mch->iommu), NULL);
> >> +    qdev_init_nofail(DEVICE(mch->iommu));
> >> +    sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
> >> +
> >> +    pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
> >> +}
> >> +
> >> +
> >>  static int mch_init(PCIDevice *d)
> >>  {
> >>      int i;
> >> @@ -363,13 +410,20 @@ static int mch_init(PCIDevice *d)
> >>      memory_region_add_subregion_overlap(mch->system_memory, 0xa0000,
> >>                                          &mch->smram_region, 1);
> >>      memory_region_set_enabled(&mch->smram_region, false);
> >> -    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
> >> -             &mch->pam_regions[0], PAM_BIOS_BASE, PAM_BIOS_SIZE);
> >> +    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
> >> +             mch->pci_address_space, &mch->pam_regions[0], PAM_BIOS_BASE,
> >> +             PAM_BIOS_SIZE);
> >>      for (i = 0; i < 12; ++i) {
> >> -        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
> >> -                 &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
> >> -                 PAM_EXPAN_SIZE);
> >> +        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
> >> +                 mch->pci_address_space, &mch->pam_regions[i+1],
> >> +                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
> >
> > Random formatting changes above? Make it a separate patch please.
> 
> Yes, I was told that I could fix some coding style issues around
> places that my patch touches. If that is improper, I will just first
> leave them out. :)
> Thanks for your reviews!
> 
> Le

Review's easier if they are split out in a separate patch.
No need to discard this, just split.

> >> +    }
> >> +
> >> +    /* Intel IOMMU (VT-d) */
> >> +    if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
> >> +        mch_init_dmar(mch);
> >>      }
> >> +
> >>      return 0;
> >>  }
> >>
> >> diff --git a/include/hw/boards.h b/include/hw/boards.h
> >> index 605a970..dfb6718 100644
> >> --- a/include/hw/boards.h
> >> +++ b/include/hw/boards.h
> >> @@ -123,6 +123,7 @@ struct MachineState {
> >>      bool mem_merge;
> >>      bool usb;
> >>      char *firmware;
> >> +    bool iommu;
> >>
> >>      ram_addr_t ram_size;
> >>      ram_addr_t maxram_size;
> >> diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
> >> index d9ee978..025d6e6 100644
> >> --- a/include/hw/pci-host/q35.h
> >> +++ b/include/hw/pci-host/q35.h
> >> @@ -33,6 +33,7 @@
> >>  #include "hw/acpi/acpi.h"
> >>  #include "hw/acpi/ich9.h"
> >>  #include "hw/pci-host/pam.h"
> >> +#include "hw/i386/intel_iommu.h"
> >>
> >>  #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
> >>  #define Q35_HOST_DEVICE(obj) \
> >> @@ -60,6 +61,7 @@ typedef struct MCHPCIState {
> >>      uint64_t pci_hole64_size;
> >>      PcGuestInfo *guest_info;
> >>      uint32_t short_root_bus;
> >> +    IntelIOMMUState *iommu;
> >>  } MCHPCIState;
> >>
> >>  typedef struct Q35PCIHost {
> >> diff --git a/qemu-options.hx b/qemu-options.hx
> >> index 96516c1..7406a17 100644
> >> --- a/qemu-options.hx
> >> +++ b/qemu-options.hx
> >> @@ -35,7 +35,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> >>      "                kernel_irqchip=on|off controls accelerated irqchip support\n"
> >>      "                kvm_shadow_mem=size of KVM shadow MMU\n"
> >>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> >> -    "                mem-merge=on|off controls memory merge support (default: on)\n",
> >> +    "                mem-merge=on|off controls memory merge support (default: on)\n"
> >> +    "                iommu=on|off controls emulated Intel IOMMU (VT-d) support (default=off)\n",
> >>      QEMU_ARCH_ALL)
> >>  STEXI
> >>  @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
> >> @@ -58,6 +59,8 @@ Include guest memory in a core dump. The default is on.
> >>  Enables or disables memory merge support. This feature, when supported by
> >>  the host, de-duplicates identical memory pages among VMs instances
> >>  (enabled by default).
> >> +@item iommu=on|off
> >> +Enables or disables emulated Intel IOMMU (VT-d) support. The default is off.
> >>  @end table
> >>  ETEXI
> >>
> >> diff --git a/vl.c b/vl.c
> >> index a8029d5..2ab1643 100644
> >> --- a/vl.c
> >> +++ b/vl.c
> >> @@ -388,6 +388,10 @@ static QemuOptsList qemu_machine_opts = {
> >>              .name = PC_MACHINE_MAX_RAM_BELOW_4G,
> >>              .type = QEMU_OPT_SIZE,
> >>              .help = "maximum ram below the 4G boundary (32bit boundary)",
> >> +        },{
> >> +            .name = "iommu",
> >> +            .type = QEMU_OPT_BOOL,
> >> +            .help = "Set on/off to enable/disable Intel IOMMU (VT-d)",
> >>          },
> >>          { /* End of list */ }
> >>      },
> >> --
> >> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:06   ` Michael S. Tsirkin
@ 2014-08-14 11:36     ` Jan Kiszka
  2014-08-14 11:43       ` Michael S. Tsirkin
  2014-08-14 11:44       ` Michael S. Tsirkin
  2014-08-14 11:37     ` Le Tan
  1 sibling, 2 replies; 34+ messages in thread
From: Jan Kiszka @ 2014-08-14 11:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 720 bytes --]

On 2014-08-14 13:06, Michael S. Tsirkin wrote:
> On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
>> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
>> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
>> is only one hardware unit without INTR_REMAP capability on the platform.
>>
>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> 
> Could you add a unit test please?

While unit tests would really be helpful, I'm afraid that's not in reach
(GSoC is almost over). The good news is that we have pretty broad test
coverage with both Linux and also Jailhouse already.

Do you see unit tests as precondition for merging the series?

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:06   ` Michael S. Tsirkin
  2014-08-14 11:36     ` Jan Kiszka
@ 2014-08-14 11:37     ` Le Tan
  1 sibling, 0 replies; 34+ messages in thread
From: Le Tan @ 2014-08-14 11:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

Hi Michael,

2014-08-14 19:06 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
>> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
>> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
>> is only one hardware unit without INTR_REMAP capability on the platform.
>>
>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
>
> Could you add a unit test please?
>
>> ---
>>  hw/i386/acpi-build.c | 41 ++++++++++++++++++++++++++++++
>>  hw/i386/acpi-defs.h  | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 111 insertions(+)
>>
>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>> index 816c6d9..595f501 100644
>> --- a/hw/i386/acpi-build.c
>> +++ b/hw/i386/acpi-build.c
>> @@ -47,6 +47,7 @@
>>  #include "hw/i386/ich9.h"
>>  #include "hw/pci/pci_bus.h"
>>  #include "hw/pci-host/q35.h"
>> +#include "hw/i386/intel_iommu.h"
>>
>>  #include "hw/i386/q35-acpi-dsdt.hex"
>>  #include "hw/i386/acpi-dsdt.hex"
>> @@ -1350,6 +1351,31 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
>>  }
>>
>>  static void
>> +build_dmar_q35(GArray *table_data, GArray *linker)
>> +{
>> +    int dmar_start = table_data->len;
>> +
>> +    AcpiTableDmar *dmar;
>> +    AcpiDmarHardwareUnit *drhd;
>> +
>> +    dmar = acpi_data_push(table_data, sizeof(*dmar));
>> +    dmar->host_address_width = VTD_HOST_ADDRESS_WIDTH - 1;
>> +    dmar->flags = 0;    /* No intr_remap for now */
>> +
>> +    /* DMAR Remapping Hardware Unit Definition structure */
>> +    drhd = acpi_data_push(table_data, sizeof(*drhd));
>> +    drhd->type = cpu_to_le16(ACPI_DMAR_TYPE_HARDWARE_UNIT);
>> +    drhd->length = cpu_to_le16(sizeof(*drhd));   /* No device scope now */
>> +    drhd->flags = ACPI_DMAR_INCLUDE_PCI_ALL;
>> +    drhd->pci_segment = cpu_to_le16(0);
>> +    drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
>> +
>> +    build_header(linker, table_data, (void *)(table_data->data + dmar_start),
>> +                 "DMAR", table_data->len - dmar_start, 1);
>> +}
>> +
>> +
>> +static void
>>  build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
>>  {
>>      AcpiTableHeader *dsdt;
>> @@ -1470,6 +1496,17 @@ static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
>>      return true;
>>  }
>>
>> +static bool acpi_has_iommu(void)
>> +{
>> +    bool ambiguous;
>> +    Object *intel_iommu;
>> +
>> +    intel_iommu = object_resolve_path_type("", TYPE_INTEL_IOMMU_DEVICE,
>> +                                           &ambiguous);
>> +    return intel_iommu && !ambiguous;
>> +}
>> +
>> +
>>  static
>>  void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
>>  {
>> @@ -1539,6 +1576,10 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
>>          acpi_add_table(table_offsets, tables->table_data);
>>          build_mcfg_q35(tables->table_data, tables->linker, &mcfg);
>>      }
>> +    if (acpi_has_iommu()) {
>> +        acpi_add_table(table_offsets, tables->table_data);
>> +        build_dmar_q35(tables->table_data, tables->linker);
>> +    }
>>
>>      /* Add tables supplied by user (if any) */
>>      for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
>> diff --git a/hw/i386/acpi-defs.h b/hw/i386/acpi-defs.h
>> index e93babb..9674825 100644
>> --- a/hw/i386/acpi-defs.h
>> +++ b/hw/i386/acpi-defs.h
>> @@ -314,4 +314,74 @@ struct AcpiTableMcfg {
>>  } QEMU_PACKED;
>>  typedef struct AcpiTableMcfg AcpiTableMcfg;
>>
>> +/* DMAR - DMA Remapping table r2.2 */
>> +struct AcpiTableDmar {
>> +    ACPI_TABLE_HEADER_DEF
>> +    uint8_t host_address_width; /* Maximum DMA physical addressability */
>> +    uint8_t flags;
>> +    uint8_t reserved[10];
>> +} QEMU_PACKED;
>> +typedef struct AcpiTableDmar AcpiTableDmar;
>> +
>> +/* Masks for Flags field above */
>> +#define ACPI_DMAR_INTR_REMAP    (1)
>> +#define ACPI_DMAR_X2APIC_OPT_OUT    (2)
>> +
>> +/*
>> + * DMAR sub-structures (Follow DMA Remapping table)
>> + */
>> +#define ACPI_DMAR_SUB_HEADER_DEF /* Common ACPI DMAR sub-structure header */\
>> +    uint16_t type;  \
>> +    uint16_t length;
>
> Really necessary? cleaner to just open-code, it's
> only used once ...

For now it is just used once. I wrote this because I thought that it
is convenient to add other sub-tables in the future. To make it
cleaner, I will remove it later.
Thanks very much!

Le

>> +
>> +/* Values for sub-structure type for DMAR */
>> +enum {
>> +    ACPI_DMAR_TYPE_HARDWARE_UNIT = 0,   /* DRHD */
>> +    ACPI_DMAR_TYPE_RESERVED_MEMORY = 1, /* RMRR */
>> +    ACPI_DMAR_TYPE_ATSR = 2,    /* ATSR */
>> +    ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,   /* RHSR */
>> +    ACPI_DMAR_TYPE_ANDD = 4,    /* ANDD */
>> +    ACPI_DMAR_TYPE_RESERVED = 5 /* Reserved for furture use */
>> +};
>> +
>> +/*
>> + * Sub-structures for DMAR, correspond to Type in ACPI_DMAR_SUB_HEADER_DEF
>> + */
>> +
>> +/* DMAR Device Scope structures */
>> +struct AcpiDmarDeviceScope {
>> +    uint8_t type;
>> +    uint8_t length;
>> +    uint16_t reserved;
>> +    uint8_t enumeration_id;
>> +    uint8_t start_bus_number;
>> +    uint8_t path[0];
>> +} QEMU_PACKED;
>> +typedef struct AcpiDmarDeviceScope AcpiDmarDeviceScope;
>> +
>> +/* Values for type in struct AcpiDmarDeviceScope */
>> +enum {
>> +    ACPI_DMAR_SCOPE_TYPE_NOT_USED = 0,
>> +    ACPI_DMAR_SCOPE_TYPE_ENDPOINT = 1,
>> +    ACPI_DMAR_SCOPE_TYPE_BRIDGE = 2,
>> +    ACPI_DMAR_SCOPE_TYPE_IOAPIC = 3,
>> +    ACPI_DMAR_SCOPE_TYPE_HPET = 4,
>> +    ACPI_DMAR_SCOPE_TYPE_ACPI = 5,
>> +    ACPI_DMAR_SCOPE_TYPE_RESERVED = 6 /* Reserved for future use */
>> +};
>> +
>> +/* 0: Hardware Unit Definition */
>> +struct AcpiDmarHardwareUnit {
>> +    ACPI_DMAR_SUB_HEADER_DEF
>> +    uint8_t flags;
>> +    uint8_t reserved;
>> +    uint16_t pci_segment;   /* The PCI Segment associated with this unit */
>> +    uint64_t address;   /* Base address of remapping hardware register-set */
>> +} QEMU_PACKED;
>> +typedef struct AcpiDmarHardwareUnit AcpiDmarHardwareUnit;
>> +
>> +/* Masks for Flags field above */
>> +#define ACPI_DMAR_INCLUDE_PCI_ALL (1)
>> +
>> +
>
> Don't add two empty lines in a row.
> Same in other places.
>
>>  #endif
>> --
>> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
  2014-08-14 11:35       ` Michael S. Tsirkin
@ 2014-08-14 11:39         ` Le Tan
  0 siblings, 0 replies; 34+ messages in thread
From: Le Tan @ 2014-08-14 11:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

2014-08-14 19:35 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Aug 14, 2014 at 07:33:29PM +0800, Le Tan wrote:
>> 2014-08-14 19:12 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:
>> > On Mon, Aug 11, 2014 at 03:05:01PM +0800, Le Tan wrote:
>> >> Add Intel IOMMU emulation to q35 chipset and expose it to the guest.
>> >> 1. Add a machine option. Users can use "-machine iommu=on|off" in the command
>> >> line to enable/disable Intel IOMMU. The default is off.
>> >> 2. Accroding to the machine option, q35 will initialize the Intel IOMMU and
>> >> use pci_setup_iommu() to setup q35_host_dma_iommu() as the IOMMU function for
>> >> the pci bus.
>> >> 3. q35_host_dma_iommu() will return different address space according to the
>> >> bus_num and devfn of the device.
>> >>
>> >> Signed-off-by: Le Tan <tamlokveer@gmail.com>
>> >> ---
>> >>  hw/core/machine.c         | 27 +++++++++++++++++---
>> >>  hw/pci-host/q35.c         | 64 +++++++++++++++++++++++++++++++++++++++++++----
>> >>  include/hw/boards.h       |  1 +
>> >>  include/hw/pci-host/q35.h |  2 ++
>> >>  qemu-options.hx           |  5 +++-
>> >>  vl.c                      |  4 +++
>> >>  6 files changed, 94 insertions(+), 9 deletions(-)
>> >>
>> >> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> >> index 7a66c57..f0046d6 100644
>> >> --- a/hw/core/machine.c
>> >> +++ b/hw/core/machine.c
>> >> @@ -235,6 +235,20 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp)
>> >>      ms->firmware = g_strdup(value);
>> >>  }
>> >>
>> >> +static bool machine_get_iommu(Object *obj, Error **errp)
>> >> +{
>> >> +    MachineState *ms = MACHINE(obj);
>> >> +
>> >> +    return ms->iommu;
>> >> +}
>> >> +
>> >> +static void machine_set_iommu(Object *obj, bool value, Error **errp)
>> >> +{
>> >> +    MachineState *ms = MACHINE(obj);
>> >> +
>> >> +    ms->iommu = value;
>> >> +}
>> >> +
>> >>  static void machine_initfn(Object *obj)
>> >>  {
>> >>      object_property_add_str(obj, "accel",
>> >> @@ -270,10 +284,17 @@ static void machine_initfn(Object *obj)
>> >>                               machine_set_dump_guest_core,
>> >>                               NULL);
>> >>      object_property_add_bool(obj, "mem-merge",
>> >> -                             machine_get_mem_merge, machine_set_mem_merge, NULL);
>> >> -    object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
>> >> +                             machine_get_mem_merge,
>> >> +                             machine_set_mem_merge, NULL);
>> >> +    object_property_add_bool(obj, "usb",
>> >> +                             machine_get_usb,
>> >> +                             machine_set_usb, NULL);
>> >>      object_property_add_str(obj, "firmware",
>> >> -                            machine_get_firmware, machine_set_firmware, NULL);
>> >> +                            machine_get_firmware,
>> >> +                            machine_set_firmware, NULL);
>> >> +    object_property_add_bool(obj, "iommu",
>> >> +                             machine_get_iommu,
>> >> +                             machine_set_iommu, NULL);
>> >>  }
>> >>
>> >>  static void machine_finalize(Object *obj)
>> >> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>> >> index a0a3068..3342711 100644
>> >> --- a/hw/pci-host/q35.c
>> >> +++ b/hw/pci-host/q35.c
>> >> @@ -347,6 +347,53 @@ static void mch_reset(DeviceState *qdev)
>> >>      mch_update(mch);
>> >>  }
>> >>
>> >> +static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>> >> +{
>> >> +    IntelIOMMUState *s = opaque;
>> >> +    VTDAddressSpace **pvtd_as;
>> >> +    VTDAddressSpace *vtd_as;
>> >> +    int bus_num = pci_bus_num(bus);
>> >> +
>> >> +    assert(devfn >= 0);
>> >> +
>> >> +    pvtd_as = s->address_spaces[bus_num];
>> >> +    if (!pvtd_as) {
>> >> +        /* No corresponding free() */
>> >> +        pvtd_as = g_malloc0(sizeof(VTDAddressSpace *) *
>> >> +                            VTD_PCI_SLOT_MAX * VTD_PCI_FUNC_MAX);
>> >> +        s->address_spaces[bus_num] = pvtd_as;
>> >> +    }
>> >> +
>> >> +    vtd_as = *(pvtd_as + devfn);
>> >
>> > pvtd_as[devfn] is cleaner.
>> > In fact, you can then drop vtd_as local variable, use pvtd_as[devfn]
>> > directly.
>> >
>> >> +    if (!vtd_as) {
>> >> +        vtd_as = g_malloc0(sizeof(*vtd_as));
>> >> +        *(pvtd_as + devfn) = vtd_as;
>> >> +
>> >> +        vtd_as->bus_num = bus_num;
>> >> +        vtd_as->devfn = devfn;
>> >> +        vtd_as->iommu_state = s;
>> >> +        memory_region_init_iommu(&vtd_as->iommu, OBJECT(s), &s->iommu_ops,
>> >> +                                 "intel_iommu", UINT64_MAX);
>> >> +        address_space_init(&vtd_as->as, &vtd_as->iommu, "intel_iommu");
>> >> +    }
>> >> +
>> >> +    return &vtd_as->as;
>> >> +}
>> >> +
>> >> +static void mch_init_dmar(MCHPCIState *mch)
>> >> +{
>> >> +    PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
>> >> +
>> >> +    mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, TYPE_INTEL_IOMMU_DEVICE));
>> >> +    object_property_add_child(OBJECT(mch), "intel-iommu",
>> >> +                              OBJECT(mch->iommu), NULL);
>> >> +    qdev_init_nofail(DEVICE(mch->iommu));
>> >> +    sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>> >> +
>> >> +    pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
>> >> +}
>> >> +
>> >> +
>> >>  static int mch_init(PCIDevice *d)
>> >>  {
>> >>      int i;
>> >> @@ -363,13 +410,20 @@ static int mch_init(PCIDevice *d)
>> >>      memory_region_add_subregion_overlap(mch->system_memory, 0xa0000,
>> >>                                          &mch->smram_region, 1);
>> >>      memory_region_set_enabled(&mch->smram_region, false);
>> >> -    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
>> >> -             &mch->pam_regions[0], PAM_BIOS_BASE, PAM_BIOS_SIZE);
>> >> +    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
>> >> +             mch->pci_address_space, &mch->pam_regions[0], PAM_BIOS_BASE,
>> >> +             PAM_BIOS_SIZE);
>> >>      for (i = 0; i < 12; ++i) {
>> >> -        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
>> >> -                 &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
>> >> -                 PAM_EXPAN_SIZE);
>> >> +        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
>> >> +                 mch->pci_address_space, &mch->pam_regions[i+1],
>> >> +                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
>> >
>> > Random formatting changes above? Make it a separate patch please.
>>
>> Yes, I was told that I could fix some coding style issues around
>> places that my patch touches. If that is improper, I will just first
>> leave them out. :)
>> Thanks for your reviews!
>>
>> Le
>
> Review's easier if they are split out in a separate patch.
> No need to discard this, just split.

OK, get it.

Le

>> >> +    }
>> >> +
>> >> +    /* Intel IOMMU (VT-d) */
>> >> +    if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
>> >> +        mch_init_dmar(mch);
>> >>      }
>> >> +
>> >>      return 0;
>> >>  }
>> >>
>> >> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> >> index 605a970..dfb6718 100644
>> >> --- a/include/hw/boards.h
>> >> +++ b/include/hw/boards.h
>> >> @@ -123,6 +123,7 @@ struct MachineState {
>> >>      bool mem_merge;
>> >>      bool usb;
>> >>      char *firmware;
>> >> +    bool iommu;
>> >>
>> >>      ram_addr_t ram_size;
>> >>      ram_addr_t maxram_size;
>> >> diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
>> >> index d9ee978..025d6e6 100644
>> >> --- a/include/hw/pci-host/q35.h
>> >> +++ b/include/hw/pci-host/q35.h
>> >> @@ -33,6 +33,7 @@
>> >>  #include "hw/acpi/acpi.h"
>> >>  #include "hw/acpi/ich9.h"
>> >>  #include "hw/pci-host/pam.h"
>> >> +#include "hw/i386/intel_iommu.h"
>> >>
>> >>  #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
>> >>  #define Q35_HOST_DEVICE(obj) \
>> >> @@ -60,6 +61,7 @@ typedef struct MCHPCIState {
>> >>      uint64_t pci_hole64_size;
>> >>      PcGuestInfo *guest_info;
>> >>      uint32_t short_root_bus;
>> >> +    IntelIOMMUState *iommu;
>> >>  } MCHPCIState;
>> >>
>> >>  typedef struct Q35PCIHost {
>> >> diff --git a/qemu-options.hx b/qemu-options.hx
>> >> index 96516c1..7406a17 100644
>> >> --- a/qemu-options.hx
>> >> +++ b/qemu-options.hx
>> >> @@ -35,7 +35,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>> >>      "                kernel_irqchip=on|off controls accelerated irqchip support\n"
>> >>      "                kvm_shadow_mem=size of KVM shadow MMU\n"
>> >>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>> >> -    "                mem-merge=on|off controls memory merge support (default: on)\n",
>> >> +    "                mem-merge=on|off controls memory merge support (default: on)\n"
>> >> +    "                iommu=on|off controls emulated Intel IOMMU (VT-d) support (default=off)\n",
>> >>      QEMU_ARCH_ALL)
>> >>  STEXI
>> >>  @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
>> >> @@ -58,6 +59,8 @@ Include guest memory in a core dump. The default is on.
>> >>  Enables or disables memory merge support. This feature, when supported by
>> >>  the host, de-duplicates identical memory pages among VMs instances
>> >>  (enabled by default).
>> >> +@item iommu=on|off
>> >> +Enables or disables emulated Intel IOMMU (VT-d) support. The default is off.
>> >>  @end table
>> >>  ETEXI
>> >>
>> >> diff --git a/vl.c b/vl.c
>> >> index a8029d5..2ab1643 100644
>> >> --- a/vl.c
>> >> +++ b/vl.c
>> >> @@ -388,6 +388,10 @@ static QemuOptsList qemu_machine_opts = {
>> >>              .name = PC_MACHINE_MAX_RAM_BELOW_4G,
>> >>              .type = QEMU_OPT_SIZE,
>> >>              .help = "maximum ram below the 4G boundary (32bit boundary)",
>> >> +        },{
>> >> +            .name = "iommu",
>> >> +            .type = QEMU_OPT_BOOL,
>> >> +            .help = "Set on/off to enable/disable Intel IOMMU (VT-d)",
>> >>          },
>> >>          { /* End of list */ }
>> >>      },
>> >> --
>> >> 1.9.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:36     ` Jan Kiszka
@ 2014-08-14 11:43       ` Michael S. Tsirkin
  2014-08-14 11:51         ` Jan Kiszka
  2014-08-14 11:44       ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:43 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Stefan Weil, Knut Omang, Le Tan, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Aug 14, 2014 at 01:36:49PM +0200, Jan Kiszka wrote:
> On 2014-08-14 13:06, Michael S. Tsirkin wrote:
> > On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
> >> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
> >> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
> >> is only one hardware unit without INTR_REMAP capability on the platform.
> >>
> >> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> > 
> > Could you add a unit test please?
> 
> While unit tests would really be helpful, I'm afraid that's not in reach
> (GSoC is almost over). The good news is that we have pretty broad test
> coverage with both Linux and also Jailhouse already.
> 
> Do you see unit tests as precondition for merging the series?
> 
> Jan
> 

Not a pre-requisite - it's very easy to add a unit test,
just add a case in test_acpi_tcg. So I can do it myself afterwards.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:36     ` Jan Kiszka
  2014-08-14 11:43       ` Michael S. Tsirkin
@ 2014-08-14 11:44       ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 11:44 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Stefan Weil, Knut Omang, Le Tan, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Aug 14, 2014 at 01:36:49PM +0200, Jan Kiszka wrote:
> On 2014-08-14 13:06, Michael S. Tsirkin wrote:
> > On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
> >> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
> >> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
> >> is only one hardware unit without INTR_REMAP capability on the platform.
> >>
> >> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> > 
> > Could you add a unit test please?
> 
> While unit tests would really be helpful, I'm afraid that's not in reach
> (GSoC is almost over).

How much time is left BTW?

> The good news is that we have pretty broad test
> coverage with both Linux and also Jailhouse already.
> 
> Do you see unit tests as precondition for merging the series?
> 
> Jan
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:43       ` Michael S. Tsirkin
@ 2014-08-14 11:51         ` Jan Kiszka
  2014-08-14 11:53           ` Le Tan
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kiszka @ 2014-08-14 11:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Weil, Knut Omang, Le Tan, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

On 2014-08-14 13:43, Michael S. Tsirkin wrote:
> On Thu, Aug 14, 2014 at 01:36:49PM +0200, Jan Kiszka wrote:
>> On 2014-08-14 13:06, Michael S. Tsirkin wrote:
>>> On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
>>>> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
>>>> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
>>>> is only one hardware unit without INTR_REMAP capability on the platform.
>>>>
>>>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
>>>
>>> Could you add a unit test please?
>>
>> While unit tests would really be helpful, I'm afraid that's not in reach
>> (GSoC is almost over). The good news is that we have pretty broad test
>> coverage with both Linux and also Jailhouse already.
>>
>> Do you see unit tests as precondition for merging the series?
>>
>> Jan
>>
> 
> Not a pre-requisite - it's very easy to add a unit test,
> just add a case in test_acpi_tcg. So I can do it myself afterwards.
> 

Ah, ok, that's tests/bios-tables-test.c. Maybe Le can have a look if
time is left. But hard pecils-down is already on the 18th.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:51         ` Jan Kiszka
@ 2014-08-14 11:53           ` Le Tan
  2014-08-14 12:04             ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Le Tan @ 2014-08-14 11:53 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, Knut Omang, qemu-devel,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

2014-08-14 19:51 GMT+08:00 Jan Kiszka <jan.kiszka@web.de>:
> On 2014-08-14 13:43, Michael S. Tsirkin wrote:
>> On Thu, Aug 14, 2014 at 01:36:49PM +0200, Jan Kiszka wrote:
>>> On 2014-08-14 13:06, Michael S. Tsirkin wrote:
>>>> On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
>>>>> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
>>>>> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
>>>>> is only one hardware unit without INTR_REMAP capability on the platform.
>>>>>
>>>>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
>>>>
>>>> Could you add a unit test please?
>>>
>>> While unit tests would really be helpful, I'm afraid that's not in reach
>>> (GSoC is almost over). The good news is that we have pretty broad test
>>> coverage with both Linux and also Jailhouse already.
>>>
>>> Do you see unit tests as precondition for merging the series?
>>>
>>> Jan
>>>
>>
>> Not a pre-requisite - it's very easy to add a unit test,
>> just add a case in test_acpi_tcg. So I can do it myself afterwards.
>>
>
> Ah, ok, that's tests/bios-tables-test.c. Maybe Le can have a look if
> time is left. But hard pecils-down is already on the 18th.
>
> Jan
>
>

OK, I will attempt to add that. I have just pushed the IOTLB cache. I
am going to fix patches according to these reviews now.

Le

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables
  2014-08-14 11:53           ` Le Tan
@ 2014-08-14 12:04             ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2014-08-14 12:04 UTC (permalink / raw)
  To: Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson, Jan Kiszka,
	Anthony Liguori, Paolo Bonzini

On Thu, Aug 14, 2014 at 07:53:32PM +0800, Le Tan wrote:
> 2014-08-14 19:51 GMT+08:00 Jan Kiszka <jan.kiszka@web.de>:
> > On 2014-08-14 13:43, Michael S. Tsirkin wrote:
> >> On Thu, Aug 14, 2014 at 01:36:49PM +0200, Jan Kiszka wrote:
> >>> On 2014-08-14 13:06, Michael S. Tsirkin wrote:
> >>>> On Mon, Aug 11, 2014 at 03:05:00PM +0800, Le Tan wrote:
> >>>>> Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
> >>>>> add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
> >>>>> is only one hardware unit without INTR_REMAP capability on the platform.
> >>>>>
> >>>>> Signed-off-by: Le Tan <tamlokveer@gmail.com>
> >>>>
> >>>> Could you add a unit test please?
> >>>
> >>> While unit tests would really be helpful, I'm afraid that's not in reach
> >>> (GSoC is almost over). The good news is that we have pretty broad test
> >>> coverage with both Linux and also Jailhouse already.
> >>>
> >>> Do you see unit tests as precondition for merging the series?
> >>>
> >>> Jan
> >>>
> >>
> >> Not a pre-requisite - it's very easy to add a unit test,
> >> just add a case in test_acpi_tcg. So I can do it myself afterwards.
> >>
> >
> > Ah, ok, that's tests/bios-tables-test.c. Maybe Le can have a look if
> > time is left. But hard pecils-down is already on the 18th.
> >
> > Jan
> >
> >
> 
> OK, I will attempt to add that. I have just pushed the IOTLB cache. I
> am going to fix patches according to these reviews now.
> 
> Le

Yes, address comments and repost first, tests are nice
to have if you have the time.

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-14 11:15 ` [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Michael S. Tsirkin
@ 2014-08-14 12:10   ` Jan Kiszka
  2014-08-15  4:42     ` Knut Omang
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kiszka @ 2014-08-14 12:10 UTC (permalink / raw)
  To: Michael S. Tsirkin, Le Tan
  Cc: Stefan Weil, Knut Omang, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1155 bytes --]

On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
>> Hi,
>>
>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
>> chipset. The major job in these patches is to add support for emulating Intel
>> IOMMU according to the VT-d specification, including basic responses to CSRs
>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
>> translations.
> 
> Thanks!
> Looks very good overall, I noted some coding style issues - I didn't
> bother reporting each issue in every place where it appears - reported
> each issue once only, so please find and fix all instances of each
> issue.

BTW, because I was in urgent need for virtual test environment for
Jailhouse, I hacked interrupt remapping on top of Le's patches:

http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap

The approach likely needs further discussions and refinements but it
already allows me to work on top with our hypervisor, and also Linux.
You can see from the last commit that Le's work made it pretty easy to
build this on top.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-14 12:10   ` Jan Kiszka
@ 2014-08-15  4:42     ` Knut Omang
  2014-08-15 11:15       ` Knut Omang
  0 siblings, 1 reply; 34+ messages in thread
From: Knut Omang @ 2014-08-15  4:42 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Le Tan,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> > On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> >> Hi,
> >>
> >> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> >> chipset. The major job in these patches is to add support for emulating Intel
> >> IOMMU according to the VT-d specification, including basic responses to CSRs
> >> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> >> translations.
> > 
> > Thanks!
> > Looks very good overall, I noted some coding style issues - I didn't
> > bother reporting each issue in every place where it appears - reported
> > each issue once only, so please find and fix all instances of each
> > issue.
> 
> BTW, because I was in urgent need for virtual test environment for
> Jailhouse, I hacked interrupt remapping on top of Le's patches:
> 
> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
> 
> The approach likely needs further discussions and refinements but it
> already allows me to work on top with our hypervisor, and also Linux.
> You can see from the last commit that Le's work made it pretty easy to
> build this on top.

Le, 

I have tried Jan's branch with my device setup which consists of a
minimal q35 setup, an ioh3420 root port (specified as -device
ioh3420,slot=0 ) and a pcie device plugged into that root port, which
gives the following lscpi -t:

-[0000:00]-+-00.0
           +-01.0
           +-02.0
           +-03.0-[01]----00.0
           +-04.0
           +-1f.0
           +-1f.2
           \-1f.3

All seems to work beautifully (I see the ISA bridge happily receive
translations) until the first DMA from my device model (at 1:00.0)
arrives, at which point I get:

[ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000 
[ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear

I would have expected request device 01:00.0 for this.
It is not clear to me yet if this is a weakness of the implementation of
ioh3420 or the iommu. Just wanted to let you know right away in case you
can shed some light to it or it is an easy fix,

The device uses pci_dma_rw with itself as device pointer.

Knut

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-15  4:42     ` Knut Omang
@ 2014-08-15 11:15       ` Knut Omang
  2014-08-15 11:37         ` Le Tan
  0 siblings, 1 reply; 34+ messages in thread
From: Knut Omang @ 2014-08-15 11:15 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, Le Tan, qemu-devel,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
> > On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> > > On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> > >> Hi,
> > >>
> > >> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> > >> chipset. The major job in these patches is to add support for emulating Intel
> > >> IOMMU according to the VT-d specification, including basic responses to CSRs
> > >> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> > >> translations.
> > > 
> > > Thanks!
> > > Looks very good overall, I noted some coding style issues - I didn't
> > > bother reporting each issue in every place where it appears - reported
> > > each issue once only, so please find and fix all instances of each
> > > issue.
> > 
> > BTW, because I was in urgent need for virtual test environment for
> > Jailhouse, I hacked interrupt remapping on top of Le's patches:
> > 
> > http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
> > 
> > The approach likely needs further discussions and refinements but it
> > already allows me to work on top with our hypervisor, and also Linux.
> > You can see from the last commit that Le's work made it pretty easy to
> > build this on top.
> 
> Le, 
> 
> I have tried Jan's branch with my device setup which consists of a
> minimal q35 setup, an ioh3420 root port (specified as -device
> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
> gives the following lscpi -t:
> 
> -[0000:00]-+-00.0
>            +-01.0
>            +-02.0
>            +-03.0-[01]----00.0
>            +-04.0
>            +-1f.0
>            +-1f.2
>            \-1f.3
> 
> All seems to work beautifully (I see the ISA bridge happily receive
> translations) until the first DMA from my device model (at 1:00.0)
> arrives, at which point I get:
> 
> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000 
> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
> 
> I would have expected request device 01:00.0 for this.
> It is not clear to me yet if this is a weakness of the implementation of
> ioh3420 or the iommu. Just wanted to let you know right away in case you
> can shed some light to it or it is an easy fix,
> 
> The device uses pci_dma_rw with itself as device pointer.

To verify my hypothesis: with this rude hack my device now works much
better:

@@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
int bus_num, int devfn,
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
     } else {
         ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
+        if (ret_fr)
+            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (ret_fr) {
             ret_fr = -ret_fr;

Looking at how things look on hardware, multiple devices often receive
overlapping DMA address ranges for different physical addresses.

So if I understand the way this works, every requester ID would also
need to have it's own unique VTDAddressSpace, as each pci
device/function sees a unique DMA address space..

Knut

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-15 11:15       ` Knut Omang
@ 2014-08-15 11:37         ` Le Tan
  2014-08-16  7:54           ` Knut Omang
  0 siblings, 1 reply; 34+ messages in thread
From: Le Tan @ 2014-08-15 11:37 UTC (permalink / raw)
  To: Knut Omang
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Alex Williamson,
	Jan Kiszka, Anthony Liguori, Paolo Bonzini

Hi Knut,

2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
>> > On 2014-08-14 13:15, Michael S. Tsirkin wrote:
>> > > On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
>> > >> Hi,
>> > >>
>> > >> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
>> > >> chipset. The major job in these patches is to add support for emulating Intel
>> > >> IOMMU according to the VT-d specification, including basic responses to CSRs
>> > >> accesses, the logics of DMAR (DMA remapping) and DMA memory address
>> > >> translations.
>> > >
>> > > Thanks!
>> > > Looks very good overall, I noted some coding style issues - I didn't
>> > > bother reporting each issue in every place where it appears - reported
>> > > each issue once only, so please find and fix all instances of each
>> > > issue.
>> >
>> > BTW, because I was in urgent need for virtual test environment for
>> > Jailhouse, I hacked interrupt remapping on top of Le's patches:
>> >
>> > http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
>> >
>> > The approach likely needs further discussions and refinements but it
>> > already allows me to work on top with our hypervisor, and also Linux.
>> > You can see from the last commit that Le's work made it pretty easy to
>> > build this on top.
>>
>> Le,
>>
>> I have tried Jan's branch with my device setup which consists of a
>> minimal q35 setup, an ioh3420 root port (specified as -device
>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
>> gives the following lscpi -t:
>>
>> -[0000:00]-+-00.0
>>            +-01.0
>>            +-02.0
>>            +-03.0-[01]----00.0
>>            +-04.0
>>            +-1f.0
>>            +-1f.2
>>            \-1f.3
>>
>> All seems to work beautifully (I see the ISA bridge happily receive
>> translations) until the first DMA from my device model (at 1:00.0)
>> arrives, at which point I get:
>>
>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
>>
>> I would have expected request device 01:00.0 for this.
>> It is not clear to me yet if this is a weakness of the implementation of
>> ioh3420 or the iommu. Just wanted to let you know right away in case you
>> can shed some light to it or it is an easy fix,
>>
>> The device uses pci_dma_rw with itself as device pointer.
>
> To verify my hypothesis: with this rude hack my device now works much
> better:
>
> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
> int bus_num, int devfn,
>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>      } else {
>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> +        if (ret_fr)
> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>          if (ret_fr) {
>              ret_fr = -ret_fr;
>
> Looking at how things look on hardware, multiple devices often receive
> overlapping DMA address ranges for different physical addresses.
>
> So if I understand the way this works, every requester ID would also
> need to have it's own unique VTDAddressSpace, as each pci
> device/function sees a unique DMA address space..

ioh3420 is a pcie-to-pcie bridge, right? In my opinion, each pci-e
device behind the pcie-to-pcie bridge can be assigned individually.
For now I added the VT-d to q35 by just adding it to the root pci bus.
You can see here in q35.c:
pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
So if we add a pcie-to-pcie bridge, we may have to call the
pci_setup_iommu() for that new bus. I don't know where to hook into
this now. :) If you know the mechanism behind that, you can try to add
that for the new bus. (I will dive into this after the clean up.)
What do you think?
Thanks very much for your testing! :)

Le


> Knut
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-15 11:37         ` Le Tan
@ 2014-08-16  7:54           ` Knut Omang
  2014-08-16  8:45             ` Jan Kiszka
  0 siblings, 1 reply; 34+ messages in thread
From: Knut Omang @ 2014-08-16  7:54 UTC (permalink / raw)
  To: Le Tan
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Alex Williamson,
	Jan Kiszka, Anthony Liguori, Paolo Bonzini

On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
> Hi Knut,
> 
> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
> > On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
> >> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
> >> > On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> >> > > On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> >> > >> Hi,
> >> > >>
> >> > >> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> >> > >> chipset. The major job in these patches is to add support for emulating Intel
> >> > >> IOMMU according to the VT-d specification, including basic responses to CSRs
> >> > >> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> >> > >> translations.
> >> > >
> >> > > Thanks!
> >> > > Looks very good overall, I noted some coding style issues - I didn't
> >> > > bother reporting each issue in every place where it appears - reported
> >> > > each issue once only, so please find and fix all instances of each
> >> > > issue.
> >> >
> >> > BTW, because I was in urgent need for virtual test environment for
> >> > Jailhouse, I hacked interrupt remapping on top of Le's patches:
> >> >
> >> > http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
> >> >
> >> > The approach likely needs further discussions and refinements but it
> >> > already allows me to work on top with our hypervisor, and also Linux.
> >> > You can see from the last commit that Le's work made it pretty easy to
> >> > build this on top.
> >>
> >> Le,
> >>
> >> I have tried Jan's branch with my device setup which consists of a
> >> minimal q35 setup, an ioh3420 root port (specified as -device
> >> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
> >> gives the following lscpi -t:
> >>
> >> -[0000:00]-+-00.0
> >>            +-01.0
> >>            +-02.0
> >>            +-03.0-[01]----00.0
> >>            +-04.0
> >>            +-1f.0
> >>            +-1f.2
> >>            \-1f.3
> >>
> >> All seems to work beautifully (I see the ISA bridge happily receive
> >> translations) until the first DMA from my device model (at 1:00.0)
> >> arrives, at which point I get:
> >>
> >> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
> >> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
> >>
> >> I would have expected request device 01:00.0 for this.
> >> It is not clear to me yet if this is a weakness of the implementation of
> >> ioh3420 or the iommu. Just wanted to let you know right away in case you
> >> can shed some light to it or it is an easy fix,
> >>
> >> The device uses pci_dma_rw with itself as device pointer.
> >
> > To verify my hypothesis: with this rude hack my device now works much
> > better:
> >
> > @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
> > int bus_num, int devfn,
> >          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> >      } else {
> >          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> > +        if (ret_fr)
> > +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
> >          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> >          if (ret_fr) {
> >              ret_fr = -ret_fr;
> >
> > Looking at how things look on hardware, multiple devices often receive
> > overlapping DMA address ranges for different physical addresses.
> >
> > So if I understand the way this works, every requester ID would also
> > need to have it's own unique VTDAddressSpace, as each pci
> > device/function sees a unique DMA address space..
> 
> ioh3420 is a pcie-to-pcie bridge, right? 

Yes.

> In my opinion, each pci-e
> device behind the pcie-to-pcie bridge can be assigned individually.
> For now I added the VT-d to q35 by just adding it to the root pci bus.
> You can see here in q35.c:
> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
> So if we add a pcie-to-pcie bridge, we may have to call the
> pci_setup_iommu() for that new bus. I don't know where to hook into
> this now. :) If you know the mechanism behind that, you can try to add
> that for the new bus. (I will dive into this after the clean up.)
> What do you think?

Thanks for the quick answer, that helped a lot!

Looking into the details here I realize it is slightly more complicated:
secondary buses are enumerated after device instantiation, as part of
the host PCI enumeration, so if I add a similar setup call in the bridge
setup, it will be called for a new device long before it has received
it's bus number from the OS (via config[PCI_SECONDARY_BUS] )

I agree that the lookup function for contexts needs to be as efficient
as possible so the simple <busno,defvn> lookup key may be the best
solution but then the address_spaces table cannot be populated with the
secondary bus entries before it receives a nonzero != 255 bus number,
eg. along the lines of this: 

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index 4becdc1..d9a8c23 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
         pci_bridge_update_mappings(s);
     }
 
+    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
+        int bus_num = pci_bus_num(&s->sec_bus);
+        if (bus_num != 0xff && bus_num != 0x00)
+            <handle bus number change>
+    }
+
     newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
     if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
         /* Trigger hot reset on 0->1 transition. */

but it is getting complicated...
Thoughts?

Thanks,

Knut

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-16  7:54           ` Knut Omang
@ 2014-08-16  8:45             ` Jan Kiszka
  2014-08-16  8:47               ` Jan Kiszka
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kiszka @ 2014-08-16  8:45 UTC (permalink / raw)
  To: Knut Omang, Le Tan
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5901 bytes --]

On 2014-08-16 09:54, Knut Omang wrote:
> On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
>> Hi Knut,
>>
>> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
>>> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
>>>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
>>>>> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
>>>>>> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
>>>>>>> chipset. The major job in these patches is to add support for emulating Intel
>>>>>>> IOMMU according to the VT-d specification, including basic responses to CSRs
>>>>>>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
>>>>>>> translations.
>>>>>>
>>>>>> Thanks!
>>>>>> Looks very good overall, I noted some coding style issues - I didn't
>>>>>> bother reporting each issue in every place where it appears - reported
>>>>>> each issue once only, so please find and fix all instances of each
>>>>>> issue.
>>>>>
>>>>> BTW, because I was in urgent need for virtual test environment for
>>>>> Jailhouse, I hacked interrupt remapping on top of Le's patches:
>>>>>
>>>>> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
>>>>>
>>>>> The approach likely needs further discussions and refinements but it
>>>>> already allows me to work on top with our hypervisor, and also Linux.
>>>>> You can see from the last commit that Le's work made it pretty easy to
>>>>> build this on top.
>>>>
>>>> Le,
>>>>
>>>> I have tried Jan's branch with my device setup which consists of a
>>>> minimal q35 setup, an ioh3420 root port (specified as -device
>>>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
>>>> gives the following lscpi -t:
>>>>
>>>> -[0000:00]-+-00.0
>>>>            +-01.0
>>>>            +-02.0
>>>>            +-03.0-[01]----00.0
>>>>            +-04.0
>>>>            +-1f.0
>>>>            +-1f.2
>>>>            \-1f.3
>>>>
>>>> All seems to work beautifully (I see the ISA bridge happily receive
>>>> translations) until the first DMA from my device model (at 1:00.0)
>>>> arrives, at which point I get:
>>>>
>>>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
>>>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
>>>>
>>>> I would have expected request device 01:00.0 for this.
>>>> It is not clear to me yet if this is a weakness of the implementation of
>>>> ioh3420 or the iommu. Just wanted to let you know right away in case you
>>>> can shed some light to it or it is an easy fix,
>>>>
>>>> The device uses pci_dma_rw with itself as device pointer.
>>>
>>> To verify my hypothesis: with this rude hack my device now works much
>>> better:
>>>
>>> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
>>> int bus_num, int devfn,
>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>>      } else {
>>>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
>>> +        if (ret_fr)
>>> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>>          if (ret_fr) {
>>>              ret_fr = -ret_fr;
>>>
>>> Looking at how things look on hardware, multiple devices often receive
>>> overlapping DMA address ranges for different physical addresses.
>>>
>>> So if I understand the way this works, every requester ID would also
>>> need to have it's own unique VTDAddressSpace, as each pci
>>> device/function sees a unique DMA address space..
>>
>> ioh3420 is a pcie-to-pcie bridge, right? 
> 
> Yes.
> 
>> In my opinion, each pci-e
>> device behind the pcie-to-pcie bridge can be assigned individually.
>> For now I added the VT-d to q35 by just adding it to the root pci bus.
>> You can see here in q35.c:
>> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
>> So if we add a pcie-to-pcie bridge, we may have to call the
>> pci_setup_iommu() for that new bus. I don't know where to hook into
>> this now. :) If you know the mechanism behind that, you can try to add
>> that for the new bus. (I will dive into this after the clean up.)
>> What do you think?
> 
> Thanks for the quick answer, that helped a lot!
> 
> Looking into the details here I realize it is slightly more complicated:
> secondary buses are enumerated after device instantiation, as part of
> the host PCI enumeration, so if I add a similar setup call in the bridge
> setup, it will be called for a new device long before it has received
> it's bus number from the OS (via config[PCI_SECONDARY_BUS] )
> 
> I agree that the lookup function for contexts needs to be as efficient
> as possible so the simple <busno,defvn> lookup key may be the best
> solution but then the address_spaces table cannot be populated with the
> secondary bus entries before it receives a nonzero != 255 bus number,
> eg. along the lines of this: 
> 
> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> index 4becdc1..d9a8c23 100644
> --- a/hw/pci/pci_bridge.c
> +++ b/hw/pci/pci_bridge.c
> @@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
>          pci_bridge_update_mappings(s);
>      }
>  
> +    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
> +        int bus_num = pci_bus_num(&s->sec_bus);
> +        if (bus_num != 0xff && bus_num != 0x00)
> +            <handle bus number change>
> +    }
> +
>      newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
>      if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
>          /* Trigger hot reset on 0->1 transition. */
> 
> but it is getting complicated...
> Thoughts?

Point to the PCI bus from VTDAddressSpace instead of storing the bus_num
there?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-16  8:45             ` Jan Kiszka
@ 2014-08-16  8:47               ` Jan Kiszka
  2014-08-18 16:34                 ` Knut Omang
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kiszka @ 2014-08-16  8:47 UTC (permalink / raw)
  To: Knut Omang, Le Tan
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6174 bytes --]

On 2014-08-16 10:45, Jan Kiszka wrote:
> On 2014-08-16 09:54, Knut Omang wrote:
>> On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
>>> Hi Knut,
>>>
>>> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
>>>> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
>>>>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
>>>>>> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
>>>>>>>> chipset. The major job in these patches is to add support for emulating Intel
>>>>>>>> IOMMU according to the VT-d specification, including basic responses to CSRs
>>>>>>>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
>>>>>>>> translations.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Looks very good overall, I noted some coding style issues - I didn't
>>>>>>> bother reporting each issue in every place where it appears - reported
>>>>>>> each issue once only, so please find and fix all instances of each
>>>>>>> issue.
>>>>>>
>>>>>> BTW, because I was in urgent need for virtual test environment for
>>>>>> Jailhouse, I hacked interrupt remapping on top of Le's patches:
>>>>>>
>>>>>> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
>>>>>>
>>>>>> The approach likely needs further discussions and refinements but it
>>>>>> already allows me to work on top with our hypervisor, and also Linux.
>>>>>> You can see from the last commit that Le's work made it pretty easy to
>>>>>> build this on top.
>>>>>
>>>>> Le,
>>>>>
>>>>> I have tried Jan's branch with my device setup which consists of a
>>>>> minimal q35 setup, an ioh3420 root port (specified as -device
>>>>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
>>>>> gives the following lscpi -t:
>>>>>
>>>>> -[0000:00]-+-00.0
>>>>>            +-01.0
>>>>>            +-02.0
>>>>>            +-03.0-[01]----00.0
>>>>>            +-04.0
>>>>>            +-1f.0
>>>>>            +-1f.2
>>>>>            \-1f.3
>>>>>
>>>>> All seems to work beautifully (I see the ISA bridge happily receive
>>>>> translations) until the first DMA from my device model (at 1:00.0)
>>>>> arrives, at which point I get:
>>>>>
>>>>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
>>>>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
>>>>>
>>>>> I would have expected request device 01:00.0 for this.
>>>>> It is not clear to me yet if this is a weakness of the implementation of
>>>>> ioh3420 or the iommu. Just wanted to let you know right away in case you
>>>>> can shed some light to it or it is an easy fix,
>>>>>
>>>>> The device uses pci_dma_rw with itself as device pointer.
>>>>
>>>> To verify my hypothesis: with this rude hack my device now works much
>>>> better:
>>>>
>>>> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
>>>> int bus_num, int devfn,
>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>>>      } else {
>>>>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
>>>> +        if (ret_fr)
>>>> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>>>          if (ret_fr) {
>>>>              ret_fr = -ret_fr;
>>>>
>>>> Looking at how things look on hardware, multiple devices often receive
>>>> overlapping DMA address ranges for different physical addresses.
>>>>
>>>> So if I understand the way this works, every requester ID would also
>>>> need to have it's own unique VTDAddressSpace, as each pci
>>>> device/function sees a unique DMA address space..
>>>
>>> ioh3420 is a pcie-to-pcie bridge, right? 
>>
>> Yes.
>>
>>> In my opinion, each pci-e
>>> device behind the pcie-to-pcie bridge can be assigned individually.
>>> For now I added the VT-d to q35 by just adding it to the root pci bus.
>>> You can see here in q35.c:
>>> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
>>> So if we add a pcie-to-pcie bridge, we may have to call the
>>> pci_setup_iommu() for that new bus. I don't know where to hook into
>>> this now. :) If you know the mechanism behind that, you can try to add
>>> that for the new bus. (I will dive into this after the clean up.)
>>> What do you think?
>>
>> Thanks for the quick answer, that helped a lot!
>>
>> Looking into the details here I realize it is slightly more complicated:
>> secondary buses are enumerated after device instantiation, as part of
>> the host PCI enumeration, so if I add a similar setup call in the bridge
>> setup, it will be called for a new device long before it has received
>> it's bus number from the OS (via config[PCI_SECONDARY_BUS] )
>>
>> I agree that the lookup function for contexts needs to be as efficient
>> as possible so the simple <busno,defvn> lookup key may be the best
>> solution but then the address_spaces table cannot be populated with the
>> secondary bus entries before it receives a nonzero != 255 bus number,
>> eg. along the lines of this: 
>>
>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>> index 4becdc1..d9a8c23 100644
>> --- a/hw/pci/pci_bridge.c
>> +++ b/hw/pci/pci_bridge.c
>> @@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
>>          pci_bridge_update_mappings(s);
>>      }
>>  
>> +    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
>> +        int bus_num = pci_bus_num(&s->sec_bus);
>> +        if (bus_num != 0xff && bus_num != 0x00)
>> +            <handle bus number change>
>> +    }
>> +
>>      newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
>>      if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
>>          /* Trigger hot reset on 0->1 transition. */
>>
>> but it is getting complicated...
>> Thoughts?
> 
> Point to the PCI bus from VTDAddressSpace instead of storing the bus_num
> there?

Also, each PCIe bus should hold an array of VTDAddressSpaces, instead of
the IntelIOMMUState.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-16  8:47               ` Jan Kiszka
@ 2014-08-18 16:34                 ` Knut Omang
  2014-08-18 18:50                   ` Jan Kiszka
  0 siblings, 1 reply; 34+ messages in thread
From: Knut Omang @ 2014-08-18 16:34 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Le Tan,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Sat, 2014-08-16 at 10:47 +0200, Jan Kiszka wrote:
> On 2014-08-16 10:45, Jan Kiszka wrote:
> > On 2014-08-16 09:54, Knut Omang wrote:
> >> On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
> >>> Hi Knut,
> >>>
> >>> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
> >>>> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
> >>>>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
> >>>>>> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> >>>>>>> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> >>>>>>>> chipset. The major job in these patches is to add support for emulating Intel
> >>>>>>>> IOMMU according to the VT-d specification, including basic responses to CSRs
> >>>>>>>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> >>>>>>>> translations.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>> Looks very good overall, I noted some coding style issues - I didn't
> >>>>>>> bother reporting each issue in every place where it appears - reported
> >>>>>>> each issue once only, so please find and fix all instances of each
> >>>>>>> issue.
> >>>>>>
> >>>>>> BTW, because I was in urgent need for virtual test environment for
> >>>>>> Jailhouse, I hacked interrupt remapping on top of Le's patches:
> >>>>>>
> >>>>>> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
> >>>>>>
> >>>>>> The approach likely needs further discussions and refinements but it
> >>>>>> already allows me to work on top with our hypervisor, and also Linux.
> >>>>>> You can see from the last commit that Le's work made it pretty easy to
> >>>>>> build this on top.
> >>>>>
> >>>>> Le,
> >>>>>
> >>>>> I have tried Jan's branch with my device setup which consists of a
> >>>>> minimal q35 setup, an ioh3420 root port (specified as -device
> >>>>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
> >>>>> gives the following lscpi -t:
> >>>>>
> >>>>> -[0000:00]-+-00.0
> >>>>>            +-01.0
> >>>>>            +-02.0
> >>>>>            +-03.0-[01]----00.0
> >>>>>            +-04.0
> >>>>>            +-1f.0
> >>>>>            +-1f.2
> >>>>>            \-1f.3
> >>>>>
> >>>>> All seems to work beautifully (I see the ISA bridge happily receive
> >>>>> translations) until the first DMA from my device model (at 1:00.0)
> >>>>> arrives, at which point I get:
> >>>>>
> >>>>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
> >>>>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
> >>>>>
> >>>>> I would have expected request device 01:00.0 for this.
> >>>>> It is not clear to me yet if this is a weakness of the implementation of
> >>>>> ioh3420 or the iommu. Just wanted to let you know right away in case you
> >>>>> can shed some light to it or it is an easy fix,
> >>>>>
> >>>>> The device uses pci_dma_rw with itself as device pointer.
> >>>>
> >>>> To verify my hypothesis: with this rude hack my device now works much
> >>>> better:
> >>>>
> >>>> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
> >>>> int bus_num, int devfn,
> >>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> >>>>      } else {
> >>>>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> >>>> +        if (ret_fr)
> >>>> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
> >>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> >>>>          if (ret_fr) {
> >>>>              ret_fr = -ret_fr;
> >>>>
> >>>> Looking at how things look on hardware, multiple devices often receive
> >>>> overlapping DMA address ranges for different physical addresses.
> >>>>
> >>>> So if I understand the way this works, every requester ID would also
> >>>> need to have it's own unique VTDAddressSpace, as each pci
> >>>> device/function sees a unique DMA address space..
> >>>
> >>> ioh3420 is a pcie-to-pcie bridge, right? 
> >>
> >> Yes.
> >>
> >>> In my opinion, each pci-e
> >>> device behind the pcie-to-pcie bridge can be assigned individually.
> >>> For now I added the VT-d to q35 by just adding it to the root pci bus.
> >>> You can see here in q35.c:
> >>> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
> >>> So if we add a pcie-to-pcie bridge, we may have to call the
> >>> pci_setup_iommu() for that new bus. I don't know where to hook into
> >>> this now. :) If you know the mechanism behind that, you can try to add
> >>> that for the new bus. (I will dive into this after the clean up.)
> >>> What do you think?
> >>
> >> Thanks for the quick answer, that helped a lot!
> >>
> >> Looking into the details here I realize it is slightly more complicated:
> >> secondary buses are enumerated after device instantiation, as part of
> >> the host PCI enumeration, so if I add a similar setup call in the bridge
> >> setup, it will be called for a new device long before it has received
> >> it's bus number from the OS (via config[PCI_SECONDARY_BUS] )
> >>
> >> I agree that the lookup function for contexts needs to be as efficient
> >> as possible so the simple <busno,defvn> lookup key may be the best
> >> solution but then the address_spaces table cannot be populated with the
> >> secondary bus entries before it receives a nonzero != 255 bus number,
> >> eg. along the lines of this: 
> >>
> >> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> >> index 4becdc1..d9a8c23 100644
> >> --- a/hw/pci/pci_bridge.c
> >> +++ b/hw/pci/pci_bridge.c
> >> @@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
> >>          pci_bridge_update_mappings(s);
> >>      }
> >>  
> >> +    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
> >> +        int bus_num = pci_bus_num(&s->sec_bus);
> >> +        if (bus_num != 0xff && bus_num != 0x00)
> >> +            <handle bus number change>
> >> +    }
> >> +
> >>      newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
> >>      if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
> >>          /* Trigger hot reset on 0->1 transition. */
> >>
> >> but it is getting complicated...
> >> Thoughts?
> > 
> > Point to the PCI bus from VTDAddressSpace instead of storing the bus_num
> > there?
> 
> Also, each PCIe bus should hold an array of VTDAddressSpaces, instead of
> the IntelIOMMUState.

Thanks - that got me going - after some playing around with the data
structures I ended up with these patches (based on top of Jan's
vtd-intremap branch):

https://github.com/knuto/qemu/tree/vtd_patches

In essence I ended up replacing the address_spaces[] array in
IntelIOMMUState with a pointer dma_as in PCIDevice and a QLIST linking
the VTDAddressSpace objects to serve vtd_context_device_invalidate, then
use pci_bus_num() whenever a bus number is requires, except for the 
"special" bus needed by the interrupt remapping code.

To achieve this, I had to change the signature of the pci_setup_iommu
function (the first commit).

With this I am now seeing translations for the device in the virtual
root port, but I think this should work equally well with other PCI
bridges.

Knut
 
> Jan
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-18 16:34                 ` Knut Omang
@ 2014-08-18 18:50                   ` Jan Kiszka
  2014-08-19  4:08                     ` Knut Omang
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kiszka @ 2014-08-18 18:50 UTC (permalink / raw)
  To: Knut Omang
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Le Tan,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7697 bytes --]

On 2014-08-18 18:34, Knut Omang wrote:
> On Sat, 2014-08-16 at 10:47 +0200, Jan Kiszka wrote:
>> On 2014-08-16 10:45, Jan Kiszka wrote:
>>> On 2014-08-16 09:54, Knut Omang wrote:
>>>> On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
>>>>> Hi Knut,
>>>>>
>>>>> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
>>>>>> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
>>>>>>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
>>>>>>>> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
>>>>>>>>> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
>>>>>>>>>> chipset. The major job in these patches is to add support for emulating Intel
>>>>>>>>>> IOMMU according to the VT-d specification, including basic responses to CSRs
>>>>>>>>>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
>>>>>>>>>> translations.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Looks very good overall, I noted some coding style issues - I didn't
>>>>>>>>> bother reporting each issue in every place where it appears - reported
>>>>>>>>> each issue once only, so please find and fix all instances of each
>>>>>>>>> issue.
>>>>>>>>
>>>>>>>> BTW, because I was in urgent need for virtual test environment for
>>>>>>>> Jailhouse, I hacked interrupt remapping on top of Le's patches:
>>>>>>>>
>>>>>>>> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
>>>>>>>>
>>>>>>>> The approach likely needs further discussions and refinements but it
>>>>>>>> already allows me to work on top with our hypervisor, and also Linux.
>>>>>>>> You can see from the last commit that Le's work made it pretty easy to
>>>>>>>> build this on top.
>>>>>>>
>>>>>>> Le,
>>>>>>>
>>>>>>> I have tried Jan's branch with my device setup which consists of a
>>>>>>> minimal q35 setup, an ioh3420 root port (specified as -device
>>>>>>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
>>>>>>> gives the following lscpi -t:
>>>>>>>
>>>>>>> -[0000:00]-+-00.0
>>>>>>>            +-01.0
>>>>>>>            +-02.0
>>>>>>>            +-03.0-[01]----00.0
>>>>>>>            +-04.0
>>>>>>>            +-1f.0
>>>>>>>            +-1f.2
>>>>>>>            \-1f.3
>>>>>>>
>>>>>>> All seems to work beautifully (I see the ISA bridge happily receive
>>>>>>> translations) until the first DMA from my device model (at 1:00.0)
>>>>>>> arrives, at which point I get:
>>>>>>>
>>>>>>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
>>>>>>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
>>>>>>>
>>>>>>> I would have expected request device 01:00.0 for this.
>>>>>>> It is not clear to me yet if this is a weakness of the implementation of
>>>>>>> ioh3420 or the iommu. Just wanted to let you know right away in case you
>>>>>>> can shed some light to it or it is an easy fix,
>>>>>>>
>>>>>>> The device uses pci_dma_rw with itself as device pointer.
>>>>>>
>>>>>> To verify my hypothesis: with this rude hack my device now works much
>>>>>> better:
>>>>>>
>>>>>> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
>>>>>> int bus_num, int devfn,
>>>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>>>>>      } else {
>>>>>>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
>>>>>> +        if (ret_fr)
>>>>>> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
>>>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>>>>>          if (ret_fr) {
>>>>>>              ret_fr = -ret_fr;
>>>>>>
>>>>>> Looking at how things look on hardware, multiple devices often receive
>>>>>> overlapping DMA address ranges for different physical addresses.
>>>>>>
>>>>>> So if I understand the way this works, every requester ID would also
>>>>>> need to have it's own unique VTDAddressSpace, as each pci
>>>>>> device/function sees a unique DMA address space..
>>>>>
>>>>> ioh3420 is a pcie-to-pcie bridge, right? 
>>>>
>>>> Yes.
>>>>
>>>>> In my opinion, each pci-e
>>>>> device behind the pcie-to-pcie bridge can be assigned individually.
>>>>> For now I added the VT-d to q35 by just adding it to the root pci bus.
>>>>> You can see here in q35.c:
>>>>> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
>>>>> So if we add a pcie-to-pcie bridge, we may have to call the
>>>>> pci_setup_iommu() for that new bus. I don't know where to hook into
>>>>> this now. :) If you know the mechanism behind that, you can try to add
>>>>> that for the new bus. (I will dive into this after the clean up.)
>>>>> What do you think?
>>>>
>>>> Thanks for the quick answer, that helped a lot!
>>>>
>>>> Looking into the details here I realize it is slightly more complicated:
>>>> secondary buses are enumerated after device instantiation, as part of
>>>> the host PCI enumeration, so if I add a similar setup call in the bridge
>>>> setup, it will be called for a new device long before it has received
>>>> it's bus number from the OS (via config[PCI_SECONDARY_BUS] )
>>>>
>>>> I agree that the lookup function for contexts needs to be as efficient
>>>> as possible so the simple <busno,defvn> lookup key may be the best
>>>> solution but then the address_spaces table cannot be populated with the
>>>> secondary bus entries before it receives a nonzero != 255 bus number,
>>>> eg. along the lines of this: 
>>>>
>>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>>>> index 4becdc1..d9a8c23 100644
>>>> --- a/hw/pci/pci_bridge.c
>>>> +++ b/hw/pci/pci_bridge.c
>>>> @@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
>>>>          pci_bridge_update_mappings(s);
>>>>      }
>>>>  
>>>> +    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
>>>> +        int bus_num = pci_bus_num(&s->sec_bus);
>>>> +        if (bus_num != 0xff && bus_num != 0x00)
>>>> +            <handle bus number change>
>>>> +    }
>>>> +
>>>>      newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
>>>>      if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
>>>>          /* Trigger hot reset on 0->1 transition. */
>>>>
>>>> but it is getting complicated...
>>>> Thoughts?
>>>
>>> Point to the PCI bus from VTDAddressSpace instead of storing the bus_num
>>> there?
>>
>> Also, each PCIe bus should hold an array of VTDAddressSpaces, instead of
>> the IntelIOMMUState.
> 
> Thanks - that got me going - after some playing around with the data
> structures I ended up with these patches (based on top of Jan's
> vtd-intremap branch):

Are you depending on interrupt remapping? If not, my patches are a bit
hacky and may cause their own issues if you are unlucky.

> 
> https://github.com/knuto/qemu/tree/vtd_patches
> 
> In essence I ended up replacing the address_spaces[] array in
> IntelIOMMUState with a pointer dma_as in PCIDevice and a QLIST linking
> the VTDAddressSpace objects to serve vtd_context_device_invalidate, then
> use pci_bus_num() whenever a bus number is requires, except for the 
> "special" bus needed by the interrupt remapping code.
> 
> To achieve this, I had to change the signature of the pci_setup_iommu
> function (the first commit).
> 
> With this I am now seeing translations for the device in the virtual
> root port, but I think this should work equally well with other PCI
> bridges.

Good to know that we are on the right path! Maybe this can be integrated
in the next posting round so that we do not need any bridge exceptions
due to incompatibility.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-18 18:50                   ` Jan Kiszka
@ 2014-08-19  4:08                     ` Knut Omang
  2014-08-19  4:18                       ` Knut Omang
  2014-08-19  5:36                       ` Jan Kiszka
  0 siblings, 2 replies; 34+ messages in thread
From: Knut Omang @ 2014-08-19  4:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Le Tan,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Mon, 2014-08-18 at 20:50 +0200, Jan Kiszka wrote:
> On 2014-08-18 18:34, Knut Omang wrote:
> > On Sat, 2014-08-16 at 10:47 +0200, Jan Kiszka wrote:
> >> On 2014-08-16 10:45, Jan Kiszka wrote:
> >>> On 2014-08-16 09:54, Knut Omang wrote:
> >>>> On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
> >>>>> Hi Knut,
> >>>>>
> >>>>> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
> >>>>>> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
> >>>>>>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
> >>>>>>>> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> >>>>>>>>> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> >>>>>>>>>> chipset. The major job in these patches is to add support for emulating Intel
> >>>>>>>>>> IOMMU according to the VT-d specification, including basic responses to CSRs
> >>>>>>>>>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> >>>>>>>>>> translations.
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>> Looks very good overall, I noted some coding style issues - I didn't
> >>>>>>>>> bother reporting each issue in every place where it appears - reported
> >>>>>>>>> each issue once only, so please find and fix all instances of each
> >>>>>>>>> issue.
> >>>>>>>>
> >>>>>>>> BTW, because I was in urgent need for virtual test environment for
> >>>>>>>> Jailhouse, I hacked interrupt remapping on top of Le's patches:
> >>>>>>>>
> >>>>>>>> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
> >>>>>>>>
> >>>>>>>> The approach likely needs further discussions and refinements but it
> >>>>>>>> already allows me to work on top with our hypervisor, and also Linux.
> >>>>>>>> You can see from the last commit that Le's work made it pretty easy to
> >>>>>>>> build this on top.
> >>>>>>>
> >>>>>>> Le,
> >>>>>>>
> >>>>>>> I have tried Jan's branch with my device setup which consists of a
> >>>>>>> minimal q35 setup, an ioh3420 root port (specified as -device
> >>>>>>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
> >>>>>>> gives the following lscpi -t:
> >>>>>>>
> >>>>>>> -[0000:00]-+-00.0
> >>>>>>>            +-01.0
> >>>>>>>            +-02.0
> >>>>>>>            +-03.0-[01]----00.0
> >>>>>>>            +-04.0
> >>>>>>>            +-1f.0
> >>>>>>>            +-1f.2
> >>>>>>>            \-1f.3
> >>>>>>>
> >>>>>>> All seems to work beautifully (I see the ISA bridge happily receive
> >>>>>>> translations) until the first DMA from my device model (at 1:00.0)
> >>>>>>> arrives, at which point I get:
> >>>>>>>
> >>>>>>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
> >>>>>>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
> >>>>>>>
> >>>>>>> I would have expected request device 01:00.0 for this.
> >>>>>>> It is not clear to me yet if this is a weakness of the implementation of
> >>>>>>> ioh3420 or the iommu. Just wanted to let you know right away in case you
> >>>>>>> can shed some light to it or it is an easy fix,
> >>>>>>>
> >>>>>>> The device uses pci_dma_rw with itself as device pointer.
> >>>>>>
> >>>>>> To verify my hypothesis: with this rude hack my device now works much
> >>>>>> better:
> >>>>>>
> >>>>>> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
> >>>>>> int bus_num, int devfn,
> >>>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> >>>>>>      } else {
> >>>>>>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> >>>>>> +        if (ret_fr)
> >>>>>> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
> >>>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> >>>>>>          if (ret_fr) {
> >>>>>>              ret_fr = -ret_fr;
> >>>>>>
> >>>>>> Looking at how things look on hardware, multiple devices often receive
> >>>>>> overlapping DMA address ranges for different physical addresses.
> >>>>>>
> >>>>>> So if I understand the way this works, every requester ID would also
> >>>>>> need to have it's own unique VTDAddressSpace, as each pci
> >>>>>> device/function sees a unique DMA address space..
> >>>>>
> >>>>> ioh3420 is a pcie-to-pcie bridge, right? 
> >>>>
> >>>> Yes.
> >>>>
> >>>>> In my opinion, each pci-e
> >>>>> device behind the pcie-to-pcie bridge can be assigned individually.
> >>>>> For now I added the VT-d to q35 by just adding it to the root pci bus.
> >>>>> You can see here in q35.c:
> >>>>> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
> >>>>> So if we add a pcie-to-pcie bridge, we may have to call the
> >>>>> pci_setup_iommu() for that new bus. I don't know where to hook into
> >>>>> this now. :) If you know the mechanism behind that, you can try to add
> >>>>> that for the new bus. (I will dive into this after the clean up.)
> >>>>> What do you think?
> >>>>
> >>>> Thanks for the quick answer, that helped a lot!
> >>>>
> >>>> Looking into the details here I realize it is slightly more complicated:
> >>>> secondary buses are enumerated after device instantiation, as part of
> >>>> the host PCI enumeration, so if I add a similar setup call in the bridge
> >>>> setup, it will be called for a new device long before it has received
> >>>> it's bus number from the OS (via config[PCI_SECONDARY_BUS] )
> >>>>
> >>>> I agree that the lookup function for contexts needs to be as efficient
> >>>> as possible so the simple <busno,defvn> lookup key may be the best
> >>>> solution but then the address_spaces table cannot be populated with the
> >>>> secondary bus entries before it receives a nonzero != 255 bus number,
> >>>> eg. along the lines of this: 
> >>>>
> >>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> >>>> index 4becdc1..d9a8c23 100644
> >>>> --- a/hw/pci/pci_bridge.c
> >>>> +++ b/hw/pci/pci_bridge.c
> >>>> @@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
> >>>>          pci_bridge_update_mappings(s);
> >>>>      }
> >>>>  
> >>>> +    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
> >>>> +        int bus_num = pci_bus_num(&s->sec_bus);
> >>>> +        if (bus_num != 0xff && bus_num != 0x00)
> >>>> +            <handle bus number change>
> >>>> +    }
> >>>> +
> >>>>      newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
> >>>>      if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
> >>>>          /* Trigger hot reset on 0->1 transition. */
> >>>>
> >>>> but it is getting complicated...
> >>>> Thoughts?
> >>>
> >>> Point to the PCI bus from VTDAddressSpace instead of storing the bus_num
> >>> there?
> >>
> >> Also, each PCIe bus should hold an array of VTDAddressSpaces, instead of
> >> the IntelIOMMUState.
> > 
> > Thanks - that got me going - after some playing around with the data
> > structures I ended up with these patches (based on top of Jan's
> > vtd-intremap branch):
> 
> Are you depending on interrupt remapping? If not, my patches are a bit
> hacky and may cause their own issues if you are unlucky.
 
It does not depend directly but interprets a NULL PciDevice pointer as
the special bus number (0xff) for non-pci devices, so I have tried to
take heights for that - I can rebase if so desired, but I would like to
see the interrupt remapping emerge as well ;-)

Knut

> > https://github.com/knuto/qemu/tree/vtd_patches
> > 
> > In essence I ended up replacing the address_spaces[] array in
> > IntelIOMMUState with a pointer dma_as in PCIDevice and a QLIST linking
> > the VTDAddressSpace objects to serve vtd_context_device_invalidate, then
> > use pci_bus_num() whenever a bus number is requires, except for the 
> > "special" bus needed by the interrupt remapping code.
> > 
> > To achieve this, I had to change the signature of the pci_setup_iommu
> > function (the first commit).
> > 
> > With this I am now seeing translations for the device in the virtual
> > root port, but I think this should work equally well with other PCI
> > bridges.
> 
> Good to know that we are on the right path! Maybe this can be integrated
> in the next posting round so that we do not need any bridge exceptions
> due to incompatibility.
> 
> Jan
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-19  4:08                     ` Knut Omang
@ 2014-08-19  4:18                       ` Knut Omang
  2014-08-19  5:36                       ` Jan Kiszka
  1 sibling, 0 replies; 34+ messages in thread
From: Knut Omang @ 2014-08-19  4:18 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Le Tan,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Tue, 2014-08-19 at 06:08 +0200, Knut Omang wrote:
> On Mon, 2014-08-18 at 20:50 +0200, Jan Kiszka wrote:
> > On 2014-08-18 18:34, Knut Omang wrote:
> > > On Sat, 2014-08-16 at 10:47 +0200, Jan Kiszka wrote:
> > >> On 2014-08-16 10:45, Jan Kiszka wrote:
> > >>> On 2014-08-16 09:54, Knut Omang wrote:
> > >>>> On Fri, 2014-08-15 at 19:37 +0800, Le Tan wrote:
> > >>>>> Hi Knut,
> > >>>>>
> > >>>>> 2014-08-15 19:15 GMT+08:00 Knut Omang <knut.omang@oracle.com>:
> > >>>>>> On Fri, 2014-08-15 at 06:42 +0200, Knut Omang wrote:
> > >>>>>>> On Thu, 2014-08-14 at 14:10 +0200, Jan Kiszka wrote:
> > >>>>>>>> On 2014-08-14 13:15, Michael S. Tsirkin wrote:
> > >>>>>>>>> On Mon, Aug 11, 2014 at 03:04:57PM +0800, Le Tan wrote:
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> These patches are intended to introduce Intel IOMMU (VT-d) emulation to q35
> > >>>>>>>>>> chipset. The major job in these patches is to add support for emulating Intel
> > >>>>>>>>>> IOMMU according to the VT-d specification, including basic responses to CSRs
> > >>>>>>>>>> accesses, the logics of DMAR (DMA remapping) and DMA memory address
> > >>>>>>>>>> translations.
> > >>>>>>>>>
> > >>>>>>>>> Thanks!
> > >>>>>>>>> Looks very good overall, I noted some coding style issues - I didn't
> > >>>>>>>>> bother reporting each issue in every place where it appears - reported
> > >>>>>>>>> each issue once only, so please find and fix all instances of each
> > >>>>>>>>> issue.
> > >>>>>>>>
> > >>>>>>>> BTW, because I was in urgent need for virtual test environment for
> > >>>>>>>> Jailhouse, I hacked interrupt remapping on top of Le's patches:
> > >>>>>>>>
> > >>>>>>>> http://git.kiszka.org/?p=qemu.git;a=shortlog;h=refs/heads/queues/vtd-intremap
> > >>>>>>>>
> > >>>>>>>> The approach likely needs further discussions and refinements but it
> > >>>>>>>> already allows me to work on top with our hypervisor, and also Linux.
> > >>>>>>>> You can see from the last commit that Le's work made it pretty easy to
> > >>>>>>>> build this on top.
> > >>>>>>>
> > >>>>>>> Le,
> > >>>>>>>
> > >>>>>>> I have tried Jan's branch with my device setup which consists of a
> > >>>>>>> minimal q35 setup, an ioh3420 root port (specified as -device
> > >>>>>>> ioh3420,slot=0 ) and a pcie device plugged into that root port, which
> > >>>>>>> gives the following lscpi -t:
> > >>>>>>>
> > >>>>>>> -[0000:00]-+-00.0
> > >>>>>>>            +-01.0
> > >>>>>>>            +-02.0
> > >>>>>>>            +-03.0-[01]----00.0
> > >>>>>>>            +-04.0
> > >>>>>>>            +-1f.0
> > >>>>>>>            +-1f.2
> > >>>>>>>            \-1f.3
> > >>>>>>>
> > >>>>>>> All seems to work beautifully (I see the ISA bridge happily receive
> > >>>>>>> translations) until the first DMA from my device model (at 1:00.0)
> > >>>>>>> arrives, at which point I get:
> > >>>>>>>
> > >>>>>>> [ 1663.732413] dmar: DMAR:[DMA Write] Request device [00:03.0] fault addr fffa0000
> > >>>>>>> [ 1663.732413] DMAR:[fault reason 02] Present bit in context entry is clear
> > >>>>>>>
> > >>>>>>> I would have expected request device 01:00.0 for this.
> > >>>>>>> It is not clear to me yet if this is a weakness of the implementation of
> > >>>>>>> ioh3420 or the iommu. Just wanted to let you know right away in case you
> > >>>>>>> can shed some light to it or it is an easy fix,
> > >>>>>>>
> > >>>>>>> The device uses pci_dma_rw with itself as device pointer.
> > >>>>>>
> > >>>>>> To verify my hypothesis: with this rude hack my device now works much
> > >>>>>> better:
> > >>>>>>
> > >>>>>> @@ -774,6 +780,8 @@ static void iommu_translate(VTDAddressSpace *vtd_as,
> > >>>>>> int bus_num, int devfn,
> > >>>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> > >>>>>>      } else {
> > >>>>>>          ret_fr = dev_to_context_entry(s, bus_num, devfn, &ce);
> > >>>>>> +        if (ret_fr)
> > >>>>>> +            ret_fr = dev_to_context_entry(s, 1, 0, &ce);
> > >>>>>>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> > >>>>>>          if (ret_fr) {
> > >>>>>>              ret_fr = -ret_fr;
> > >>>>>>
> > >>>>>> Looking at how things look on hardware, multiple devices often receive
> > >>>>>> overlapping DMA address ranges for different physical addresses.
> > >>>>>>
> > >>>>>> So if I understand the way this works, every requester ID would also
> > >>>>>> need to have it's own unique VTDAddressSpace, as each pci
> > >>>>>> device/function sees a unique DMA address space..
> > >>>>>
> > >>>>> ioh3420 is a pcie-to-pcie bridge, right? 
> > >>>>
> > >>>> Yes.
> > >>>>
> > >>>>> In my opinion, each pci-e
> > >>>>> device behind the pcie-to-pcie bridge can be assigned individually.
> > >>>>> For now I added the VT-d to q35 by just adding it to the root pci bus.
> > >>>>> You can see here in q35.c:
> > >>>>> pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
> > >>>>> So if we add a pcie-to-pcie bridge, we may have to call the
> > >>>>> pci_setup_iommu() for that new bus. I don't know where to hook into
> > >>>>> this now. :) If you know the mechanism behind that, you can try to add
> > >>>>> that for the new bus. (I will dive into this after the clean up.)
> > >>>>> What do you think?
> > >>>>
> > >>>> Thanks for the quick answer, that helped a lot!
> > >>>>
> > >>>> Looking into the details here I realize it is slightly more complicated:
> > >>>> secondary buses are enumerated after device instantiation, as part of
> > >>>> the host PCI enumeration, so if I add a similar setup call in the bridge
> > >>>> setup, it will be called for a new device long before it has received
> > >>>> it's bus number from the OS (via config[PCI_SECONDARY_BUS] )
> > >>>>
> > >>>> I agree that the lookup function for contexts needs to be as efficient
> > >>>> as possible so the simple <busno,defvn> lookup key may be the best
> > >>>> solution but then the address_spaces table cannot be populated with the
> > >>>> secondary bus entries before it receives a nonzero != 255 bus number,
> > >>>> eg. along the lines of this: 
> > >>>>
> > >>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> > >>>> index 4becdc1..d9a8c23 100644
> > >>>> --- a/hw/pci/pci_bridge.c
> > >>>> +++ b/hw/pci/pci_bridge.c
> > >>>> @@ -265,6 +265,12 @@ void pci_bridge_write_config(PCIDevice *d,
> > >>>>          pci_bridge_update_mappings(s);
> > >>>>      }
> > >>>>  
> > >>>> +    if (ranges_overlap(address, len, PCI_SECONDARY_BUS, 1)) {
> > >>>> +        int bus_num = pci_bus_num(&s->sec_bus);
> > >>>> +        if (bus_num != 0xff && bus_num != 0x00)
> > >>>> +            <handle bus number change>
> > >>>> +    }
> > >>>> +
> > >>>>      newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
> > >>>>      if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
> > >>>>          /* Trigger hot reset on 0->1 transition. */
> > >>>>
> > >>>> but it is getting complicated...
> > >>>> Thoughts?
> > >>>
> > >>> Point to the PCI bus from VTDAddressSpace instead of storing the bus_num
> > >>> there?
> > >>
> > >> Also, each PCIe bus should hold an array of VTDAddressSpaces, instead of
> > >> the IntelIOMMUState.
> > > 
> > > Thanks - that got me going - after some playing around with the data
> > > structures I ended up with these patches (based on top of Jan's
> > > vtd-intremap branch):
> > 
> > Are you depending on interrupt remapping? If not, my patches are a bit
> > hacky and may cause their own issues if you are unlucky.
>  
> It does not depend directly but interprets a NULL PciDevice pointer as
> the special bus number (0xff) for non-pci devices, so I have tried to
> take heights for that - I can rebase if so desired, but I would like to
> see the interrupt remapping emerge as well ;-)
> 
> Knut
> 
> > > https://github.com/knuto/qemu/tree/vtd_patches
> > > 
> > > In essence I ended up replacing the address_spaces[] array in
> > > IntelIOMMUState with a pointer dma_as in PCIDevice and a QLIST linking
> > > the VTDAddressSpace objects to serve vtd_context_device_invalidate, then
> > > use pci_bus_num() whenever a bus number is requires, except for the 
> > > "special" bus needed by the interrupt remapping code.
> > > 
> > > To achieve this, I had to change the signature of the pci_setup_iommu
> > > function (the first commit).
> > > 
> > > With this I am now seeing translations for the device in the virtual
> > > root port, but I think this should work equally well with other PCI
> > > bridges.
> > 
> > Good to know that we are on the right path! Maybe this can be integrated
> > in the next posting round so that we do not need any bridge exceptions
> > due to incompatibility.
> >
> > Jan

Lets aim at that - as far as I understand the Linux iommu driver, the
kernel will try to setup mappings for everything (using identity
mappings for some devices), so having exceptions on the Qemu side would
probably have to be a hack.

Knut

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset
  2014-08-19  4:08                     ` Knut Omang
  2014-08-19  4:18                       ` Knut Omang
@ 2014-08-19  5:36                       ` Jan Kiszka
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Kiszka @ 2014-08-19  5:36 UTC (permalink / raw)
  To: Knut Omang
  Cc: Michael S. Tsirkin, Stefan Weil, qemu-devel, Le Tan,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On 2014-08-19 06:08, Knut Omang wrote:
>> Are you depending on interrupt remapping? If not, my patches are a bit
>> hacky and may cause their own issues if you are unlucky.
>  
> It does not depend directly but interprets a NULL PciDevice pointer as
> the special bus number (0xff) for non-pci devices, so I have tried to

Yeah, this won't map nicely on future chipsets - unless we redefine
Q35_PSEUDO_XXX as VTD_PSEUDO_XXX, or so, and use the same values for all
of them. Or create an unconnected pseudo PCI bus for the chipset devices
that carry the pseudo bus number.

> take heights for that - I can rebase if so desired, but I would like to
> see the interrupt remapping emerge as well ;-)

Good, then we are already two ;). Let's get the baseline ready, then we
can discuss extensions like interrupt remapping.

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-08-19  5:37 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-11  7:04 [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Le Tan
2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 1/5] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Le Tan
2014-08-11  7:04 ` [Qemu-devel] [PATCH v3 2/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation Le Tan
2014-08-12  7:34   ` Jan Kiszka
2014-08-12  9:04     ` Le Tan
2014-08-14 11:03   ` Michael S. Tsirkin
2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 3/5] intel-iommu: add DMAR table to ACPI tables Le Tan
2014-08-14 11:06   ` Michael S. Tsirkin
2014-08-14 11:36     ` Jan Kiszka
2014-08-14 11:43       ` Michael S. Tsirkin
2014-08-14 11:51         ` Jan Kiszka
2014-08-14 11:53           ` Le Tan
2014-08-14 12:04             ` Michael S. Tsirkin
2014-08-14 11:44       ` Michael S. Tsirkin
2014-08-14 11:37     ` Le Tan
2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 4/5] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Le Tan
2014-08-14 11:12   ` Michael S. Tsirkin
2014-08-14 11:33     ` Le Tan
2014-08-14 11:35       ` Michael S. Tsirkin
2014-08-14 11:39         ` Le Tan
2014-08-11  7:05 ` [Qemu-devel] [PATCH v3 5/5] intel-iommu: add supports for queued invalidation interface Le Tan
2014-08-14 11:15 ` [Qemu-devel] [PATCH v3 0/5] intel-iommu: introduce Intel IOMMU (VT-d) emulation to q35 chipset Michael S. Tsirkin
2014-08-14 12:10   ` Jan Kiszka
2014-08-15  4:42     ` Knut Omang
2014-08-15 11:15       ` Knut Omang
2014-08-15 11:37         ` Le Tan
2014-08-16  7:54           ` Knut Omang
2014-08-16  8:45             ` Jan Kiszka
2014-08-16  8:47               ` Jan Kiszka
2014-08-18 16:34                 ` Knut Omang
2014-08-18 18:50                   ` Jan Kiszka
2014-08-19  4:08                     ` Knut Omang
2014-08-19  4:18                       ` Knut Omang
2014-08-19  5:36                       ` Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.