All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PULL 00/13] pci, pc fixes, features
@ 2014-09-02 15:07 Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 01/13] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Michael S. Tsirkin
                   ` (13 more replies)
  0 siblings, 14 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Anthony Liguori

The following changes since commit 187de915e8d06aaf82be206aebc551c82bf0670c:

  pcie: fix trailing whitespace (2014-08-25 00:16:07 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream

for you to fetch changes up to aad4dce934649b3a398396fc2a76f215bb194ea4:

  vhost_net: start/stop guest notifiers properly (2014-09-02 17:33:37 +0300)

----------------------------------------------------------------
pci, pc fixes, features

A bunch of bugfixes - these will make sense for 2.1.1

Initial Intel IOMMU support.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Gonglei (1):
      ioh3420: remove unused ioh3420_init() declaration

Jason Wang (1):
      vhost_net: start/stop guest notifiers properly

Knut Omang (1):
      pci: avoid losing config updates to MSI/MSIX cap regs

Le Tan (8):
      iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps
      intel-iommu: introduce Intel IOMMU (VT-d) emulation
      intel-iommu: add DMAR table to ACPI tables
      intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
      intel-iommu: fix coding style issues around in q35.c and machine.c
      intel-iommu: add supports for queued invalidation interface
      intel-iommu: add context-cache to cache context-entry
      intel-iommu: add IOTLB using hash table

Michael S. Tsirkin (2):
      vhost_net: cleanup start/stop condition
      virtio-net: don't run bh on vm stopped

 hw/i386/acpi-defs.h            |   40 +
 hw/i386/intel_iommu_internal.h |  389 ++++++++
 hw/pci-bridge/ioh3420.h        |    4 -
 include/exec/memory.h          |    2 +-
 include/hw/boards.h            |    1 +
 include/hw/i386/intel_iommu.h  |  120 +++
 include/hw/pci-host/q35.h      |    2 +
 exec.c                         |    2 +-
 hw/alpha/typhoon.c             |    3 +-
 hw/core/machine.c              |   27 +-
 hw/i386/acpi-build.c           |   39 +
 hw/i386/intel_iommu.c          | 1963 ++++++++++++++++++++++++++++++++++++++++
 hw/net/vhost_net.c             |   39 +-
 hw/net/virtio-net.c            |   14 +-
 hw/pci-host/apb.c              |    3 +-
 hw/pci-host/q35.c              |   58 +-
 hw/pci/pci.c                   |    7 +-
 hw/ppc/spapr_iommu.c           |    3 +-
 hw/virtio/vhost.c              |    2 -
 vl.c                           |    4 +
 hw/i386/Makefile.objs          |    1 +
 qemu-options.hx                |    5 +-
 22 files changed, 2683 insertions(+), 45 deletions(-)
 create mode 100644 hw/i386/intel_iommu_internal.h
 create mode 100644 include/hw/i386/intel_iommu.h
 create mode 100644 hw/i386/intel_iommu.c

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 01/13] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 02/13] intel-iommu: introduce Intel IOMMU (VT-d) emulation Michael S. Tsirkin
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Stefan Weil, Mark Cave-Ayland, Alexander Graf,
	Le Tan, Michael Tokarev, qemu-ppc, Anthony Liguori,
	Paolo Bonzini, Richard Henderson

From: Le Tan <tamlokveer@gmail.com>

Add a bool variable is_write as a parameter to the translate function of
MemoryRegionIOMMUOps to indicate the operation of the access. It can be
used for correct fault reporting from within the callback.
Change the interface of related functions.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/exec/memory.h | 2 +-
 exec.c                | 2 +-
 hw/alpha/typhoon.c    | 3 ++-
 hw/pci-host/apb.c     | 3 ++-
 hw/ppc/spapr_iommu.c  | 3 ++-
 5 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index d165b27..ea381d6 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -129,7 +129,7 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
-    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr);
+    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/exec.c b/exec.c
index 5f9857c..5122a33 100644
--- a/exec.c
+++ b/exec.c
@@ -373,7 +373,7 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
             break;
         }
 
-        iotlb = mr->iommu_ops->translate(mr, addr);
+        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                 | (addr & iotlb.addr_mask));
         len = MIN(len, (addr | iotlb.addr_mask) - addr + 1);
diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 67a1070..31947d9 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -660,7 +660,8 @@ static bool window_translate(TyphoonWindow *win, hwaddr addr,
 /* Handle PCI-to-system address translation.  */
 /* TODO: A translation failure here ought to set PCI error codes on the
    Pchip and generate a machine check interrupt.  */
-static IOMMUTLBEntry typhoon_translate_iommu(MemoryRegion *iommu, hwaddr addr)
+static IOMMUTLBEntry typhoon_translate_iommu(MemoryRegion *iommu, hwaddr addr,
+                                             bool is_write)
 {
     TyphoonPchip *pchip = container_of(iommu, TyphoonPchip, iommu);
     IOMMUTLBEntry ret;
diff --git a/hw/pci-host/apb.c b/hw/pci-host/apb.c
index 60bd81e..762ebdd 100644
--- a/hw/pci-host/apb.c
+++ b/hw/pci-host/apb.c
@@ -204,7 +204,8 @@ static AddressSpace *pbm_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &is->iommu_as;
 }
 
-static IOMMUTLBEntry pbm_translate_iommu(MemoryRegion *iommu, hwaddr addr)
+static IOMMUTLBEntry pbm_translate_iommu(MemoryRegion *iommu, hwaddr addr,
+                                         bool is_write)
 {
     IOMMUState *is = container_of(iommu, IOMMUState, iommu);
     hwaddr baseaddr, offset;
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index f6e32a4..6c91d8e 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -59,7 +59,8 @@ static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
     return NULL;
 }
 
-static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr)
+static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
+                                               bool is_write)
 {
     sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
     uint64_t tce;
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 02/13] intel-iommu: introduce Intel IOMMU (VT-d) emulation
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 01/13] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 03/13] intel-iommu: add DMAR table to ACPI tables Michael S. Tsirkin
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Le Tan, Anthony Liguori

From: Le Tan <tamlokveer@gmail.com>

Add support for emulating Intel IOMMU according to the VT-d specification for
the q35 chipset machine. Implement the logics for DMAR (DMA remapping) without
PASID support. The emulation supports register-based invalidation and primary
fault logging.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/intel_iommu_internal.h |  333 +++++++++++
 include/hw/i386/intel_iommu.h  |   89 +++
 hw/i386/intel_iommu.c          | 1257 ++++++++++++++++++++++++++++++++++++++++
 hw/i386/Makefile.objs          |    1 +
 4 files changed, 1680 insertions(+)
 create mode 100644 hw/i386/intel_iommu_internal.h
 create mode 100644 include/hw/i386/intel_iommu.h
 create mode 100644 hw/i386/intel_iommu.c

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
new file mode 100644
index 0000000..7ca034d
--- /dev/null
+++ b/hw/i386/intel_iommu_internal.h
@@ -0,0 +1,333 @@
+/*
+ * QEMU emulation of an Intel IOMMU (VT-d)
+ *   (DMA Remapping device)
+ *
+ * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
+ * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ * Lots of defines copied from kernel/include/linux/intel-iommu.h:
+ *   Copyright (C) 2006-2008 Intel Corporation
+ *   Author: Ashok Raj <ashok.raj@intel.com>
+ *   Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ *
+ */
+
+#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
+#define HW_I386_INTEL_IOMMU_INTERNAL_H
+#include "hw/i386/intel_iommu.h"
+
+/*
+ * Intel IOMMU register specification
+ */
+#define DMAR_VER_REG            0x0  /* Arch version supported by this IOMMU */
+#define DMAR_CAP_REG            0x8  /* Hardware supported capabilities */
+#define DMAR_CAP_REG_HI         0xc  /* High 32-bit of DMAR_CAP_REG */
+#define DMAR_ECAP_REG           0x10 /* Extended capabilities supported */
+#define DMAR_ECAP_REG_HI        0X14
+#define DMAR_GCMD_REG           0x18 /* Global command */
+#define DMAR_GSTS_REG           0x1c /* Global status */
+#define DMAR_RTADDR_REG         0x20 /* Root entry table */
+#define DMAR_RTADDR_REG_HI      0X24
+#define DMAR_CCMD_REG           0x28 /* Context command */
+#define DMAR_CCMD_REG_HI        0x2c
+#define DMAR_FSTS_REG           0x34 /* Fault status */
+#define DMAR_FECTL_REG          0x38 /* Fault control */
+#define DMAR_FEDATA_REG         0x3c /* Fault event interrupt data */
+#define DMAR_FEADDR_REG         0x40 /* Fault event interrupt addr */
+#define DMAR_FEUADDR_REG        0x44 /* Upper address */
+#define DMAR_AFLOG_REG          0x58 /* Advanced fault control */
+#define DMAR_AFLOG_REG_HI       0X5c
+#define DMAR_PMEN_REG           0x64 /* Enable protected memory region */
+#define DMAR_PLMBASE_REG        0x68 /* PMRR low addr */
+#define DMAR_PLMLIMIT_REG       0x6c /* PMRR low limit */
+#define DMAR_PHMBASE_REG        0x70 /* PMRR high base addr */
+#define DMAR_PHMBASE_REG_HI     0X74
+#define DMAR_PHMLIMIT_REG       0x78 /* PMRR high limit */
+#define DMAR_PHMLIMIT_REG_HI    0x7c
+#define DMAR_IQH_REG            0x80 /* Invalidation queue head */
+#define DMAR_IQH_REG_HI         0X84
+#define DMAR_IQT_REG            0x88 /* Invalidation queue tail */
+#define DMAR_IQT_REG_HI         0X8c
+#define DMAR_IQA_REG            0x90 /* Invalidation queue addr */
+#define DMAR_IQA_REG_HI         0x94
+#define DMAR_ICS_REG            0x9c /* Invalidation complete status */
+#define DMAR_IRTA_REG           0xb8 /* Interrupt remapping table addr */
+#define DMAR_IRTA_REG_HI        0xbc
+#define DMAR_IECTL_REG          0xa0 /* Invalidation event control */
+#define DMAR_IEDATA_REG         0xa4 /* Invalidation event data */
+#define DMAR_IEADDR_REG         0xa8 /* Invalidation event address */
+#define DMAR_IEUADDR_REG        0xac /* Invalidation event address */
+#define DMAR_PQH_REG            0xc0 /* Page request queue head */
+#define DMAR_PQH_REG_HI         0xc4
+#define DMAR_PQT_REG            0xc8 /* Page request queue tail*/
+#define DMAR_PQT_REG_HI         0xcc
+#define DMAR_PQA_REG            0xd0 /* Page request queue address */
+#define DMAR_PQA_REG_HI         0xd4
+#define DMAR_PRS_REG            0xdc /* Page request status */
+#define DMAR_PECTL_REG          0xe0 /* Page request event control */
+#define DMAR_PEDATA_REG         0xe4 /* Page request event data */
+#define DMAR_PEADDR_REG         0xe8 /* Page request event address */
+#define DMAR_PEUADDR_REG        0xec /* Page event upper address */
+#define DMAR_MTRRCAP_REG        0x100 /* MTRR capability */
+#define DMAR_MTRRCAP_REG_HI     0x104
+#define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
+#define DMAR_MTRRDEF_REG_HI     0x10c
+
+/* IOTLB registers */
+#define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
+#define DMAR_IVA_REG            DMAR_IOTLB_REG_OFFSET /* Invalidate address */
+#define DMAR_IVA_REG_HI         (DMAR_IVA_REG + 4)
+/* IOTLB invalidate register */
+#define DMAR_IOTLB_REG          (DMAR_IOTLB_REG_OFFSET + 0x8)
+#define DMAR_IOTLB_REG_HI       (DMAR_IOTLB_REG + 4)
+
+/* FRCD */
+#define DMAR_FRCD_REG_OFFSET    0x220 /* Offset to the fault recording regs */
+/* NOTICE: If you change the DMAR_FRCD_REG_NR, please remember to change the
+ * DMAR_REG_SIZE in include/hw/i386/intel_iommu.h.
+ * #define DMAR_REG_SIZE   (DMAR_FRCD_REG_OFFSET + 16 * DMAR_FRCD_REG_NR)
+ */
+#define DMAR_FRCD_REG_NR        1ULL /* Num of fault recording regs */
+
+#define DMAR_FRCD_REG_0_0       0x220 /* The 0th fault recording regs */
+#define DMAR_FRCD_REG_0_1       0x224
+#define DMAR_FRCD_REG_0_2       0x228
+#define DMAR_FRCD_REG_0_3       0x22c
+
+/* Interrupt Address Range */
+#define VTD_INTERRUPT_ADDR_FIRST    0xfee00000ULL
+#define VTD_INTERRUPT_ADDR_LAST     0xfeefffffULL
+
+/* IOTLB_REG */
+#define VTD_TLB_GLOBAL_FLUSH        (1ULL << 60) /* Global invalidation */
+#define VTD_TLB_DSI_FLUSH           (2ULL << 60) /* Domain-selective */
+#define VTD_TLB_PSI_FLUSH           (3ULL << 60) /* Page-selective */
+#define VTD_TLB_FLUSH_GRANU_MASK    (3ULL << 60)
+#define VTD_TLB_GLOBAL_FLUSH_A      (1ULL << 57)
+#define VTD_TLB_DSI_FLUSH_A         (2ULL << 57)
+#define VTD_TLB_PSI_FLUSH_A         (3ULL << 57)
+#define VTD_TLB_FLUSH_GRANU_MASK_A  (3ULL << 57)
+#define VTD_TLB_IVT                 (1ULL << 63)
+
+/* GCMD_REG */
+#define VTD_GCMD_TE                 (1UL << 31)
+#define VTD_GCMD_SRTP               (1UL << 30)
+#define VTD_GCMD_SFL                (1UL << 29)
+#define VTD_GCMD_EAFL               (1UL << 28)
+#define VTD_GCMD_WBF                (1UL << 27)
+#define VTD_GCMD_QIE                (1UL << 26)
+#define VTD_GCMD_IRE                (1UL << 25)
+#define VTD_GCMD_SIRTP              (1UL << 24)
+#define VTD_GCMD_CFI                (1UL << 23)
+
+/* GSTS_REG */
+#define VTD_GSTS_TES                (1UL << 31)
+#define VTD_GSTS_RTPS               (1UL << 30)
+#define VTD_GSTS_FLS                (1UL << 29)
+#define VTD_GSTS_AFLS               (1UL << 28)
+#define VTD_GSTS_WBFS               (1UL << 27)
+#define VTD_GSTS_QIES               (1UL << 26)
+#define VTD_GSTS_IRES               (1UL << 25)
+#define VTD_GSTS_IRTPS              (1UL << 24)
+#define VTD_GSTS_CFIS               (1UL << 23)
+
+/* CCMD_REG */
+#define VTD_CCMD_ICC                (1ULL << 63)
+#define VTD_CCMD_GLOBAL_INVL        (1ULL << 61)
+#define VTD_CCMD_DOMAIN_INVL        (2ULL << 61)
+#define VTD_CCMD_DEVICE_INVL        (3ULL << 61)
+#define VTD_CCMD_CIRG_MASK          (3ULL << 61)
+#define VTD_CCMD_GLOBAL_INVL_A      (1ULL << 59)
+#define VTD_CCMD_DOMAIN_INVL_A      (2ULL << 59)
+#define VTD_CCMD_DEVICE_INVL_A      (3ULL << 59)
+#define VTD_CCMD_CAIG_MASK          (3ULL << 59)
+
+/* RTADDR_REG */
+#define VTD_RTADDR_RTT              (1ULL << 11)
+#define VTD_RTADDR_ADDR_MASK        (VTD_HAW_MASK ^ 0xfffULL)
+
+/* ECAP_REG */
+/* (offset >> 4) << 8 */
+#define VTD_ECAP_IRO                (DMAR_IOTLB_REG_OFFSET << 4)
+#define VTD_ECAP_QI                 (1ULL << 1)
+
+/* CAP_REG */
+/* (offset >> 4) << 24 */
+#define VTD_CAP_FRO                 (DMAR_FRCD_REG_OFFSET << 20)
+#define VTD_CAP_NFR                 ((DMAR_FRCD_REG_NR - 1) << 40)
+#define VTD_DOMAIN_ID_SHIFT         16  /* 16-bit domain id for 64K domains */
+#define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
+#define VTD_MGAW                    39  /* Maximum Guest Address Width */
+#define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
+
+/* Supported Adjusted Guest Address Widths */
+#define VTD_CAP_SAGAW_SHIFT         8
+#define VTD_CAP_SAGAW_MASK          (0x1fULL << VTD_CAP_SAGAW_SHIFT)
+ /* 39-bit AGAW, 3-level page-table */
+#define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
+ /* 48-bit AGAW, 4-level page-table */
+#define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
+#define VTD_CAP_SAGAW               VTD_CAP_SAGAW_39bit
+
+/* IQT_REG */
+#define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
+
+/* IQA_REG */
+#define VTD_IQA_IQA_MASK            (VTD_HAW_MASK ^ 0xfffULL)
+#define VTD_IQA_QS                  0x7ULL
+
+/* IQH_REG */
+#define VTD_IQH_QH_SHIFT            4
+#define VTD_IQH_QH_MASK             0x7fff0ULL
+
+/* ICS_REG */
+#define VTD_ICS_IWC                 1UL
+
+/* IECTL_REG */
+#define VTD_IECTL_IM                (1UL << 31)
+#define VTD_IECTL_IP                (1UL << 30)
+
+/* FSTS_REG */
+#define VTD_FSTS_FRI_MASK       0xff00UL
+#define VTD_FSTS_FRI(val)       ((((uint32_t)(val)) << 8) & VTD_FSTS_FRI_MASK)
+#define VTD_FSTS_IQE            (1UL << 4)
+#define VTD_FSTS_PPF            (1UL << 1)
+#define VTD_FSTS_PFO            1UL
+
+/* FECTL_REG */
+#define VTD_FECTL_IM            (1UL << 31)
+#define VTD_FECTL_IP            (1UL << 30)
+
+/* Fault Recording Register */
+/* For the high 64-bit of 128-bit */
+#define VTD_FRCD_F              (1ULL << 63)
+#define VTD_FRCD_T              (1ULL << 62)
+#define VTD_FRCD_FR(val)        (((val) & 0xffULL) << 32)
+#define VTD_FRCD_SID_MASK       0xffffULL
+#define VTD_FRCD_SID(val)       ((val) & VTD_FRCD_SID_MASK)
+/* For the low 64-bit of 128-bit */
+#define VTD_FRCD_FI(val)        ((val) & (((1ULL << VTD_MGAW) - 1) ^ 0xfffULL))
+
+/* DMA Remapping Fault Conditions */
+typedef enum VTDFaultReason {
+    VTD_FR_RESERVED = 0,        /* Reserved for Advanced Fault logging */
+    VTD_FR_ROOT_ENTRY_P = 1,    /* The Present(P) field of root-entry is 0 */
+    VTD_FR_CONTEXT_ENTRY_P,     /* The Present(P) field of context-entry is 0 */
+    VTD_FR_CONTEXT_ENTRY_INV,   /* Invalid programming of a context-entry */
+    VTD_FR_ADDR_BEYOND_MGAW,    /* Input-address above (2^x-1) */
+    VTD_FR_WRITE,               /* No write permission */
+    VTD_FR_READ,                /* No read permission */
+    /* Fail to access a second-level paging entry (not SL_PML4E) */
+    VTD_FR_PAGING_ENTRY_INV,
+    VTD_FR_ROOT_TABLE_INV,      /* Fail to access a root-entry */
+    VTD_FR_CONTEXT_TABLE_INV,   /* Fail to access a context-entry */
+    /* Non-zero reserved field in a present root-entry */
+    VTD_FR_ROOT_ENTRY_RSVD,
+    /* Non-zero reserved field in a present context-entry */
+    VTD_FR_CONTEXT_ENTRY_RSVD,
+    /* Non-zero reserved field in a second-level paging entry with at lease one
+     * Read(R) and Write(W) or Execute(E) field is Set.
+     */
+    VTD_FR_PAGING_ENTRY_RSVD,
+    /* Translation request or translated request explicitly blocked dut to the
+     * programming of the Translation Type (T) field in the present
+     * context-entry.
+     */
+    VTD_FR_CONTEXT_ENTRY_TT,
+    /* This is not a normal fault reason. We use this to indicate some faults
+     * that are not referenced by the VT-d specification.
+     * Fault event with such reason should not be recorded.
+     */
+    VTD_FR_RESERVED_ERR,
+    VTD_FR_MAX,                 /* Guard */
+} VTDFaultReason;
+
+/* Masks for Queued Invalidation Descriptor */
+#define VTD_INV_DESC_TYPE           0xf
+#define VTD_INV_DESC_CC             0x1 /* Context-cache Invalidate Desc */
+#define VTD_INV_DESC_IOTLB          0x2
+#define VTD_INV_DESC_WAIT           0x5 /* Invalidation Wait Descriptor */
+#define VTD_INV_DESC_NONE           0   /* Not an Invalidate Descriptor */
+
+/* Pagesize of VTD paging structures, including root and context tables */
+#define VTD_PAGE_SHIFT              12
+#define VTD_PAGE_SIZE               (1ULL << VTD_PAGE_SHIFT)
+
+#define VTD_PAGE_SHIFT_4K           12
+#define VTD_PAGE_MASK_4K            (~((1ULL << VTD_PAGE_SHIFT_4K) - 1))
+#define VTD_PAGE_SHIFT_2M           21
+#define VTD_PAGE_MASK_2M            (~((1ULL << VTD_PAGE_SHIFT_2M) - 1))
+#define VTD_PAGE_SHIFT_1G           30
+#define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
+
+struct VTDRootEntry {
+    uint64_t val;
+    uint64_t rsvd;
+};
+typedef struct VTDRootEntry VTDRootEntry;
+
+/* Masks for struct VTDRootEntry */
+#define VTD_ROOT_ENTRY_P            1ULL
+#define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
+
+#define VTD_ROOT_ENTRY_NR           (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
+#define VTD_ROOT_ENTRY_RSVD         (0xffeULL | ~VTD_HAW_MASK)
+
+/* Context-Entry */
+struct VTDContextEntry {
+    uint64_t lo;
+    uint64_t hi;
+};
+typedef struct VTDContextEntry VTDContextEntry;
+
+/* Masks for struct VTDContextEntry */
+/* lo */
+#define VTD_CONTEXT_ENTRY_P         (1ULL << 0)
+#define VTD_CONTEXT_ENTRY_FPD       (1ULL << 1) /* Fault Processing Disable */
+#define VTD_CONTEXT_ENTRY_TT        (3ULL << 2) /* Translation Type */
+#define VTD_CONTEXT_TT_MULTI_LEVEL  0
+#define VTD_CONTEXT_TT_DEV_IOTLB    1
+#define VTD_CONTEXT_TT_PASS_THROUGH 2
+/* Second Level Page Translation Pointer*/
+#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
+#define VTD_CONTEXT_ENTRY_RSVD_LO   (0xff0ULL | ~VTD_HAW_MASK)
+/* hi */
+#define VTD_CONTEXT_ENTRY_AW        7ULL /* Adjusted guest-address-width */
+#define VTD_CONTEXT_ENTRY_DID       (0xffffULL << 8) /* Domain Identifier */
+#define VTD_CONTEXT_ENTRY_RSVD_HI   0xffffffffff000080ULL
+
+#define VTD_CONTEXT_ENTRY_NR        (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
+
+/* Paging Structure common */
+#define VTD_SL_PT_PAGE_SIZE_MASK    (1ULL << 7)
+/* Bits to decide the offset for each level */
+#define VTD_SL_LEVEL_BITS           9
+
+/* Second Level Paging Structure */
+#define VTD_SL_PML4_LEVEL           4
+#define VTD_SL_PDP_LEVEL            3
+#define VTD_SL_PD_LEVEL             2
+#define VTD_SL_PT_LEVEL             1
+#define VTD_SL_PT_ENTRY_NR          512
+
+/* Masks for Second Level Paging Entry */
+#define VTD_SL_RW_MASK              3ULL
+#define VTD_SL_R                    1ULL
+#define VTD_SL_W                    (1ULL << 1)
+#define VTD_SL_PT_BASE_ADDR_MASK    (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK)
+#define VTD_SL_IGN_COM              0xbff0000000000000ULL
+
+#endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
new file mode 100644
index 0000000..fe1f1e9
--- /dev/null
+++ b/include/hw/i386/intel_iommu.h
@@ -0,0 +1,89 @@
+/*
+ * QEMU emulation of an Intel IOMMU (VT-d)
+ *   (DMA Remapping device)
+ *
+ * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
+ * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef INTEL_IOMMU_H
+#define INTEL_IOMMU_H
+#include "hw/qdev.h"
+#include "sysemu/dma.h"
+
+#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
+#define INTEL_IOMMU_DEVICE(obj) \
+     OBJECT_CHECK(IntelIOMMUState, (obj), TYPE_INTEL_IOMMU_DEVICE)
+
+/* DMAR Hardware Unit Definition address (IOMMU unit) */
+#define Q35_HOST_BRIDGE_IOMMU_ADDR  0xfed90000ULL
+
+#define VTD_PCI_BUS_MAX             256
+#define VTD_PCI_SLOT_MAX            32
+#define VTD_PCI_FUNC_MAX            8
+#define VTD_PCI_DEVFN_MAX           256
+#define VTD_PCI_SLOT(devfn)         (((devfn) >> 3) & 0x1f)
+#define VTD_PCI_FUNC(devfn)         ((devfn) & 0x07)
+
+#define DMAR_REG_SIZE               0x230
+#define VTD_HOST_ADDRESS_WIDTH      39
+#define VTD_HAW_MASK                ((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
+
+typedef struct IntelIOMMUState IntelIOMMUState;
+typedef struct VTDAddressSpace VTDAddressSpace;
+
+struct VTDAddressSpace {
+    uint8_t bus_num;
+    uint8_t devfn;
+    AddressSpace as;
+    MemoryRegion iommu;
+    IntelIOMMUState *iommu_state;
+};
+
+/* The iommu (DMAR) device state struct */
+struct IntelIOMMUState {
+    SysBusDevice busdev;
+    MemoryRegion csrmem;
+    uint8_t csr[DMAR_REG_SIZE];     /* register values */
+    uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
+    uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
+    uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
+    uint32_t version;
+
+    dma_addr_t root;                /* Current root table pointer */
+    bool root_extended;             /* Type of root table (extended or not) */
+    bool dmar_enabled;              /* Set if DMA remapping is enabled */
+
+    uint16_t iq_head;               /* Current invalidation queue head */
+    uint16_t iq_tail;               /* Current invalidation queue tail */
+    dma_addr_t iq;                  /* Current invalidation queue pointer */
+    uint16_t iq_size;               /* IQ Size in number of entries */
+    bool qi_enabled;                /* Set if the QI is enabled */
+    uint8_t iq_last_desc_type;      /* The type of last completed descriptor */
+
+    /* The index of the Fault Recording Register to be used next.
+     * Wraps around from N-1 to 0, where N is the number of FRCD_REG.
+     */
+    uint16_t next_frcd_reg;
+
+    uint64_t cap;                   /* The value of capability reg */
+    uint64_t ecap;                  /* The value of extended capability reg */
+
+    MemoryRegionIOMMUOps iommu_ops;
+    VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
+};
+
+#endif
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
new file mode 100644
index 0000000..8e67e04
--- /dev/null
+++ b/hw/i386/intel_iommu.c
@@ -0,0 +1,1257 @@
+/*
+ * QEMU emulation of an Intel IOMMU (VT-d)
+ *   (DMA Remapping device)
+ *
+ * Copyright (C) 2013 Knut Omang, Oracle <knut.omang@oracle.com>
+ * Copyright (C) 2014 Le Tan, <tamlokveer@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "hw/sysbus.h"
+#include "exec/address-spaces.h"
+#include "intel_iommu_internal.h"
+
+/*#define DEBUG_INTEL_IOMMU*/
+#ifdef DEBUG_INTEL_IOMMU
+enum {
+    DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
+};
+#define VTD_DBGBIT(x)   (1 << DEBUG_##x)
+static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR);
+
+#define VTD_DPRINTF(what, fmt, ...) do { \
+    if (vtd_dbgflags & VTD_DBGBIT(what)) { \
+        fprintf(stderr, "(vtd)%s: " fmt "\n", __func__, \
+                ## __VA_ARGS__); } \
+    } while (0)
+#else
+#define VTD_DPRINTF(what, fmt, ...) do {} while (0)
+#endif
+
+static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
+                            uint64_t wmask, uint64_t w1cmask)
+{
+    stq_le_p(&s->csr[addr], val);
+    stq_le_p(&s->wmask[addr], wmask);
+    stq_le_p(&s->w1cmask[addr], w1cmask);
+}
+
+static void vtd_define_quad_wo(IntelIOMMUState *s, hwaddr addr, uint64_t mask)
+{
+    stq_le_p(&s->womask[addr], mask);
+}
+
+static void vtd_define_long(IntelIOMMUState *s, hwaddr addr, uint32_t val,
+                            uint32_t wmask, uint32_t w1cmask)
+{
+    stl_le_p(&s->csr[addr], val);
+    stl_le_p(&s->wmask[addr], wmask);
+    stl_le_p(&s->w1cmask[addr], w1cmask);
+}
+
+static void vtd_define_long_wo(IntelIOMMUState *s, hwaddr addr, uint32_t mask)
+{
+    stl_le_p(&s->womask[addr], mask);
+}
+
+/* "External" get/set operations */
+static void vtd_set_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val)
+{
+    uint64_t oldval = ldq_le_p(&s->csr[addr]);
+    uint64_t wmask = ldq_le_p(&s->wmask[addr]);
+    uint64_t w1cmask = ldq_le_p(&s->w1cmask[addr]);
+    stq_le_p(&s->csr[addr],
+             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
+}
+
+static void vtd_set_long(IntelIOMMUState *s, hwaddr addr, uint32_t val)
+{
+    uint32_t oldval = ldl_le_p(&s->csr[addr]);
+    uint32_t wmask = ldl_le_p(&s->wmask[addr]);
+    uint32_t w1cmask = ldl_le_p(&s->w1cmask[addr]);
+    stl_le_p(&s->csr[addr],
+             ((oldval & ~wmask) | (val & wmask)) & ~(w1cmask & val));
+}
+
+static uint64_t vtd_get_quad(IntelIOMMUState *s, hwaddr addr)
+{
+    uint64_t val = ldq_le_p(&s->csr[addr]);
+    uint64_t womask = ldq_le_p(&s->womask[addr]);
+    return val & ~womask;
+}
+
+static uint32_t vtd_get_long(IntelIOMMUState *s, hwaddr addr)
+{
+    uint32_t val = ldl_le_p(&s->csr[addr]);
+    uint32_t womask = ldl_le_p(&s->womask[addr]);
+    return val & ~womask;
+}
+
+/* "Internal" get/set operations */
+static uint64_t vtd_get_quad_raw(IntelIOMMUState *s, hwaddr addr)
+{
+    return ldq_le_p(&s->csr[addr]);
+}
+
+static uint32_t vtd_get_long_raw(IntelIOMMUState *s, hwaddr addr)
+{
+    return ldl_le_p(&s->csr[addr]);
+}
+
+static void vtd_set_quad_raw(IntelIOMMUState *s, hwaddr addr, uint64_t val)
+{
+    stq_le_p(&s->csr[addr], val);
+}
+
+static uint32_t vtd_set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
+                                        uint32_t clear, uint32_t mask)
+{
+    uint32_t new_val = (ldl_le_p(&s->csr[addr]) & ~clear) | mask;
+    stl_le_p(&s->csr[addr], new_val);
+    return new_val;
+}
+
+static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
+                                        uint64_t clear, uint64_t mask)
+{
+    uint64_t new_val = (ldq_le_p(&s->csr[addr]) & ~clear) | mask;
+    stq_le_p(&s->csr[addr], new_val);
+    return new_val;
+}
+
+/* Given the reg addr of both the message data and address, generate an
+ * interrupt via MSI.
+ */
+static void vtd_generate_interrupt(IntelIOMMUState *s, hwaddr mesg_addr_reg,
+                                   hwaddr mesg_data_reg)
+{
+    hwaddr addr;
+    uint32_t data;
+
+    assert(mesg_data_reg < DMAR_REG_SIZE);
+    assert(mesg_addr_reg < DMAR_REG_SIZE);
+
+    addr = vtd_get_long_raw(s, mesg_addr_reg);
+    data = vtd_get_long_raw(s, mesg_data_reg);
+
+    VTD_DPRINTF(FLOG, "msi: addr 0x%"PRIx64 " data 0x%"PRIx32, addr, data);
+    stl_le_phys(&address_space_memory, addr, data);
+}
+
+/* Generate a fault event to software via MSI if conditions are met.
+ * Notice that the value of FSTS_REG being passed to it should be the one
+ * before any update.
+ */
+static void vtd_generate_fault_event(IntelIOMMUState *s, uint32_t pre_fsts)
+{
+    if (pre_fsts & VTD_FSTS_PPF || pre_fsts & VTD_FSTS_PFO ||
+        pre_fsts & VTD_FSTS_IQE) {
+        VTD_DPRINTF(FLOG, "there are previous interrupt conditions "
+                    "to be serviced by software, fault event is not generated "
+                    "(FSTS_REG 0x%"PRIx32 ")", pre_fsts);
+        return;
+    }
+    vtd_set_clear_mask_long(s, DMAR_FECTL_REG, 0, VTD_FECTL_IP);
+    if (vtd_get_long_raw(s, DMAR_FECTL_REG) & VTD_FECTL_IM) {
+        VTD_DPRINTF(FLOG, "Interrupt Mask set, fault event is not generated");
+    } else {
+        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
+        vtd_set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
+    }
+}
+
+/* Check if the Fault (F) field of the Fault Recording Register referenced by
+ * @index is Set.
+ */
+static bool vtd_is_frcd_set(IntelIOMMUState *s, uint16_t index)
+{
+    /* Each reg is 128-bit */
+    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
+    addr += 8; /* Access the high 64-bit half */
+
+    assert(index < DMAR_FRCD_REG_NR);
+
+    return vtd_get_quad_raw(s, addr) & VTD_FRCD_F;
+}
+
+/* Update the PPF field of Fault Status Register.
+ * Should be called whenever change the F field of any fault recording
+ * registers.
+ */
+static void vtd_update_fsts_ppf(IntelIOMMUState *s)
+{
+    uint32_t i;
+    uint32_t ppf_mask = 0;
+
+    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
+        if (vtd_is_frcd_set(s, i)) {
+            ppf_mask = VTD_FSTS_PPF;
+            break;
+        }
+    }
+    vtd_set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_PPF, ppf_mask);
+    VTD_DPRINTF(FLOG, "set PPF of FSTS_REG to %d", ppf_mask ? 1 : 0);
+}
+
+static void vtd_set_frcd_and_update_ppf(IntelIOMMUState *s, uint16_t index)
+{
+    /* Each reg is 128-bit */
+    hwaddr addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
+    addr += 8; /* Access the high 64-bit half */
+
+    assert(index < DMAR_FRCD_REG_NR);
+
+    vtd_set_clear_mask_quad(s, addr, 0, VTD_FRCD_F);
+    vtd_update_fsts_ppf(s);
+}
+
+/* Must not update F field now, should be done later */
+static void vtd_record_frcd(IntelIOMMUState *s, uint16_t index,
+                            uint16_t source_id, hwaddr addr,
+                            VTDFaultReason fault, bool is_write)
+{
+    uint64_t hi = 0, lo;
+    hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
+
+    assert(index < DMAR_FRCD_REG_NR);
+
+    lo = VTD_FRCD_FI(addr);
+    hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
+    if (!is_write) {
+        hi |= VTD_FRCD_T;
+    }
+    vtd_set_quad_raw(s, frcd_reg_addr, lo);
+    vtd_set_quad_raw(s, frcd_reg_addr + 8, hi);
+    VTD_DPRINTF(FLOG, "record to FRCD_REG #%"PRIu16 ": hi 0x%"PRIx64
+                ", lo 0x%"PRIx64, index, hi, lo);
+}
+
+/* Try to collapse multiple pending faults from the same requester */
+static bool vtd_try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
+{
+    uint32_t i;
+    uint64_t frcd_reg;
+    hwaddr addr = DMAR_FRCD_REG_OFFSET + 8; /* The high 64-bit half */
+
+    for (i = 0; i < DMAR_FRCD_REG_NR; i++) {
+        frcd_reg = vtd_get_quad_raw(s, addr);
+        VTD_DPRINTF(FLOG, "frcd_reg #%d 0x%"PRIx64, i, frcd_reg);
+        if ((frcd_reg & VTD_FRCD_F) &&
+            ((frcd_reg & VTD_FRCD_SID_MASK) == source_id)) {
+            return true;
+        }
+        addr += 16; /* 128-bit for each */
+    }
+    return false;
+}
+
+/* Log and report an DMAR (address translation) fault to software */
+static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
+                                  hwaddr addr, VTDFaultReason fault,
+                                  bool is_write)
+{
+    uint32_t fsts_reg = vtd_get_long_raw(s, DMAR_FSTS_REG);
+
+    assert(fault < VTD_FR_MAX);
+
+    if (fault == VTD_FR_RESERVED_ERR) {
+        /* This is not a normal fault reason case. Drop it. */
+        return;
+    }
+    VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
+                ", is_write %d", source_id, fault, addr, is_write);
+    if (fsts_reg & VTD_FSTS_PFO) {
+        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
+                    "Primary Fault Overflow");
+        return;
+    }
+    if (vtd_try_collapse_fault(s, source_id)) {
+        VTD_DPRINTF(FLOG, "new fault is not recorded due to "
+                    "compression of faults");
+        return;
+    }
+    if (vtd_is_frcd_set(s, s->next_frcd_reg)) {
+        VTD_DPRINTF(FLOG, "Primary Fault Overflow and "
+                    "new fault is not recorded, set PFO field");
+        vtd_set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_PFO);
+        return;
+    }
+
+    vtd_record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
+
+    if (fsts_reg & VTD_FSTS_PPF) {
+        VTD_DPRINTF(FLOG, "there are pending faults already, "
+                    "fault event is not generated");
+        vtd_set_frcd_and_update_ppf(s, s->next_frcd_reg);
+        s->next_frcd_reg++;
+        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
+            s->next_frcd_reg = 0;
+        }
+    } else {
+        vtd_set_clear_mask_long(s, DMAR_FSTS_REG, VTD_FSTS_FRI_MASK,
+                                VTD_FSTS_FRI(s->next_frcd_reg));
+        vtd_set_frcd_and_update_ppf(s, s->next_frcd_reg); /* Will set PPF */
+        s->next_frcd_reg++;
+        if (s->next_frcd_reg == DMAR_FRCD_REG_NR) {
+            s->next_frcd_reg = 0;
+        }
+        /* This case actually cause the PPF to be Set.
+         * So generate fault event (interrupt).
+         */
+         vtd_generate_fault_event(s, fsts_reg);
+    }
+}
+
+static inline bool vtd_root_entry_present(VTDRootEntry *root)
+{
+    return root->val & VTD_ROOT_ENTRY_P;
+}
+
+static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
+                              VTDRootEntry *re)
+{
+    dma_addr_t addr;
+
+    addr = s->root + index * sizeof(*re);
+    if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
+        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
+                    " + %"PRIu8, s->root, index);
+        re->val = 0;
+        return -VTD_FR_ROOT_TABLE_INV;
+    }
+    re->val = le64_to_cpu(re->val);
+    return 0;
+}
+
+static inline bool vtd_context_entry_present(VTDContextEntry *context)
+{
+    return context->lo & VTD_CONTEXT_ENTRY_P;
+}
+
+static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
+                                           VTDContextEntry *ce)
+{
+    dma_addr_t addr;
+
+    if (!vtd_root_entry_present(root)) {
+        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
+        return -VTD_FR_ROOT_ENTRY_P;
+    }
+    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
+    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
+        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
+                    " + %"PRIu8,
+                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
+        return -VTD_FR_CONTEXT_TABLE_INV;
+    }
+    ce->lo = le64_to_cpu(ce->lo);
+    ce->hi = le64_to_cpu(ce->hi);
+    return 0;
+}
+
+static inline dma_addr_t vtd_get_slpt_base_from_context(VTDContextEntry *ce)
+{
+    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
+}
+
+/* The shift of an addr for a certain level of paging structure */
+static inline uint32_t vtd_slpt_level_shift(uint32_t level)
+{
+    return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
+}
+
+static inline uint64_t vtd_get_slpte_addr(uint64_t slpte)
+{
+    return slpte & VTD_SL_PT_BASE_ADDR_MASK;
+}
+
+/* Whether the pte indicates the address of the page frame */
+static inline bool vtd_is_last_slpte(uint64_t slpte, uint32_t level)
+{
+    return level == VTD_SL_PT_LEVEL || (slpte & VTD_SL_PT_PAGE_SIZE_MASK);
+}
+
+/* Get the content of a spte located in @base_addr[@index] */
+static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
+{
+    uint64_t slpte;
+
+    assert(index < VTD_SL_PT_ENTRY_NR);
+
+    if (dma_memory_read(&address_space_memory,
+                        base_addr + index * sizeof(slpte), &slpte,
+                        sizeof(slpte))) {
+        slpte = (uint64_t)-1;
+        return slpte;
+    }
+    slpte = le64_to_cpu(slpte);
+    return slpte;
+}
+
+/* Given a gpa and the level of paging structure, return the offset of current
+ * level.
+ */
+static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
+{
+    return (gpa >> vtd_slpt_level_shift(level)) &
+            ((1ULL << VTD_SL_LEVEL_BITS) - 1);
+}
+
+/* Check Capability Register to see if the @level of page-table is supported */
+static inline bool vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
+{
+    return VTD_CAP_SAGAW_MASK & s->cap &
+           (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
+}
+
+/* Get the page-table level that hardware should use for the second-level
+ * page-table walk from the Address Width field of context-entry.
+ */
+static inline uint32_t vtd_get_level_from_context_entry(VTDContextEntry *ce)
+{
+    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
+}
+
+static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
+{
+    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
+}
+
+static const uint64_t vtd_paging_entry_rsvd_field[] = {
+    [0] = ~0ULL,
+    /* For not large page */
+    [1] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [2] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [3] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [4] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    /* For large page */
+    [5] = 0x800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [6] = 0x1ff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [7] = 0x3ffff800ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+    [8] = 0x880ULL | ~(VTD_HAW_MASK | VTD_SL_IGN_COM),
+};
+
+static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
+{
+    if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
+        /* Maybe large page */
+        return slpte & vtd_paging_entry_rsvd_field[level + 4];
+    } else {
+        return slpte & vtd_paging_entry_rsvd_field[level];
+    }
+}
+
+/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
+ * of the translation, can be used for deciding the size of large page.
+ */
+static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
+                            uint64_t *slptep, uint32_t *slpte_level,
+                            bool *reads, bool *writes)
+{
+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
+    uint32_t level = vtd_get_level_from_context_entry(ce);
+    uint32_t offset;
+    uint64_t slpte;
+    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    uint64_t access_right_check;
+
+    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
+     * and AW in context-entry.
+     */
+    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    /* FIXME: what is the Atomics request here? */
+    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
+
+    while (true) {
+        offset = vtd_gpa_level_offset(gpa, level);
+        slpte = vtd_get_slpte(addr, offset);
+
+        if (slpte == (uint64_t)-1) {
+            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
+                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
+                        level, gpa);
+            if (level == vtd_get_level_from_context_entry(ce)) {
+                /* Invalid programming of context-entry */
+                return -VTD_FR_CONTEXT_ENTRY_INV;
+            } else {
+                return -VTD_FR_PAGING_ENTRY_INV;
+            }
+        }
+        *reads = (*reads) && (slpte & VTD_SL_R);
+        *writes = (*writes) && (slpte & VTD_SL_W);
+        if (!(slpte & access_right_check)) {
+            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
+                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
+                        (is_write ? "write" : "read"), gpa, slpte);
+            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
+        }
+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
+                        "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
+                        level, slpte);
+            return -VTD_FR_PAGING_ENTRY_RSVD;
+        }
+
+        if (vtd_is_last_slpte(slpte, level)) {
+            *slptep = slpte;
+            *slpte_level = level;
+            return 0;
+        }
+        addr = vtd_get_slpte_addr(slpte);
+        level--;
+    }
+}
+
+/* Map a device to its corresponding domain (context-entry) */
+static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
+                                    uint8_t devfn, VTDContextEntry *ce)
+{
+    VTDRootEntry re;
+    int ret_fr;
+
+    ret_fr = vtd_get_root_entry(s, bus_num, &re);
+    if (ret_fr) {
+        return ret_fr;
+    }
+
+    if (!vtd_root_entry_present(&re)) {
+        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
+                    bus_num);
+        return -VTD_FR_ROOT_ENTRY_P;
+    } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
+        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
+        return -VTD_FR_ROOT_ENTRY_RSVD;
+    }
+
+    ret_fr = vtd_get_context_entry_from_root(&re, devfn, ce);
+    if (ret_fr) {
+        return ret_fr;
+    }
+
+    if (!vtd_context_entry_present(ce)) {
+        VTD_DPRINTF(GENERAL,
+                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
+                    "is not present", devfn, bus_num);
+        return -VTD_FR_CONTEXT_ENTRY_P;
+    } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
+               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
+        VTD_DPRINTF(GENERAL,
+                    "error: non-zero reserved field in context-entry "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    }
+    /* Check if the programming of context-entry is valid */
+    if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
+        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
+                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    } else if (ce->lo & VTD_CONTEXT_ENTRY_TT) {
+        VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
+                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    }
+    return 0;
+}
+
+static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
+{
+    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
+}
+
+static const bool vtd_qualified_faults[] = {
+    [VTD_FR_RESERVED] = false,
+    [VTD_FR_ROOT_ENTRY_P] = false,
+    [VTD_FR_CONTEXT_ENTRY_P] = true,
+    [VTD_FR_CONTEXT_ENTRY_INV] = true,
+    [VTD_FR_ADDR_BEYOND_MGAW] = true,
+    [VTD_FR_WRITE] = true,
+    [VTD_FR_READ] = true,
+    [VTD_FR_PAGING_ENTRY_INV] = true,
+    [VTD_FR_ROOT_TABLE_INV] = false,
+    [VTD_FR_CONTEXT_TABLE_INV] = false,
+    [VTD_FR_ROOT_ENTRY_RSVD] = false,
+    [VTD_FR_PAGING_ENTRY_RSVD] = true,
+    [VTD_FR_CONTEXT_ENTRY_TT] = true,
+    [VTD_FR_RESERVED_ERR] = false,
+    [VTD_FR_MAX] = false,
+};
+
+/* To see if a fault condition is "qualified", which is reported to software
+ * only if the FPD field in the context-entry used to process the faulting
+ * request is 0.
+ */
+static inline bool vtd_is_qualified_fault(VTDFaultReason fault)
+{
+    return vtd_qualified_faults[fault];
+}
+
+static inline bool vtd_is_interrupt_addr(hwaddr addr)
+{
+    return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
+}
+
+/* Map dev to context-entry then do a paging-structures walk to do a iommu
+ * translation.
+ * @bus_num: The bus number
+ * @devfn: The devfn, which is the  combined of device and function number
+ * @is_write: The access is a write operation
+ * @entry: IOMMUTLBEntry that contain the addr to be translated and result
+ */
+static void vtd_do_iommu_translate(IntelIOMMUState *s, uint8_t bus_num,
+                                   uint8_t devfn, hwaddr addr, bool is_write,
+                                   IOMMUTLBEntry *entry)
+{
+    VTDContextEntry ce;
+    uint64_t slpte;
+    uint32_t level;
+    uint16_t source_id = vtd_make_source_id(bus_num, devfn);
+    int ret_fr;
+    bool is_fpd_set = false;
+    bool reads = true;
+    bool writes = true;
+
+    /* Check if the request is in interrupt address range */
+    if (vtd_is_interrupt_addr(addr)) {
+        if (is_write) {
+            /* FIXME: since we don't know the length of the access here, we
+             * treat Non-DWORD length write requests without PASID as
+             * interrupt requests, too. Withoud interrupt remapping support,
+             * we just use 1:1 mapping.
+             */
+            VTD_DPRINTF(MMU, "write request to interrupt address "
+                        "gpa 0x%"PRIx64, addr);
+            entry->iova = addr & VTD_PAGE_MASK_4K;
+            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
+            entry->addr_mask = ~VTD_PAGE_MASK_4K;
+            entry->perm = IOMMU_WO;
+            return;
+        } else {
+            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
+                        "gpa 0x%"PRIx64, addr);
+            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
+            return;
+        }
+    }
+
+    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+    if (ret_fr) {
+        ret_fr = -ret_fr;
+        if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
+                        "through this context-entry (with FPD Set)");
+        } else {
+            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+        }
+        return;
+    }
+
+    ret_fr = vtd_gpa_to_slpte(&ce, addr, is_write, &slpte, &level,
+                              &reads, &writes);
+    if (ret_fr) {
+        ret_fr = -ret_fr;
+        if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
+                        "through this context-entry (with FPD Set)");
+        } else {
+            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+        }
+        return;
+    }
+
+    entry->iova = addr & VTD_PAGE_MASK_4K;
+    entry->translated_addr = vtd_get_slpte_addr(slpte) & VTD_PAGE_MASK_4K;
+    entry->addr_mask = ~VTD_PAGE_MASK_4K;
+    entry->perm = (writes ? 2 : 0) + (reads ? 1 : 0);
+}
+
+static void vtd_root_table_setup(IntelIOMMUState *s)
+{
+    s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
+    s->root_extended = s->root & VTD_RTADDR_RTT;
+    s->root &= VTD_RTADDR_ADDR_MASK;
+
+    VTD_DPRINTF(CSR, "root_table addr 0x%"PRIx64 " %s", s->root,
+                (s->root_extended ? "(extended)" : ""));
+}
+
+/* Context-cache invalidation
+ * Returns the Context Actual Invalidation Granularity.
+ * @val: the content of the CCMD_REG
+ */
+static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
+{
+    uint64_t caig;
+    uint64_t type = val & VTD_CCMD_CIRG_MASK;
+
+    switch (type) {
+    case VTD_CCMD_GLOBAL_INVL:
+        VTD_DPRINTF(INV, "Global invalidation request");
+        caig = VTD_CCMD_GLOBAL_INVL_A;
+        break;
+
+    case VTD_CCMD_DOMAIN_INVL:
+        VTD_DPRINTF(INV, "Domain-selective invalidation request");
+        caig = VTD_CCMD_DOMAIN_INVL_A;
+        break;
+
+    case VTD_CCMD_DEVICE_INVL:
+        VTD_DPRINTF(INV, "Domain-selective invalidation request");
+        caig = VTD_CCMD_DEVICE_INVL_A;
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL,
+                    "error: wrong context-cache invalidation granularity");
+        caig = 0;
+    }
+    return caig;
+}
+
+/* Flush IOTLB
+ * Returns the IOTLB Actual Invalidation Granularity.
+ * @val: the content of the IOTLB_REG
+ */
+static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
+{
+    uint64_t iaig;
+    uint64_t type = val & VTD_TLB_FLUSH_GRANU_MASK;
+
+    switch (type) {
+    case VTD_TLB_GLOBAL_FLUSH:
+        VTD_DPRINTF(INV, "Global IOTLB flush");
+        iaig = VTD_TLB_GLOBAL_FLUSH_A;
+        break;
+
+    case VTD_TLB_DSI_FLUSH:
+        VTD_DPRINTF(INV, "Domain-selective IOTLB flush");
+        iaig = VTD_TLB_DSI_FLUSH_A;
+        break;
+
+    case VTD_TLB_PSI_FLUSH:
+        VTD_DPRINTF(INV, "Page-selective-within-domain IOTLB flush");
+        iaig = VTD_TLB_PSI_FLUSH_A;
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: wrong iotlb flush granularity");
+        iaig = 0;
+    }
+    return iaig;
+}
+
+/* Set Root Table Pointer */
+static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
+{
+    VTD_DPRINTF(CSR, "set Root Table Pointer");
+
+    vtd_root_table_setup(s);
+    /* Ok - report back to driver */
+    vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
+}
+
+/* Handle Translation Enable/Disable */
+static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
+{
+    VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
+
+    if (en) {
+        s->dmar_enabled = true;
+        /* Ok - report back to driver */
+        vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
+    } else {
+        s->dmar_enabled = false;
+
+        /* Clear the index of Fault Recording Register */
+        s->next_frcd_reg = 0;
+        /* Ok - report back to driver */
+        vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
+    }
+}
+
+/* Handle write to Global Command Register */
+static void vtd_handle_gcmd_write(IntelIOMMUState *s)
+{
+    uint32_t status = vtd_get_long_raw(s, DMAR_GSTS_REG);
+    uint32_t val = vtd_get_long_raw(s, DMAR_GCMD_REG);
+    uint32_t changed = status ^ val;
+
+    VTD_DPRINTF(CSR, "value 0x%"PRIx32 " status 0x%"PRIx32, val, status);
+    if (changed & VTD_GCMD_TE) {
+        /* Translation enable/disable */
+        vtd_handle_gcmd_te(s, val & VTD_GCMD_TE);
+    }
+    if (val & VTD_GCMD_SRTP) {
+        /* Set/update the root-table pointer */
+        vtd_handle_gcmd_srtp(s);
+    }
+}
+
+/* Handle write to Context Command Register */
+static void vtd_handle_ccmd_write(IntelIOMMUState *s)
+{
+    uint64_t ret;
+    uint64_t val = vtd_get_quad_raw(s, DMAR_CCMD_REG);
+
+    /* Context-cache invalidation request */
+    if (val & VTD_CCMD_ICC) {
+        ret = vtd_context_cache_invalidate(s, val);
+        /* Invalidation completed. Change something to show */
+        vtd_set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_ICC, 0ULL);
+        ret = vtd_set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_CAIG_MASK,
+                                      ret);
+        VTD_DPRINTF(INV, "CCMD_REG write-back val: 0x%"PRIx64, ret);
+    }
+}
+
+/* Handle write to IOTLB Invalidation Register */
+static void vtd_handle_iotlb_write(IntelIOMMUState *s)
+{
+    uint64_t ret;
+    uint64_t val = vtd_get_quad_raw(s, DMAR_IOTLB_REG);
+
+    /* IOTLB invalidation request */
+    if (val & VTD_TLB_IVT) {
+        ret = vtd_iotlb_flush(s, val);
+        /* Invalidation completed. Change something to show */
+        vtd_set_clear_mask_quad(s, DMAR_IOTLB_REG, VTD_TLB_IVT, 0ULL);
+        ret = vtd_set_clear_mask_quad(s, DMAR_IOTLB_REG,
+                                      VTD_TLB_FLUSH_GRANU_MASK_A, ret);
+        VTD_DPRINTF(INV, "IOTLB_REG write-back val: 0x%"PRIx64, ret);
+    }
+}
+
+static void vtd_handle_fsts_write(IntelIOMMUState *s)
+{
+    uint32_t fsts_reg = vtd_get_long_raw(s, DMAR_FSTS_REG);
+    uint32_t fectl_reg = vtd_get_long_raw(s, DMAR_FECTL_REG);
+    uint32_t status_fields = VTD_FSTS_PFO | VTD_FSTS_PPF | VTD_FSTS_IQE;
+
+    if ((fectl_reg & VTD_FECTL_IP) && !(fsts_reg & status_fields)) {
+        vtd_set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
+        VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
+                    "IP field of FECTL_REG");
+    }
+}
+
+static void vtd_handle_fectl_write(IntelIOMMUState *s)
+{
+    uint32_t fectl_reg;
+    /* FIXME: when software clears the IM field, check the IP field. But do we
+     * need to compare the old value and the new value to conclude that
+     * software clears the IM field? Or just check if the IM field is zero?
+     */
+    fectl_reg = vtd_get_long_raw(s, DMAR_FECTL_REG);
+    if ((fectl_reg & VTD_FECTL_IP) && !(fectl_reg & VTD_FECTL_IM)) {
+        vtd_generate_interrupt(s, DMAR_FEADDR_REG, DMAR_FEDATA_REG);
+        vtd_set_clear_mask_long(s, DMAR_FECTL_REG, VTD_FECTL_IP, 0);
+        VTD_DPRINTF(FLOG, "IM field is cleared, generate "
+                    "fault event interrupt");
+    }
+}
+
+static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
+{
+    IntelIOMMUState *s = opaque;
+    uint64_t val;
+
+    if (addr + size > DMAR_REG_SIZE) {
+        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
+                    ", got 0x%"PRIx64 " %d",
+                    (uint64_t)DMAR_REG_SIZE, addr, size);
+        return (uint64_t)-1;
+    }
+
+    switch (addr) {
+    /* Root Table Address Register, 64-bit */
+    case DMAR_RTADDR_REG:
+        if (size == 4) {
+            val = s->root & ((1ULL << 32) - 1);
+        } else {
+            val = s->root;
+        }
+        break;
+
+    case DMAR_RTADDR_REG_HI:
+        assert(size == 4);
+        val = s->root >> 32;
+        break;
+
+    default:
+        if (size == 4) {
+            val = vtd_get_long(s, addr);
+        } else {
+            val = vtd_get_quad(s, addr);
+        }
+    }
+    VTD_DPRINTF(CSR, "addr 0x%"PRIx64 " size %d val 0x%"PRIx64,
+                addr, size, val);
+    return val;
+}
+
+static void vtd_mem_write(void *opaque, hwaddr addr,
+                          uint64_t val, unsigned size)
+{
+    IntelIOMMUState *s = opaque;
+
+    if (addr + size > DMAR_REG_SIZE) {
+        VTD_DPRINTF(GENERAL, "error: addr outside region: max 0x%"PRIx64
+                    ", got 0x%"PRIx64 " %d",
+                    (uint64_t)DMAR_REG_SIZE, addr, size);
+        return;
+    }
+
+    switch (addr) {
+    /* Global Command Register, 32-bit */
+    case DMAR_GCMD_REG:
+        VTD_DPRINTF(CSR, "DMAR_GCMD_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        vtd_set_long(s, addr, val);
+        vtd_handle_gcmd_write(s);
+        break;
+
+    /* Context Command Register, 64-bit */
+    case DMAR_CCMD_REG:
+        VTD_DPRINTF(CSR, "DMAR_CCMD_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+            vtd_handle_ccmd_write(s);
+        }
+        break;
+
+    case DMAR_CCMD_REG_HI:
+        VTD_DPRINTF(CSR, "DMAR_CCMD_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        vtd_handle_ccmd_write(s);
+        break;
+
+    /* IOTLB Invalidation Register, 64-bit */
+    case DMAR_IOTLB_REG:
+        VTD_DPRINTF(INV, "DMAR_IOTLB_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+            vtd_handle_iotlb_write(s);
+        }
+        break;
+
+    case DMAR_IOTLB_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IOTLB_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        vtd_handle_iotlb_write(s);
+        break;
+
+    /* Fault Status Register, 32-bit */
+    case DMAR_FSTS_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FSTS_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        vtd_handle_fsts_write(s);
+        break;
+
+    /* Fault Event Control Register, 32-bit */
+    case DMAR_FECTL_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FECTL_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        vtd_handle_fectl_write(s);
+        break;
+
+    /* Fault Event Data Register, 32-bit */
+    case DMAR_FEDATA_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FEDATA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Fault Event Address Register, 32-bit */
+    case DMAR_FEADDR_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FEADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Fault Event Upper Address Register, 32-bit */
+    case DMAR_FEUADDR_REG:
+        VTD_DPRINTF(FLOG, "DMAR_FEUADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Protected Memory Enable Register, 32-bit */
+    case DMAR_PMEN_REG:
+        VTD_DPRINTF(CSR, "DMAR_PMEN_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Root Table Address Register, 64-bit */
+    case DMAR_RTADDR_REG:
+        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_RTADDR_REG_HI:
+        VTD_DPRINTF(CSR, "DMAR_RTADDR_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Fault Recording Registers, 128-bit */
+    case DMAR_FRCD_REG_0_0:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_FRCD_REG_0_1:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_1 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    case DMAR_FRCD_REG_0_2:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_2 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+            /* May clear bit 127 (Fault), update PPF */
+            vtd_update_fsts_ppf(s);
+        }
+        break;
+
+    case DMAR_FRCD_REG_0_3:
+        VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_3 write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        /* May clear bit 127 (Fault), update PPF */
+        vtd_update_fsts_ppf(s);
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: unhandled reg write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+        }
+    }
+}
+
+static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
+                                         bool is_write)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_num = vtd_as->bus_num;
+    uint8_t devfn = vtd_as->devfn;
+    IOMMUTLBEntry ret = {
+        .target_as = &address_space_memory,
+        .iova = addr,
+        .translated_addr = 0,
+        .addr_mask = ~(hwaddr)0,
+        .perm = IOMMU_NONE,
+    };
+
+    if (!s->dmar_enabled) {
+        /* DMAR disabled, passthrough, use 4k-page*/
+        ret.iova = addr & VTD_PAGE_MASK_4K;
+        ret.translated_addr = addr & VTD_PAGE_MASK_4K;
+        ret.addr_mask = ~VTD_PAGE_MASK_4K;
+        ret.perm = IOMMU_RW;
+        return ret;
+    }
+
+    vtd_do_iommu_translate(s, bus_num, devfn, addr, is_write, &ret);
+
+    VTD_DPRINTF(MMU,
+                "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
+                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, bus_num,
+                VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), devfn, addr,
+                ret.translated_addr);
+    return ret;
+}
+
+static const VMStateDescription vtd_vmstate = {
+    .name = "iommu-intel",
+    .unmigratable = 1,
+};
+
+static const MemoryRegionOps vtd_mem_ops = {
+    .read = vtd_mem_read,
+    .write = vtd_mem_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    },
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    },
+};
+
+static Property vtd_properties[] = {
+    DEFINE_PROP_UINT32("version", IntelIOMMUState, version, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+/* Do the initialization. It will also be called when reset, so pay
+ * attention when adding new initialization stuff.
+ */
+static void vtd_init(IntelIOMMUState *s)
+{
+    memset(s->csr, 0, DMAR_REG_SIZE);
+    memset(s->wmask, 0, DMAR_REG_SIZE);
+    memset(s->w1cmask, 0, DMAR_REG_SIZE);
+    memset(s->womask, 0, DMAR_REG_SIZE);
+
+    s->iommu_ops.translate = vtd_iommu_translate;
+    s->root = 0;
+    s->root_extended = false;
+    s->dmar_enabled = false;
+    s->iq_head = 0;
+    s->iq_tail = 0;
+    s->iq = 0;
+    s->iq_size = 0;
+    s->qi_enabled = false;
+    s->iq_last_desc_type = VTD_INV_DESC_NONE;
+    s->next_frcd_reg = 0;
+    s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
+             VTD_CAP_SAGAW;
+    s->ecap = VTD_ECAP_IRO;
+
+    /* Define registers with default values and bit semantics */
+    vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
+    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
+    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
+    vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
+    vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
+    vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
+    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
+    vtd_define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
+    vtd_define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
+
+    /* Advanced Fault Logging not supported */
+    vtd_define_long(s, DMAR_FSTS_REG, 0, 0, 0x11UL);
+    vtd_define_long(s, DMAR_FECTL_REG, 0x80000000UL, 0x80000000UL, 0);
+    vtd_define_long(s, DMAR_FEDATA_REG, 0, 0x0000ffffUL, 0);
+    vtd_define_long(s, DMAR_FEADDR_REG, 0, 0xfffffffcUL, 0);
+
+    /* Treated as RsvdZ when EIM in ECAP_REG is not supported
+     * vtd_define_long(s, DMAR_FEUADDR_REG, 0, 0xffffffffUL, 0);
+     */
+    vtd_define_long(s, DMAR_FEUADDR_REG, 0, 0, 0);
+
+    /* Treated as RO for implementations that PLMR and PHMR fields reported
+     * as Clear in the CAP_REG.
+     * vtd_define_long(s, DMAR_PMEN_REG, 0, 0x80000000UL, 0);
+     */
+    vtd_define_long(s, DMAR_PMEN_REG, 0, 0, 0);
+
+    /* IOTLB registers */
+    vtd_define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
+    vtd_define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
+    vtd_define_quad_wo(s, DMAR_IVA_REG, 0xfffffffffffff07fULL);
+
+    /* Fault Recording Registers, 128-bit */
+    vtd_define_quad(s, DMAR_FRCD_REG_0_0, 0, 0, 0);
+    vtd_define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000000000000000ULL);
+}
+
+/* Should not reset address_spaces when reset because devices will still use
+ * the address space they got at first (won't ask the bus again).
+ */
+static void vtd_reset(DeviceState *dev)
+{
+    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
+
+    VTD_DPRINTF(GENERAL, "");
+    vtd_init(s);
+}
+
+static void vtd_realize(DeviceState *dev, Error **errp)
+{
+    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
+
+    VTD_DPRINTF(GENERAL, "");
+    memset(s->address_spaces, 0, sizeof(s->address_spaces));
+    memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
+                          "intel_iommu", DMAR_REG_SIZE);
+    sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
+    vtd_init(s);
+}
+
+static void vtd_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->reset = vtd_reset;
+    dc->realize = vtd_realize;
+    dc->vmsd = &vtd_vmstate;
+    dc->props = vtd_properties;
+}
+
+static const TypeInfo vtd_info = {
+    .name          = TYPE_INTEL_IOMMU_DEVICE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(IntelIOMMUState),
+    .class_init    = vtd_class_init,
+};
+
+static void vtd_register_types(void)
+{
+    VTD_DPRINTF(GENERAL, "");
+    type_register_static(&vtd_info);
+}
+
+type_init(vtd_register_types)
diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 3688cf8..9d419ad 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -2,6 +2,7 @@ obj-$(CONFIG_KVM) += kvm/
 obj-y += multiboot.o smbios.o
 obj-y += pc.o pc_piix.o pc_q35.o
 obj-y += pc_sysfw.o
+obj-y += intel_iommu.o
 obj-$(CONFIG_XEN) += ../xenpv/ xen/
 
 obj-y += kvmvapic.o
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 03/13] intel-iommu: add DMAR table to ACPI tables
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 01/13] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 02/13] intel-iommu: introduce Intel IOMMU (VT-d) emulation Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 04/13] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Michael S. Tsirkin
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Le Tan, Anthony Liguori

From: Le Tan <tamlokveer@gmail.com>

Expose Intel IOMMU to the BIOS. If object of TYPE_INTEL_IOMMU_DEVICE exists,
add DMAR table to ACPI RSDT table. For now the DMAR table indicates that there
is only one hardware unit without INTR_REMAP capability on the platform.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/acpi-defs.h  | 40 ++++++++++++++++++++++++++++++++++++++++
 hw/i386/acpi-build.c | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/hw/i386/acpi-defs.h b/hw/i386/acpi-defs.h
index 1bc974a..c4468f8 100644
--- a/hw/i386/acpi-defs.h
+++ b/hw/i386/acpi-defs.h
@@ -325,4 +325,44 @@ struct Acpi20Tcpa {
 } QEMU_PACKED;
 typedef struct Acpi20Tcpa Acpi20Tcpa;
 
+/* DMAR - DMA Remapping table r2.2 */
+struct AcpiTableDmar {
+    ACPI_TABLE_HEADER_DEF
+    uint8_t host_address_width; /* Maximum DMA physical addressability */
+    uint8_t flags;
+    uint8_t reserved[10];
+} QEMU_PACKED;
+typedef struct AcpiTableDmar AcpiTableDmar;
+
+/* Masks for Flags field above */
+#define ACPI_DMAR_INTR_REMAP        1
+#define ACPI_DMAR_X2APIC_OPT_OUT    (1 << 1)
+
+/* Values for sub-structure type for DMAR */
+enum {
+    ACPI_DMAR_TYPE_HARDWARE_UNIT = 0,       /* DRHD */
+    ACPI_DMAR_TYPE_RESERVED_MEMORY = 1,     /* RMRR */
+    ACPI_DMAR_TYPE_ATSR = 2,                /* ATSR */
+    ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,   /* RHSR */
+    ACPI_DMAR_TYPE_ANDD = 4,                /* ANDD */
+    ACPI_DMAR_TYPE_RESERVED = 5             /* Reserved for furture use */
+};
+
+/*
+ * Sub-structures for DMAR
+ */
+/* Type 0: Hardware Unit Definition */
+struct AcpiDmarHardwareUnit {
+    uint16_t type;
+    uint16_t length;
+    uint8_t flags;
+    uint8_t reserved;
+    uint16_t pci_segment;   /* The PCI Segment associated with this unit */
+    uint64_t address;   /* Base address of remapping hardware register-set */
+} QEMU_PACKED;
+typedef struct AcpiDmarHardwareUnit AcpiDmarHardwareUnit;
+
+/* Masks for Flags field above */
+#define ACPI_DMAR_INCLUDE_PCI_ALL   1
+
 #endif
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 85e5834..3e7fba3 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -49,6 +49,7 @@
 #include "hw/i386/ich9.h"
 #include "hw/pci/pci_bus.h"
 #include "hw/pci-host/q35.h"
+#include "hw/i386/intel_iommu.h"
 
 #include "hw/i386/q35-acpi-dsdt.hex"
 #include "hw/i386/acpi-dsdt.hex"
@@ -1388,6 +1389,30 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
 }
 
 static void
+build_dmar_q35(GArray *table_data, GArray *linker)
+{
+    int dmar_start = table_data->len;
+
+    AcpiTableDmar *dmar;
+    AcpiDmarHardwareUnit *drhd;
+
+    dmar = acpi_data_push(table_data, sizeof(*dmar));
+    dmar->host_address_width = VTD_HOST_ADDRESS_WIDTH - 1;
+    dmar->flags = 0;    /* No intr_remap for now */
+
+    /* DMAR Remapping Hardware Unit Definition structure */
+    drhd = acpi_data_push(table_data, sizeof(*drhd));
+    drhd->type = cpu_to_le16(ACPI_DMAR_TYPE_HARDWARE_UNIT);
+    drhd->length = cpu_to_le16(sizeof(*drhd));   /* No device scope now */
+    drhd->flags = ACPI_DMAR_INCLUDE_PCI_ALL;
+    drhd->pci_segment = cpu_to_le16(0);
+    drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
+
+    build_header(linker, table_data, (void *)(table_data->data + dmar_start),
+                 "DMAR", table_data->len - dmar_start, 1);
+}
+
+static void
 build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
 {
     AcpiTableHeader *dsdt;
@@ -1508,6 +1533,16 @@ static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
     return true;
 }
 
+static bool acpi_has_iommu(void)
+{
+    bool ambiguous;
+    Object *intel_iommu;
+
+    intel_iommu = object_resolve_path_type("", TYPE_INTEL_IOMMU_DEVICE,
+                                           &ambiguous);
+    return intel_iommu && !ambiguous;
+}
+
 static
 void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
 {
@@ -1584,6 +1619,10 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
         acpi_add_table(table_offsets, tables->table_data);
         build_mcfg_q35(tables->table_data, tables->linker, &mcfg);
     }
+    if (acpi_has_iommu()) {
+        acpi_add_table(table_offsets, tables->table_data);
+        build_dmar_q35(tables->table_data, tables->linker);
+    }
 
     /* Add tables supplied by user (if any) */
     for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 04/13] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 03/13] intel-iommu: add DMAR table to ACPI tables Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 05/13] intel-iommu: fix coding style issues around in q35.c and machine.c Michael S. Tsirkin
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Stefan Hajnoczi, Marcel Apfelbaum,
	Michael Tokarev, Michael Roth, Le Tan, Alexander Graf,
	Anthony Liguori, Paolo Bonzini, Igor Mammedov,
	=?UTF-8?q?Andreas=20F=C3=A4rber?=

From: Le Tan <tamlokveer@gmail.com>

Add Intel IOMMU emulation to q35 chipset and expose it to the guest.
1. Add a machine option. Users can use "-machine iommu=on|off" in the command
line to enable/disable Intel IOMMU. The default is off.
2. Accroding to the machine option, q35 will initialize the Intel IOMMU and
use pci_setup_iommu() to setup q35_host_dma_iommu() as the IOMMU function for
the pci bus.
3. q35_host_dma_iommu() will return different address space according to the
bus_num and devfn of the device.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/boards.h       |  1 +
 include/hw/pci-host/q35.h |  2 ++
 hw/core/machine.c         | 17 +++++++++++++++++
 hw/pci-host/q35.c         | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 vl.c                      |  4 ++++
 qemu-options.hx           |  5 ++++-
 6 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 605a970..dfb6718 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -123,6 +123,7 @@ struct MachineState {
     bool mem_merge;
     bool usb;
     char *firmware;
+    bool iommu;
 
     ram_addr_t ram_size;
     ram_addr_t maxram_size;
diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index d9ee978..025d6e6 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -33,6 +33,7 @@
 #include "hw/acpi/acpi.h"
 #include "hw/acpi/ich9.h"
 #include "hw/pci-host/pam.h"
+#include "hw/i386/intel_iommu.h"
 
 #define TYPE_Q35_HOST_DEVICE "q35-pcihost"
 #define Q35_HOST_DEVICE(obj) \
@@ -60,6 +61,7 @@ typedef struct MCHPCIState {
     uint64_t pci_hole64_size;
     PcGuestInfo *guest_info;
     uint32_t short_root_bus;
+    IntelIOMMUState *iommu;
 } MCHPCIState;
 
 typedef struct Q35PCIHost {
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 7a66c57..0708de5 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -235,6 +235,20 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp)
     ms->firmware = g_strdup(value);
 }
 
+static bool machine_get_iommu(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->iommu;
+}
+
+static void machine_set_iommu(Object *obj, bool value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->iommu = value;
+}
+
 static void machine_initfn(Object *obj)
 {
     object_property_add_str(obj, "accel",
@@ -274,6 +288,9 @@ static void machine_initfn(Object *obj)
     object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
     object_property_add_str(obj, "firmware",
                             machine_get_firmware, machine_set_firmware, NULL);
+    object_property_add_bool(obj, "iommu",
+                             machine_get_iommu,
+                             machine_set_iommu, NULL);
 }
 
 static void machine_finalize(Object *obj)
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 37f228e..721cf5b 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -347,6 +347,48 @@ static void mch_reset(DeviceState *qdev)
     mch_update(mch);
 }
 
+static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDAddressSpace **pvtd_as;
+    int bus_num = pci_bus_num(bus);
+
+    assert(0 <= bus_num && bus_num <= VTD_PCI_BUS_MAX);
+    assert(0 <= devfn && devfn <= VTD_PCI_DEVFN_MAX);
+
+    pvtd_as = s->address_spaces[bus_num];
+    if (!pvtd_as) {
+        /* No corresponding free() */
+        pvtd_as = g_malloc0(sizeof(VTDAddressSpace *) * VTD_PCI_DEVFN_MAX);
+        s->address_spaces[bus_num] = pvtd_as;
+    }
+    if (!pvtd_as[devfn]) {
+        pvtd_as[devfn] = g_malloc0(sizeof(VTDAddressSpace));
+
+        pvtd_as[devfn]->bus_num = (uint8_t)bus_num;
+        pvtd_as[devfn]->devfn = (uint8_t)devfn;
+        pvtd_as[devfn]->iommu_state = s;
+        memory_region_init_iommu(&pvtd_as[devfn]->iommu, OBJECT(s),
+                                 &s->iommu_ops, "intel_iommu", UINT64_MAX);
+        address_space_init(&pvtd_as[devfn]->as,
+                           &pvtd_as[devfn]->iommu, "intel_iommu");
+    }
+    return &pvtd_as[devfn]->as;
+}
+
+static void mch_init_dmar(MCHPCIState *mch)
+{
+    PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
+
+    mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, TYPE_INTEL_IOMMU_DEVICE));
+    object_property_add_child(OBJECT(mch), "intel-iommu",
+                              OBJECT(mch->iommu), NULL);
+    qdev_init_nofail(DEVICE(mch->iommu));
+    sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
+
+    pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
+}
+
 static int mch_init(PCIDevice *d)
 {
     int i;
@@ -370,6 +412,10 @@ static int mch_init(PCIDevice *d)
                  &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
                  PAM_EXPAN_SIZE);
     }
+    /* Intel IOMMU (VT-d) */
+    if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
+        mch_init_dmar(mch);
+    }
     return 0;
 }
 
diff --git a/vl.c b/vl.c
index b796c67..cca012a 100644
--- a/vl.c
+++ b/vl.c
@@ -388,6 +388,10 @@ static QemuOptsList qemu_machine_opts = {
             .name = PC_MACHINE_MAX_RAM_BELOW_4G,
             .type = QEMU_OPT_SIZE,
             .help = "maximum ram below the 4G boundary (32bit boundary)",
+        },{
+            .name = "iommu",
+            .type = QEMU_OPT_BOOL,
+            .help = "Set on/off to enable/disable Intel IOMMU (VT-d)",
         },
         { /* End of list */ }
     },
diff --git a/qemu-options.hx b/qemu-options.hx
index c573dd8..92b4da6 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -35,7 +35,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                kernel_irqchip=on|off controls accelerated irqchip support\n"
     "                kvm_shadow_mem=size of KVM shadow MMU\n"
     "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
-    "                mem-merge=on|off controls memory merge support (default: on)\n",
+    "                mem-merge=on|off controls memory merge support (default: on)\n"
+    "                iommu=on|off controls emulated Intel IOMMU (VT-d) support (default=off)\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
@@ -58,6 +59,8 @@ Include guest memory in a core dump. The default is on.
 Enables or disables memory merge support. This feature, when supported by
 the host, de-duplicates identical memory pages among VMs instances
 (enabled by default).
+@item iommu=on|off
+Enables or disables emulated Intel IOMMU (VT-d) support. The default is off.
 @end table
 ETEXI
 
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 05/13] intel-iommu: fix coding style issues around in q35.c and machine.c
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 04/13] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 06/13] intel-iommu: add supports for queued invalidation interface Michael S. Tsirkin
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Marcel Apfelbaum, Michael Roth, Le Tan,
	Anthony Liguori, =?UTF-8?q?Andreas=20F=C3=A4rber?=

From: Le Tan <tamlokveer@gmail.com>

Fix coding style issues around in hw/pci-host/q35.c and hw/core/machine.c.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/core/machine.c | 10 +++++++---
 hw/pci-host/q35.c | 11 ++++++-----
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 0708de5..f0046d6 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -284,10 +284,14 @@ static void machine_initfn(Object *obj)
                              machine_set_dump_guest_core,
                              NULL);
     object_property_add_bool(obj, "mem-merge",
-                             machine_get_mem_merge, machine_set_mem_merge, NULL);
-    object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, NULL);
+                             machine_get_mem_merge,
+                             machine_set_mem_merge, NULL);
+    object_property_add_bool(obj, "usb",
+                             machine_get_usb,
+                             machine_set_usb, NULL);
     object_property_add_str(obj, "firmware",
-                            machine_get_firmware, machine_set_firmware, NULL);
+                            machine_get_firmware,
+                            machine_set_firmware, NULL);
     object_property_add_bool(obj, "iommu",
                              machine_get_iommu,
                              machine_set_iommu, NULL);
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 721cf5b..057cab6 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -405,12 +405,13 @@ static int mch_init(PCIDevice *d)
     memory_region_add_subregion_overlap(mch->system_memory, 0xa0000,
                                         &mch->smram_region, 1);
     memory_region_set_enabled(&mch->smram_region, false);
-    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
-             &mch->pam_regions[0], PAM_BIOS_BASE, PAM_BIOS_SIZE);
+    init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
+             mch->pci_address_space, &mch->pam_regions[0],
+             PAM_BIOS_BASE, PAM_BIOS_SIZE);
     for (i = 0; i < 12; ++i) {
-        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory, mch->pci_address_space,
-                 &mch->pam_regions[i+1], PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE,
-                 PAM_EXPAN_SIZE);
+        init_pam(DEVICE(mch), mch->ram_memory, mch->system_memory,
+                 mch->pci_address_space, &mch->pam_regions[i+1],
+                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
     }
     /* Intel IOMMU (VT-d) */
     if (qemu_opt_get_bool(qemu_get_machine_opts(), "iommu", false)) {
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 06/13] intel-iommu: add supports for queued invalidation interface
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 05/13] intel-iommu: fix coding style issues around in q35.c and machine.c Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 07/13] intel-iommu: add context-cache to cache context-entry Michael S. Tsirkin
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Le Tan, Anthony Liguori

From: Le Tan <tamlokveer@gmail.com>

Add supports for queued invalidation interface, an expended invalidation
interface with extended capabilities.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/intel_iommu_internal.h |  27 ++-
 hw/i386/intel_iommu.c          | 373 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 393 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 7ca034d..cbcc8d1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -255,12 +255,27 @@ typedef enum VTDFaultReason {
     VTD_FR_MAX,                 /* Guard */
 } VTDFaultReason;
 
-/* Masks for Queued Invalidation Descriptor */
-#define VTD_INV_DESC_TYPE           0xf
-#define VTD_INV_DESC_CC             0x1 /* Context-cache Invalidate Desc */
-#define VTD_INV_DESC_IOTLB          0x2
-#define VTD_INV_DESC_WAIT           0x5 /* Invalidation Wait Descriptor */
-#define VTD_INV_DESC_NONE           0   /* Not an Invalidate Descriptor */
+/* Queued Invalidation Descriptor */
+struct VTDInvDesc {
+    uint64_t lo;
+    uint64_t hi;
+};
+typedef struct VTDInvDesc VTDInvDesc;
+
+/* Masks for struct VTDInvDesc */
+#define VTD_INV_DESC_TYPE               0xf
+#define VTD_INV_DESC_CC                 0x1 /* Context-cache Invalidate Desc */
+#define VTD_INV_DESC_IOTLB              0x2
+#define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
+#define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
+
+/* Masks for Invalidation Wait Descriptor*/
+#define VTD_INV_DESC_WAIT_SW            (1ULL << 5)
+#define VTD_INV_DESC_WAIT_IF            (1ULL << 4)
+#define VTD_INV_DESC_WAIT_FN            (1ULL << 6)
+#define VTD_INV_DESC_WAIT_DATA_SHIFT    32
+#define VTD_INV_DESC_WAIT_RSVD_LO       0Xffffff80ULL
+#define VTD_INV_DESC_WAIT_RSVD_HI       3ULL
 
 /* Pagesize of VTD paging structures, including root and context tables */
 #define VTD_PAGE_SHIFT              12
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8e67e04..60dec4f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -314,6 +314,41 @@ static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
     }
 }
 
+/* Handle Invalidation Queue Errors of queued invalidation interface error
+ * conditions.
+ */
+static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
+{
+    uint32_t fsts_reg = vtd_get_long_raw(s, DMAR_FSTS_REG);
+
+    vtd_set_clear_mask_long(s, DMAR_FSTS_REG, 0, VTD_FSTS_IQE);
+    vtd_generate_fault_event(s, fsts_reg);
+}
+
+/* Set the IWC field and try to generate an invalidation completion interrupt */
+static void vtd_generate_completion_event(IntelIOMMUState *s)
+{
+    VTD_DPRINTF(INV, "completes an invalidation wait command with "
+                "Interrupt Flag");
+    if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
+        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
+                    "serviced by software, "
+                    "new invalidation event is not generated");
+        return;
+    }
+    vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
+    vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
+    if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
+        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
+                    "event is not generated");
+        return;
+    } else {
+        /* Generate the interrupt event */
+        vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
+        vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
+    }
+}
+
 static inline bool vtd_root_entry_present(VTDRootEntry *root)
 {
     return root->val & VTD_ROOT_ENTRY_P;
@@ -759,6 +794,54 @@ static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
     return iaig;
 }
 
+static inline bool vtd_queued_inv_enable_check(IntelIOMMUState *s)
+{
+    return s->iq_tail == 0;
+}
+
+static inline bool vtd_queued_inv_disable_check(IntelIOMMUState *s)
+{
+    return s->qi_enabled && (s->iq_tail == s->iq_head) &&
+           (s->iq_last_desc_type == VTD_INV_DESC_WAIT);
+}
+
+static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
+{
+    uint64_t iqa_val = vtd_get_quad_raw(s, DMAR_IQA_REG);
+
+    VTD_DPRINTF(INV, "Queued Invalidation Enable %s", (en ? "on" : "off"));
+    if (en) {
+        if (vtd_queued_inv_enable_check(s)) {
+            s->iq = iqa_val & VTD_IQA_IQA_MASK;
+            /* 2^(x+8) entries */
+            s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
+            s->qi_enabled = true;
+            VTD_DPRINTF(INV, "DMAR_IQA_REG 0x%"PRIx64, iqa_val);
+            VTD_DPRINTF(INV, "Invalidation Queue addr 0x%"PRIx64 " size %d",
+                        s->iq, s->iq_size);
+            /* Ok - report back to driver */
+            vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_QIES);
+        } else {
+            VTD_DPRINTF(GENERAL, "error: can't enable Queued Invalidation: "
+                        "tail %"PRIu16, s->iq_tail);
+        }
+    } else {
+        if (vtd_queued_inv_disable_check(s)) {
+            /* disable Queued Invalidation */
+            vtd_set_quad_raw(s, DMAR_IQH_REG, 0);
+            s->iq_head = 0;
+            s->qi_enabled = false;
+            /* Ok - report back to driver */
+            vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_QIES, 0);
+        } else {
+            VTD_DPRINTF(GENERAL, "error: can't disable Queued Invalidation: "
+                        "head %"PRIu16 ", tail %"PRIu16
+                        ", last_descriptor %"PRIu8,
+                        s->iq_head, s->iq_tail, s->iq_last_desc_type);
+        }
+    }
+}
+
 /* Set Root Table Pointer */
 static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
 {
@@ -804,6 +887,10 @@ static void vtd_handle_gcmd_write(IntelIOMMUState *s)
         /* Set/update the root-table pointer */
         vtd_handle_gcmd_srtp(s);
     }
+    if (changed & VTD_GCMD_QIE) {
+        /* Queued Invalidation Enable */
+        vtd_handle_gcmd_qie(s, val & VTD_GCMD_QIE);
+    }
 }
 
 /* Handle write to Context Command Register */
@@ -814,6 +901,11 @@ static void vtd_handle_ccmd_write(IntelIOMMUState *s)
 
     /* Context-cache invalidation request */
     if (val & VTD_CCMD_ICC) {
+        if (s->qi_enabled) {
+            VTD_DPRINTF(GENERAL, "error: Queued Invalidation enabled, "
+                        "should not use register-based invalidation");
+            return;
+        }
         ret = vtd_context_cache_invalidate(s, val);
         /* Invalidation completed. Change something to show */
         vtd_set_clear_mask_quad(s, DMAR_CCMD_REG, VTD_CCMD_ICC, 0ULL);
@@ -831,6 +923,11 @@ static void vtd_handle_iotlb_write(IntelIOMMUState *s)
 
     /* IOTLB invalidation request */
     if (val & VTD_TLB_IVT) {
+        if (s->qi_enabled) {
+            VTD_DPRINTF(GENERAL, "error: Queued Invalidation enabled, "
+                        "should not use register-based invalidation");
+            return;
+        }
         ret = vtd_iotlb_flush(s, val);
         /* Invalidation completed. Change something to show */
         vtd_set_clear_mask_quad(s, DMAR_IOTLB_REG, VTD_TLB_IVT, 0ULL);
@@ -840,6 +937,146 @@ static void vtd_handle_iotlb_write(IntelIOMMUState *s)
     }
 }
 
+/* Fetch an Invalidation Descriptor from the Invalidation Queue */
+static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
+                             VTDInvDesc *inv_desc)
+{
+    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
+    if (dma_memory_read(&address_space_memory, addr, inv_desc,
+        sizeof(*inv_desc))) {
+        VTD_DPRINTF(GENERAL, "error: fail to fetch Invalidation Descriptor "
+                    "base_addr 0x%"PRIx64 " offset %"PRIu32, base_addr, offset);
+        inv_desc->lo = 0;
+        inv_desc->hi = 0;
+
+        return false;
+    }
+    inv_desc->lo = le64_to_cpu(inv_desc->lo);
+    inv_desc->hi = le64_to_cpu(inv_desc->hi);
+    return true;
+}
+
+static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
+{
+    if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
+        (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
+        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
+                    "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    inv_desc->hi, inv_desc->lo);
+        return false;
+    }
+    if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
+        /* Status Write */
+        uint32_t status_data = (uint32_t)(inv_desc->lo >>
+                               VTD_INV_DESC_WAIT_DATA_SHIFT);
+
+        assert(!(inv_desc->lo & VTD_INV_DESC_WAIT_IF));
+
+        /* FIXME: need to be masked with HAW? */
+        dma_addr_t status_addr = inv_desc->hi;
+        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
+                    status_data, status_addr);
+        status_data = cpu_to_le32(status_data);
+        if (dma_memory_write(&address_space_memory, status_addr, &status_data,
+                             sizeof(status_data))) {
+            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
+            return false;
+        }
+    } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
+        /* Interrupt flag */
+        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
+        vtd_generate_completion_event(s);
+    } else {
+        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
+        return false;
+    }
+    return true;
+}
+
+static bool vtd_process_inv_desc(IntelIOMMUState *s)
+{
+    VTDInvDesc inv_desc;
+    uint8_t desc_type;
+
+    VTD_DPRINTF(INV, "iq head %"PRIu16, s->iq_head);
+    if (!vtd_get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
+        s->iq_last_desc_type = VTD_INV_DESC_NONE;
+        return false;
+    }
+    desc_type = inv_desc.lo & VTD_INV_DESC_TYPE;
+    /* FIXME: should update at first or at last? */
+    s->iq_last_desc_type = desc_type;
+
+    switch (desc_type) {
+    case VTD_INV_DESC_CC:
+        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
+                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        break;
+
+    case VTD_INV_DESC_IOTLB:
+        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
+                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        break;
+
+    case VTD_INV_DESC_WAIT:
+        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
+                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_wait_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
+                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
+                    inv_desc.hi, inv_desc.lo, desc_type);
+        return false;
+    }
+    s->iq_head++;
+    if (s->iq_head == s->iq_size) {
+        s->iq_head = 0;
+    }
+    return true;
+}
+
+/* Try to fetch and process more Invalidation Descriptors */
+static void vtd_fetch_inv_desc(IntelIOMMUState *s)
+{
+    VTD_DPRINTF(INV, "fetch Invalidation Descriptors");
+    if (s->iq_tail >= s->iq_size) {
+        /* Detects an invalid Tail pointer */
+        VTD_DPRINTF(GENERAL, "error: iq_tail is %"PRIu16
+                    " while iq_size is %"PRIu16, s->iq_tail, s->iq_size);
+        vtd_handle_inv_queue_error(s);
+        return;
+    }
+    while (s->iq_head != s->iq_tail) {
+        if (!vtd_process_inv_desc(s)) {
+            /* Invalidation Queue Errors */
+            vtd_handle_inv_queue_error(s);
+            break;
+        }
+        /* Must update the IQH_REG in time */
+        vtd_set_quad_raw(s, DMAR_IQH_REG,
+                         (((uint64_t)(s->iq_head)) << VTD_IQH_QH_SHIFT) &
+                         VTD_IQH_QH_MASK);
+    }
+}
+
+/* Handle write to Invalidation Queue Tail Register */
+static void vtd_handle_iqt_write(IntelIOMMUState *s)
+{
+    uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
+
+    s->iq_tail = VTD_IQT_QT(val);
+    VTD_DPRINTF(INV, "set iq tail %"PRIu16, s->iq_tail);
+    if (s->qi_enabled && !(vtd_get_long_raw(s, DMAR_FSTS_REG) & VTD_FSTS_IQE)) {
+        /* Process Invalidation Queue here */
+        vtd_fetch_inv_desc(s);
+    }
+}
+
 static void vtd_handle_fsts_write(IntelIOMMUState *s)
 {
     uint32_t fsts_reg = vtd_get_long_raw(s, DMAR_FSTS_REG);
@@ -851,6 +1088,9 @@ static void vtd_handle_fsts_write(IntelIOMMUState *s)
         VTD_DPRINTF(FLOG, "all pending interrupt conditions serviced, clear "
                     "IP field of FECTL_REG");
     }
+    /* FIXME: when IQE is Clear, should we try to fetch some Invalidation
+     * Descriptors if there are any when Queued Invalidation is enabled?
+     */
 }
 
 static void vtd_handle_fectl_write(IntelIOMMUState *s)
@@ -869,6 +1109,34 @@ static void vtd_handle_fectl_write(IntelIOMMUState *s)
     }
 }
 
+static void vtd_handle_ics_write(IntelIOMMUState *s)
+{
+    uint32_t ics_reg = vtd_get_long_raw(s, DMAR_ICS_REG);
+    uint32_t iectl_reg = vtd_get_long_raw(s, DMAR_IECTL_REG);
+
+    if ((iectl_reg & VTD_IECTL_IP) && !(ics_reg & VTD_ICS_IWC)) {
+        vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
+        VTD_DPRINTF(INV, "pending completion interrupt condition serviced, "
+                    "clear IP field of IECTL_REG");
+    }
+}
+
+static void vtd_handle_iectl_write(IntelIOMMUState *s)
+{
+    uint32_t iectl_reg;
+    /* FIXME: when software clears the IM field, check the IP field. But do we
+     * need to compare the old value and the new value to conclude that
+     * software clears the IM field? Or just check if the IM field is zero?
+     */
+    iectl_reg = vtd_get_long_raw(s, DMAR_IECTL_REG);
+    if ((iectl_reg & VTD_IECTL_IP) && !(iectl_reg & VTD_IECTL_IM)) {
+        vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
+        vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
+        VTD_DPRINTF(INV, "IM field is cleared, generate "
+                    "invalidation event interrupt");
+    }
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
     IntelIOMMUState *s = opaque;
@@ -896,6 +1164,19 @@ static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
         val = s->root >> 32;
         break;
 
+    /* Invalidation Queue Address Register, 64-bit */
+    case DMAR_IQA_REG:
+        val = s->iq | (vtd_get_quad(s, DMAR_IQA_REG) & VTD_IQA_QS);
+        if (size == 4) {
+            val = val & ((1ULL << 32) - 1);
+        }
+        break;
+
+    case DMAR_IQA_REG_HI:
+        assert(size == 4);
+        val = s->iq >> 32;
+        break;
+
     default:
         if (size == 4) {
             val = vtd_get_long(s, addr);
@@ -1037,6 +1318,86 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         vtd_set_long(s, addr, val);
         break;
 
+    /* Invalidation Queue Tail Register, 64-bit */
+    case DMAR_IQT_REG:
+        VTD_DPRINTF(INV, "DMAR_IQT_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+        }
+        vtd_handle_iqt_write(s);
+        break;
+
+    case DMAR_IQT_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IQT_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        /* 19:63 of IQT_REG is RsvdZ, do nothing here */
+        break;
+
+    /* Invalidation Queue Address Register, 64-bit */
+    case DMAR_IQA_REG:
+        VTD_DPRINTF(INV, "DMAR_IQA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_IQA_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IQA_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Invalidation Completion Status Register, 32-bit */
+    case DMAR_ICS_REG:
+        VTD_DPRINTF(INV, "DMAR_ICS_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        vtd_handle_ics_write(s);
+        break;
+
+    /* Invalidation Event Control Register, 32-bit */
+    case DMAR_IECTL_REG:
+        VTD_DPRINTF(INV, "DMAR_IECTL_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        vtd_handle_iectl_write(s);
+        break;
+
+    /* Invalidation Event Data Register, 32-bit */
+    case DMAR_IEDATA_REG:
+        VTD_DPRINTF(INV, "DMAR_IEDATA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Invalidation Event Address Register, 32-bit */
+    case DMAR_IEADDR_REG:
+        VTD_DPRINTF(INV, "DMAR_IEADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
+    /* Invalidation Event Upper Address Register, 32-bit */
+    case DMAR_IEUADDR_REG:
+        VTD_DPRINTF(INV, "DMAR_IEUADDR_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
     /* Fault Recording Registers, 128-bit */
     case DMAR_FRCD_REG_0_0:
         VTD_DPRINTF(FLOG, "DMAR_FRCD_REG_0_0 write addr 0x%"PRIx64
@@ -1168,7 +1529,7 @@ static void vtd_init(IntelIOMMUState *s)
     s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
              VTD_CAP_SAGAW;
-    s->ecap = VTD_ECAP_IRO;
+    s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
     /* Define registers with default values and bit semantics */
     vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
@@ -1198,6 +1559,16 @@ static void vtd_init(IntelIOMMUState *s)
      */
     vtd_define_long(s, DMAR_PMEN_REG, 0, 0, 0);
 
+    vtd_define_quad(s, DMAR_IQH_REG, 0, 0, 0);
+    vtd_define_quad(s, DMAR_IQT_REG, 0, 0x7fff0ULL, 0);
+    vtd_define_quad(s, DMAR_IQA_REG, 0, 0xfffffffffffff007ULL, 0);
+    vtd_define_long(s, DMAR_ICS_REG, 0, 0, 0x1UL);
+    vtd_define_long(s, DMAR_IECTL_REG, 0x80000000UL, 0x80000000UL, 0);
+    vtd_define_long(s, DMAR_IEDATA_REG, 0, 0xffffffffUL, 0);
+    vtd_define_long(s, DMAR_IEADDR_REG, 0, 0xfffffffcUL, 0);
+    /* Treadted as RsvdZ when EIM in ECAP_REG is not supported */
+    vtd_define_long(s, DMAR_IEUADDR_REG, 0, 0, 0);
+
     /* IOTLB registers */
     vtd_define_quad(s, DMAR_IOTLB_REG, 0, 0Xb003ffff00000000ULL, 0);
     vtd_define_quad(s, DMAR_IVA_REG, 0, 0xfffffffffffff07fULL, 0);
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 07/13] intel-iommu: add context-cache to cache context-entry
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 06/13] intel-iommu: add supports for queued invalidation interface Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 08/13] intel-iommu: add IOTLB using hash table Michael S. Tsirkin
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Le Tan, Anthony Liguori

From: Le Tan <tamlokveer@gmail.com>

Add context-cache to cache context-entry encountered on a page-walk. Each
VTDAddressSpace has a member of VTDContextCacheEntry which represents an entry
in the context-cache. Since devices with different bus_num and devfn have their
respective VTDAddressSpace, this will be a good way to reference the cached
entries.
Each VTDContextCacheEntry will have a context_cache_gen and the cached entry
is valid only when context_cache_gen equals IntelIOMMUState.context_cache_gen.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/intel_iommu_internal.h |  23 +++--
 include/hw/i386/intel_iommu.h  |  22 +++++
 hw/i386/intel_iommu.c          | 188 +++++++++++++++++++++++++++++++++++------
 hw/pci-host/q35.c              |   1 +
 4 files changed, 199 insertions(+), 35 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index cbcc8d1..30c318d 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -154,6 +154,9 @@
 #define VTD_CCMD_DOMAIN_INVL_A      (2ULL << 59)
 #define VTD_CCMD_DEVICE_INVL_A      (3ULL << 59)
 #define VTD_CCMD_CAIG_MASK          (3ULL << 59)
+#define VTD_CCMD_DID(val)           ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_CCMD_SID(val)           (((val) >> 16) & 0xffffULL)
+#define VTD_CCMD_FM(val)            (((val) >> 32) & 3ULL)
 
 /* RTADDR_REG */
 #define VTD_RTADDR_RTT              (1ULL << 11)
@@ -169,6 +172,7 @@
 #define VTD_CAP_FRO                 (DMAR_FRCD_REG_OFFSET << 20)
 #define VTD_CAP_NFR                 ((DMAR_FRCD_REG_NR - 1) << 40)
 #define VTD_DOMAIN_ID_SHIFT         16  /* 16-bit domain id for 64K domains */
+#define VTD_DOMAIN_ID_MASK          ((1UL << VTD_DOMAIN_ID_SHIFT) - 1)
 #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
 #define VTD_MGAW                    39  /* Maximum Guest Address Width */
 #define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
@@ -255,6 +259,8 @@ typedef enum VTDFaultReason {
     VTD_FR_MAX,                 /* Guard */
 } VTDFaultReason;
 
+#define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
+
 /* Queued Invalidation Descriptor */
 struct VTDInvDesc {
     uint64_t lo;
@@ -277,6 +283,16 @@ typedef struct VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT_RSVD_LO       0Xffffff80ULL
 #define VTD_INV_DESC_WAIT_RSVD_HI       3ULL
 
+/* Masks for Context-cache Invalidation Descriptor */
+#define VTD_INV_DESC_CC_G               (3ULL << 4)
+#define VTD_INV_DESC_CC_GLOBAL          (1ULL << 4)
+#define VTD_INV_DESC_CC_DOMAIN          (2ULL << 4)
+#define VTD_INV_DESC_CC_DEVICE          (3ULL << 4)
+#define VTD_INV_DESC_CC_DID(val)        (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_CC_SID(val)        (((val) >> 32) & 0xffffUL)
+#define VTD_INV_DESC_CC_FM(val)         (((val) >> 48) & 3UL)
+#define VTD_INV_DESC_CC_RSVD            0xfffc00000000ffc0ULL
+
 /* Pagesize of VTD paging structures, including root and context tables */
 #define VTD_PAGE_SHIFT              12
 #define VTD_PAGE_SIZE               (1ULL << VTD_PAGE_SHIFT)
@@ -301,13 +317,6 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_ROOT_ENTRY_NR           (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
 #define VTD_ROOT_ENTRY_RSVD         (0xffeULL | ~VTD_HAW_MASK)
 
-/* Context-Entry */
-struct VTDContextEntry {
-    uint64_t lo;
-    uint64_t hi;
-};
-typedef struct VTDContextEntry VTDContextEntry;
-
 /* Masks for struct VTDContextEntry */
 /* lo */
 #define VTD_CONTEXT_ENTRY_P         (1ULL << 0)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fe1f1e9..d9a5215 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -37,20 +37,40 @@
 #define VTD_PCI_DEVFN_MAX           256
 #define VTD_PCI_SLOT(devfn)         (((devfn) >> 3) & 0x1f)
 #define VTD_PCI_FUNC(devfn)         ((devfn) & 0x07)
+#define VTD_SID_TO_BUS(sid)         (((sid) >> 8) && 0xff)
+#define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
 
 #define DMAR_REG_SIZE               0x230
 #define VTD_HOST_ADDRESS_WIDTH      39
 #define VTD_HAW_MASK                ((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
 
+typedef struct VTDContextEntry VTDContextEntry;
+typedef struct VTDContextCacheEntry VTDContextCacheEntry;
 typedef struct IntelIOMMUState IntelIOMMUState;
 typedef struct VTDAddressSpace VTDAddressSpace;
 
+
+/* Context-Entry */
+struct VTDContextEntry {
+    uint64_t lo;
+    uint64_t hi;
+};
+
+struct VTDContextCacheEntry {
+    /* The cache entry is obsolete if
+     * context_cache_gen!=IntelIOMMUState.context_cache_gen
+     */
+    uint32_t context_cache_gen;
+    struct VTDContextEntry context_entry;
+};
+
 struct VTDAddressSpace {
     uint8_t bus_num;
     uint8_t devfn;
     AddressSpace as;
     MemoryRegion iommu;
     IntelIOMMUState *iommu_state;
+    VTDContextCacheEntry context_cache_entry;
 };
 
 /* The iommu (DMAR) device state struct */
@@ -82,6 +102,8 @@ struct IntelIOMMUState {
     uint64_t cap;                   /* The value of capability reg */
     uint64_t ecap;                  /* The value of extended capability reg */
 
+    uint32_t context_cache_gen;     /* Should be in [1,MAX] */
+
     MemoryRegionIOMMUOps iommu_ops;
     VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
 };
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 60dec4f..c514310 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -27,6 +27,7 @@
 #ifdef DEBUG_INTEL_IOMMU
 enum {
     DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
+    DEBUG_CACHE,
 };
 #define VTD_DBGBIT(x)   (1 << DEBUG_##x)
 static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR);
@@ -131,6 +132,33 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
+/* Reset all the gen of VTDAddressSpace to zero and set the gen of
+ * IntelIOMMUState to 1.
+ */
+static void vtd_reset_context_cache(IntelIOMMUState *s)
+{
+    VTDAddressSpace **pvtd_as;
+    VTDAddressSpace *vtd_as;
+    uint32_t bus_it;
+    uint32_t devfn_it;
+
+    VTD_DPRINTF(CACHE, "global context_cache_gen=1");
+    for (bus_it = 0; bus_it < VTD_PCI_BUS_MAX; ++bus_it) {
+        pvtd_as = s->address_spaces[bus_it];
+        if (!pvtd_as) {
+            continue;
+        }
+        for (devfn_it = 0; devfn_it < VTD_PCI_DEVFN_MAX; ++devfn_it) {
+            vtd_as = pvtd_as[devfn_it];
+            if (!vtd_as) {
+                continue;
+            }
+            vtd_as->context_cache_entry.context_cache_gen = 0;
+        }
+    }
+    s->context_cache_gen = 1;
+}
+
 /* Given the reg addr of both the message data and address, generate an
  * interrupt via MSI.
  */
@@ -651,11 +679,13 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
  * @is_write: The access is a write operation
  * @entry: IOMMUTLBEntry that contain the addr to be translated and result
  */
-static void vtd_do_iommu_translate(IntelIOMMUState *s, uint8_t bus_num,
+static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, uint8_t bus_num,
                                    uint8_t devfn, hwaddr addr, bool is_write,
                                    IOMMUTLBEntry *entry)
 {
+    IntelIOMMUState *s = vtd_as->iommu_state;
     VTDContextEntry ce;
+    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
     uint64_t slpte;
     uint32_t level;
     uint16_t source_id = vtd_make_source_id(bus_num, devfn);
@@ -686,18 +716,35 @@ static void vtd_do_iommu_translate(IntelIOMMUState *s, uint8_t bus_num,
             return;
         }
     }
-
-    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
-    is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
-    if (ret_fr) {
-        ret_fr = -ret_fr;
-        if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
-                        "through this context-entry (with FPD Set)");
-        } else {
-            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+    /* Try to fetch context-entry from cache first */
+    if (cc_entry->context_cache_gen == s->context_cache_gen) {
+        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
+                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
+                    bus_num, devfn, cc_entry->context_entry.hi,
+                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
+        ce = cc_entry->context_entry;
+        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+    } else {
+        ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+        if (ret_fr) {
+            ret_fr = -ret_fr;
+            if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
+                            "requests through this context-entry "
+                            "(with FPD Set)");
+            } else {
+                vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+            }
+            return;
         }
-        return;
+        /* Update context-cache */
+        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
+                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
+                    bus_num, devfn, ce.hi, ce.lo,
+                    cc_entry->context_cache_gen, s->context_cache_gen);
+        cc_entry->context_entry = ce;
+        cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
     ret_fr = vtd_gpa_to_slpte(&ce, addr, is_write, &slpte, &level,
@@ -729,6 +776,57 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
                 (s->root_extended ? "(extended)" : ""));
 }
 
+static void vtd_context_global_invalidate(IntelIOMMUState *s)
+{
+    s->context_cache_gen++;
+    if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
+        vtd_reset_context_cache(s);
+    }
+}
+
+/* Do a context-cache device-selective invalidation.
+ * @func_mask: FM field after shifting
+ */
+static void vtd_context_device_invalidate(IntelIOMMUState *s,
+                                          uint16_t source_id,
+                                          uint16_t func_mask)
+{
+    uint16_t mask;
+    VTDAddressSpace **pvtd_as;
+    VTDAddressSpace *vtd_as;
+    uint16_t devfn;
+    uint16_t devfn_it;
+
+    switch (func_mask & 3) {
+    case 0:
+        mask = 0;   /* No bits in the SID field masked */
+        break;
+    case 1:
+        mask = 4;   /* Mask bit 2 in the SID field */
+        break;
+    case 2:
+        mask = 6;   /* Mask bit 2:1 in the SID field */
+        break;
+    case 3:
+        mask = 7;   /* Mask bit 2:0 in the SID field */
+        break;
+    }
+    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
+                    " mask %"PRIu16, source_id, mask);
+    pvtd_as = s->address_spaces[VTD_SID_TO_BUS(source_id)];
+    if (pvtd_as) {
+        devfn = VTD_SID_TO_DEVFN(source_id);
+        for (devfn_it = 0; devfn_it < VTD_PCI_DEVFN_MAX; ++devfn_it) {
+            vtd_as = pvtd_as[devfn_it];
+            if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
+                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
+                            devfn_it);
+                vtd_as->context_cache_entry.context_cache_gen = 0;
+            }
+        }
+    }
+}
+
 /* Context-cache invalidation
  * Returns the Context Actual Invalidation Granularity.
  * @val: the content of the CCMD_REG
@@ -739,24 +837,23 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
     uint64_t type = val & VTD_CCMD_CIRG_MASK;
 
     switch (type) {
+    case VTD_CCMD_DOMAIN_INVL:
+        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
+                    (uint16_t)VTD_CCMD_DID(val));
+        /* Fall through */
     case VTD_CCMD_GLOBAL_INVL:
-        VTD_DPRINTF(INV, "Global invalidation request");
+        VTD_DPRINTF(INV, "global invalidation");
         caig = VTD_CCMD_GLOBAL_INVL_A;
-        break;
-
-    case VTD_CCMD_DOMAIN_INVL:
-        VTD_DPRINTF(INV, "Domain-selective invalidation request");
-        caig = VTD_CCMD_DOMAIN_INVL_A;
+        vtd_context_global_invalidate(s);
         break;
 
     case VTD_CCMD_DEVICE_INVL:
-        VTD_DPRINTF(INV, "Domain-selective invalidation request");
         caig = VTD_CCMD_DEVICE_INVL_A;
+        vtd_context_device_invalidate(s, VTD_CCMD_SID(val), VTD_CCMD_FM(val));
         break;
 
     default:
-        VTD_DPRINTF(GENERAL,
-                    "error: wrong context-cache invalidation granularity");
+        VTD_DPRINTF(GENERAL, "error: invalid granularity");
         caig = 0;
     }
     return caig;
@@ -994,6 +1091,38 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
+                                           VTDInvDesc *inv_desc)
+{
+    if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
+        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
+                    "Invalidate Descriptor");
+        return false;
+    }
+    switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
+    case VTD_INV_DESC_CC_DOMAIN:
+        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
+                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
+        /* Fall through */
+    case VTD_INV_DESC_CC_GLOBAL:
+        VTD_DPRINTF(INV, "global invalidation");
+        vtd_context_global_invalidate(s);
+        break;
+
+    case VTD_INV_DESC_CC_DEVICE:
+        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
+                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
+                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
+                    inv_desc->hi, inv_desc->lo);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_desc(IntelIOMMUState *s)
 {
     VTDInvDesc inv_desc;
@@ -1012,6 +1141,9 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     case VTD_INV_DESC_CC:
         VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
                     " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_context_cache_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_IOTLB:
@@ -1453,8 +1585,6 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
     IntelIOMMUState *s = vtd_as->iommu_state;
-    uint8_t bus_num = vtd_as->bus_num;
-    uint8_t devfn = vtd_as->devfn;
     IOMMUTLBEntry ret = {
         .target_as = &address_space_memory,
         .iova = addr,
@@ -1472,13 +1602,13 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
         return ret;
     }
 
-    vtd_do_iommu_translate(s, bus_num, devfn, addr, is_write, &ret);
-
+    vtd_do_iommu_translate(vtd_as, vtd_as->bus_num, vtd_as->devfn, addr,
+                           is_write, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
-                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, bus_num,
-                VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), devfn, addr,
-                ret.translated_addr);
+                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, vtd_as->bus_num,
+                VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
+                vtd_as->devfn, addr, ret.translated_addr);
     return ret;
 }
 
@@ -1531,6 +1661,8 @@ static void vtd_init(IntelIOMMUState *s)
              VTD_CAP_SAGAW;
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
+    vtd_reset_context_cache(s);
+
     /* Define registers with default values and bit semantics */
     vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
     vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 057cab6..b20bad8 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -368,6 +368,7 @@ static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
         pvtd_as[devfn]->bus_num = (uint8_t)bus_num;
         pvtd_as[devfn]->devfn = (uint8_t)devfn;
         pvtd_as[devfn]->iommu_state = s;
+        pvtd_as[devfn]->context_cache_entry.context_cache_gen = 0;
         memory_region_init_iommu(&pvtd_as[devfn]->iommu, OBJECT(s),
                                  &s->iommu_ops, "intel_iommu", UINT64_MAX);
         address_space_init(&pvtd_as[devfn]->as,
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 08/13] intel-iommu: add IOTLB using hash table
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 07/13] intel-iommu: add context-cache to cache context-entry Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 09/13] vhost_net: cleanup start/stop condition Michael S. Tsirkin
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Le Tan, Anthony Liguori

From: Le Tan <tamlokveer@gmail.com>

Add IOTLB to cache information about the translation of input-addresses. IOTLB
use a GHashTable as cache. The key of the hash table is the logical-OR of gfn
and source id after left-shifting.

Signed-off-by: Le Tan <tamlokveer@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/intel_iommu_internal.h |  34 ++++++-
 include/hw/i386/intel_iommu.h  |  11 ++-
 hw/i386/intel_iommu.c          | 213 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 251 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 30c318d..ba288ab 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -111,6 +111,10 @@
 #define VTD_INTERRUPT_ADDR_FIRST    0xfee00000ULL
 #define VTD_INTERRUPT_ADDR_LAST     0xfeefffffULL
 
+/* The shift of source_id in the key of IOTLB hash table */
+#define VTD_IOTLB_SID_SHIFT         36
+#define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
+
 /* IOTLB_REG */
 #define VTD_TLB_GLOBAL_FLUSH        (1ULL << 60) /* Global invalidation */
 #define VTD_TLB_DSI_FLUSH           (2ULL << 60) /* Domain-selective */
@@ -121,6 +125,11 @@
 #define VTD_TLB_PSI_FLUSH_A         (3ULL << 57)
 #define VTD_TLB_FLUSH_GRANU_MASK_A  (3ULL << 57)
 #define VTD_TLB_IVT                 (1ULL << 63)
+#define VTD_TLB_DID(val)            (((val) >> 32) & VTD_DOMAIN_ID_MASK)
+
+/* IVA_REG */
+#define VTD_IVA_ADDR(val)       ((val) & ~0xfffULL & ((1ULL << VTD_MGAW) - 1))
+#define VTD_IVA_AM(val)         ((val) & 0x3fULL)
 
 /* GCMD_REG */
 #define VTD_GCMD_TE                 (1UL << 31)
@@ -176,6 +185,9 @@
 #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
 #define VTD_MGAW                    39  /* Maximum Guest Address Width */
 #define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
+#define VTD_MAMV                    9ULL
+#define VTD_CAP_MAMV                (VTD_MAMV << 48)
+#define VTD_CAP_PSI                 (1ULL << 39)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
@@ -293,6 +305,26 @@ typedef struct VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_CC_FM(val)         (((val) >> 48) & 3UL)
 #define VTD_INV_DESC_CC_RSVD            0xfffc00000000ffc0ULL
 
+/* Masks for IOTLB Invalidate Descriptor */
+#define VTD_INV_DESC_IOTLB_G            (3ULL << 4)
+#define VTD_INV_DESC_IOTLB_GLOBAL       (1ULL << 4)
+#define VTD_INV_DESC_IOTLB_DOMAIN       (2ULL << 4)
+#define VTD_INV_DESC_IOTLB_PAGE         (3ULL << 4)
+#define VTD_INV_DESC_IOTLB_DID(val)     (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_IOTLB_ADDR(val)    ((val) & ~0xfffULL & \
+                                         ((1ULL << VTD_MGAW) - 1))
+#define VTD_INV_DESC_IOTLB_AM(val)      ((val) & 0x3fULL)
+#define VTD_INV_DESC_IOTLB_RSVD_LO      0xffffffff0000ff00ULL
+#define VTD_INV_DESC_IOTLB_RSVD_HI      0xf80ULL
+
+/* Information about page-selective IOTLB invalidate */
+struct VTDIOTLBPageInvInfo {
+    uint16_t domain_id;
+    uint64_t gfn;
+    uint8_t mask;
+};
+typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
+
 /* Pagesize of VTD paging structures, including root and context tables */
 #define VTD_PAGE_SHIFT              12
 #define VTD_PAGE_SIZE               (1ULL << VTD_PAGE_SHIFT)
@@ -330,7 +362,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CONTEXT_ENTRY_RSVD_LO   (0xff0ULL | ~VTD_HAW_MASK)
 /* hi */
 #define VTD_CONTEXT_ENTRY_AW        7ULL /* Adjusted guest-address-width */
-#define VTD_CONTEXT_ENTRY_DID       (0xffffULL << 8) /* Domain Identifier */
+#define VTD_CONTEXT_ENTRY_DID(val)  (((val) >> 8) & VTD_DOMAIN_ID_MASK)
 #define VTD_CONTEXT_ENTRY_RSVD_HI   0xffffffffff000080ULL
 
 #define VTD_CONTEXT_ENTRY_NR        (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index d9a5215..f4701e1 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -48,7 +48,7 @@ typedef struct VTDContextEntry VTDContextEntry;
 typedef struct VTDContextCacheEntry VTDContextCacheEntry;
 typedef struct IntelIOMMUState IntelIOMMUState;
 typedef struct VTDAddressSpace VTDAddressSpace;
-
+typedef struct VTDIOTLBEntry VTDIOTLBEntry;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -73,6 +73,14 @@ struct VTDAddressSpace {
     VTDContextCacheEntry context_cache_entry;
 };
 
+struct VTDIOTLBEntry {
+    uint64_t gfn;
+    uint16_t domain_id;
+    uint64_t slpte;
+    bool read_flags;
+    bool write_flags;
+};
+
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     SysBusDevice busdev;
@@ -103,6 +111,7 @@ struct IntelIOMMUState {
     uint64_t ecap;                  /* The value of extended capability reg */
 
     uint32_t context_cache_gen;     /* Should be in [1,MAX] */
+    GHashTable *iotlb;              /* IOTLB */
 
     MemoryRegionIOMMUOps iommu_ops;
     VTDAddressSpace **address_spaces[VTD_PCI_BUS_MAX];
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c514310..0a4282a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -132,6 +132,35 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
+/* GHashTable functions */
+static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
+{
+    return *((const uint64_t *)v1) == *((const uint64_t *)v2);
+}
+
+static guint vtd_uint64_hash(gconstpointer v)
+{
+    return (guint)*(const uint64_t *)v;
+}
+
+static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
+                                          gpointer user_data)
+{
+    VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
+    uint16_t domain_id = *(uint16_t *)user_data;
+    return entry->domain_id == domain_id;
+}
+
+static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
+                                        gpointer user_data)
+{
+    VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
+    VTDIOTLBPageInvInfo *info = (VTDIOTLBPageInvInfo *)user_data;
+    uint64_t gfn = info->gfn & info->mask;
+    return (entry->domain_id == info->domain_id) &&
+            ((entry->gfn & info->mask) == gfn);
+}
+
 /* Reset all the gen of VTDAddressSpace to zero and set the gen of
  * IntelIOMMUState to 1.
  */
@@ -159,6 +188,48 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
     s->context_cache_gen = 1;
 }
 
+static void vtd_reset_iotlb(IntelIOMMUState *s)
+{
+    assert(s->iotlb);
+    g_hash_table_remove_all(s->iotlb);
+}
+
+static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
+                                       hwaddr addr)
+{
+    uint64_t key;
+
+    key = (addr >> VTD_PAGE_SHIFT_4K) |
+           ((uint64_t)(source_id) << VTD_IOTLB_SID_SHIFT);
+    return g_hash_table_lookup(s->iotlb, &key);
+
+}
+
+static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
+                             uint16_t domain_id, hwaddr addr, uint64_t slpte,
+                             bool read_flags, bool write_flags)
+{
+    VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
+    uint64_t *key = g_malloc(sizeof(*key));
+    uint64_t gfn = addr >> VTD_PAGE_SHIFT_4K;
+
+    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
+                domain_id);
+    if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
+        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
+        vtd_reset_iotlb(s);
+    }
+
+    entry->gfn = gfn;
+    entry->domain_id = domain_id;
+    entry->slpte = slpte;
+    entry->read_flags = read_flags;
+    entry->write_flags = write_flags;
+    *key = gfn | ((uint64_t)(source_id) << VTD_IOTLB_SID_SHIFT);
+    g_hash_table_replace(s->iotlb, key, entry);
+}
+
 /* Given the reg addr of both the message data and address, generate an
  * interrupt via MSI.
  */
@@ -693,6 +764,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, uint8_t bus_num,
     bool is_fpd_set = false;
     bool reads = true;
     bool writes = true;
+    VTDIOTLBEntry *iotlb_entry;
 
     /* Check if the request is in interrupt address range */
     if (vtd_is_interrupt_addr(addr)) {
@@ -716,6 +788,17 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, uint8_t bus_num,
             return;
         }
     }
+    /* Try to fetch slpte form IOTLB */
+    iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
+    if (iotlb_entry) {
+        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
+                    iotlb_entry->slpte, iotlb_entry->domain_id);
+        slpte = iotlb_entry->slpte;
+        reads = iotlb_entry->read_flags;
+        writes = iotlb_entry->write_flags;
+        goto out;
+    }
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
         VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
@@ -760,6 +843,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, uint8_t bus_num,
         return;
     }
 
+    vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
+                     reads, writes);
+out:
     entry->iova = addr & VTD_PAGE_MASK_4K;
     entry->translated_addr = vtd_get_slpte_addr(slpte) & VTD_PAGE_MASK_4K;
     entry->addr_mask = ~VTD_PAGE_MASK_4K;
@@ -859,6 +945,29 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
     return caig;
 }
 
+static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
+{
+    vtd_reset_iotlb(s);
+}
+
+static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
+{
+    g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
+                                &domain_id);
+}
+
+static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+                                      hwaddr addr, uint8_t am)
+{
+    VTDIOTLBPageInvInfo info;
+
+    assert(am <= VTD_MAMV);
+    info.domain_id = domain_id;
+    info.gfn = addr >> VTD_PAGE_SHIFT_4K;
+    info.mask = ~((1 << am) - 1);
+    g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+}
+
 /* Flush IOTLB
  * Returns the IOTLB Actual Invalidation Granularity.
  * @val: the content of the IOTLB_REG
@@ -867,25 +976,44 @@ static uint64_t vtd_iotlb_flush(IntelIOMMUState *s, uint64_t val)
 {
     uint64_t iaig;
     uint64_t type = val & VTD_TLB_FLUSH_GRANU_MASK;
+    uint16_t domain_id;
+    hwaddr addr;
+    uint8_t am;
 
     switch (type) {
     case VTD_TLB_GLOBAL_FLUSH:
-        VTD_DPRINTF(INV, "Global IOTLB flush");
+        VTD_DPRINTF(INV, "global invalidation");
         iaig = VTD_TLB_GLOBAL_FLUSH_A;
+        vtd_iotlb_global_invalidate(s);
         break;
 
     case VTD_TLB_DSI_FLUSH:
-        VTD_DPRINTF(INV, "Domain-selective IOTLB flush");
+        domain_id = VTD_TLB_DID(val);
+        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
+                    domain_id);
         iaig = VTD_TLB_DSI_FLUSH_A;
+        vtd_iotlb_domain_invalidate(s, domain_id);
         break;
 
     case VTD_TLB_PSI_FLUSH:
-        VTD_DPRINTF(INV, "Page-selective-within-domain IOTLB flush");
+        domain_id = VTD_TLB_DID(val);
+        addr = vtd_get_quad_raw(s, DMAR_IVA_REG);
+        am = VTD_IVA_AM(addr);
+        addr = VTD_IVA_ADDR(addr);
+        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
+                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
+        if (am > VTD_MAMV) {
+            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
+                        "%"PRIu8, (uint8_t)VTD_MAMV);
+            iaig = 0;
+            break;
+        }
         iaig = VTD_TLB_PSI_FLUSH_A;
+        vtd_iotlb_page_invalidate(s, domain_id, addr, am);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: wrong iotlb flush granularity");
+        VTD_DPRINTF(GENERAL, "error: invalid granularity");
         iaig = 0;
     }
     return iaig;
@@ -1123,6 +1251,56 @@ static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
     return true;
 }
 
+static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint8_t am;
+    hwaddr addr;
+
+    if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
+        (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
+        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
+                    "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    inv_desc->hi, inv_desc->lo);
+        return false;
+    }
+
+    switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
+    case VTD_INV_DESC_IOTLB_GLOBAL:
+        VTD_DPRINTF(INV, "global invalidation");
+        vtd_iotlb_global_invalidate(s);
+        break;
+
+    case VTD_INV_DESC_IOTLB_DOMAIN:
+        domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
+        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
+                    domain_id);
+        vtd_iotlb_domain_invalidate(s, domain_id);
+        break;
+
+    case VTD_INV_DESC_IOTLB_PAGE:
+        domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
+        addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
+        am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
+        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
+                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
+        if (am > VTD_MAMV) {
+            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
+                        "%"PRIu8, (uint8_t)VTD_MAMV);
+            return false;
+        }
+        vtd_iotlb_page_invalidate(s, domain_id, addr, am);
+        break;
+
+    default:
+        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
+                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                    inv_desc->hi, inv_desc->lo);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_desc(IntelIOMMUState *s)
 {
     VTDInvDesc inv_desc;
@@ -1149,6 +1327,9 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     case VTD_INV_DESC_IOTLB:
         VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
                     " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_iotlb_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_WAIT:
@@ -1382,6 +1563,24 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         vtd_handle_iotlb_write(s);
         break;
 
+    /* Invalidate Address Register, 64-bit */
+    case DMAR_IVA_REG:
+        VTD_DPRINTF(INV, "DMAR_IVA_REG write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        if (size == 4) {
+            vtd_set_long(s, addr, val);
+        } else {
+            vtd_set_quad(s, addr, val);
+        }
+        break;
+
+    case DMAR_IVA_REG_HI:
+        VTD_DPRINTF(INV, "DMAR_IVA_REG_HI write addr 0x%"PRIx64
+                    ", size %d, val 0x%"PRIx64, addr, size, val);
+        assert(size == 4);
+        vtd_set_long(s, addr, val);
+        break;
+
     /* Fault Status Register, 32-bit */
     case DMAR_FSTS_REG:
         VTD_DPRINTF(FLOG, "DMAR_FSTS_REG write addr 0x%"PRIx64
@@ -1658,10 +1857,11 @@ static void vtd_init(IntelIOMMUState *s)
     s->iq_last_desc_type = VTD_INV_DESC_NONE;
     s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
-             VTD_CAP_SAGAW;
+             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI;
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
     vtd_reset_context_cache(s);
+    vtd_reset_iotlb(s);
 
     /* Define registers with default values and bit semantics */
     vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
@@ -1731,6 +1931,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
+    /* No corresponding destroy */
+    s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
+                                     g_free, g_free);
     vtd_init(s);
 }
 
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 09/13] vhost_net: cleanup start/stop condition
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 08/13] intel-iommu: add IOTLB using hash table Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 10/13] ioh3420: remove unused ioh3420_init() declaration Michael S. Tsirkin
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Amos Kong, Anthony Liguori

Checking vhost device internal state in vhost_net looks like
a layering violation since vhost_net does not
set this flag: it is set and tested by vhost.c.
There seems to be no reason to check this:
caller in virtio net uses its own flag,
vhost_started, to ensure vhost is started/stopped
as appropriate.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Amos Kong <akong@redhat.com>
---
 hw/net/vhost_net.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index f87c798..9bbf2ee 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -195,10 +195,6 @@ static int vhost_net_start_one(struct vhost_net *net,
     struct vhost_vring_file file = { };
     int r;
 
-    if (net->dev.started) {
-        return 0;
-    }
-
     net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
     net->dev.vq_index = vq_index;
@@ -256,10 +252,6 @@ static void vhost_net_stop_one(struct vhost_net *net,
 {
     struct vhost_vring_file file = { .fd = -1 };
 
-    if (!net->dev.started) {
-        return;
-    }
-
     if (net->nc->info->type == NET_CLIENT_OPTIONS_KIND_TAP) {
         for (file.index = 0; file.index < net->dev.nvqs; ++file.index) {
             const VhostOps *vhost_ops = net->dev.vhost_ops;
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 10/13] ioh3420: remove unused ioh3420_init() declaration
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (8 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 09/13] vhost_net: cleanup start/stop condition Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 11/13] virtio-net: don't run bh on vm stopped Michael S. Tsirkin
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Gonglei, Knut Omang, Anthony Liguori

From: Gonglei <arei.gonglei@huawei.com>

commit 0f9b1771ccc65873a8376c81200a437aa58c2f6d
    ioh3420: Remove obsoleted, unused ioh3420_init function
removed the implementation of ioh3420_init

Drop the declaration from the header file as well.

Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/pci-bridge/ioh3420.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/hw/pci-bridge/ioh3420.h b/hw/pci-bridge/ioh3420.h
index 7776e5b..ea423cb 100644
--- a/hw/pci-bridge/ioh3420.h
+++ b/hw/pci-bridge/ioh3420.h
@@ -3,8 +3,4 @@
 
 #include "hw/pci/pcie_port.h"
 
-PCIESlot *ioh3420_init(PCIBus *bus, int devfn, bool multifunction,
-                       const char *bus_name, pci_map_irq_fn map_irq,
-                       uint8_t port, uint8_t chassis, uint16_t slot);
-
 #endif /* QEMU_IOH3420_H */
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 11/13] virtio-net: don't run bh on vm stopped
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (9 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 10/13] ioh3420: remove unused ioh3420_init() declaration Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 12/13] pci: avoid losing config updates to MSI/MSIX cap regs Michael S. Tsirkin
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Stefan Hajnoczi, qemu-stable, Anthony Liguori

commit 783e7706937fe15523b609b545587a028a2bdd03
    virtio-net: stop/start bh when appropriate

is incomplete: BH might execute within the same main loop iteration but
after vmstop, so in theory, we might trigger an assertion.
I was unable to reproduce this in practice,
but it seems clear enough that the potential is there, so worth fixing.

Cc: qemu-stable@nongnu.org
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/net/virtio-net.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 268eff9..365e266 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1224,7 +1224,12 @@ static void virtio_net_tx_timer(void *opaque)
     VirtIONetQueue *q = opaque;
     VirtIONet *n = q->n;
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
-    assert(vdev->vm_running);
+    /* This happens when device was stopped but BH wasn't. */
+    if (!vdev->vm_running) {
+        /* Make sure tx waiting is set, so we'll run when restarted. */
+        assert(q->tx_waiting);
+        return;
+    }
 
     q->tx_waiting = 0;
 
@@ -1244,7 +1249,12 @@ static void virtio_net_tx_bh(void *opaque)
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
     int32_t ret;
 
-    assert(vdev->vm_running);
+    /* This happens when device was stopped but BH wasn't. */
+    if (!vdev->vm_running) {
+        /* Make sure tx waiting is set, so we'll run when restarted. */
+        assert(q->tx_waiting);
+        return;
+    }
 
     q->tx_waiting = 0;
 
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 12/13] pci: avoid losing config updates to MSI/MSIX cap regs
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 11/13] virtio-net: don't run bh on vm stopped Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:07 ` [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly Michael S. Tsirkin
  2014-09-03 11:26 ` [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
  13 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Knut Omang, qemu-stable, Anthony Liguori

From: Knut Omang <knut.omang@oracle.com>

Since
commit 95d658002401e2e47a5404298ebe9508846e8a39
    msi: Invoke msi/msix_write_config from PCI core
msix config writes are lost, the value written is always 0.

Fix pci_default_write_config to avoid this.

Cc: qemu-stable@nongnu.org
Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/pci/pci.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index daeaeac..d1e9a2a 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1146,9 +1146,10 @@ uint32_t pci_default_read_config(PCIDevice *d,
     return le32_to_cpu(val);
 }
 
-void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
+void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int l)
 {
     int i, was_irq_disabled = pci_irq_disabled(d);
+    uint32_t val = val_in;
 
     for (i = 0; i < l; val >>= 8, ++i) {
         uint8_t wmask = d->wmask[addr + i];
@@ -1170,8 +1171,8 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
                                     & PCI_COMMAND_MASTER);
     }
 
-    msi_write_config(d, addr, val, l);
-    msix_write_config(d, addr, val, l);
+    msi_write_config(d, addr, val_in, l);
+    msix_write_config(d, addr, val_in, l);
 }
 
 /***********************************************************/
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (11 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 12/13] pci: avoid losing config updates to MSI/MSIX cap regs Michael S. Tsirkin
@ 2014-09-02 15:07 ` Michael S. Tsirkin
  2014-09-02 15:47   ` William Dauchy
  2014-09-03 11:26 ` [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
  13 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 15:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Jason Wang, qemu-stable, William Dauchy,
	Anthony Liguori, Zhangjie (HZ)

From: Jason Wang <jasowang@redhat.com>

commit a9f98bb5ebe6fb1869321dcc58e72041ae626ad8 vhost: multiqueue
support changed the order of stopping the device. Previously
vhost_dev_stop would disable backend and only afterwards, unset guest
notifiers. We now unset guest notifiers while vhost is still
active. This can lose interrupts causing guest networking to fail. In
particular, this has been observed during migration.

To adapt this, several other changes are needed:
- remove the hdev->started assertion in vhost.c since we may want to
start the guest notifiers before vhost starts and stop the guest
notifiers after vhost is stopped.
- introduce the vhost_net_set_vq_index() and call it before setting
guest notifiers. This is used to guarantee vhost_net has the correct
virtqueue index when setting guest notifiers.

Cc: qemu-stable@nongnu.org
Reported-by: "Zhangjie (HZ)" <zhangjie14@huawei.com>
Tested-by: William Dauchy <wdauchy@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/net/vhost_net.c | 31 +++++++++++++++++++------------
 hw/virtio/vhost.c  |  2 --
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 9bbf2ee..ba5d544 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -188,16 +188,19 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
     return vhost_dev_query(&net->dev, dev);
 }
 
+static void vhost_net_set_vq_index(struct vhost_net *net, int vq_index)
+{
+    net->dev.vq_index = vq_index;
+}
+
 static int vhost_net_start_one(struct vhost_net *net,
-                               VirtIODevice *dev,
-                               int vq_index)
+                               VirtIODevice *dev)
 {
     struct vhost_vring_file file = { };
     int r;
 
     net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
-    net->dev.vq_index = vq_index;
 
     r = vhost_dev_enable_notifiers(&net->dev, dev);
     if (r < 0) {
@@ -301,11 +304,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
     }
 
     for (i = 0; i < total_queues; i++) {
-        r = vhost_net_start_one(get_vhost_net(ncs[i].peer), dev, i * 2);
-
-        if (r < 0) {
-            goto err;
-        }
+        vhost_net_set_vq_index(get_vhost_net(ncs[i].peer), i * 2);
     }
 
     r = k->set_guest_notifiers(qbus->parent, total_queues * 2, true);
@@ -314,6 +313,14 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
         goto err;
     }
 
+    for (i = 0; i < total_queues; i++) {
+        r = vhost_net_start_one(get_vhost_net(ncs[i].peer), dev);
+
+        if (r < 0) {
+            goto err;
+        }
+    }
+
     return 0;
 
 err:
@@ -331,16 +338,16 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
     int i, r;
 
+    for (i = 0; i < total_queues; i++) {
+        vhost_net_stop_one(get_vhost_net(ncs[i].peer), dev);
+    }
+
     r = k->set_guest_notifiers(qbus->parent, total_queues * 2, false);
     if (r < 0) {
         fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", r);
         fflush(stderr);
     }
     assert(r >= 0);
-
-    for (i = 0; i < total_queues; i++) {
-        vhost_net_stop_one(get_vhost_net(ncs[i].peer), dev);
-    }
 }
 
 void vhost_net_cleanup(struct vhost_net *net)
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e55fe1c..5d7c40a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -976,7 +976,6 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
 bool vhost_virtqueue_pending(struct vhost_dev *hdev, int n)
 {
     struct vhost_virtqueue *vq = hdev->vqs + n - hdev->vq_index;
-    assert(hdev->started);
     assert(n >= hdev->vq_index && n < hdev->vq_index + hdev->nvqs);
     return event_notifier_test_and_clear(&vq->masked_notifier);
 }
@@ -988,7 +987,6 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, VirtIODevice *vdev, int n,
     struct VirtQueue *vvq = virtio_get_queue(vdev, n);
     int r, index = n - hdev->vq_index;
 
-    assert(hdev->started);
     assert(n >= hdev->vq_index && n < hdev->vq_index + hdev->nvqs);
 
     struct vhost_vring_file file = {
-- 
MST

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly
  2014-09-02 15:07 ` [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly Michael S. Tsirkin
@ 2014-09-02 15:47   ` William Dauchy
  2014-09-02 16:01     ` Michael S. Tsirkin
  0 siblings, 1 reply; 19+ messages in thread
From: William Dauchy @ 2014-09-02 15:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Jason Wang, qemu-devel, qemu-stable,
	William Dauchy, Anthony Liguori, Zhangjie (HZ)

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

On Sep02 18:07, Michael S. Tsirkin wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> commit a9f98bb5ebe6fb1869321dcc58e72041ae626ad8 vhost: multiqueue
> support changed the order of stopping the device. Previously
> vhost_dev_stop would disable backend and only afterwards, unset guest
> notifiers. We now unset guest notifiers while vhost is still
> active. This can lose interrupts causing guest networking to fail. In
> particular, this has been observed during migration.
> 
> To adapt this, several other changes are needed:
> - remove the hdev->started assertion in vhost.c since we may want to
> start the guest notifiers before vhost starts and stop the guest
> notifiers after vhost is stopped.
> - introduce the vhost_net_set_vq_index() and call it before setting
> guest notifiers. This is used to guarantee vhost_net has the correct
> virtqueue index when setting guest notifiers.
> 
> Cc: qemu-stable@nongnu.org
> Reported-by: "Zhangjie (HZ)" <zhangjie14@huawei.com>
> Tested-by: William Dauchy <wdauchy@gmail.com>

please use:

Tested-by: William Dauchy <william@gandi.net>

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Thanks,
-- 
William

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly
  2014-09-02 15:47   ` William Dauchy
@ 2014-09-02 16:01     ` Michael S. Tsirkin
  0 siblings, 0 replies; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-02 16:01 UTC (permalink / raw)
  To: William Dauchy
  Cc: Peter Maydell, Jason Wang, qemu-devel, qemu-stable,
	William Dauchy, Anthony Liguori, Zhangjie (HZ)

On Tue, Sep 02, 2014 at 05:47:54PM +0200, William Dauchy wrote:
> On Sep02 18:07, Michael S. Tsirkin wrote:
> > From: Jason Wang <jasowang@redhat.com>
> > 
> > commit a9f98bb5ebe6fb1869321dcc58e72041ae626ad8 vhost: multiqueue
> > support changed the order of stopping the device. Previously
> > vhost_dev_stop would disable backend and only afterwards, unset guest
> > notifiers. We now unset guest notifiers while vhost is still
> > active. This can lose interrupts causing guest networking to fail. In
> > particular, this has been observed during migration.
> > 
> > To adapt this, several other changes are needed:
> > - remove the hdev->started assertion in vhost.c since we may want to
> > start the guest notifiers before vhost starts and stop the guest
> > notifiers after vhost is stopped.
> > - introduce the vhost_net_set_vq_index() and call it before setting
> > guest notifiers. This is used to guarantee vhost_net has the correct
> > virtqueue index when setting guest notifiers.
> > 
> > Cc: qemu-stable@nongnu.org
> > Reported-by: "Zhangjie (HZ)" <zhangjie14@huawei.com>
> > Tested-by: William Dauchy <wdauchy@gmail.com>
> 
> please use:
> 
> Tested-by: William Dauchy <william@gandi.net>

It's a pull request so not easy to fix.
I'll do it like that next time.

> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Thanks,
> -- 
> William

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PULL 00/13] pci, pc fixes, features
  2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
                   ` (12 preceding siblings ...)
  2014-09-02 15:07 ` [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly Michael S. Tsirkin
@ 2014-09-03 11:26 ` Michael S. Tsirkin
  2014-09-04 11:11   ` Peter Maydell
  13 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2014-09-03 11:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Anthony Liguori

On Tue, Sep 02, 2014 at 06:07:01PM +0300, Michael S. Tsirkin wrote:
> The following changes since commit 187de915e8d06aaf82be206aebc551c82bf0670c:
> 
>   pcie: fix trailing whitespace (2014-08-25 00:16:07 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
> 
> for you to fetch changes up to aad4dce934649b3a398396fc2a76f215bb194ea4:
> 
>   vhost_net: start/stop guest notifiers properly (2014-09-02 17:33:37 +0300)
> 
> ----------------------------------------------------------------
> pci, pc fixes, features
> 
> A bunch of bugfixes - these will make sense for 2.1.1
> 
> Initial Intel IOMMU support.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ----------------------------------------------------------------
> Gonglei (1):
>       ioh3420: remove unused ioh3420_init() declaration
> 
> Jason Wang (1):
>       vhost_net: start/stop guest notifiers properly
> 
> Knut Omang (1):
>       pci: avoid losing config updates to MSI/MSIX cap regs
> 
> Le Tan (8):
>       iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps
>       intel-iommu: introduce Intel IOMMU (VT-d) emulation
>       intel-iommu: add DMAR table to ACPI tables
>       intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch
>       intel-iommu: fix coding style issues around in q35.c and machine.c
>       intel-iommu: add supports for queued invalidation interface
>       intel-iommu: add context-cache to cache context-entry
>       intel-iommu: add IOTLB using hash table
> 
> Michael S. Tsirkin (2):
>       vhost_net: cleanup start/stop condition

A problem was reported with this one.
I fixed it up, will send v2 pull.

>       virtio-net: don't run bh on vm stopped
> 
>  hw/i386/acpi-defs.h            |   40 +
>  hw/i386/intel_iommu_internal.h |  389 ++++++++
>  hw/pci-bridge/ioh3420.h        |    4 -
>  include/exec/memory.h          |    2 +-
>  include/hw/boards.h            |    1 +
>  include/hw/i386/intel_iommu.h  |  120 +++
>  include/hw/pci-host/q35.h      |    2 +
>  exec.c                         |    2 +-
>  hw/alpha/typhoon.c             |    3 +-
>  hw/core/machine.c              |   27 +-
>  hw/i386/acpi-build.c           |   39 +
>  hw/i386/intel_iommu.c          | 1963 ++++++++++++++++++++++++++++++++++++++++
>  hw/net/vhost_net.c             |   39 +-
>  hw/net/virtio-net.c            |   14 +-
>  hw/pci-host/apb.c              |    3 +-
>  hw/pci-host/q35.c              |   58 +-
>  hw/pci/pci.c                   |    7 +-
>  hw/ppc/spapr_iommu.c           |    3 +-
>  hw/virtio/vhost.c              |    2 -
>  vl.c                           |    4 +
>  hw/i386/Makefile.objs          |    1 +
>  qemu-options.hx                |    5 +-
>  22 files changed, 2683 insertions(+), 45 deletions(-)
>  create mode 100644 hw/i386/intel_iommu_internal.h
>  create mode 100644 include/hw/i386/intel_iommu.h
>  create mode 100644 hw/i386/intel_iommu.c
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PULL 00/13] pci, pc fixes, features
  2014-09-03 11:26 ` [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
@ 2014-09-04 11:11   ` Peter Maydell
  2014-09-04 12:33     ` Peter Maydell
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2014-09-04 11:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: QEMU Developers, Anthony Liguori

On 3 September 2014 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> A problem was reported with this one.
> I fixed it up, will send v2 pull.

I accidentally just merged the v1 by mistake :-(
Sorry about that; I'm going to merge in the v2 (it conflicts
in vhost_net.c but fairly trivially so I'll fix that up) and then
we should get to where we were intending to go. Apologies
for the error.

-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PULL 00/13] pci, pc fixes, features
  2014-09-04 11:11   ` Peter Maydell
@ 2014-09-04 12:33     ` Peter Maydell
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Maydell @ 2014-09-04 12:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: QEMU Developers, Anthony Liguori

On 4 September 2014 12:11, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 3 September 2014 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
>> A problem was reported with this one.
>> I fixed it up, will send v2 pull.
>
> I accidentally just merged the v1 by mistake :-(
> Sorry about that; I'm going to merge in the v2 (it conflicts
> in vhost_net.c but fairly trivially so I'll fix that up) and then
> we should get to where we were intending to go. Apologies
> for the error.

As discussed on IRC, reverted the incorrect vhost_net
commit and then cleanly merged in the v2 pull.

Apologies again for this mess.

-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-04 12:33 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-02 15:07 [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 01/13] iommu: add is_write as a parameter to the translate function of MemoryRegionIOMMUOps Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 02/13] intel-iommu: introduce Intel IOMMU (VT-d) emulation Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 03/13] intel-iommu: add DMAR table to ACPI tables Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 04/13] intel-iommu: add Intel IOMMU emulation to q35 and add a machine option "iommu" as a switch Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 05/13] intel-iommu: fix coding style issues around in q35.c and machine.c Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 06/13] intel-iommu: add supports for queued invalidation interface Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 07/13] intel-iommu: add context-cache to cache context-entry Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 08/13] intel-iommu: add IOTLB using hash table Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 09/13] vhost_net: cleanup start/stop condition Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 10/13] ioh3420: remove unused ioh3420_init() declaration Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 11/13] virtio-net: don't run bh on vm stopped Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 12/13] pci: avoid losing config updates to MSI/MSIX cap regs Michael S. Tsirkin
2014-09-02 15:07 ` [Qemu-devel] [PULL 13/13] vhost_net: start/stop guest notifiers properly Michael S. Tsirkin
2014-09-02 15:47   ` William Dauchy
2014-09-02 16:01     ` Michael S. Tsirkin
2014-09-03 11:26 ` [Qemu-devel] [PULL 00/13] pci, pc fixes, features Michael S. Tsirkin
2014-09-04 11:11   ` Peter Maydell
2014-09-04 12:33     ` Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.