All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/13] IOMMU infrastructure
@ 2012-05-10  4:48 Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 01/13] Better support for dma_addr_t variables Benjamin Herrenschmidt
                   ` (13 more replies)
  0 siblings, 14 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:48 UTC (permalink / raw)
  To: qemu-devel

Hi folks !

This is a repose (& rebase on top of current HEAD) of
David and Eduard iommu patch series which provides the
necessary infrastructure for doing DMA through an iommu,
along with the SPAPR iommu implementation.

David is on vacation, so make sure to CC all comments
to me.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 01/13] Better support for dma_addr_t variables
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
@ 2012-05-10  4:48 ` Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero() Benjamin Herrenschmidt
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

A while back, we introduced the dma_addr_t type, which is supposed to
be used for bus visible memory addresses.  At present, this is an
alias for target_phys_addr_t, but this will change when we eventually
add support for guest visible IOMMUs.

There are some instances of target_phys_addr_t in the code now which
should really be dma_addr_t, but can't be trivially converted due to
missing features which this patch corrects.

 * We add DMA_ADDR_BITS analagous to TARGET_PHYS_ADDR_BITS.  This is
   important where we need to make a compile-time (#if) based on the
   size of dma_addr_t.

 * We add a new helper macro to create device properties which take a
   dma_addr_t, currently an alias to DEFINE_PROP_TADDR().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma.h         |    1 +
 hw/qdev-dma.h |   12 ++++++++++++
 2 files changed, 13 insertions(+)
 create mode 100644 hw/qdev-dma.h

diff --git a/dma.h b/dma.h
index 8c1ec8f..fe08b72 100644
--- a/dma.h
+++ b/dma.h
@@ -31,6 +31,7 @@ struct QEMUSGList {
 #if defined(TARGET_PHYS_ADDR_BITS)
 typedef target_phys_addr_t dma_addr_t;
 
+#define DMA_ADDR_BITS TARGET_PHYS_ADDR_BITS
 #define DMA_ADDR_FMT TARGET_FMT_plx
 
 struct ScatterGatherEntry {
diff --git a/hw/qdev-dma.h b/hw/qdev-dma.h
new file mode 100644
index 0000000..f0ff558
--- /dev/null
+++ b/hw/qdev-dma.h
@@ -0,0 +1,12 @@
+/*
+ * Support for dma_addr_t typed properties
+ *
+ * Copyright (C) 2012 David Gibson, IBM Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qdev-addr.h"
+
+#define DEFINE_PROP_DMAADDR(_n, _s, _f, _d)                               \
+    DEFINE_PROP_TADDR(_n, _s, _f, _d)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero()
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 01/13] Better support for dma_addr_t variables Benjamin Herrenschmidt
@ 2012-05-10  4:48 ` Benjamin Herrenschmidt
  2012-05-15  0:42   ` Anthony Liguori
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 03/13] iommu: Add universal DMA helper functions Benjamin Herrenschmidt
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

This patch adds cpu_physical_memory_zero() function.  This is equivalent to
calling cpu_physical_memory_write() with a buffer full of zeroes, but
avoids actually allocating such a buffer along the way.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 cpu-common.h |    1 +
 exec.c       |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/cpu-common.h b/cpu-common.h
index dca5175..146429c 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -53,6 +53,7 @@ void qemu_ram_set_idstr(ram_addr_t addr, const char *name, DeviceState *dev);
 
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                             int len, int is_write);
+void cpu_physical_memory_zero(target_phys_addr_t addr, int len);
 static inline void cpu_physical_memory_read(target_phys_addr_t addr,
                                             void *buf, int len)
 {
diff --git a/exec.c b/exec.c
index 0607c9b..8511496 100644
--- a/exec.c
+++ b/exec.c
@@ -3581,6 +3581,59 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
     }
 }
 
+void cpu_physical_memory_zero(target_phys_addr_t addr, int len)
+{
+    int l;
+    uint8_t *ptr;
+    target_phys_addr_t page;
+    MemoryRegionSection *section;
+
+    while (len > 0) {
+        page = addr & TARGET_PAGE_MASK;
+        l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+        section = phys_page_find(page >> TARGET_PAGE_BITS);
+
+        if (!memory_region_is_ram(section->mr)) {
+            target_phys_addr_t addr1;
+            addr1 = memory_region_section_addr(section, addr);
+            /* XXX: could force cpu_single_env to NULL to avoid
+               potential bugs */
+            if (l >= 4 && ((addr1 & 3) == 0)) {
+                /* 32 bit write access */
+                io_mem_write(section->mr, addr1, 0, 4);
+                l = 4;
+            } else if (l >= 2 && ((addr1 & 1) == 0)) {
+                /* 16 bit write access */
+                io_mem_write(section->mr, addr1, 0, 2);
+                l = 2;
+            } else {
+                /* 8 bit write access */
+                io_mem_write(section->mr, addr1, 0, 1);
+                l = 1;
+            }
+        } else if (!section->readonly) {
+            ram_addr_t addr1;
+            addr1 = memory_region_get_ram_addr(section->mr)
+                + memory_region_section_addr(section, addr);
+            /* RAM case */
+            ptr = qemu_get_ram_ptr(addr1);
+            memset(ptr, 0, l);
+            if (!cpu_physical_memory_is_dirty(addr1)) {
+                /* invalidate code */
+                tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
+                /* set dirty bit */
+                cpu_physical_memory_set_dirty_flags(
+                    addr1, (0xff & ~CODE_DIRTY_FLAG));
+            }
+            qemu_put_ram_ptr(ptr);
+        }
+        len -= l;
+        addr += l;
+    }
+}
+
 /* used for ROM loading : can write in RAM and ROM */
 void cpu_physical_memory_write_rom(target_phys_addr_t addr,
                                    const uint8_t *buf, int len)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 03/13] iommu: Add universal DMA helper functions
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 01/13] Better support for dma_addr_t variables Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero() Benjamin Herrenschmidt
@ 2012-05-10  4:48 ` Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 04/13] usb-ohci: Use " Benjamin Herrenschmidt
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduard - Gabriel Munteanu, Richard Henderson, Michael S. Tsirkin,
	David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

Not that long ago, every device implementation using DMA directly
accessed guest memory using cpu_physical_memory_*().  This meant that
adding support for a guest visible IOMMU would require changing every
one of these devices to go through IOMMU translation.

Shortly before qemu 1.0, I made a start on fixing this by providing
helper functions for PCI DMA.  These are currently just stubs which
call the direct access functions, but mean that an IOMMU can be
implemented in one place, rather than for every PCI device.

Clearly, this doesn't help for non PCI devices, which could also be
IOMMU translated on some platforms.  It is also problematic for the
devices which have both PCI and non-PCI version (e.g. OHCI, AHCI) - we
cannot use the the pci_dma_*() functions, because they assume the
presence of a PCIDevice, but we don't want to have to check between
pci_dma_*() and cpu_physical_memory_*() every time we do a DMA in the
device code.

This patch makes the first step on addressing both these problems, by
introducing new (stub) dma helper functions which can be used for any
DMA capable device.

These dma functions take a DMAContext *, a new (currently empty)
variable describing the DMA address space in which the operation is to
take place.  NULL indicates untranslated DMA directly into guest
physical address space.  The intention is that in future non-NULL
values will given information about any necessary IOMMU translation.

DMA using devices must obtain a DMAContext (or, potentially, contexts)
from their bus or platform.  For now this patch just converts the PCI
wrappers to be implemented in terms of the universal wrappers,
converting other drivers can take place over time.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Richard Henderson <rth@twiddle.net>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma.h         |  100 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/pci.h      |   21 ++++++------
 qemu-common.h |    1 +
 3 files changed, 113 insertions(+), 9 deletions(-)

diff --git a/dma.h b/dma.h
index fe08b72..96b5343 100644
--- a/dma.h
+++ b/dma.h
@@ -34,6 +34,106 @@ typedef target_phys_addr_t dma_addr_t;
 #define DMA_ADDR_BITS TARGET_PHYS_ADDR_BITS
 #define DMA_ADDR_FMT TARGET_FMT_plx
 
+/* Checks that the given range of addresses is valid for DMA.  This is
+ * useful for certain cases, but usually you should just use
+ * dma_memory_{read,write}() and check for errors */
+static inline bool dma_memory_valid(DMAContext *dma, dma_addr_t addr,
+                                    dma_addr_t len, DMADirection dir)
+{
+    /* Stub version, with no iommu we assume all bus addresses are valid */
+    return true;
+}
+
+static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
+                                void *buf, dma_addr_t len, DMADirection dir)
+{
+    /* Stub version when we have no iommu support */
+    cpu_physical_memory_rw(addr, buf, (target_phys_addr_t)len,
+                           dir == DMA_DIRECTION_FROM_DEVICE);
+    return 0;
+}
+
+static inline int dma_memory_read(DMAContext *dma, dma_addr_t addr,
+                                  void *buf, dma_addr_t len)
+{
+    return dma_memory_rw(dma, addr, buf, len, DMA_DIRECTION_TO_DEVICE);
+}
+
+static inline int dma_memory_write(DMAContext *dma, dma_addr_t addr,
+                                   const void *buf, dma_addr_t len)
+{
+    return dma_memory_rw(dma, addr, (void *)buf, len,
+                         DMA_DIRECTION_FROM_DEVICE);
+}
+
+static inline int dma_memory_zero(DMAContext *dma, dma_addr_t addr,
+                                  dma_addr_t len)
+{
+    /* Stub version when we have no iommu support */
+    cpu_physical_memory_zero(addr, len);
+    return 0;
+}
+
+static inline void *dma_memory_map(DMAContext *dma,
+                                   dma_addr_t addr, dma_addr_t *len,
+                                   DMADirection dir)
+{
+    target_phys_addr_t xlen = *len;
+    void *p;
+
+    p = cpu_physical_memory_map(addr, &xlen,
+                                dir == DMA_DIRECTION_FROM_DEVICE);
+    *len = xlen;
+    return p;
+}
+
+static inline void dma_memory_unmap(DMAContext *dma,
+                                    void *buffer, dma_addr_t len,
+                                    DMADirection dir, dma_addr_t access_len)
+{
+    return cpu_physical_memory_unmap(buffer, (target_phys_addr_t)len,
+                                     dir == DMA_DIRECTION_FROM_DEVICE,
+                                     access_len);
+}
+
+#define DEFINE_LDST_DMA(_lname, _sname, _bits, _end) \
+    static inline uint##_bits##_t ld##_lname##_##_end##_dma(DMAContext *dma, \
+                                                            dma_addr_t addr) \
+    {                                                                   \
+        uint##_bits##_t val;                                            \
+        dma_memory_read(dma, addr, &val, (_bits) / 8);                  \
+        return _end##_bits##_to_cpu(val);                               \
+    }                                                                   \
+    static inline void st##_sname##_##_end##_dma(DMAContext *dma,       \
+                                                 dma_addr_t addr,       \
+                                                 uint##_bits##_t val)   \
+    {                                                                   \
+        val = cpu_to_##_end##_bits(val);                                \
+        dma_memory_write(dma, addr, &val, (_bits) / 8);                 \
+    }
+
+static inline uint8_t ldub_dma(DMAContext *dma, dma_addr_t addr)
+{
+    uint8_t val;
+
+    dma_memory_read(dma, addr, &val, 1);
+    return val;
+}
+
+static inline void stb_dma(DMAContext *dma, dma_addr_t addr, uint8_t val)
+{
+    dma_memory_write(dma, addr, &val, 1);
+}
+
+DEFINE_LDST_DMA(uw, w, 16, le);
+DEFINE_LDST_DMA(l, l, 32, le);
+DEFINE_LDST_DMA(q, q, 64, le);
+DEFINE_LDST_DMA(uw, w, 16, be);
+DEFINE_LDST_DMA(l, l, 32, be);
+DEFINE_LDST_DMA(q, q, 64, be);
+
+#undef DEFINE_LDST_DMA
+
 struct ScatterGatherEntry {
     dma_addr_t base;
     dma_addr_t len;
diff --git a/hw/pci.h b/hw/pci.h
index 8d0aa49..f2fe6bf 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -550,10 +550,16 @@ static inline uint32_t pci_config_size(const PCIDevice *d)
 }
 
 /* DMA access functions */
+static inline DMAContext *pci_dma_context(PCIDevice *dev)
+{
+    /* Stub for when we have no PCI iommu support */
+    return NULL;
+}
+
 static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
                              void *buf, dma_addr_t len, DMADirection dir)
 {
-    cpu_physical_memory_rw(addr, buf, len, dir == DMA_DIRECTION_FROM_DEVICE);
+    dma_memory_rw(pci_dma_context(dev), addr, buf, len, dir);
     return 0;
 }
 
@@ -573,12 +579,12 @@ static inline int pci_dma_write(PCIDevice *dev, dma_addr_t addr,
     static inline uint##_bits##_t ld##_l##_pci_dma(PCIDevice *dev,      \
                                                    dma_addr_t addr)     \
     {                                                                   \
-        return ld##_l##_phys(addr);                                     \
+        return ld##_l##_dma(pci_dma_context(dev), addr);                \
     }                                                                   \
     static inline void st##_s##_pci_dma(PCIDevice *dev,                 \
-                          dma_addr_t addr, uint##_bits##_t val)         \
+                                        dma_addr_t addr, uint##_bits##_t val) \
     {                                                                   \
-        st##_s##_phys(addr, val);                                       \
+        st##_s##_dma(pci_dma_context(dev), addr, val);                  \
     }
 
 PCI_DMA_DEFINE_LDST(ub, b, 8);
@@ -594,19 +600,16 @@ PCI_DMA_DEFINE_LDST(q_be, q_be, 64);
 static inline void *pci_dma_map(PCIDevice *dev, dma_addr_t addr,
                                 dma_addr_t *plen, DMADirection dir)
 {
-    target_phys_addr_t len = *plen;
     void *buf;
 
-    buf = cpu_physical_memory_map(addr, &len, dir == DMA_DIRECTION_FROM_DEVICE);
-    *plen = len;
+    buf = dma_memory_map(pci_dma_context(dev), addr, plen, dir);
     return buf;
 }
 
 static inline void pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len,
                                  DMADirection dir, dma_addr_t access_len)
 {
-    cpu_physical_memory_unmap(buffer, len, dir == DMA_DIRECTION_FROM_DEVICE,
-                              access_len);
+    dma_memory_unmap(pci_dma_context(dev), buffer, len, dir, access_len);
 }
 
 static inline void pci_dma_sglist_init(QEMUSGList *qsg, PCIDevice *dev,
diff --git a/qemu-common.h b/qemu-common.h
index 50f659a..0c92a3f 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -259,6 +259,7 @@ typedef struct EventNotifier EventNotifier;
 typedef struct VirtIODevice VirtIODevice;
 typedef struct QEMUSGList QEMUSGList;
 typedef struct SHPCDevice SHPCDevice;
+typedef struct DMAContext DMAContext;
 
 typedef uint64_t pcibus_t;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 04/13] usb-ohci: Use universal DMA helper functions
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (2 preceding siblings ...)
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 03/13] iommu: Add universal DMA helper functions Benjamin Herrenschmidt
@ 2012-05-10  4:48 ` Benjamin Herrenschmidt
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 05/13] iommu: Make sglists and dma_bdrv helpers use new universal DMA helpers Benjamin Herrenschmidt
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: Michael S. Tsirkin, Gerd Hoffmann, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The OHCI device emulation can provide both PCI and SysBus OHCI
implementations.  Because of this, it was not previously converted to
use the PCI DMA helper functions.

This patch converts it to use the new universal DMA helper functions.
In the PCI case, it obtains its DMAContext from pci_dma_context(), in
the SysBus case, it uses NULL - i.e. assumes for now that there will
be no IOMMU translation for a SysBus OHCI.

Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 hw/usb/hcd-ohci.c |   93 +++++++++++++++++++++++++++++------------------------
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/hw/usb/hcd-ohci.c b/hw/usb/hcd-ohci.c
index 1a1cc88..844e7ed 100644
--- a/hw/usb/hcd-ohci.c
+++ b/hw/usb/hcd-ohci.c
@@ -31,7 +31,7 @@
 #include "hw/usb.h"
 #include "hw/pci.h"
 #include "hw/sysbus.h"
-#include "hw/qdev-addr.h"
+#include "hw/qdev-dma.h"
 
 //#define DEBUG_OHCI
 /* Dump packet contents.  */
@@ -62,6 +62,7 @@ typedef struct {
     USBBus bus;
     qemu_irq irq;
     MemoryRegion mem;
+    DMAContext *dma;
     int num_ports;
     const char *name;
 
@@ -104,7 +105,7 @@ typedef struct {
     uint32_t htest;
 
     /* SM501 local memory offset */
-    target_phys_addr_t localmem_base;
+    dma_addr_t localmem_base;
 
     /* Active packets.  */
     uint32_t old_ctl;
@@ -482,14 +483,14 @@ static void ohci_reset(void *opaque)
 
 /* Get an array of dwords from main memory */
 static inline int get_dwords(OHCIState *ohci,
-                             uint32_t addr, uint32_t *buf, int num)
+                             dma_addr_t addr, uint32_t *buf, int num)
 {
     int i;
 
     addr += ohci->localmem_base;
 
     for (i = 0; i < num; i++, buf++, addr += sizeof(*buf)) {
-        cpu_physical_memory_read(addr, buf, sizeof(*buf));
+        dma_memory_read(ohci->dma, addr, buf, sizeof(*buf));
         *buf = le32_to_cpu(*buf);
     }
 
@@ -498,7 +499,7 @@ static inline int get_dwords(OHCIState *ohci,
 
 /* Put an array of dwords in to main memory */
 static inline int put_dwords(OHCIState *ohci,
-                             uint32_t addr, uint32_t *buf, int num)
+                             dma_addr_t addr, uint32_t *buf, int num)
 {
     int i;
 
@@ -506,7 +507,7 @@ static inline int put_dwords(OHCIState *ohci,
 
     for (i = 0; i < num; i++, buf++, addr += sizeof(*buf)) {
         uint32_t tmp = cpu_to_le32(*buf);
-        cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
+        dma_memory_write(ohci->dma, addr, &tmp, sizeof(tmp));
     }
 
     return 1;
@@ -514,14 +515,14 @@ static inline int put_dwords(OHCIState *ohci,
 
 /* Get an array of words from main memory */
 static inline int get_words(OHCIState *ohci,
-                            uint32_t addr, uint16_t *buf, int num)
+                            dma_addr_t addr, uint16_t *buf, int num)
 {
     int i;
 
     addr += ohci->localmem_base;
 
     for (i = 0; i < num; i++, buf++, addr += sizeof(*buf)) {
-        cpu_physical_memory_read(addr, buf, sizeof(*buf));
+        dma_memory_read(ohci->dma, addr, buf, sizeof(*buf));
         *buf = le16_to_cpu(*buf);
     }
 
@@ -530,7 +531,7 @@ static inline int get_words(OHCIState *ohci,
 
 /* Put an array of words in to main memory */
 static inline int put_words(OHCIState *ohci,
-                            uint32_t addr, uint16_t *buf, int num)
+                            dma_addr_t addr, uint16_t *buf, int num)
 {
     int i;
 
@@ -538,40 +539,40 @@ static inline int put_words(OHCIState *ohci,
 
     for (i = 0; i < num; i++, buf++, addr += sizeof(*buf)) {
         uint16_t tmp = cpu_to_le16(*buf);
-        cpu_physical_memory_write(addr, &tmp, sizeof(tmp));
+        dma_memory_write(ohci->dma, addr, &tmp, sizeof(tmp));
     }
 
     return 1;
 }
 
 static inline int ohci_read_ed(OHCIState *ohci,
-                               uint32_t addr, struct ohci_ed *ed)
+                               dma_addr_t addr, struct ohci_ed *ed)
 {
     return get_dwords(ohci, addr, (uint32_t *)ed, sizeof(*ed) >> 2);
 }
 
 static inline int ohci_read_td(OHCIState *ohci,
-                               uint32_t addr, struct ohci_td *td)
+                               dma_addr_t addr, struct ohci_td *td)
 {
     return get_dwords(ohci, addr, (uint32_t *)td, sizeof(*td) >> 2);
 }
 
 static inline int ohci_read_iso_td(OHCIState *ohci,
-                                   uint32_t addr, struct ohci_iso_td *td)
+                                   dma_addr_t addr, struct ohci_iso_td *td)
 {
     return (get_dwords(ohci, addr, (uint32_t *)td, 4) &&
             get_words(ohci, addr + 16, td->offset, 8));
 }
 
 static inline int ohci_read_hcca(OHCIState *ohci,
-                                 uint32_t addr, struct ohci_hcca *hcca)
+                                 dma_addr_t addr, struct ohci_hcca *hcca)
 {
-    cpu_physical_memory_read(addr + ohci->localmem_base, hcca, sizeof(*hcca));
+    dma_memory_read(ohci->dma, addr + ohci->localmem_base, hcca, sizeof(*hcca));
     return 1;
 }
 
 static inline int ohci_put_ed(OHCIState *ohci,
-                              uint32_t addr, struct ohci_ed *ed)
+                              dma_addr_t addr, struct ohci_ed *ed)
 {
     /* ed->tail is under control of the HCD.
      * Since just ed->head is changed by HC, just write back this
@@ -583,64 +584,63 @@ static inline int ohci_put_ed(OHCIState *ohci,
 }
 
 static inline int ohci_put_td(OHCIState *ohci,
-                              uint32_t addr, struct ohci_td *td)
+                              dma_addr_t addr, struct ohci_td *td)
 {
     return put_dwords(ohci, addr, (uint32_t *)td, sizeof(*td) >> 2);
 }
 
 static inline int ohci_put_iso_td(OHCIState *ohci,
-                                  uint32_t addr, struct ohci_iso_td *td)
+                                  dma_addr_t addr, struct ohci_iso_td *td)
 {
     return (put_dwords(ohci, addr, (uint32_t *)td, 4) &&
             put_words(ohci, addr + 16, td->offset, 8));
 }
 
 static inline int ohci_put_hcca(OHCIState *ohci,
-                                uint32_t addr, struct ohci_hcca *hcca)
+                                dma_addr_t addr, struct ohci_hcca *hcca)
 {
-    cpu_physical_memory_write(addr + ohci->localmem_base + HCCA_WRITEBACK_OFFSET,
-                              (char *)hcca + HCCA_WRITEBACK_OFFSET,
-                              HCCA_WRITEBACK_SIZE);
+    dma_memory_write(ohci->dma,
+                     addr + ohci->localmem_base + HCCA_WRITEBACK_OFFSET,
+                     (char *)hcca + HCCA_WRITEBACK_OFFSET,
+                     HCCA_WRITEBACK_SIZE);
     return 1;
 }
 
 /* Read/Write the contents of a TD from/to main memory.  */
 static void ohci_copy_td(OHCIState *ohci, struct ohci_td *td,
-                         uint8_t *buf, int len, int write)
+                         uint8_t *buf, int len, DMADirection dir)
 {
-    uint32_t ptr;
-    uint32_t n;
+    dma_addr_t ptr, n;
 
     ptr = td->cbp;
     n = 0x1000 - (ptr & 0xfff);
     if (n > len)
         n = len;
-    cpu_physical_memory_rw(ptr + ohci->localmem_base, buf, n, write);
+    dma_memory_rw(ohci->dma, ptr + ohci->localmem_base, buf, n, dir);
     if (n == len)
         return;
     ptr = td->be & ~0xfffu;
     buf += n;
-    cpu_physical_memory_rw(ptr + ohci->localmem_base, buf, len - n, write);
+    dma_memory_rw(ohci->dma, ptr + ohci->localmem_base, buf, len - n, dir);
 }
 
 /* Read/Write the contents of an ISO TD from/to main memory.  */
 static void ohci_copy_iso_td(OHCIState *ohci,
                              uint32_t start_addr, uint32_t end_addr,
-                             uint8_t *buf, int len, int write)
+                             uint8_t *buf, int len, DMADirection dir)
 {
-    uint32_t ptr;
-    uint32_t n;
+    dma_addr_t ptr, n;
 
     ptr = start_addr;
     n = 0x1000 - (ptr & 0xfff);
     if (n > len)
         n = len;
-    cpu_physical_memory_rw(ptr + ohci->localmem_base, buf, n, write);
+    dma_memory_rw(ohci->dma, ptr + ohci->localmem_base, buf, n, dir);
     if (n == len)
         return;
     ptr = end_addr & ~0xfffu;
     buf += n;
-    cpu_physical_memory_rw(ptr + ohci->localmem_base, buf, len - n, write);
+    dma_memory_rw(ohci->dma, ptr + ohci->localmem_base, buf, len - n, dir);
 }
 
 static void ohci_process_lists(OHCIState *ohci, int completion);
@@ -803,7 +803,8 @@ static int ohci_service_iso_td(OHCIState *ohci, struct ohci_ed *ed,
     }
 
     if (len && dir != OHCI_TD_DIR_IN) {
-        ohci_copy_iso_td(ohci, start_addr, end_addr, ohci->usb_buf, len, 0);
+        ohci_copy_iso_td(ohci, start_addr, end_addr, ohci->usb_buf, len,
+                         DMA_DIRECTION_TO_DEVICE);
     }
 
     if (completion) {
@@ -827,7 +828,8 @@ static int ohci_service_iso_td(OHCIState *ohci, struct ohci_ed *ed,
     /* Writeback */
     if (dir == OHCI_TD_DIR_IN && ret >= 0 && ret <= len) {
         /* IN transfer succeeded */
-        ohci_copy_iso_td(ohci, start_addr, end_addr, ohci->usb_buf, ret, 1);
+        ohci_copy_iso_td(ohci, start_addr, end_addr, ohci->usb_buf, ret,
+                         DMA_DIRECTION_FROM_DEVICE);
         OHCI_SET_BM(iso_td.offset[relative_frame_number], TD_PSW_CC,
                     OHCI_CC_NOERROR);
         OHCI_SET_BM(iso_td.offset[relative_frame_number], TD_PSW_SIZE, ret);
@@ -971,7 +973,8 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed *ed)
                 pktlen = len;
             }
             if (!completion) {
-                ohci_copy_td(ohci, &td, ohci->usb_buf, pktlen, 0);
+                ohci_copy_td(ohci, &td, ohci->usb_buf, pktlen,
+                             DMA_DIRECTION_TO_DEVICE);
             }
         }
     }
@@ -1021,7 +1024,8 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed *ed)
     }
     if (ret >= 0) {
         if (dir == OHCI_TD_DIR_IN) {
-            ohci_copy_td(ohci, &td, ohci->usb_buf, ret, 1);
+            ohci_copy_td(ohci, &td, ohci->usb_buf, ret,
+                         DMA_DIRECTION_FROM_DEVICE);
 #ifdef DEBUG_PACKET
             DPRINTF("  data:");
             for (i = 0; i < ret; i++)
@@ -1748,11 +1752,14 @@ static USBBusOps ohci_bus_ops = {
 };
 
 static int usb_ohci_init(OHCIState *ohci, DeviceState *dev,
-                         int num_ports, uint32_t localmem_base,
-                         char *masterbus, uint32_t firstport)
+                         int num_ports, dma_addr_t localmem_base,
+                         char *masterbus, uint32_t firstport,
+                         DMAContext *dma)
 {
     int i;
 
+    ohci->dma = dma;
+
     if (usb_frame_time == 0) {
 #ifdef OHCI_TIME_WARP
         usb_frame_time = get_ticks_per_sec();
@@ -1817,7 +1824,8 @@ static int usb_ohci_initfn_pci(struct PCIDevice *dev)
     ohci->pci_dev.config[PCI_INTERRUPT_PIN] = 0x01; /* interrupt pin A */
 
     if (usb_ohci_init(&ohci->state, &dev->qdev, ohci->num_ports, 0,
-                      ohci->masterbus, ohci->firstport) != 0) {
+                      ohci->masterbus, ohci->firstport,
+                      pci_dma_context(dev)) != 0) {
         return -1;
     }
     ohci->state.irq = ohci->pci_dev.irq[0];
@@ -1831,7 +1839,7 @@ typedef struct {
     SysBusDevice busdev;
     OHCIState ohci;
     uint32_t num_ports;
-    target_phys_addr_t dma_offset;
+    dma_addr_t dma_offset;
 } OHCISysBusState;
 
 static int ohci_init_pxa(SysBusDevice *dev)
@@ -1839,7 +1847,8 @@ static int ohci_init_pxa(SysBusDevice *dev)
     OHCISysBusState *s = FROM_SYSBUS(OHCISysBusState, dev);
 
     /* Cannot fail as we pass NULL for masterbus */
-    usb_ohci_init(&s->ohci, &dev->qdev, s->num_ports, s->dma_offset, NULL, 0);
+    usb_ohci_init(&s->ohci, &dev->qdev, s->num_ports, s->dma_offset, NULL, 0,
+                  NULL);
     sysbus_init_irq(dev, &s->ohci.irq);
     sysbus_init_mmio(dev, &s->ohci.mem);
 
@@ -1875,7 +1884,7 @@ static TypeInfo ohci_pci_info = {
 
 static Property ohci_sysbus_properties[] = {
     DEFINE_PROP_UINT32("num-ports", OHCISysBusState, num_ports, 3),
-    DEFINE_PROP_TADDR("dma-offset", OHCISysBusState, dma_offset, 3),
+    DEFINE_PROP_DMAADDR("dma-offset", OHCISysBusState, dma_offset, 3),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 05/13] iommu: Make sglists and dma_bdrv helpers use new universal DMA helpers
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (3 preceding siblings ...)
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 04/13] usb-ohci: Use " Benjamin Herrenschmidt
@ 2012-05-10  4:48 ` Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 06/13] ide/ahci: Use universal DMA helper functions Benjamin Herrenschmidt
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Paolo Bonzini, Michael S. Tsirkin, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

dma-helpers.c contains a number of helper functions for doing
scatter/gather DMA, and various block device related DMA.  Currently,
these directly access guest memory using cpu_physical_memory_*(),
assuming no IOMMU translation.

This patch updates this code to use the new universal DMA helper
functions.  qemu_sglist_init() now takes a DMAContext * to describe
the DMA address space in which the scatter/gather will take place.

We minimally update the callers qemu_sglist_init() to pass NULL
(i.e. no translation, same as current behaviour).  Some of those
callers should pass something else in some cases to allow proper IOMMU
translation in future, but that will be fixed in later patches.

Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma-helpers.c  |   24 ++++++++++++------------
 dma.h          |    3 ++-
 hw/ide/ahci.c  |    3 ++-
 hw/ide/macio.c |    4 ++--
 hw/pci.h       |    2 +-
 5 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/dma-helpers.c b/dma-helpers.c
index 7971a89..2dc4691 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -10,12 +10,13 @@
 #include "dma.h"
 #include "trace.h"
 
-void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint)
+void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMAContext *dma)
 {
     qsg->sg = g_malloc(alloc_hint * sizeof(ScatterGatherEntry));
     qsg->nsg = 0;
     qsg->nalloc = alloc_hint;
     qsg->size = 0;
+    qsg->dma = dma;
 }
 
 void qemu_sglist_add(QEMUSGList *qsg, dma_addr_t base, dma_addr_t len)
@@ -74,10 +75,9 @@ static void dma_bdrv_unmap(DMAAIOCB *dbs)
     int i;
 
     for (i = 0; i < dbs->iov.niov; ++i) {
-        cpu_physical_memory_unmap(dbs->iov.iov[i].iov_base,
-                                  dbs->iov.iov[i].iov_len,
-                                  dbs->dir != DMA_DIRECTION_TO_DEVICE,
-                                  dbs->iov.iov[i].iov_len);
+        dma_memory_unmap(dbs->sg->dma, dbs->iov.iov[i].iov_base,
+                         dbs->iov.iov[i].iov_len, dbs->dir,
+                         dbs->iov.iov[i].iov_len);
     }
     qemu_iovec_reset(&dbs->iov);
 }
@@ -106,7 +106,7 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
 static void dma_bdrv_cb(void *opaque, int ret)
 {
     DMAAIOCB *dbs = (DMAAIOCB *)opaque;
-    target_phys_addr_t cur_addr, cur_len;
+    dma_addr_t cur_addr, cur_len;
     void *mem;
 
     trace_dma_bdrv_cb(dbs, ret);
@@ -123,8 +123,7 @@ static void dma_bdrv_cb(void *opaque, int ret)
     while (dbs->sg_cur_index < dbs->sg->nsg) {
         cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte;
         cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte;
-        mem = cpu_physical_memory_map(cur_addr, &cur_len,
-                                      dbs->dir != DMA_DIRECTION_TO_DEVICE);
+        mem = dma_memory_map(dbs->sg->dma, cur_addr, &cur_len, dbs->dir);
         if (!mem)
             break;
         qemu_iovec_add(&dbs->iov, mem, cur_len);
@@ -209,7 +208,8 @@ BlockDriverAIOCB *dma_bdrv_write(BlockDriverState *bs,
 }
 
 
-static uint64_t dma_buf_rw(uint8_t *ptr, int32_t len, QEMUSGList *sg, bool to_dev)
+static uint64_t dma_buf_rw(uint8_t *ptr, int32_t len, QEMUSGList *sg,
+                           DMADirection dir)
 {
     uint64_t resid;
     int sg_cur_index;
@@ -220,7 +220,7 @@ static uint64_t dma_buf_rw(uint8_t *ptr, int32_t len, QEMUSGList *sg, bool to_de
     while (len > 0) {
         ScatterGatherEntry entry = sg->sg[sg_cur_index++];
         int32_t xfer = MIN(len, entry.len);
-        cpu_physical_memory_rw(entry.base, ptr, xfer, !to_dev);
+        dma_memory_rw(sg->dma, entry.base, ptr, xfer, dir);
         ptr += xfer;
         len -= xfer;
         resid -= xfer;
@@ -231,12 +231,12 @@ static uint64_t dma_buf_rw(uint8_t *ptr, int32_t len, QEMUSGList *sg, bool to_de
 
 uint64_t dma_buf_read(uint8_t *ptr, int32_t len, QEMUSGList *sg)
 {
-    return dma_buf_rw(ptr, len, sg, 0);
+    return dma_buf_rw(ptr, len, sg, DMA_DIRECTION_FROM_DEVICE);
 }
 
 uint64_t dma_buf_write(uint8_t *ptr, int32_t len, QEMUSGList *sg)
 {
-    return dma_buf_rw(ptr, len, sg, 1);
+    return dma_buf_rw(ptr, len, sg, DMA_DIRECTION_TO_DEVICE);
 }
 
 void dma_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
diff --git a/dma.h b/dma.h
index 96b5343..876aea4 100644
--- a/dma.h
+++ b/dma.h
@@ -26,6 +26,7 @@ struct QEMUSGList {
     int nsg;
     int nalloc;
     size_t size;
+    DMAContext *dma;
 };
 
 #if defined(TARGET_PHYS_ADDR_BITS)
@@ -139,7 +140,7 @@ struct ScatterGatherEntry {
     dma_addr_t len;
 };
 
-void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint);
+void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMAContext *dma);
 void qemu_sglist_add(QEMUSGList *qsg, dma_addr_t base, dma_addr_t len);
 void qemu_sglist_destroy(QEMUSGList *qsg);
 #endif
diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index a883a92..96d8f62 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -667,7 +667,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     if (sglist_alloc_hint > 0) {
         AHCI_SG *tbl = (AHCI_SG *)prdt;
 
-        qemu_sglist_init(sglist, sglist_alloc_hint);
+        /* FIXME: pass the correct DMAContext */
+        qemu_sglist_init(sglist, sglist_alloc_hint, NULL);
         for (i = 0; i < sglist_alloc_hint; i++) {
             /* flags_size is zero-based */
             qemu_sglist_add(sglist, le64_to_cpu(tbl[i].addr),
diff --git a/hw/ide/macio.c b/hw/ide/macio.c
index 7b38d9e..848cb31 100644
--- a/hw/ide/macio.c
+++ b/hw/ide/macio.c
@@ -76,7 +76,7 @@ static void pmac_ide_atapi_transfer_cb(void *opaque, int ret)
 
     s->io_buffer_size = io->len;
 
-    qemu_sglist_init(&s->sg, io->len / MACIO_PAGE_SIZE + 1);
+    qemu_sglist_init(&s->sg, io->len / MACIO_PAGE_SIZE + 1, NULL);
     qemu_sglist_add(&s->sg, io->addr, io->len);
     io->addr += io->len;
     io->len = 0;
@@ -133,7 +133,7 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
     s->io_buffer_index = 0;
     s->io_buffer_size = io->len;
 
-    qemu_sglist_init(&s->sg, io->len / MACIO_PAGE_SIZE + 1);
+    qemu_sglist_init(&s->sg, io->len / MACIO_PAGE_SIZE + 1, NULL);
     qemu_sglist_add(&s->sg, io->addr, io->len);
     io->addr += io->len;
     io->len = 0;
diff --git a/hw/pci.h b/hw/pci.h
index f2fe6bf..7e36c53 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -615,7 +615,7 @@ static inline void pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len,
 static inline void pci_dma_sglist_init(QEMUSGList *qsg, PCIDevice *dev,
                                        int alloc_hint)
 {
-    qemu_sglist_init(qsg, alloc_hint);
+    qemu_sglist_init(qsg, alloc_hint, pci_dma_context(dev));
 }
 
 extern const VMStateDescription vmstate_pci_device;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 06/13] ide/ahci: Use universal DMA helper functions
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (4 preceding siblings ...)
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 05/13] iommu: Make sglists and dma_bdrv helpers use new universal DMA helpers Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-21  1:51   ` [Qemu-devel] [PATCH 06/13 - UPDATED] " Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 07/13] usb: Convert usb_packet_{map, unmap} to universal DMA helpers Benjamin Herrenschmidt
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Michael S. Tsirkin, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The AHCI device can provide both PCI and SysBus AHCI device
emulations.  For this reason, it wasn't previously converted to use
the pci_dma_*() helper functions.  Now that we have universal DMA
helper functions, this converts AHCI to use them.

The DMAContext is obtained from pci_dma_context() in the PCI case and
set to NULL in the SysBus case (i.e. we assume for now that a SysBus
AHCI has no IOMMU translation).

Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 hw/ide/ahci.c |   34 ++++++++++++++++++++--------------
 hw/ide/ahci.h |    3 ++-
 hw/ide/ich.c  |    2 +-
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 96d8f62..89b572f 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -588,7 +588,7 @@ static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     AHCIPortRegs *pr = &ad->port_regs;
     uint8_t *d2h_fis;
     int i;
-    target_phys_addr_t cmd_len = 0x80;
+    dma_addr_t cmd_len = 0x80;
     int cmd_mapped = 0;
 
     if (!ad->res_fis || !(pr->cmd & PORT_CMD_FIS_RX)) {
@@ -598,7 +598,8 @@ static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     if (!cmd_fis) {
         /* map cmd_fis */
         uint64_t tbl_addr = le64_to_cpu(ad->cur_cmd->tbl_addr);
-        cmd_fis = cpu_physical_memory_map(tbl_addr, &cmd_len, 0);
+        cmd_fis = dma_memory_map(ad->hba->dma, tbl_addr, &cmd_len,
+                                 DMA_DIRECTION_TO_DEVICE);
         cmd_mapped = 1;
     }
 
@@ -630,7 +631,8 @@ static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     ahci_trigger_irq(ad->hba, ad, PORT_IRQ_D2H_REG_FIS);
 
     if (cmd_mapped) {
-        cpu_physical_memory_unmap(cmd_fis, cmd_len, 0, cmd_len);
+        dma_memory_unmap(ad->hba->dma, cmd_fis, cmd_len,
+                         DMA_DIRECTION_TO_DEVICE, cmd_len);
     }
 }
 
@@ -640,8 +642,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     uint32_t opts = le32_to_cpu(cmd->opts);
     uint64_t prdt_addr = le64_to_cpu(cmd->tbl_addr) + 0x80;
     int sglist_alloc_hint = opts >> AHCI_CMD_HDR_PRDT_LEN;
-    target_phys_addr_t prdt_len = (sglist_alloc_hint * sizeof(AHCI_SG));
-    target_phys_addr_t real_prdt_len = prdt_len;
+    dma_addr_t prdt_len = (sglist_alloc_hint * sizeof(AHCI_SG));
+    dma_addr_t real_prdt_len = prdt_len;
     uint8_t *prdt;
     int i;
     int r = 0;
@@ -652,7 +654,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     }
 
     /* map PRDT */
-    if (!(prdt = cpu_physical_memory_map(prdt_addr, &prdt_len, 0))){
+    if (!(prdt = dma_memory_map(ad->hba->dma, prdt_addr, &prdt_len,
+                                DMA_DIRECTION_TO_DEVICE))){
         DPRINTF(ad->port_no, "map failed\n");
         return -1;
     }
@@ -667,8 +670,7 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     if (sglist_alloc_hint > 0) {
         AHCI_SG *tbl = (AHCI_SG *)prdt;
 
-        /* FIXME: pass the correct DMAContext */
-        qemu_sglist_init(sglist, sglist_alloc_hint, NULL);
+        qemu_sglist_init(sglist, sglist_alloc_hint, ad->hba->dma);
         for (i = 0; i < sglist_alloc_hint; i++) {
             /* flags_size is zero-based */
             qemu_sglist_add(sglist, le64_to_cpu(tbl[i].addr),
@@ -677,7 +679,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     }
 
 out:
-    cpu_physical_memory_unmap(prdt, prdt_len, 0, prdt_len);
+    dma_memory_unmap(ad->hba->dma, prdt, prdt_len,
+                     DMA_DIRECTION_TO_DEVICE, prdt_len);
     return r;
 }
 
@@ -787,7 +790,7 @@ static int handle_cmd(AHCIState *s, int port, int slot)
     uint64_t tbl_addr;
     AHCICmdHdr *cmd;
     uint8_t *cmd_fis;
-    target_phys_addr_t cmd_len;
+    dma_addr_t cmd_len;
 
     if (s->dev[port].port.ifs[0].status & (BUSY_STAT|DRQ_STAT)) {
         /* Engine currently busy, try again later */
@@ -809,7 +812,8 @@ static int handle_cmd(AHCIState *s, int port, int slot)
     tbl_addr = le64_to_cpu(cmd->tbl_addr);
 
     cmd_len = 0x80;
-    cmd_fis = cpu_physical_memory_map(tbl_addr, &cmd_len, 1);
+    cmd_fis = dma_memory_map(s->dma, tbl_addr, &cmd_len,
+                             DMA_DIRECTION_FROM_DEVICE);
 
     if (!cmd_fis) {
         DPRINTF(port, "error: guest passed us an invalid cmd fis\n");
@@ -935,7 +939,8 @@ static int handle_cmd(AHCIState *s, int port, int slot)
     }
 
 out:
-    cpu_physical_memory_unmap(cmd_fis, cmd_len, 1, cmd_len);
+    dma_memory_unmap(s->dma, cmd_fis, cmd_len, DMA_DIRECTION_FROM_DEVICE,
+                     cmd_len);
 
     if (s->dev[port].port.ifs[0].status & (BUSY_STAT|DRQ_STAT)) {
         /* async command, complete later */
@@ -1115,11 +1120,12 @@ static const IDEDMAOps ahci_dma_ops = {
     .reset = ahci_dma_reset,
 };
 
-void ahci_init(AHCIState *s, DeviceState *qdev, int ports)
+void ahci_init(AHCIState *s, DeviceState *qdev, DMAContext *dma, int ports)
 {
     qemu_irq *irqs;
     int i;
 
+    s->dma = dma;
     s->ports = ports;
     s->dev = g_malloc0(sizeof(AHCIDevice) * ports);
     ahci_reg_init(s);
@@ -1182,7 +1188,7 @@ static const VMStateDescription vmstate_sysbus_ahci = {
 static int sysbus_ahci_init(SysBusDevice *dev)
 {
     SysbusAHCIState *s = FROM_SYSBUS(SysbusAHCIState, dev);
-    ahci_init(&s->ahci, &dev->qdev, s->num_ports);
+    ahci_init(&s->ahci, &dev->qdev, NULL, s->num_ports);
 
     sysbus_init_mmio(dev, &s->ahci.mem);
     sysbus_init_irq(dev, &s->ahci.irq);
diff --git a/hw/ide/ahci.h b/hw/ide/ahci.h
index b223d2c..af8c6ef 100644
--- a/hw/ide/ahci.h
+++ b/hw/ide/ahci.h
@@ -299,6 +299,7 @@ typedef struct AHCIState {
     uint32_t idp_index;     /* Current IDP index */
     int ports;
     qemu_irq irq;
+    DMAContext *dma;
 } AHCIState;
 
 typedef struct AHCIPCIState {
@@ -329,7 +330,7 @@ typedef struct NCQFrame {
     uint8_t reserved10;
 } QEMU_PACKED NCQFrame;
 
-void ahci_init(AHCIState *s, DeviceState *qdev, int ports);
+void ahci_init(AHCIState *s, DeviceState *qdev, DMAContext *dma, int ports);
 void ahci_uninit(AHCIState *s);
 
 void ahci_reset(void *opaque);
diff --git a/hw/ide/ich.c b/hw/ide/ich.c
index 560ae37..5354e13 100644
--- a/hw/ide/ich.c
+++ b/hw/ide/ich.c
@@ -91,7 +91,7 @@ static int pci_ich9_ahci_init(PCIDevice *dev)
     uint8_t *sata_cap;
     d = DO_UPCAST(struct AHCIPCIState, card, dev);
 
-    ahci_init(&d->ahci, &dev->qdev, 6);
+    ahci_init(&d->ahci, &dev->qdev, pci_dma_context(dev), 6);
 
     pci_config_set_prog_interface(d->card.config, AHCI_PROGMODE_MAJOR_REV_1);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 07/13] usb: Convert usb_packet_{map, unmap} to universal DMA helpers
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (5 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 06/13] ide/ahci: Use universal DMA helper functions Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure Benjamin Herrenschmidt
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The USB UHCI and EHCI drivers were converted some time ago to use the
pci_dma_*() helper functions.  However, this conversion was not complete
because in some places both these drivers do DMA via the usb_packet_map()
function in usb-libhw.c.  That function directly used
cpu_physical_memory_map().

Now that the sglist code uses DMA wrappers properly, we can convert the
functions in usb-libhw.c, thus conpleting the conversion of UHCI and EHCI
to use the DMA wrappers.

Note that usb_packet_map() invokes dma_memory_map() with a NULL invalidate
callback function.  When IOMMU support is added, this will mean that
usb_packet_map() and the corresponding usb_packet_unmap() must be called in
close proximity without dropping the qemu device lock - otherwise the guest
might invalidate IOMMU mappings while they are still in use by the device
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 hw/usb.h          |    2 +-
 hw/usb/hcd-ehci.c |    4 ++--
 hw/usb/hcd-uhci.c |    2 +-
 hw/usb/libhw.c    |   21 +++++++++++----------
 4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/hw/usb.h b/hw/usb.h
index ae7ccda..db2512c 100644
--- a/hw/usb.h
+++ b/hw/usb.h
@@ -345,7 +345,7 @@ void usb_packet_check_state(USBPacket *p, USBPacketState expected);
 void usb_packet_setup(USBPacket *p, int pid, USBEndpoint *ep);
 void usb_packet_addbuf(USBPacket *p, void *ptr, size_t len);
 int usb_packet_map(USBPacket *p, QEMUSGList *sgl);
-void usb_packet_unmap(USBPacket *p);
+void usb_packet_unmap(USBPacket *p, QEMUSGList *sgl);
 void usb_packet_copy(USBPacket *p, void *ptr, size_t bytes);
 void usb_packet_skip(USBPacket *p, size_t bytes);
 void usb_packet_cleanup(USBPacket *p);
diff --git a/hw/usb/hcd-ehci.c b/hw/usb/hcd-ehci.c
index 4ff4d40..601f59c 100644
--- a/hw/usb/hcd-ehci.c
+++ b/hw/usb/hcd-ehci.c
@@ -1335,7 +1335,7 @@ static void ehci_execute_complete(EHCIQueue *q)
         set_field(&q->qh.token, q->tbytes, QTD_TOKEN_TBYTES);
     }
     ehci_finish_transfer(q, q->usb_status);
-    usb_packet_unmap(&q->packet);
+    usb_packet_unmap(&q->packet, &q->sgl);
 
     q->qh.token ^= QTD_TOKEN_DTOGGLE;
     q->qh.token &= ~QTD_TOKEN_ACTIVE;
@@ -1456,7 +1456,7 @@ static int ehci_process_itd(EHCIState *ehci,
                 usb_packet_map(&ehci->ipacket, &ehci->isgl);
                 ret = usb_handle_packet(dev, &ehci->ipacket);
                 assert(ret != USB_RET_ASYNC);
-                usb_packet_unmap(&ehci->ipacket);
+                usb_packet_unmap(&ehci->ipacket, &ehci->isgl);
             } else {
                 DPRINTF("ISOCH: attempt to addess non-iso endpoint\n");
                 ret = USB_RET_NAK;
diff --git a/hw/usb/hcd-uhci.c b/hw/usb/hcd-uhci.c
index 9e211a0..726bdf3 100644
--- a/hw/usb/hcd-uhci.c
+++ b/hw/usb/hcd-uhci.c
@@ -874,7 +874,7 @@ static int uhci_handle_td(UHCIState *s, uint32_t addr, UHCI_TD *td,
 
 done:
     len = uhci_complete_td(s, td, async, int_mask);
-    usb_packet_unmap(&async->packet);
+    usb_packet_unmap(&async->packet, &async->sgl);
     uhci_async_free(async);
     return len;
 }
diff --git a/hw/usb/libhw.c b/hw/usb/libhw.c
index 2462351..c0de30e 100644
--- a/hw/usb/libhw.c
+++ b/hw/usb/libhw.c
@@ -26,15 +26,15 @@
 
 int usb_packet_map(USBPacket *p, QEMUSGList *sgl)
 {
-    int is_write = (p->pid == USB_TOKEN_IN);
-    target_phys_addr_t len;
+    DMADirection dir = (p->pid == USB_TOKEN_IN) ?
+        DMA_DIRECTION_FROM_DEVICE : DMA_DIRECTION_TO_DEVICE;
+    dma_addr_t len;
     void *mem;
     int i;
 
     for (i = 0; i < sgl->nsg; i++) {
         len = sgl->sg[i].len;
-        mem = cpu_physical_memory_map(sgl->sg[i].base, &len,
-                                      is_write);
+        mem = dma_memory_map(sgl->dma, sgl->sg[i].base, &len, dir);
         if (!mem) {
             goto err;
         }
@@ -46,18 +46,19 @@ int usb_packet_map(USBPacket *p, QEMUSGList *sgl)
     return 0;
 
 err:
-    usb_packet_unmap(p);
+    usb_packet_unmap(p, sgl);
     return -1;
 }
 
-void usb_packet_unmap(USBPacket *p)
+void usb_packet_unmap(USBPacket *p, QEMUSGList *sgl)
 {
-    int is_write = (p->pid == USB_TOKEN_IN);
+    DMADirection dir = (p->pid == USB_TOKEN_IN) ?
+        DMA_DIRECTION_FROM_DEVICE : DMA_DIRECTION_TO_DEVICE;
     int i;
 
     for (i = 0; i < p->iov.niov; i++) {
-        cpu_physical_memory_unmap(p->iov.iov[i].iov_base,
-                                  p->iov.iov[i].iov_len, is_write,
-                                  p->iov.iov[i].iov_len);
+        dma_memory_unmap(sgl->dma, p->iov.iov[i].iov_base,
+                         p->iov.iov[i].iov_len, dir,
+                         p->iov.iov[i].iov_len);
     }
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (6 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 07/13] usb: Convert usb_packet_{map, unmap} to universal DMA helpers Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-15  0:49   ` Anthony Liguori
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 09/13] iommu: Add facility to cancel in-use dma memory maps Benjamin Herrenschmidt
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduard - Gabriel Munteanu, Richard Henderson, Michael S. Tsirkin,
	David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

This patch adds the basic infrastructure necessary to emulate an IOMMU
visible to the guest.  The DMAContext structure is extended with
information and a callback describing the translation, and the various
DMA functions used by devices will now perform IOMMU translation using
this callback.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma-helpers.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 dma.h         |  108 ++++++++++++++++++++++-------
 hw/qdev-dma.h |    4 +-
 3 files changed, 299 insertions(+), 27 deletions(-)

diff --git a/dma-helpers.c b/dma-helpers.c
index 2dc4691..09591ef 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -9,6 +9,10 @@
 
 #include "dma.h"
 #include "trace.h"
+#include "range.h"
+#include "qemu-thread.h"
+
+/* #define DEBUG_IOMMU */
 
 void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMAContext *dma)
 {
@@ -244,3 +248,213 @@ void dma_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
 {
     bdrv_acct_start(bs, cookie, sg->size, type);
 }
+
+bool iommu_dma_memory_valid(DMAContext *dma, dma_addr_t addr, dma_addr_t len,
+                            DMADirection dir)
+{
+    target_phys_addr_t paddr, plen;
+
+#ifdef DEBUG_IOMMU
+    fprintf(stderr, "dma_memory_check context=%p addr=0x" DMA_ADDR_FMT
+            " len=0x" DMA_ADDR_FMT " dir=%d\n", dma, addr, len, dir);
+#endif
+
+    while (len) {
+        if (dma->translate(dma, addr, &paddr, &plen, dir) != 0) {
+            return false;
+        }
+
+        /* The translation might be valid for larger regions. */
+        if (plen > len) {
+            plen = len;
+        }
+
+        len -= plen;
+        addr += plen;
+    }
+
+    return true;
+}
+
+int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
+                        void *buf, dma_addr_t len, DMADirection dir)
+{
+    target_phys_addr_t paddr, plen;
+    int err;
+
+#ifdef DEBUG_IOMMU
+    fprintf(stderr, "dma_memory_rw context=%p addr=0x" DMA_ADDR_FMT " len=0x"
+            DMA_ADDR_FMT " dir=%d\n", dma, addr, len, dir);
+#endif
+
+    while (len) {
+        err = dma->translate(dma, addr, &paddr, &plen, dir);
+        if (err) {
+            return -1;
+        }
+
+        /* The translation might be valid for larger regions. */
+        if (plen > len) {
+            plen = len;
+        }
+
+        cpu_physical_memory_rw(paddr, buf, plen,
+                               dir == DMA_DIRECTION_FROM_DEVICE);
+
+        len -= plen;
+        addr += plen;
+        buf += plen;
+    }
+
+    return 0;
+}
+
+int iommu_dma_memory_zero(DMAContext *dma, dma_addr_t addr, dma_addr_t len)
+{
+    target_phys_addr_t paddr, plen;
+    int err;
+
+#ifdef DEBUG_IOMMU
+    fprintf(stderr, "dma_memory_zero context=%p addr=0x" DMA_ADDR_FMT
+            " len=0x" DMA_ADDR_FMT "\n", dma, addr, len);
+#endif
+
+    while (len) {
+        err = dma->translate(dma, addr, &paddr, &plen,
+                             DMA_DIRECTION_FROM_DEVICE);
+        if (err) {
+            return err;
+        }
+
+        /* The translation might be valid for larger regions. */
+        if (plen > len) {
+            plen = len;
+        }
+
+        cpu_physical_memory_zero(paddr, plen);
+
+        len -= plen;
+        addr += plen;
+    }
+
+    return 0;
+}
+
+typedef struct {
+    unsigned long count;
+    QemuCond cond;
+} DMAInvalidationState;
+
+typedef struct DMAMemoryMap DMAMemoryMap;
+struct DMAMemoryMap {
+    dma_addr_t              addr;
+    size_t                  len;
+    void                    *buf;
+
+    DMAInvalidationState    *invalidate;
+    QLIST_ENTRY(DMAMemoryMap) list;
+};
+
+void dma_context_init(DMAContext *dma, DMATranslateFunc fn)
+{
+#ifdef DEBUG_IOMMU
+    fprintf(stderr, "dma_context_init(%p, %p)\n", dma, fn);
+#endif
+    dma->translate = fn;
+    QLIST_INIT(&dma->memory_maps);
+}
+
+void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len,
+                           DMADirection dir)
+{
+    int err;
+    target_phys_addr_t paddr, plen;
+    void *buf;
+    DMAMemoryMap *map;
+
+    plen = *len;
+    err = dma->translate(dma, addr, &paddr, &plen, dir);
+    if (err) {
+        return NULL;
+    }
+
+    /*
+     * If this is true, the virtual region is contiguous,
+     * but the translated physical region isn't. We just
+     * clamp *len, much like cpu_physical_memory_map() does.
+     */
+    if (plen < *len) {
+        *len = plen;
+    }
+
+    buf = cpu_physical_memory_map(paddr, &plen,
+                                  dir == DMA_DIRECTION_FROM_DEVICE);
+    *len = plen;
+
+    /* We treat maps as remote TLBs to cope with stuff like AIO. */
+    map = g_malloc(sizeof(DMAMemoryMap));
+    map->addr = addr;
+    map->len = *len;
+    map->buf = buf;
+    map->invalidate = NULL;
+
+    QLIST_INSERT_HEAD(&dma->memory_maps, map, list);
+
+    return buf;
+}
+
+void iommu_dma_memory_unmap(DMAContext *dma, void *buffer, dma_addr_t len,
+                            DMADirection dir, dma_addr_t access_len)
+{
+    DMAMemoryMap *map;
+
+    cpu_physical_memory_unmap(buffer, len,
+                              dir == DMA_DIRECTION_FROM_DEVICE,
+                              access_len);
+
+    QLIST_FOREACH(map, &dma->memory_maps, list) {
+        if ((map->buf == buffer) && (map->len == len)) {
+            QLIST_REMOVE(map, list);
+
+            if (map->invalidate) {
+                /* If this mapping was invalidated */
+                if (--map->invalidate->count == 0) {
+                    /* And we're the last mapping invalidated at the time */
+                    /* Then wake up whoever was waiting for the
+                     * invalidation to complete */
+                    qemu_cond_signal(&map->invalidate->cond);
+                }
+            }
+
+            free(map);
+        }
+    }
+
+
+    /* unmap called on a buffer that wasn't mapped */
+    assert(false);
+}
+
+extern QemuMutex qemu_global_mutex;
+
+void iommu_wait_for_invalidated_maps(DMAContext *dma,
+                                     dma_addr_t addr, dma_addr_t len)
+{
+    DMAMemoryMap *map;
+    DMAInvalidationState is;
+
+    is.count = 0;
+    qemu_cond_init(&is.cond);
+
+    QLIST_FOREACH(map, &dma->memory_maps, list) {
+        if (ranges_overlap(addr, len, map->addr, map->len)) {
+            is.count++;
+            map->invalidate = &is;
+        }
+    }
+
+    if (is.count) {
+        qemu_cond_wait(&is.cond, &qemu_global_mutex);
+    }
+    assert(is.count == 0);
+}
diff --git a/dma.h b/dma.h
index 876aea4..b57d72f 100644
--- a/dma.h
+++ b/dma.h
@@ -14,6 +14,7 @@
 #include "hw/hw.h"
 #include "block.h"
 
+typedef struct DMAContext DMAContext;
 typedef struct ScatterGatherEntry ScatterGatherEntry;
 
 typedef enum {
@@ -30,28 +31,64 @@ struct QEMUSGList {
 };
 
 #if defined(TARGET_PHYS_ADDR_BITS)
-typedef target_phys_addr_t dma_addr_t;
 
-#define DMA_ADDR_BITS TARGET_PHYS_ADDR_BITS
-#define DMA_ADDR_FMT TARGET_FMT_plx
+/*
+ * When an IOMMU is present, bus addresses become distinct from
+ * CPU/memory physical addresses and may be a different size.  Because
+ * the IOVA size depends more on the bus than on the platform, we more
+ * or less have to treat these as 64-bit always to cover all (or at
+ * least most) cases.
+ */
+typedef uint64_t dma_addr_t;
+
+#define DMA_ADDR_BITS 64
+#define DMA_ADDR_FMT "%" PRIx64
+
+typedef int DMATranslateFunc(DMAContext *dma,
+                             dma_addr_t addr,
+                             target_phys_addr_t *paddr,
+                             target_phys_addr_t *len,
+                             DMADirection dir);
+
+typedef struct DMAContext {
+    DMATranslateFunc *translate;
+    QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
+} DMAContext;
+
+static inline bool dma_has_iommu(DMAContext *dma)
+{
+    return !!dma;
+}
 
 /* Checks that the given range of addresses is valid for DMA.  This is
  * useful for certain cases, but usually you should just use
  * dma_memory_{read,write}() and check for errors */
-static inline bool dma_memory_valid(DMAContext *dma, dma_addr_t addr,
-                                    dma_addr_t len, DMADirection dir)
+bool iommu_dma_memory_valid(DMAContext *dma, dma_addr_t addr, dma_addr_t len,
+                            DMADirection dir);
+static inline bool dma_memory_valid(DMAContext *dma,
+                                    dma_addr_t addr, dma_addr_t len,
+                                    DMADirection dir)
 {
-    /* Stub version, with no iommu we assume all bus addresses are valid */
-    return true;
+    if (!dma_has_iommu(dma)) {
+        return true;
+    } else {
+        return iommu_dma_memory_valid(dma, addr, len, dir);
+    }
 }
 
+int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
+                        void *buf, dma_addr_t len, DMADirection dir);
 static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
                                 void *buf, dma_addr_t len, DMADirection dir)
 {
-    /* Stub version when we have no iommu support */
-    cpu_physical_memory_rw(addr, buf, (target_phys_addr_t)len,
-                           dir == DMA_DIRECTION_FROM_DEVICE);
-    return 0;
+    if (!dma_has_iommu(dma)) {
+        /* Fast-path for no IOMMU */
+        cpu_physical_memory_rw(addr, buf, len,
+                               dir == DMA_DIRECTION_FROM_DEVICE);
+        return 0;
+    } else {
+        return iommu_dma_memory_rw(dma, addr, buf, len, dir);
+    }
 }
 
 static inline int dma_memory_read(DMAContext *dma, dma_addr_t addr,
@@ -67,34 +104,53 @@ static inline int dma_memory_write(DMAContext *dma, dma_addr_t addr,
                          DMA_DIRECTION_FROM_DEVICE);
 }
 
+int iommu_dma_memory_zero(DMAContext *dma, dma_addr_t addr, dma_addr_t len);
 static inline int dma_memory_zero(DMAContext *dma, dma_addr_t addr,
                                   dma_addr_t len)
 {
-    /* Stub version when we have no iommu support */
-    cpu_physical_memory_zero(addr, len);
-    return 0;
+    if (!dma_has_iommu(dma)) {
+        /* Fast-path for no IOMMU */
+        cpu_physical_memory_zero(addr, len);
+        return 0;
+    } else {
+        return iommu_dma_memory_zero(dma, addr, len);
+    }
 }
 
+void *iommu_dma_memory_map(DMAContext *dma,
+                           dma_addr_t addr, dma_addr_t *len,
+                           DMADirection dir);
 static inline void *dma_memory_map(DMAContext *dma,
                                    dma_addr_t addr, dma_addr_t *len,
                                    DMADirection dir)
 {
-    target_phys_addr_t xlen = *len;
-    void *p;
-
-    p = cpu_physical_memory_map(addr, &xlen,
-                                dir == DMA_DIRECTION_FROM_DEVICE);
-    *len = xlen;
-    return p;
+    if (!dma_has_iommu(dma)) {
+        target_phys_addr_t xlen = *len;
+        void *p;
+
+        p = cpu_physical_memory_map(addr, &xlen,
+                                    dir == DMA_DIRECTION_FROM_DEVICE);
+        *len = xlen;
+        return p;
+    } else {
+        return iommu_dma_memory_map(dma, addr, len, dir);
+    }
 }
 
+void iommu_dma_memory_unmap(DMAContext *dma,
+                            void *buffer, dma_addr_t len,
+                            DMADirection dir, dma_addr_t access_len);
 static inline void dma_memory_unmap(DMAContext *dma,
                                     void *buffer, dma_addr_t len,
                                     DMADirection dir, dma_addr_t access_len)
 {
-    return cpu_physical_memory_unmap(buffer, (target_phys_addr_t)len,
-                                     dir == DMA_DIRECTION_FROM_DEVICE,
-                                     access_len);
+    if (!dma_has_iommu(dma)) {
+        return cpu_physical_memory_unmap(buffer, (target_phys_addr_t)len,
+                                         dir == DMA_DIRECTION_FROM_DEVICE,
+                                         access_len);
+    } else {
+        iommu_dma_memory_unmap(dma, buffer, len, dir, access_len);
+    }
 }
 
 #define DEFINE_LDST_DMA(_lname, _sname, _bits, _end) \
@@ -135,6 +191,10 @@ DEFINE_LDST_DMA(q, q, 64, be);
 
 #undef DEFINE_LDST_DMA
 
+void dma_context_init(DMAContext *dma, DMATranslateFunc fn);
+void iommu_wait_for_invalidated_maps(DMAContext *dma,
+                                     dma_addr_t addr, dma_addr_t len);
+
 struct ScatterGatherEntry {
     dma_addr_t base;
     dma_addr_t len;
diff --git a/hw/qdev-dma.h b/hw/qdev-dma.h
index f0ff558..6812735 100644
--- a/hw/qdev-dma.h
+++ b/hw/qdev-dma.h
@@ -6,7 +6,5 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
  */
-#include "qdev-addr.h"
-
 #define DEFINE_PROP_DMAADDR(_n, _s, _f, _d)                               \
-    DEFINE_PROP_TADDR(_n, _s, _f, _d)
+    DEFINE_PROP_HEX64(_n, _s, _f, _d)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 09/13] iommu: Add facility to cancel in-use dma memory maps
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (7 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 10/13] pseries: Convert sPAPR TCEs to use generic IOMMU infrastructure Benjamin Herrenschmidt
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

One new complication raised by IOMMU support over only handling DMA
directly to physical addresses is handling dma_memory_map() case
(replacing cpu_physical_memory_map()) when the IOMMU translation the
IOVAs covered by such a map are invalidated or changed while the map
is active.  This should never happen with correct guest software, but
we do need to handle buggy guests.  This case might also occur during
handovers between different guest software stages if the handover
protocols aren't fully seamless.

Presently, we handle this by having the IOMMU driver use a helper to
wait (blocking the initiating CPU thread) for any such mappings to go
away before completing the IOMMU update operation.  This is correct,
because maps are transient in all existing cases, but it's possible
that delay could be quite long.

This patch adds an infrastructure to reduce such delays, by
(optionally) signalling drivers holding maps with a callback when
their map is invalidated from under them.  From this callback they
should cancel in-progress DMAs using the map.  They should then unmap
the buffer ASAP, although this need not be synchronous with the
callback.  This will allow the invalidated maps to disappear faster,
unblocking the CPU thread triggering the invalidation.

In addition, this adds a user of the infrastructure in the bdrv code -
the most common and potentially longest lived used of maps.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma-helpers.c |   52 ++++++++++++++++++++++++++++++++--------------------
 dma.h         |   20 ++++++++++++++++----
 2 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/dma-helpers.c b/dma-helpers.c
index 09591ef..36fa963 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -107,6 +107,28 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
     }
 }
 
+static void dma_aio_cancel(BlockDriverAIOCB *acb)
+{
+    DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common);
+
+    trace_dma_aio_cancel(dbs);
+
+    if (dbs->acb) {
+        BlockDriverAIOCB *acb = dbs->acb;
+        dbs->acb = NULL;
+        dbs->in_cancel = true;
+        bdrv_aio_cancel(acb);
+        dbs->in_cancel = false;
+    }
+    dbs->common.cb = NULL;
+    dma_complete(dbs, 0);
+}
+
+static void dma_bdrv_cancel_cb(void *opaque)
+{
+    dma_aio_cancel(&((DMAAIOCB *)opaque)->common);
+}
+
 static void dma_bdrv_cb(void *opaque, int ret)
 {
     DMAAIOCB *dbs = (DMAAIOCB *)opaque;
@@ -127,7 +149,8 @@ static void dma_bdrv_cb(void *opaque, int ret)
     while (dbs->sg_cur_index < dbs->sg->nsg) {
         cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte;
         cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte;
-        mem = dma_memory_map(dbs->sg->dma, cur_addr, &cur_len, dbs->dir);
+        mem = dma_memory_map_with_cancel(dbs->sg->dma, dma_bdrv_cancel_cb, dbs,
+                                         cur_addr, &cur_len, dbs->dir);
         if (!mem)
             break;
         qemu_iovec_add(&dbs->iov, mem, cur_len);
@@ -149,23 +172,6 @@ static void dma_bdrv_cb(void *opaque, int ret)
     assert(dbs->acb);
 }
 
-static void dma_aio_cancel(BlockDriverAIOCB *acb)
-{
-    DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common);
-
-    trace_dma_aio_cancel(dbs);
-
-    if (dbs->acb) {
-        BlockDriverAIOCB *acb = dbs->acb;
-        dbs->acb = NULL;
-        dbs->in_cancel = true;
-        bdrv_aio_cancel(acb);
-        dbs->in_cancel = false;
-    }
-    dbs->common.cb = NULL;
-    dma_complete(dbs, 0);
-}
-
 static AIOPool dma_aio_pool = {
     .aiocb_size         = sizeof(DMAAIOCB),
     .cancel             = dma_aio_cancel,
@@ -350,6 +356,8 @@ struct DMAMemoryMap {
     dma_addr_t              addr;
     size_t                  len;
     void                    *buf;
+    DMACancelMapFunc        *cancel;
+    void                    *cancel_opaque;
 
     DMAInvalidationState    *invalidate;
     QLIST_ENTRY(DMAMemoryMap) list;
@@ -364,7 +372,9 @@ void dma_context_init(DMAContext *dma, DMATranslateFunc fn)
     QLIST_INIT(&dma->memory_maps);
 }
 
-void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len,
+void *iommu_dma_memory_map(DMAContext *dma,
+                           DMACancelMapFunc cb, void *cb_opaque,
+                           dma_addr_t addr, dma_addr_t *len,
                            DMADirection dir)
 {
     int err;
@@ -397,6 +407,8 @@ void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len,
     map->len = *len;
     map->buf = buf;
     map->invalidate = NULL;
+    map->cancel = cb;
+    map->cancel_opaque = cb_opaque;
 
     QLIST_INSERT_HEAD(&dma->memory_maps, map, list);
 
@@ -430,7 +442,6 @@ void iommu_dma_memory_unmap(DMAContext *dma, void *buffer, dma_addr_t len,
         }
     }
 
-
     /* unmap called on a buffer that wasn't mapped */
     assert(false);
 }
@@ -450,6 +461,7 @@ void iommu_wait_for_invalidated_maps(DMAContext *dma,
         if (ranges_overlap(addr, len, map->addr, map->len)) {
             is.count++;
             map->invalidate = &is;
+            map->cancel(map->cancel_opaque);
         }
     }
 
diff --git a/dma.h b/dma.h
index b57d72f..51914c6 100644
--- a/dma.h
+++ b/dma.h
@@ -60,6 +60,8 @@ static inline bool dma_has_iommu(DMAContext *dma)
     return !!dma;
 }
 
+typedef void DMACancelMapFunc(void *);
+
 /* Checks that the given range of addresses is valid for DMA.  This is
  * useful for certain cases, but usually you should just use
  * dma_memory_{read,write}() and check for errors */
@@ -118,11 +120,15 @@ static inline int dma_memory_zero(DMAContext *dma, dma_addr_t addr,
 }
 
 void *iommu_dma_memory_map(DMAContext *dma,
+                           DMACancelMapFunc *cb, void *opaque,
                            dma_addr_t addr, dma_addr_t *len,
                            DMADirection dir);
-static inline void *dma_memory_map(DMAContext *dma,
-                                   dma_addr_t addr, dma_addr_t *len,
-                                   DMADirection dir)
+static inline void *dma_memory_map_with_cancel(DMAContext *dma,
+                                               DMACancelMapFunc *cb,
+                                               void *opaque,
+                                               dma_addr_t addr,
+                                               dma_addr_t *len,
+                                               DMADirection dir)
 {
     if (!dma_has_iommu(dma)) {
         target_phys_addr_t xlen = *len;
@@ -133,9 +139,15 @@ static inline void *dma_memory_map(DMAContext *dma,
         *len = xlen;
         return p;
     } else {
-        return iommu_dma_memory_map(dma, addr, len, dir);
+        return iommu_dma_memory_map(dma, cb, opaque, addr, len, dir);
     }
 }
+static inline void *dma_memory_map(DMAContext *dma,
+                                   dma_addr_t addr, dma_addr_t *len,
+                                   DMADirection dir)
+{
+    return dma_memory_map_with_cancel(dma, NULL, NULL, addr, len, dir);
+}
 
 void iommu_dma_memory_unmap(DMAContext *dma,
                             void *buffer, dma_addr_t len,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 10/13] pseries: Convert sPAPR TCEs to use generic IOMMU infrastructure
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (8 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 09/13] iommu: Add facility to cancel in-use dma memory maps Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 11/13] iommu: Allow PCI to use " Benjamin Herrenschmidt
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Graf, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The pseries platform already contains an IOMMU implementation, since it is
essential for the platform's paravirtualized VIO devices.  This IOMMU
support is currently built into the implementation of the VIO "bus" and
the various VIO devices.

This patch converts this code to make use of the new common IOMMU
infrastructure.

Cc: Alex Graf <agraf@suse.de>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 Makefile.target  |    2 +-
 hw/spapr.c       |    3 +
 hw/spapr.h       |   16 ++++
 hw/spapr_iommu.c |  248 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/spapr_llan.c  |   63 ++++++------
 hw/spapr_vio.c   |  281 ++++--------------------------------------------------
 hw/spapr_vio.h   |   73 +++++++-------
 hw/spapr_vscsi.c |   26 ++---
 hw/spapr_vty.c   |    2 +-
 target-ppc/kvm.c |    4 +-
 10 files changed, 375 insertions(+), 343 deletions(-)
 create mode 100644 hw/spapr_iommu.c

diff --git a/Makefile.target b/Makefile.target
index 1582904..4a9f74e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -241,7 +241,7 @@ obj-ppc-y += ppc_oldworld.o
 # NewWorld PowerMac
 obj-ppc-y += ppc_newworld.o
 # IBM pSeries (sPAPR)
-obj-ppc-$(CONFIG_PSERIES) += spapr.o spapr_hcall.o spapr_rtas.o spapr_vio.o
+obj-ppc-$(CONFIG_PSERIES) += spapr.o spapr_hcall.o spapr_rtas.o spapr_vio.o spapr_iommu.o
 obj-ppc-$(CONFIG_PSERIES) += xics.o spapr_vty.o spapr_llan.o spapr_vscsi.o
 obj-ppc-$(CONFIG_PSERIES) += spapr_pci.o device-hotplug.o pci-hotplug.o
 # PowerPC 4xx boards
diff --git a/hw/spapr.c b/hw/spapr.c
index cca20f9..1c9597b 100644
--- a/hw/spapr.c
+++ b/hw/spapr.c
@@ -626,6 +626,9 @@ static void ppc_spapr_init(ram_addr_t ram_size,
     spapr->icp = xics_system_init(XICS_IRQS);
     spapr->next_irq = 16;
 
+    /* Set up IOMMU */
+    spapr_iommu_init();
+
     /* Set up VIO bus */
     spapr->vio_bus = spapr_vio_bus_init();
 
diff --git a/hw/spapr.h b/hw/spapr.h
index 654a7a8..df3e8b1 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -319,4 +319,20 @@ target_ulong spapr_rtas_call(sPAPREnvironment *spapr,
 int spapr_rtas_device_tree_setup(void *fdt, target_phys_addr_t rtas_addr,
                                  target_phys_addr_t rtas_size);
 
+#define SPAPR_TCE_PAGE_SHIFT   12
+#define SPAPR_TCE_PAGE_SIZE    (1ULL << SPAPR_TCE_PAGE_SHIFT)
+#define SPAPR_TCE_PAGE_MASK    (SPAPR_TCE_PAGE_SIZE - 1)
+
+typedef struct sPAPRTCE {
+    uint64_t tce;
+} sPAPRTCE;
+
+#define SPAPR_VIO_BASE_LIOBN    0x00000000
+
+void spapr_iommu_init(void);
+DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size);
+void spapr_tce_free(DMAContext *dma);
+int spapr_dma_dt(void *fdt, int node_off, const char *propname,
+                 DMAContext *dma);
+
 #endif /* !defined (__HW_SPAPR_H__) */
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
new file mode 100644
index 0000000..87ed09c
--- /dev/null
+++ b/hw/spapr_iommu.c
@@ -0,0 +1,248 @@
+/*
+ * QEMU sPAPR IOMMU (TCE) code
+ *
+ * Copyright (c) 2010 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "hw.h"
+#include "kvm.h"
+#include "qdev.h"
+#include "kvm_ppc.h"
+#include "dma.h"
+
+#include "hw/spapr.h"
+
+#include <libfdt.h>
+
+/* #define DEBUG_TCE */
+
+enum sPAPRTCEAccess {
+    SPAPR_TCE_FAULT = 0,
+    SPAPR_TCE_RO = 1,
+    SPAPR_TCE_WO = 2,
+    SPAPR_TCE_RW = 3,
+};
+
+typedef struct sPAPRTCETable sPAPRTCETable;
+
+struct sPAPRTCETable {
+    DMAContext dma;
+    uint32_t liobn;
+    uint32_t window_size;
+    sPAPRTCE *table;
+    int fd;
+    QLIST_ENTRY(sPAPRTCETable) list;
+};
+
+
+QLIST_HEAD(spapr_tce_tables, sPAPRTCETable) spapr_tce_tables;
+
+static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
+{
+    sPAPRTCETable *tcet;
+
+    QLIST_FOREACH(tcet, &spapr_tce_tables, list) {
+        if (tcet->liobn == liobn) {
+            return tcet;
+        }
+    }
+
+    return NULL;
+}
+
+static int spapr_tce_translate(DMAContext *dma,
+                               dma_addr_t addr,
+                               target_phys_addr_t *paddr,
+                               target_phys_addr_t *len,
+                               DMADirection dir)
+{
+    sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
+    enum sPAPRTCEAccess access = (dir == DMA_DIRECTION_FROM_DEVICE)
+        ? SPAPR_TCE_WO : SPAPR_TCE_RO;
+    uint64_t tce;
+
+#ifdef DEBUG_TCE
+    fprintf(stderr, "spapr_tce_translate liobn=0x%" PRIx32 " addr=0x"
+            DMA_ADDR_FMT "\n", tcet->liobn, addr);
+#endif
+
+    /* Check if we are in bound */
+    if (addr >= tcet->window_size) {
+#ifdef DEBUG_TCE
+        fprintf(stderr, "spapr_tce_translate out of bounds\n");
+#endif
+        return -EFAULT;
+    }
+
+    tce = tcet->table[addr >> SPAPR_TCE_PAGE_SHIFT].tce;
+
+    /* Check TCE */
+    if (!(tce & access)) {
+        return -EPERM;
+    }
+
+    /* How much til end of page ? */
+    *len = ((~addr) & SPAPR_TCE_PAGE_MASK) + 1;
+
+    /* Translate */
+    *paddr = (tce & ~SPAPR_TCE_PAGE_MASK) |
+        (addr & SPAPR_TCE_PAGE_MASK);
+
+#ifdef DEBUG_TCE
+    fprintf(stderr, " ->  *paddr=0x" TARGET_FMT_plx ", *len=0x"
+            TARGET_FMT_plx "\n", *paddr, *len);
+#endif
+
+    return 0;
+}
+
+DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
+{
+    sPAPRTCETable *tcet;
+
+    if (!window_size) {
+        return NULL;
+    }
+
+    tcet = g_malloc0(sizeof(*tcet));
+    dma_context_init(&tcet->dma, spapr_tce_translate);
+
+    tcet->liobn = liobn;
+    tcet->window_size = window_size;
+
+    if (kvm_enabled()) {
+        tcet->table = kvmppc_create_spapr_tce(liobn,
+                                              window_size,
+                                              &tcet->fd);
+    }
+
+    if (!tcet->table) {
+        size_t table_size = (window_size >> SPAPR_TCE_PAGE_SHIFT)
+            * sizeof(sPAPRTCE);
+        tcet->table = g_malloc0(table_size);
+    }
+
+#ifdef DEBUG_TCE
+    fprintf(stderr, "spapr_iommu: New TCE table, liobn=0x%x, context @ %p, "
+            "table @ %p, fd=%d\n", liobn, &tcet->dma, tcet->table, tcet->fd);
+#endif
+
+    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
+
+    return &tcet->dma;
+}
+
+void spapr_tce_free(DMAContext *dma)
+{
+
+    if (dma) {
+        sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
+
+        QLIST_REMOVE(tcet, list);
+
+        if (!kvm_enabled() ||
+            (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
+                                     tcet->window_size) != 0)) {
+            g_free(tcet->table);
+        }
+
+        g_free(tcet);
+    }
+}
+
+
+static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
+                              target_ulong opcode, target_ulong *args)
+{
+    target_ulong liobn = args[0];
+    target_ulong ioba = args[1];
+    target_ulong tce = args[2];
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    sPAPRTCE *tcep;
+    target_ulong oldtce;
+
+    if (liobn & 0xFFFFFFFF00000000ULL) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN "
+                      TARGET_FMT_lx "\n", liobn);
+        return H_PARAMETER;
+    }
+    if (!tcet) {
+        hcall_dprintf("spapr_vio_put_tce on non-existent LIOBN "
+                      TARGET_FMT_lx "\n", liobn);
+        return H_PARAMETER;
+    }
+
+    ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
+
+#ifdef DEBUG_TCE
+    fprintf(stderr, "spapr_vio_put_tce on liobn=" TARGET_FMT_lx /*%s*/
+            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx "\n",
+            liobn, /*dev->qdev.id, */ioba, tce);
+#endif
+
+    if (ioba >= tcet->window_size) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boards IOBA 0x"
+                      TARGET_FMT_lx "\n", ioba);
+        return H_PARAMETER;
+    }
+
+    tcep = tcet->table + (ioba >> SPAPR_TCE_PAGE_SHIFT);
+    oldtce = tcep->tce;
+    tcep->tce = tce;
+
+    if (oldtce != 0) {
+        iommu_wait_for_invalidated_maps(&tcet->dma, ioba, SPAPR_TCE_PAGE_SIZE);
+    }
+
+    return H_SUCCESS;
+}
+
+void spapr_iommu_init(void)
+{
+    QLIST_INIT(&spapr_tce_tables);
+
+    /* hcall-tce */
+    spapr_register_hypercall(H_PUT_TCE, h_put_tce);
+}
+
+int spapr_dma_dt(void *fdt, int node_off, const char *propname,
+                 DMAContext *dma)
+{
+    if (dma) {
+        sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
+        uint32_t dma_prop[] = {cpu_to_be32(tcet->liobn),
+                               0, 0,
+                               0, cpu_to_be32(tcet->window_size)};
+        int ret;
+
+        ret = fdt_setprop_cell(fdt, node_off, "ibm,#dma-address-cells", 2);
+        if (ret < 0) {
+            return ret;
+        }
+
+        ret = fdt_setprop_cell(fdt, node_off, "ibm,#dma-size-cells", 2);
+        if (ret < 0) {
+            return ret;
+        }
+
+        ret = fdt_setprop(fdt, node_off, propname, dma_prop,
+                          sizeof(dma_prop));
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
diff --git a/hw/spapr_llan.c b/hw/spapr_llan.c
index 8313043..2dcbcea 100644
--- a/hw/spapr_llan.c
+++ b/hw/spapr_llan.c
@@ -71,7 +71,7 @@ typedef uint64_t vlan_bd_t;
 #define VLAN_RXQ_BD_OFF      0
 #define VLAN_FILTER_BD_OFF   8
 #define VLAN_RX_BDS_OFF      16
-#define VLAN_MAX_BUFS        ((SPAPR_VIO_TCE_PAGE_SIZE - VLAN_RX_BDS_OFF) / 8)
+#define VLAN_MAX_BUFS        ((SPAPR_TCE_PAGE_SIZE - VLAN_RX_BDS_OFF) / 8)
 
 typedef struct VIOsPAPRVLANDevice {
     VIOsPAPRDevice sdev;
@@ -95,7 +95,7 @@ static ssize_t spapr_vlan_receive(VLANClientState *nc, const uint8_t *buf,
 {
     VIOsPAPRDevice *sdev = DO_UPCAST(NICState, nc, nc)->opaque;
     VIOsPAPRVLANDevice *dev = (VIOsPAPRVLANDevice *)sdev;
-    vlan_bd_t rxq_bd = ldq_tce(sdev, dev->buf_list + VLAN_RXQ_BD_OFF);
+    vlan_bd_t rxq_bd = vio_ldq(sdev, dev->buf_list + VLAN_RXQ_BD_OFF);
     vlan_bd_t bd;
     int buf_ptr = dev->use_buf_ptr;
     uint64_t handle;
@@ -114,11 +114,11 @@ static ssize_t spapr_vlan_receive(VLANClientState *nc, const uint8_t *buf,
 
     do {
         buf_ptr += 8;
-        if (buf_ptr >= SPAPR_VIO_TCE_PAGE_SIZE) {
+        if (buf_ptr >= SPAPR_TCE_PAGE_SIZE) {
             buf_ptr = VLAN_RX_BDS_OFF;
         }
 
-        bd = ldq_tce(sdev, dev->buf_list + buf_ptr);
+        bd = vio_ldq(sdev, dev->buf_list + buf_ptr);
         dprintf("use_buf_ptr=%d bd=0x%016llx\n",
                 buf_ptr, (unsigned long long)bd);
     } while ((!(bd & VLAN_BD_VALID) || (VLAN_BD_LEN(bd) < (size + 8)))
@@ -132,12 +132,12 @@ static ssize_t spapr_vlan_receive(VLANClientState *nc, const uint8_t *buf,
     /* Remove the buffer from the pool */
     dev->rx_bufs--;
     dev->use_buf_ptr = buf_ptr;
-    stq_tce(sdev, dev->buf_list + dev->use_buf_ptr, 0);
+    vio_stq(sdev, dev->buf_list + dev->use_buf_ptr, 0);
 
     dprintf("Found buffer: ptr=%d num=%d\n", dev->use_buf_ptr, dev->rx_bufs);
 
     /* Transfer the packet data */
-    if (spapr_tce_dma_write(sdev, VLAN_BD_ADDR(bd) + 8, buf, size) < 0) {
+    if (spapr_vio_dma_write(sdev, VLAN_BD_ADDR(bd) + 8, buf, size) < 0) {
         return -1;
     }
 
@@ -149,23 +149,23 @@ static ssize_t spapr_vlan_receive(VLANClientState *nc, const uint8_t *buf,
         control ^= VLAN_RXQC_TOGGLE;
     }
 
-    handle = ldq_tce(sdev, VLAN_BD_ADDR(bd));
-    stq_tce(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr + 8, handle);
-    stw_tce(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr + 4, size);
-    sth_tce(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr + 2, 8);
-    stb_tce(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr, control);
+    handle = vio_ldq(sdev, VLAN_BD_ADDR(bd));
+    vio_stq(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr + 8, handle);
+    vio_stl(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr + 4, size);
+    vio_sth(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr + 2, 8);
+    vio_stb(sdev, VLAN_BD_ADDR(rxq_bd) + dev->rxq_ptr, control);
 
     dprintf("wrote rxq entry (ptr=0x%llx): 0x%016llx 0x%016llx\n",
             (unsigned long long)dev->rxq_ptr,
-            (unsigned long long)ldq_tce(sdev, VLAN_BD_ADDR(rxq_bd) +
+            (unsigned long long)vio_ldq(sdev, VLAN_BD_ADDR(rxq_bd) +
                                         dev->rxq_ptr),
-            (unsigned long long)ldq_tce(sdev, VLAN_BD_ADDR(rxq_bd) +
+            (unsigned long long)vio_ldq(sdev, VLAN_BD_ADDR(rxq_bd) +
                                         dev->rxq_ptr + 8));
 
     dev->rxq_ptr += 16;
     if (dev->rxq_ptr >= VLAN_BD_LEN(rxq_bd)) {
         dev->rxq_ptr = 0;
-        stq_tce(sdev, dev->buf_list + VLAN_RXQ_BD_OFF, rxq_bd ^ VLAN_BD_TOGGLE);
+        vio_stq(sdev, dev->buf_list + VLAN_RXQ_BD_OFF, rxq_bd ^ VLAN_BD_TOGGLE);
     }
 
     if (sdev->signal_state & 1) {
@@ -254,8 +254,10 @@ static int check_bd(VIOsPAPRVLANDevice *dev, vlan_bd_t bd,
         return -1;
     }
 
-    if (spapr_vio_check_tces(&dev->sdev, VLAN_BD_ADDR(bd),
-                             VLAN_BD_LEN(bd), SPAPR_TCE_RW) != 0) {
+    if (!spapr_vio_dma_valid(&dev->sdev, VLAN_BD_ADDR(bd),
+                             VLAN_BD_LEN(bd), DMA_DIRECTION_FROM_DEVICE)
+        || !spapr_vio_dma_valid(&dev->sdev, VLAN_BD_ADDR(bd),
+                                VLAN_BD_LEN(bd), DMA_DIRECTION_TO_DEVICE)) {
         return -1;
     }
 
@@ -285,14 +287,14 @@ static target_ulong h_register_logical_lan(CPUPPCState *env,
         return H_RESOURCE;
     }
 
-    if (check_bd(dev, VLAN_VALID_BD(buf_list, SPAPR_VIO_TCE_PAGE_SIZE),
-                 SPAPR_VIO_TCE_PAGE_SIZE) < 0) {
+    if (check_bd(dev, VLAN_VALID_BD(buf_list, SPAPR_TCE_PAGE_SIZE),
+                 SPAPR_TCE_PAGE_SIZE) < 0) {
         hcall_dprintf("Bad buf_list 0x" TARGET_FMT_lx "\n", buf_list);
         return H_PARAMETER;
     }
 
-    filter_list_bd = VLAN_VALID_BD(filter_list, SPAPR_VIO_TCE_PAGE_SIZE);
-    if (check_bd(dev, filter_list_bd, SPAPR_VIO_TCE_PAGE_SIZE) < 0) {
+    filter_list_bd = VLAN_VALID_BD(filter_list, SPAPR_TCE_PAGE_SIZE);
+    if (check_bd(dev, filter_list_bd, SPAPR_TCE_PAGE_SIZE) < 0) {
         hcall_dprintf("Bad filter_list 0x" TARGET_FMT_lx "\n", filter_list);
         return H_PARAMETER;
     }
@@ -309,17 +311,17 @@ static target_ulong h_register_logical_lan(CPUPPCState *env,
     rec_queue &= ~VLAN_BD_TOGGLE;
 
     /* Initialize the buffer list */
-    stq_tce(sdev, buf_list, rec_queue);
-    stq_tce(sdev, buf_list + 8, filter_list_bd);
-    spapr_tce_dma_zero(sdev, buf_list + VLAN_RX_BDS_OFF,
-                       SPAPR_VIO_TCE_PAGE_SIZE - VLAN_RX_BDS_OFF);
+    vio_stq(sdev, buf_list, rec_queue);
+    vio_stq(sdev, buf_list + 8, filter_list_bd);
+    spapr_vio_dma_zero(sdev, buf_list + VLAN_RX_BDS_OFF,
+                       SPAPR_TCE_PAGE_SIZE - VLAN_RX_BDS_OFF);
     dev->add_buf_ptr = VLAN_RX_BDS_OFF - 8;
     dev->use_buf_ptr = VLAN_RX_BDS_OFF - 8;
     dev->rx_bufs = 0;
     dev->rxq_ptr = 0;
 
     /* Initialize the receive queue */
-    spapr_tce_dma_zero(sdev, VLAN_BD_ADDR(rec_queue), VLAN_BD_LEN(rec_queue));
+    spapr_vio_dma_zero(sdev, VLAN_BD_ADDR(rec_queue), VLAN_BD_LEN(rec_queue));
 
     dev->isopen = 1;
     return H_SUCCESS;
@@ -378,14 +380,14 @@ static target_ulong h_add_logical_lan_buffer(CPUPPCState *env,
 
     do {
         dev->add_buf_ptr += 8;
-        if (dev->add_buf_ptr >= SPAPR_VIO_TCE_PAGE_SIZE) {
+        if (dev->add_buf_ptr >= SPAPR_TCE_PAGE_SIZE) {
             dev->add_buf_ptr = VLAN_RX_BDS_OFF;
         }
 
-        bd = ldq_tce(sdev, dev->buf_list + dev->add_buf_ptr);
+        bd = vio_ldq(sdev, dev->buf_list + dev->add_buf_ptr);
     } while (bd & VLAN_BD_VALID);
 
-    stq_tce(sdev, dev->buf_list + dev->add_buf_ptr, buf);
+    vio_stq(sdev, dev->buf_list + dev->add_buf_ptr, buf);
 
     dev->rx_bufs++;
 
@@ -451,7 +453,7 @@ static target_ulong h_send_logical_lan(CPUPPCState *env, sPAPREnvironment *spapr
     lbuf = alloca(total_len);
     p = lbuf;
     for (i = 0; i < nbufs; i++) {
-        ret = spapr_tce_dma_read(sdev, VLAN_BD_ADDR(bufs[i]),
+        ret = spapr_vio_dma_read(sdev, VLAN_BD_ADDR(bufs[i]),
                                  p, VLAN_BD_LEN(bufs[i]));
         if (ret < 0) {
             return ret;
@@ -479,7 +481,7 @@ static target_ulong h_multicast_ctrl(CPUPPCState *env, sPAPREnvironment *spapr,
 }
 
 static Property spapr_vlan_properties[] = {
-    DEFINE_SPAPR_PROPERTIES(VIOsPAPRVLANDevice, sdev, 0x10000000),
+    DEFINE_SPAPR_PROPERTIES(VIOsPAPRVLANDevice, sdev),
     DEFINE_NIC_PROPERTIES(VIOsPAPRVLANDevice, nicconf),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -497,6 +499,7 @@ static void spapr_vlan_class_init(ObjectClass *klass, void *data)
     k->dt_compatible = "IBM,l-lan";
     k->signal_mask = 0x1;
     dc->props = spapr_vlan_properties;
+    k->rtce_window_size = 0x10000000;
 }
 
 static TypeInfo spapr_vlan_info = {
diff --git a/hw/spapr_vio.c b/hw/spapr_vio.c
index 315ab80..9a0a6e9 100644
--- a/hw/spapr_vio.c
+++ b/hw/spapr_vio.c
@@ -39,7 +39,6 @@
 #endif /* CONFIG_FDT */
 
 /* #define DEBUG_SPAPR */
-/* #define DEBUG_TCE */
 
 #ifdef DEBUG_SPAPR
 #define dprintf(fmt, ...) \
@@ -141,26 +140,9 @@ static int vio_make_devnode(VIOsPAPRDevice *dev,
         }
     }
 
-    if (dev->rtce_window_size) {
-        uint32_t dma_prop[] = {cpu_to_be32(dev->reg),
-                               0, 0,
-                               0, cpu_to_be32(dev->rtce_window_size)};
-
-        ret = fdt_setprop_cell(fdt, node_off, "ibm,#dma-address-cells", 2);
-        if (ret < 0) {
-            return ret;
-        }
-
-        ret = fdt_setprop_cell(fdt, node_off, "ibm,#dma-size-cells", 2);
-        if (ret < 0) {
-            return ret;
-        }
-
-        ret = fdt_setprop(fdt, node_off, "ibm,my-dma-window", dma_prop,
-                          sizeof(dma_prop));
-        if (ret < 0) {
-            return ret;
-        }
+    ret = spapr_dma_dt(fdt, node_off, "ibm,my-dma-window", dev->dma);
+    if (ret < 0) {
+        return ret;
     }
 
     if (pc->devnode) {
@@ -175,232 +157,6 @@ static int vio_make_devnode(VIOsPAPRDevice *dev,
 #endif /* CONFIG_FDT */
 
 /*
- * RTCE handling
- */
-
-static void rtce_init(VIOsPAPRDevice *dev)
-{
-    size_t size = (dev->rtce_window_size >> SPAPR_VIO_TCE_PAGE_SHIFT)
-        * sizeof(VIOsPAPR_RTCE);
-
-    if (size) {
-        dev->rtce_table = kvmppc_create_spapr_tce(dev->reg,
-                                                  dev->rtce_window_size,
-                                                  &dev->kvmtce_fd);
-
-        if (!dev->rtce_table) {
-            dev->rtce_table = g_malloc0(size);
-        }
-    }
-}
-
-static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
-                              target_ulong opcode, target_ulong *args)
-{
-    target_ulong liobn = args[0];
-    target_ulong ioba = args[1];
-    target_ulong tce = args[2];
-    VIOsPAPRDevice *dev = spapr_vio_find_by_reg(spapr->vio_bus, liobn);
-    VIOsPAPR_RTCE *rtce;
-
-    if (!dev) {
-        hcall_dprintf("LIOBN 0x" TARGET_FMT_lx " does not exist\n", liobn);
-        return H_PARAMETER;
-    }
-
-    ioba &= ~(SPAPR_VIO_TCE_PAGE_SIZE - 1);
-
-#ifdef DEBUG_TCE
-    fprintf(stderr, "spapr_vio_put_tce on %s  ioba 0x" TARGET_FMT_lx
-            "  TCE 0x" TARGET_FMT_lx "\n", dev->qdev.id, ioba, tce);
-#endif
-
-    if (ioba >= dev->rtce_window_size) {
-        hcall_dprintf("Out-of-bounds IOBA 0x" TARGET_FMT_lx "\n", ioba);
-        return H_PARAMETER;
-    }
-
-    rtce = dev->rtce_table + (ioba >> SPAPR_VIO_TCE_PAGE_SHIFT);
-    rtce->tce = tce;
-
-    return H_SUCCESS;
-}
-
-int spapr_vio_check_tces(VIOsPAPRDevice *dev, target_ulong ioba,
-                         target_ulong len, enum VIOsPAPR_TCEAccess access)
-{
-    int start, end, i;
-
-    start = ioba >> SPAPR_VIO_TCE_PAGE_SHIFT;
-    end = (ioba + len - 1) >> SPAPR_VIO_TCE_PAGE_SHIFT;
-
-    for (i = start; i <= end; i++) {
-        if ((dev->rtce_table[i].tce & access) != access) {
-#ifdef DEBUG_TCE
-            fprintf(stderr, "FAIL on %d\n", i);
-#endif
-            return -1;
-        }
-    }
-
-    return 0;
-}
-
-int spapr_tce_dma_write(VIOsPAPRDevice *dev, uint64_t taddr, const void *buf,
-                        uint32_t size)
-{
-#ifdef DEBUG_TCE
-    fprintf(stderr, "spapr_tce_dma_write taddr=0x%llx size=0x%x\n",
-            (unsigned long long)taddr, size);
-#endif
-
-    /* Check for bypass */
-    if (dev->flags & VIO_PAPR_FLAG_DMA_BYPASS) {
-        cpu_physical_memory_write(taddr, buf, size);
-        return 0;
-    }
-
-    while (size) {
-        uint64_t tce;
-        uint32_t lsize;
-        uint64_t txaddr;
-
-        /* Check if we are in bound */
-        if (taddr >= dev->rtce_window_size) {
-#ifdef DEBUG_TCE
-            fprintf(stderr, "spapr_tce_dma_write out of bounds\n");
-#endif
-            return H_DEST_PARM;
-        }
-        tce = dev->rtce_table[taddr >> SPAPR_VIO_TCE_PAGE_SHIFT].tce;
-
-        /* How much til end of page ? */
-        lsize = MIN(size, ((~taddr) & SPAPR_VIO_TCE_PAGE_MASK) + 1);
-
-        /* Check TCE */
-        if (!(tce & 2)) {
-            return H_DEST_PARM;
-        }
-
-        /* Translate */
-        txaddr = (tce & ~SPAPR_VIO_TCE_PAGE_MASK) |
-            (taddr & SPAPR_VIO_TCE_PAGE_MASK);
-
-#ifdef DEBUG_TCE
-        fprintf(stderr, " -> write to txaddr=0x%llx, size=0x%x\n",
-                (unsigned long long)txaddr, lsize);
-#endif
-
-        /* Do it */
-        cpu_physical_memory_write(txaddr, buf, lsize);
-        buf += lsize;
-        taddr += lsize;
-        size -= lsize;
-    }
-    return 0;
-}
-
-int spapr_tce_dma_zero(VIOsPAPRDevice *dev, uint64_t taddr, uint32_t size)
-{
-    /* FIXME: allocating a temp buffer is nasty, but just stepping
-     * through writing zeroes is awkward.  This will do for now. */
-    uint8_t zeroes[size];
-
-#ifdef DEBUG_TCE
-    fprintf(stderr, "spapr_tce_dma_zero taddr=0x%llx size=0x%x\n",
-            (unsigned long long)taddr, size);
-#endif
-
-    memset(zeroes, 0, size);
-    return spapr_tce_dma_write(dev, taddr, zeroes, size);
-}
-
-void stb_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint8_t val)
-{
-    spapr_tce_dma_write(dev, taddr, &val, sizeof(val));
-}
-
-void sth_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint16_t val)
-{
-    val = tswap16(val);
-    spapr_tce_dma_write(dev, taddr, &val, sizeof(val));
-}
-
-
-void stw_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint32_t val)
-{
-    val = tswap32(val);
-    spapr_tce_dma_write(dev, taddr, &val, sizeof(val));
-}
-
-void stq_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint64_t val)
-{
-    val = tswap64(val);
-    spapr_tce_dma_write(dev, taddr, &val, sizeof(val));
-}
-
-int spapr_tce_dma_read(VIOsPAPRDevice *dev, uint64_t taddr, void *buf,
-                       uint32_t size)
-{
-#ifdef DEBUG_TCE
-    fprintf(stderr, "spapr_tce_dma_write taddr=0x%llx size=0x%x\n",
-            (unsigned long long)taddr, size);
-#endif
-
-    /* Check for bypass */
-    if (dev->flags & VIO_PAPR_FLAG_DMA_BYPASS) {
-        cpu_physical_memory_read(taddr, buf, size);
-        return 0;
-    }
-
-    while (size) {
-        uint64_t tce;
-        uint32_t lsize;
-        uint64_t txaddr;
-
-        /* Check if we are in bound */
-        if (taddr >= dev->rtce_window_size) {
-#ifdef DEBUG_TCE
-            fprintf(stderr, "spapr_tce_dma_read out of bounds\n");
-#endif
-            return H_DEST_PARM;
-        }
-        tce = dev->rtce_table[taddr >> SPAPR_VIO_TCE_PAGE_SHIFT].tce;
-
-        /* How much til end of page ? */
-        lsize = MIN(size, ((~taddr) & SPAPR_VIO_TCE_PAGE_MASK) + 1);
-
-        /* Check TCE */
-        if (!(tce & 1)) {
-            return H_DEST_PARM;
-        }
-
-        /* Translate */
-        txaddr = (tce & ~SPAPR_VIO_TCE_PAGE_MASK) |
-            (taddr & SPAPR_VIO_TCE_PAGE_MASK);
-
-#ifdef DEBUG_TCE
-        fprintf(stderr, " -> write to txaddr=0x%llx, size=0x%x\n",
-                (unsigned long long)txaddr, lsize);
-#endif
-        /* Do it */
-        cpu_physical_memory_read(txaddr, buf, lsize);
-        buf += lsize;
-        taddr += lsize;
-        size -= lsize;
-    }
-    return H_SUCCESS;
-}
-
-uint64_t ldq_tce(VIOsPAPRDevice *dev, uint64_t taddr)
-{
-    uint64_t val;
-
-    spapr_tce_dma_read(dev, taddr, &val, sizeof(val));
-    return tswap64(val);
-}
-
-/*
  * CRQ handling
  */
 static target_ulong h_reg_crq(CPUPPCState *env, sPAPREnvironment *spapr,
@@ -524,7 +280,7 @@ int spapr_vio_send_crq(VIOsPAPRDevice *dev, uint8_t *crq)
     }
 
     /* Maybe do a fast path for KVM just writing to the pages */
-    rc = spapr_tce_dma_read(dev, dev->crq.qladdr + dev->crq.qnext, &byte, 1);
+    rc = spapr_vio_dma_read(dev, dev->crq.qladdr + dev->crq.qnext, &byte, 1);
     if (rc) {
         return rc;
     }
@@ -532,7 +288,7 @@ int spapr_vio_send_crq(VIOsPAPRDevice *dev, uint8_t *crq)
         return 1;
     }
 
-    rc = spapr_tce_dma_write(dev, dev->crq.qladdr + dev->crq.qnext + 8,
+    rc = spapr_vio_dma_write(dev, dev->crq.qladdr + dev->crq.qnext + 8,
                              &crq[8], 8);
     if (rc) {
         return rc;
@@ -540,7 +296,7 @@ int spapr_vio_send_crq(VIOsPAPRDevice *dev, uint8_t *crq)
 
     kvmppc_eieio();
 
-    rc = spapr_tce_dma_write(dev, dev->crq.qladdr + dev->crq.qnext, crq, 8);
+    rc = spapr_vio_dma_write(dev, dev->crq.qladdr + dev->crq.qnext, crq, 8);
     if (rc) {
         return rc;
     }
@@ -558,13 +314,13 @@ int spapr_vio_send_crq(VIOsPAPRDevice *dev, uint8_t *crq)
 
 static void spapr_vio_quiesce_one(VIOsPAPRDevice *dev)
 {
-    dev->flags &= ~VIO_PAPR_FLAG_DMA_BYPASS;
+    VIOsPAPRDeviceClass *pc = VIO_SPAPR_DEVICE_GET_CLASS(dev);
+    uint32_t liobn = SPAPR_VIO_BASE_LIOBN | dev->reg;
 
-    if (dev->rtce_table) {
-        size_t size = (dev->rtce_window_size >> SPAPR_VIO_TCE_PAGE_SHIFT)
-            * sizeof(VIOsPAPR_RTCE);
-        memset(dev->rtce_table, 0, size);
+    if (dev->dma) {
+        spapr_tce_free(dev->dma);
     }
+    dev->dma = spapr_tce_new_dma_context(liobn, pc->rtce_window_size);
 
     dev->crq.qladdr = 0;
     dev->crq.qsize = 0;
@@ -591,9 +347,13 @@ static void rtas_set_tce_bypass(sPAPREnvironment *spapr, uint32_t token,
         return;
     }
     if (enable) {
-        dev->flags |= VIO_PAPR_FLAG_DMA_BYPASS;
+        spapr_tce_free(dev->dma);
+        dev->dma = NULL;
     } else {
-        dev->flags &= ~VIO_PAPR_FLAG_DMA_BYPASS;
+        VIOsPAPRDeviceClass *pc = VIO_SPAPR_DEVICE_GET_CLASS(dev);
+        uint32_t liobn = SPAPR_VIO_BASE_LIOBN | dev->reg;
+
+        dev->dma = spapr_tce_new_dma_context(liobn, pc->rtce_window_size);
     }
 
     rtas_st(rets, 0, 0);
@@ -660,6 +420,7 @@ static int spapr_vio_busdev_init(DeviceState *qdev)
 {
     VIOsPAPRDevice *dev = (VIOsPAPRDevice *)qdev;
     VIOsPAPRDeviceClass *pc = VIO_SPAPR_DEVICE_GET_CLASS(dev);
+    uint32_t liobn;
     char *id;
 
     if (dev->reg != -1) {
@@ -701,7 +462,8 @@ static int spapr_vio_busdev_init(DeviceState *qdev)
         return -1;
     }
 
-    rtce_init(dev);
+    liobn = SPAPR_VIO_BASE_LIOBN | dev->reg;
+    dev->dma = spapr_tce_new_dma_context(liobn, pc->rtce_window_size);
 
     return pc->init(dev);
 }
@@ -749,9 +511,6 @@ VIOsPAPRBus *spapr_vio_bus_init(void)
     /* hcall-vio */
     spapr_register_hypercall(H_VIO_SIGNAL, h_vio_signal);
 
-    /* hcall-tce */
-    spapr_register_hypercall(H_PUT_TCE, h_put_tce);
-
     /* hcall-crq */
     spapr_register_hypercall(H_REG_CRQ, h_reg_crq);
     spapr_register_hypercall(H_FREE_CRQ, h_free_crq);
diff --git a/hw/spapr_vio.h b/hw/spapr_vio.h
index 87816e4..db1698a 100644
--- a/hw/spapr_vio.h
+++ b/hw/spapr_vio.h
@@ -21,16 +21,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 
-#define SPAPR_VIO_TCE_PAGE_SHIFT   12
-#define SPAPR_VIO_TCE_PAGE_SIZE    (1ULL << SPAPR_VIO_TCE_PAGE_SHIFT)
-#define SPAPR_VIO_TCE_PAGE_MASK    (SPAPR_VIO_TCE_PAGE_SIZE - 1)
-
-enum VIOsPAPR_TCEAccess {
-    SPAPR_TCE_FAULT = 0,
-    SPAPR_TCE_RO = 1,
-    SPAPR_TCE_WO = 2,
-    SPAPR_TCE_RW = 3,
-};
+#include "dma.h"
 
 #define TYPE_VIO_SPAPR_DEVICE "vio-spapr-device"
 #define VIO_SPAPR_DEVICE(obj) \
@@ -42,10 +33,6 @@ enum VIOsPAPR_TCEAccess {
 
 struct VIOsPAPRDevice;
 
-typedef struct VIOsPAPR_RTCE {
-    uint64_t tce;
-} VIOsPAPR_RTCE;
-
 typedef struct VIOsPAPR_CRQ {
     uint64_t qladdr;
     uint32_t qsize;
@@ -61,6 +48,7 @@ typedef struct VIOsPAPRDeviceClass {
 
     const char *dt_name, *dt_type, *dt_compatible;
     target_ulong signal_mask;
+    uint32_t rtce_window_size;
     int (*init)(VIOsPAPRDevice *dev);
     void (*reset)(VIOsPAPRDevice *dev);
     int (*devnode)(VIOsPAPRDevice *dev, void *fdt, int node_off);
@@ -70,20 +58,15 @@ struct VIOsPAPRDevice {
     DeviceState qdev;
     uint32_t reg;
     uint32_t flags;
-#define VIO_PAPR_FLAG_DMA_BYPASS        0x1
     qemu_irq qirq;
     uint32_t vio_irq_num;
     target_ulong signal_state;
-    uint32_t rtce_window_size;
-    VIOsPAPR_RTCE *rtce_table;
-    int kvmtce_fd;
     VIOsPAPR_CRQ crq;
+    DMAContext *dma;
 };
 
-#define DEFINE_SPAPR_PROPERTIES(type, field, default_dma_window)       \
-        DEFINE_PROP_UINT32("reg", type, field.reg, -1),                \
-        DEFINE_PROP_UINT32("dma-window", type, field.rtce_window_size, \
-                           default_dma_window)
+#define DEFINE_SPAPR_PROPERTIES(type, field)           \
+        DEFINE_PROP_UINT32("reg", type, field.reg, -1)
 
 struct VIOsPAPRBus {
     BusState bus;
@@ -99,20 +82,38 @@ extern int spapr_populate_chosen_stdout(void *fdt, VIOsPAPRBus *bus);
 
 extern int spapr_vio_signal(VIOsPAPRDevice *dev, target_ulong mode);
 
-int spapr_vio_check_tces(VIOsPAPRDevice *dev, target_ulong ioba,
-                         target_ulong len,
-                         enum VIOsPAPR_TCEAccess access);
-
-int spapr_tce_dma_read(VIOsPAPRDevice *dev, uint64_t taddr,
-                       void *buf, uint32_t size);
-int spapr_tce_dma_write(VIOsPAPRDevice *dev, uint64_t taddr,
-                        const void *buf, uint32_t size);
-int spapr_tce_dma_zero(VIOsPAPRDevice *dev, uint64_t taddr, uint32_t size);
-void stb_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint8_t val);
-void sth_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint16_t val);
-void stw_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint32_t val);
-void stq_tce(VIOsPAPRDevice *dev, uint64_t taddr, uint64_t val);
-uint64_t ldq_tce(VIOsPAPRDevice *dev, uint64_t taddr);
+static inline bool spapr_vio_dma_valid(VIOsPAPRDevice *dev, uint64_t taddr,
+                                       uint32_t size, DMADirection dir)
+{
+    return dma_memory_valid(dev->dma, taddr, size, dir);
+}
+
+static inline int spapr_vio_dma_read(VIOsPAPRDevice *dev, uint64_t taddr,
+                                     void *buf, uint32_t size)
+{
+    return (dma_memory_read(dev->dma, taddr, buf, size) != 0) ?
+        H_DEST_PARM : H_SUCCESS;
+}
+
+static inline int spapr_vio_dma_write(VIOsPAPRDevice *dev, uint64_t taddr,
+                                      const void *buf, uint32_t size)
+{
+    return (dma_memory_write(dev->dma, taddr, buf, size) != 0) ?
+        H_DEST_PARM : H_SUCCESS;
+}
+
+static inline int spapr_vio_dma_zero(VIOsPAPRDevice *dev, uint64_t taddr,
+                                     uint32_t size)
+{
+    return (dma_memory_zero(dev->dma, taddr, size) != 0) ?
+        H_DEST_PARM : H_SUCCESS;
+}
+
+#define vio_stb(_dev, _addr, _val) (stb_dma((_dev)->dma, (_addr), (_val)))
+#define vio_sth(_dev, _addr, _val) (stw_be_dma((_dev)->dma, (_addr), (_val)))
+#define vio_stl(_dev, _addr, _val) (stl_be_dma((_dev)->dma, (_addr), (_val)))
+#define vio_stq(_dev, _addr, _val) (stq_be_dma((_dev)->dma, (_addr), (_val)))
+#define vio_ldq(_dev, _addr) (ldq_be_dma((_dev)->dma, (_addr)))
 
 int spapr_vio_send_crq(VIOsPAPRDevice *dev, uint8_t *crq);
 
diff --git a/hw/spapr_vscsi.c b/hw/spapr_vscsi.c
index 037867a..d2fe3e5 100644
--- a/hw/spapr_vscsi.c
+++ b/hw/spapr_vscsi.c
@@ -165,7 +165,7 @@ static int vscsi_send_iu(VSCSIState *s, vscsi_req *req,
     long rc, rc1;
 
     /* First copy the SRP */
-    rc = spapr_tce_dma_write(&s->vdev, req->crq.s.IU_data_ptr,
+    rc = spapr_vio_dma_write(&s->vdev, req->crq.s.IU_data_ptr,
                              &req->iu, length);
     if (rc) {
         fprintf(stderr, "vscsi_send_iu: DMA write failure !\n");
@@ -281,9 +281,9 @@ static int vscsi_srp_direct_data(VSCSIState *s, vscsi_req *req,
     llen = MIN(len, md->len);
     if (llen) {
         if (req->writing) { /* writing = to device = reading from memory */
-            rc = spapr_tce_dma_read(&s->vdev, md->va, buf, llen);
+            rc = spapr_vio_dma_read(&s->vdev, md->va, buf, llen);
         } else {
-            rc = spapr_tce_dma_write(&s->vdev, md->va, buf, llen);
+            rc = spapr_vio_dma_write(&s->vdev, md->va, buf, llen);
         }
     }
     md->len -= llen;
@@ -329,10 +329,11 @@ static int vscsi_srp_indirect_data(VSCSIState *s, vscsi_req *req,
             md = req->cur_desc = &req->ext_desc;
             dprintf("VSCSI:   Reading desc from 0x%llx\n",
                     (unsigned long long)td->va);
-            rc = spapr_tce_dma_read(&s->vdev, td->va, md,
+            rc = spapr_vio_dma_read(&s->vdev, td->va, md,
                                     sizeof(struct srp_direct_buf));
             if (rc) {
-                dprintf("VSCSI: tce_dma_read -> %d reading ext_desc\n", rc);
+                dprintf("VSCSI: spapr_vio_dma_read -> %d reading ext_desc\n",
+                        rc);
                 break;
             }
             vscsi_swap_desc(md);
@@ -345,12 +346,12 @@ static int vscsi_srp_indirect_data(VSCSIState *s, vscsi_req *req,
         /* Perform transfer */
         llen = MIN(len, md->len);
         if (req->writing) { /* writing = to device = reading from memory */
-            rc = spapr_tce_dma_read(&s->vdev, md->va, buf, llen);
+            rc = spapr_vio_dma_read(&s->vdev, md->va, buf, llen);
         } else {
-            rc = spapr_tce_dma_write(&s->vdev, md->va, buf, llen);
+            rc = spapr_vio_dma_write(&s->vdev, md->va, buf, llen);
         }
         if (rc) {
-            dprintf("VSCSI: tce_dma_r/w(%d) -> %d\n", req->writing, rc);
+            dprintf("VSCSI: spapr_vio_dma_r/w(%d) -> %d\n", req->writing, rc);
             break;
         }
         dprintf("VSCSI:     data: %02x %02x %02x %02x...\n",
@@ -728,7 +729,7 @@ static int vscsi_send_adapter_info(VSCSIState *s, vscsi_req *req)
     sinfo = &req->iu.mad.adapter_info;
 
 #if 0 /* What for ? */
-    rc = spapr_tce_dma_read(&s->vdev, be64_to_cpu(sinfo->buffer),
+    rc = spapr_vio_dma_read(&s->vdev, be64_to_cpu(sinfo->buffer),
                             &info, be16_to_cpu(sinfo->common.length));
     if (rc) {
         fprintf(stderr, "vscsi_send_adapter_info: DMA read failure !\n");
@@ -742,7 +743,7 @@ static int vscsi_send_adapter_info(VSCSIState *s, vscsi_req *req)
     info.os_type = cpu_to_be32(2);
     info.port_max_txu[0] = cpu_to_be32(VSCSI_MAX_SECTORS << 9);
 
-    rc = spapr_tce_dma_write(&s->vdev, be64_to_cpu(sinfo->buffer),
+    rc = spapr_vio_dma_write(&s->vdev, be64_to_cpu(sinfo->buffer),
                              &info, be16_to_cpu(sinfo->common.length));
     if (rc)  {
         fprintf(stderr, "vscsi_send_adapter_info: DMA write failure !\n");
@@ -804,7 +805,7 @@ static void vscsi_got_payload(VSCSIState *s, vscsi_crq *crq)
     }
 
     /* XXX Handle failure differently ? */
-    if (spapr_tce_dma_read(&s->vdev, crq->s.IU_data_ptr, &req->iu,
+    if (spapr_vio_dma_read(&s->vdev, crq->s.IU_data_ptr, &req->iu,
                            crq->s.IU_length)) {
         fprintf(stderr, "vscsi_got_payload: DMA read failure !\n");
         g_free(req);
@@ -945,7 +946,7 @@ static int spapr_vscsi_devnode(VIOsPAPRDevice *dev, void *fdt, int node_off)
 }
 
 static Property spapr_vscsi_properties[] = {
-    DEFINE_SPAPR_PROPERTIES(VSCSIState, vdev, 0x10000000),
+    DEFINE_SPAPR_PROPERTIES(VSCSIState, vdev),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -962,6 +963,7 @@ static void spapr_vscsi_class_init(ObjectClass *klass, void *data)
     k->dt_compatible = "IBM,v-scsi";
     k->signal_mask = 0x00000001;
     dc->props = spapr_vscsi_properties;
+    k->rtce_window_size = 0x10000000;
 }
 
 static TypeInfo spapr_vscsi_info = {
diff --git a/hw/spapr_vty.c b/hw/spapr_vty.c
index c9674f3..f0b27b1 100644
--- a/hw/spapr_vty.c
+++ b/hw/spapr_vty.c
@@ -133,7 +133,7 @@ void spapr_vty_create(VIOsPAPRBus *bus, CharDriverState *chardev)
 }
 
 static Property spapr_vty_properties[] = {
-    DEFINE_SPAPR_PROPERTIES(VIOsPAPRVTYDevice, sdev, 0),
+    DEFINE_SPAPR_PROPERTIES(VIOsPAPRVTYDevice, sdev),
     DEFINE_PROP_CHR("chardev", VIOsPAPRVTYDevice, chardev),
     DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index c09cc39..0ab7630 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -859,7 +859,7 @@ void *kvmppc_create_spapr_tce(uint32_t liobn, uint32_t window_size, int *pfd)
         return NULL;
     }
 
-    len = (window_size / SPAPR_VIO_TCE_PAGE_SIZE) * sizeof(VIOsPAPR_RTCE);
+    len = (window_size / SPAPR_TCE_PAGE_SIZE) * sizeof(sPAPRTCE);
     /* FIXME: round this up to page size */
 
     table = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
@@ -882,7 +882,7 @@ int kvmppc_remove_spapr_tce(void *table, int fd, uint32_t window_size)
         return -1;
     }
 
-    len = (window_size / SPAPR_VIO_TCE_PAGE_SIZE)*sizeof(VIOsPAPR_RTCE);
+    len = (window_size / SPAPR_TCE_PAGE_SIZE)*sizeof(sPAPRTCE);
     if ((munmap(table, len) < 0) ||
         (close(fd) < 0)) {
         fprintf(stderr, "KVM: Unexpected error removing TCE table: %s",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 11/13] iommu: Allow PCI to use IOMMU infrastructure
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (9 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 10/13] pseries: Convert sPAPR TCEs to use generic IOMMU infrastructure Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 12/13] pseries: Implement IOMMU and DMA for PAPR PCI devices Benjamin Herrenschmidt
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduard - Gabriel Munteanu, Richard Henderson, Michael S. Tsirkin,
	David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

This patch adds some hooks to let PCI devices and busses use the new IOMMU
infrastructure.  When IOMMU support is enabled, each PCI device now
contains a DMAContext * which is used by the pci_dma_*() wrapper functions.

By default, the contexts are initialized to NULL, assuming no IOMMU.
However the platform or host bridge code which sets up the PCI bus can use
pci_setup_iommu() to set a function which will determine the correct
DMAContext for a given PCI device.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 hw/pci.c           |    9 +++++++++
 hw/pci.h           |    9 +++++++--
 hw/pci_internals.h |    2 ++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index b706e69..8901c01 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -761,6 +761,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
         return NULL;
     }
     pci_dev->bus = bus;
+    if (bus->dma_context_fn) {
+        pci_dev->dma = bus->dma_context_fn(bus, bus->dma_context_opaque, devfn);
+    }
     pci_dev->devfn = devfn;
     pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
     pci_dev->irq_state = 0;
@@ -2004,6 +2007,12 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     k->bus_info = &pci_bus_info;
 }
 
+void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque)
+{
+    bus->dma_context_fn = fn;
+    bus->dma_context_opaque = opaque;
+}
+
 static TypeInfo pci_device_type_info = {
     .name = TYPE_PCI_DEVICE,
     .parent = TYPE_DEVICE,
diff --git a/hw/pci.h b/hw/pci.h
index 7e36c53..8c96438 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -175,6 +175,7 @@ typedef struct PCIDeviceClass {
 
 struct PCIDevice {
     DeviceState qdev;
+
     /* PCI config space */
     uint8_t *config;
 
@@ -196,6 +197,7 @@ struct PCIDevice {
     uint32_t devfn;
     char name[64];
     PCIIORegion io_regions[PCI_NUM_REGIONS];
+    DMAContext *dma;
 
     /* do not access the following fields */
     PCIConfigReadFunc *config_read;
@@ -316,6 +318,10 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
+typedef DMAContext *(*PCIDMAContextFunc)(PCIBus *, void *, int);
+
+void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
@@ -552,8 +558,7 @@ static inline uint32_t pci_config_size(const PCIDevice *d)
 /* DMA access functions */
 static inline DMAContext *pci_dma_context(PCIDevice *dev)
 {
-    /* Stub for when we have no PCI iommu support */
-    return NULL;
+    return dev->dma;
 }
 
 static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
diff --git a/hw/pci_internals.h b/hw/pci_internals.h
index 96690b7..983594b 100644
--- a/hw/pci_internals.h
+++ b/hw/pci_internals.h
@@ -16,6 +16,8 @@ extern struct BusInfo pci_bus_info;
 
 struct PCIBus {
     BusState qbus;
+    PCIDMAContextFunc dma_context_fn;
+    void *dma_context_opaque;
     uint8_t devfn_min;
     pci_set_irq_fn set_irq;
     pci_map_irq_fn map_irq;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 12/13] pseries: Implement IOMMU and DMA for PAPR PCI devices
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (10 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 11/13] iommu: Allow PCI to use " Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Benjamin Herrenschmidt
  2012-05-15  0:52 ` [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Anthony Liguori
  13 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Graf, David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

Currently the pseries machine emulation does not support DMA for emulated
PCI devices, because the PAPR spec always requires a (guest visible,
paravirtualized) IOMMU which was not implemented.  Now that we have
infrastructure for IOMMU emulation, we can correct this and allow PCI DMA
for pseries.

With the existing PAPR IOMMU code used for VIO devices, this is almost
trivial. We use a single DMAContext for each (virtual) PCI host bridge,
which is the usual configuration on real PAPR machines (which often have
_many_ PCI host bridges).

Cc: Alex Graf <agraf@suse.de>

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 hw/spapr.h       |    1 +
 hw/spapr_iommu.c |   52 ++++++++++++++++++++++++++++------------------------
 hw/spapr_pci.c   |   15 +++++++++++++++
 hw/spapr_pci.h   |    1 +
 4 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/hw/spapr.h b/hw/spapr.h
index df3e8b1..7c497aa 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -328,6 +328,7 @@ typedef struct sPAPRTCE {
 } sPAPRTCE;
 
 #define SPAPR_VIO_BASE_LIOBN    0x00000000
+#define SPAPR_PCI_BASE_LIOBN    0x80000000
 
 void spapr_iommu_init(void);
 DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size);
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 87ed09c..79c2a06 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -162,6 +162,28 @@ void spapr_tce_free(DMAContext *dma)
     }
 }
 
+static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
+                                target_ulong tce)
+{
+    sPAPRTCE *tcep;
+    target_ulong oldtce;
+
+    if (ioba >= tcet->window_size) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boards IOBA 0x"
+                      TARGET_FMT_lx "\n", ioba);
+        return H_PARAMETER;
+    }
+
+    tcep = tcet->table + (ioba >> SPAPR_TCE_PAGE_SHIFT);
+    oldtce = tcep->tce;
+    tcep->tce = tce;
+
+    if (oldtce != 0) {
+        iommu_wait_for_invalidated_maps(&tcet->dma, ioba, SPAPR_TCE_PAGE_SIZE);
+    }
+
+    return H_SUCCESS;
+}
 
 static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
                               target_ulong opcode, target_ulong *args)
@@ -170,43 +192,25 @@ static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr,
     target_ulong ioba = args[1];
     target_ulong tce = args[2];
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
-    sPAPRTCE *tcep;
-    target_ulong oldtce;
 
     if (liobn & 0xFFFFFFFF00000000ULL) {
         hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN "
                       TARGET_FMT_lx "\n", liobn);
         return H_PARAMETER;
     }
-    if (!tcet) {
-        hcall_dprintf("spapr_vio_put_tce on non-existent LIOBN "
-                      TARGET_FMT_lx "\n", liobn);
-        return H_PARAMETER;
-    }
 
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
+    if (tcet) {
+        return put_tce_emu(tcet, ioba, tce);
+    }
 #ifdef DEBUG_TCE
-    fprintf(stderr, "spapr_vio_put_tce on liobn=" TARGET_FMT_lx /*%s*/
+    fprintf(stderr, "%s on liobn=" TARGET_FMT_lx /*%s*/
             "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx "\n",
-            liobn, /*dev->qdev.id, */ioba, tce);
+            __func__, liobn, /*dev->qdev.id, */ioba, tce);
 #endif
 
-    if (ioba >= tcet->window_size) {
-        hcall_dprintf("spapr_vio_put_tce on out-of-boards IOBA 0x"
-                      TARGET_FMT_lx "\n", ioba);
-        return H_PARAMETER;
-    }
-
-    tcep = tcet->table + (ioba >> SPAPR_TCE_PAGE_SHIFT);
-    oldtce = tcep->tce;
-    tcep->tce = tce;
-
-    if (oldtce != 0) {
-        iommu_wait_for_invalidated_maps(&tcet->dma, ioba, SPAPR_TCE_PAGE_SIZE);
-    }
-
-    return H_SUCCESS;
+    return H_PARAMETER;
 }
 
 void spapr_iommu_init(void)
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 25b400a..7b9973c 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -265,12 +265,21 @@ static const MemoryRegionOps spapr_io_ops = {
 /*
  * PHB PCI device
  */
+static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
+                                            int devfn)
+{
+    sPAPRPHBState *phb = opaque;
+
+    return phb->dma;
+}
+
 static int spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *phb = FROM_SYSBUS(sPAPRPHBState, s);
     char *namebuf;
     int i;
     PCIBus *bus;
+    uint32_t liobn;
 
     phb->dtbusname = g_strdup_printf("pci@%" PRIx64, phb->buid);
     namebuf = alloca(strlen(phb->dtbusname) + 32);
@@ -311,6 +320,10 @@ static int spapr_phb_init(SysBusDevice *s)
                            PCI_DEVFN(0, 0), PCI_NUM_PINS);
     phb->host_state.bus = bus;
 
+    liobn = SPAPR_PCI_BASE_LIOBN | (pci_find_domain(bus) << 16);
+    phb->dma = spapr_tce_new_dma_context(liobn, 0x40000000);
+    pci_setup_iommu(bus, spapr_pci_dma_context_fn, phb);
+
     QLIST_INSERT_HEAD(&spapr->phbs, phb, list);
 
     /* Initialize the LSI table */
@@ -471,6 +484,8 @@ int spapr_populate_pci_devices(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
+    spapr_dma_dt(fdt, bus_off, "ibm,dma-window", phb->dma);
+
     return 0;
 }
 
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index f54c2e8..d9e46e2 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -38,6 +38,7 @@ typedef struct sPAPRPHBState {
     MemoryRegion memspace, iospace;
     target_phys_addr_t mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     MemoryRegion memwindow, iowindow;
+    DMAContext *dma;
 
     struct {
         uint32_t dt_irq;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (11 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 12/13] pseries: Implement IOMMU and DMA for PAPR PCI devices Benjamin Herrenschmidt
@ 2012-05-10  4:49 ` Benjamin Herrenschmidt
  2012-05-15  0:52   ` Anthony Liguori
  2012-05-15  0:52 ` [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Anthony Liguori
  13 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-10  4:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The emulated devices can run simultaneously with the guest, so
we need to be careful with ordering of load and stores done by
them to the guest system memory, which need to be observed in
the right order by the guest operating system.

The simplest way for now to address that is to stick a memory
barrier in the main DMA read/write function of the iommu layer,
this will make everything using that layer hopefully "just work".

We don't emulate devices supporting the relaxed ordering PCIe
feature nor do we want to look at doing more fine grained
barriers for now as it could quickly become too complex and not
worth the cost.

Note that this will not help devices using the map/unmap APIs,
those will need to use explicit barriers, similar to what
virtio does.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma-helpers.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/dma-helpers.c b/dma-helpers.c
index 36fa963..4350cdf 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -312,6 +312,9 @@ int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
         buf += plen;
     }
 
+    /* HACK: full memory barrier here */
+    __sync_synchronize();
+
     return 0;
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero()
  2012-05-10  4:48 ` [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero() Benjamin Herrenschmidt
@ 2012-05-15  0:42   ` Anthony Liguori
  2012-05-15  1:23     ` David Gibson
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  0:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel, David Gibson

On 05/09/2012 11:48 PM, Benjamin Herrenschmidt wrote:
> From: David Gibson<david@gibson.dropbear.id.au>
>
> This patch adds cpu_physical_memory_zero() function.  This is equivalent to
> calling cpu_physical_memory_write() with a buffer full of zeroes, but
> avoids actually allocating such a buffer along the way.
>
> Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>   cpu-common.h |    1 +
>   exec.c       |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 54 insertions(+)
>
> diff --git a/cpu-common.h b/cpu-common.h
> index dca5175..146429c 100644
> --- a/cpu-common.h
> +++ b/cpu-common.h
> @@ -53,6 +53,7 @@ void qemu_ram_set_idstr(ram_addr_t addr, const char *name, DeviceState *dev);
>
>   void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>                               int len, int is_write);
> +void cpu_physical_memory_zero(target_phys_addr_t addr, int len);
>   static inline void cpu_physical_memory_read(target_phys_addr_t addr,
>                                               void *buf, int len)
>   {
> diff --git a/exec.c b/exec.c
> index 0607c9b..8511496 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3581,6 +3581,59 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>       }
>   }
>
> +void cpu_physical_memory_zero(target_phys_addr_t addr, int len)
> +{

I'd think a memset() like interface would be better but...

We should definitely implement this function in terms of 
cpu_physical_memory_write instead of open coding the logic again.

Regards,

Anthony Liguori

> +    int l;
> +    uint8_t *ptr;
> +    target_phys_addr_t page;
> +    MemoryRegionSection *section;
> +
> +    while (len>  0) {
> +        page = addr&  TARGET_PAGE_MASK;
> +        l = (page + TARGET_PAGE_SIZE) - addr;
> +        if (l>  len)
> +            l = len;
> +        section = phys_page_find(page>>  TARGET_PAGE_BITS);
> +
> +        if (!memory_region_is_ram(section->mr)) {
> +            target_phys_addr_t addr1;
> +            addr1 = memory_region_section_addr(section, addr);
> +            /* XXX: could force cpu_single_env to NULL to avoid
> +               potential bugs */
> +            if (l>= 4&&  ((addr1&  3) == 0)) {
> +                /* 32 bit write access */
> +                io_mem_write(section->mr, addr1, 0, 4);
> +                l = 4;
> +            } else if (l>= 2&&  ((addr1&  1) == 0)) {
> +                /* 16 bit write access */
> +                io_mem_write(section->mr, addr1, 0, 2);
> +                l = 2;
> +            } else {
> +                /* 8 bit write access */
> +                io_mem_write(section->mr, addr1, 0, 1);
> +                l = 1;
> +            }
> +        } else if (!section->readonly) {
> +            ram_addr_t addr1;
> +            addr1 = memory_region_get_ram_addr(section->mr)
> +                + memory_region_section_addr(section, addr);
> +            /* RAM case */
> +            ptr = qemu_get_ram_ptr(addr1);
> +            memset(ptr, 0, l);
> +            if (!cpu_physical_memory_is_dirty(addr1)) {
> +                /* invalidate code */
> +                tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
> +                /* set dirty bit */
> +                cpu_physical_memory_set_dirty_flags(
> +                    addr1, (0xff&  ~CODE_DIRTY_FLAG));
> +            }
> +            qemu_put_ram_ptr(ptr);
> +        }
> +        len -= l;
> +        addr += l;
> +    }
> +}
> +
>   /* used for ROM loading : can write in RAM and ROM */
>   void cpu_physical_memory_write_rom(target_phys_addr_t addr,
>                                      const uint8_t *buf, int len)

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure Benjamin Herrenschmidt
@ 2012-05-15  0:49   ` Anthony Liguori
  2012-05-15  1:42     ` David Gibson
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  0:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Richard Henderson, Michael S. Tsirkin, qemu-devel, David Gibson,
	Eduard - Gabriel Munteanu

On 05/09/2012 11:49 PM, Benjamin Herrenschmidt wrote:
> From: David Gibson<david@gibson.dropbear.id.au>
>
> This patch adds the basic infrastructure necessary to emulate an IOMMU
> visible to the guest.  The DMAContext structure is extended with
> information and a callback describing the translation, and the various
> DMA functions used by devices will now perform IOMMU translation using
> this callback.
>
> Cc: Michael S. Tsirkin<mst@redhat.com>
> Cc: Richard Henderson<rth@twiddle.net>
>
> Signed-off-by: Eduard - Gabriel Munteanu<eduard.munteanu@linux360.ro>
> Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>   dma-helpers.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   dma.h         |  108 ++++++++++++++++++++++-------
>   hw/qdev-dma.h |    4 +-
>   3 files changed, 299 insertions(+), 27 deletions(-)
>
> diff --git a/dma-helpers.c b/dma-helpers.c
> index 2dc4691..09591ef 100644
> --- a/dma-helpers.c
> +++ b/dma-helpers.c
> @@ -9,6 +9,10 @@
>
>   #include "dma.h"
>   #include "trace.h"
> +#include "range.h"
> +#include "qemu-thread.h"
> +
> +/* #define DEBUG_IOMMU */
>
>   void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMAContext *dma)
>   {
> @@ -244,3 +248,213 @@ void dma_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
>   {
>       bdrv_acct_start(bs, cookie, sg->size, type);
>   }
> +
> +bool iommu_dma_memory_valid(DMAContext *dma, dma_addr_t addr, dma_addr_t len,
> +                            DMADirection dir)
> +{
> +    target_phys_addr_t paddr, plen;
> +
> +#ifdef DEBUG_IOMMU
> +    fprintf(stderr, "dma_memory_check context=%p addr=0x" DMA_ADDR_FMT
> +            " len=0x" DMA_ADDR_FMT " dir=%d\n", dma, addr, len, dir);
> +#endif
> +
> +    while (len) {
> +        if (dma->translate(dma, addr,&paddr,&plen, dir) != 0) {
> +            return false;
> +        }
> +
> +        /* The translation might be valid for larger regions. */
> +        if (plen>  len) {
> +            plen = len;
> +        }
> +
> +        len -= plen;
> +        addr += plen;
> +    }
> +
> +    return true;
> +}
> +
> +int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
> +                        void *buf, dma_addr_t len, DMADirection dir)
> +{
> +    target_phys_addr_t paddr, plen;
> +    int err;
> +
> +#ifdef DEBUG_IOMMU
> +    fprintf(stderr, "dma_memory_rw context=%p addr=0x" DMA_ADDR_FMT " len=0x"
> +            DMA_ADDR_FMT " dir=%d\n", dma, addr, len, dir);
> +#endif
> +
> +    while (len) {
> +        err = dma->translate(dma, addr,&paddr,&plen, dir);
> +        if (err) {
> +            return -1;
> +        }
> +
> +        /* The translation might be valid for larger regions. */
> +        if (plen>  len) {
> +            plen = len;
> +        }
> +
> +        cpu_physical_memory_rw(paddr, buf, plen,
> +                               dir == DMA_DIRECTION_FROM_DEVICE);
> +
> +        len -= plen;
> +        addr += plen;
> +        buf += plen;
> +    }
> +
> +    return 0;
> +}
> +
> +int iommu_dma_memory_zero(DMAContext *dma, dma_addr_t addr, dma_addr_t len)
> +{
> +    target_phys_addr_t paddr, plen;
> +    int err;
> +
> +#ifdef DEBUG_IOMMU
> +    fprintf(stderr, "dma_memory_zero context=%p addr=0x" DMA_ADDR_FMT
> +            " len=0x" DMA_ADDR_FMT "\n", dma, addr, len);
> +#endif
> +
> +    while (len) {
> +        err = dma->translate(dma, addr,&paddr,&plen,
> +                             DMA_DIRECTION_FROM_DEVICE);
> +        if (err) {
> +            return err;
> +        }
> +
> +        /* The translation might be valid for larger regions. */
> +        if (plen>  len) {
> +            plen = len;
> +        }
> +
> +        cpu_physical_memory_zero(paddr, plen);
> +
> +        len -= plen;
> +        addr += plen;
> +    }
> +
> +    return 0;
> +}
> +
> +typedef struct {
> +    unsigned long count;
> +    QemuCond cond;
> +} DMAInvalidationState;
> +
> +typedef struct DMAMemoryMap DMAMemoryMap;
> +struct DMAMemoryMap {
> +    dma_addr_t              addr;
> +    size_t                  len;
> +    void                    *buf;
> +
> +    DMAInvalidationState    *invalidate;
> +    QLIST_ENTRY(DMAMemoryMap) list;
> +};
> +
> +void dma_context_init(DMAContext *dma, DMATranslateFunc fn)
> +{
> +#ifdef DEBUG_IOMMU
> +    fprintf(stderr, "dma_context_init(%p, %p)\n", dma, fn);
> +#endif
> +    dma->translate = fn;
> +    QLIST_INIT(&dma->memory_maps);
> +}
> +
> +void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len,
> +                           DMADirection dir)
> +{
> +    int err;
> +    target_phys_addr_t paddr, plen;
> +    void *buf;
> +    DMAMemoryMap *map;
> +
> +    plen = *len;
> +    err = dma->translate(dma, addr,&paddr,&plen, dir);
> +    if (err) {
> +        return NULL;
> +    }
> +
> +    /*
> +     * If this is true, the virtual region is contiguous,
> +     * but the translated physical region isn't. We just
> +     * clamp *len, much like cpu_physical_memory_map() does.
> +     */
> +    if (plen<  *len) {
> +        *len = plen;
> +    }
> +
> +    buf = cpu_physical_memory_map(paddr,&plen,
> +                                  dir == DMA_DIRECTION_FROM_DEVICE);
> +    *len = plen;
> +
> +    /* We treat maps as remote TLBs to cope with stuff like AIO. */
> +    map = g_malloc(sizeof(DMAMemoryMap));
> +    map->addr = addr;
> +    map->len = *len;
> +    map->buf = buf;
> +    map->invalidate = NULL;
> +
> +    QLIST_INSERT_HEAD(&dma->memory_maps, map, list);
> +
> +    return buf;
> +}
> +
> +void iommu_dma_memory_unmap(DMAContext *dma, void *buffer, dma_addr_t len,
> +                            DMADirection dir, dma_addr_t access_len)
> +{
> +    DMAMemoryMap *map;
> +
> +    cpu_physical_memory_unmap(buffer, len,
> +                              dir == DMA_DIRECTION_FROM_DEVICE,
> +                              access_len);
> +
> +    QLIST_FOREACH(map,&dma->memory_maps, list) {
> +        if ((map->buf == buffer)&&  (map->len == len)) {
> +            QLIST_REMOVE(map, list);
> +
> +            if (map->invalidate) {
> +                /* If this mapping was invalidated */
> +                if (--map->invalidate->count == 0) {
> +                    /* And we're the last mapping invalidated at the time */
> +                    /* Then wake up whoever was waiting for the
> +                     * invalidation to complete */
> +                    qemu_cond_signal(&map->invalidate->cond);
> +                }
> +            }
> +
> +            free(map);
> +        }
> +    }
> +
> +
> +    /* unmap called on a buffer that wasn't mapped */
> +    assert(false);
> +}
> +
> +extern QemuMutex qemu_global_mutex;
> +
> +void iommu_wait_for_invalidated_maps(DMAContext *dma,
> +                                     dma_addr_t addr, dma_addr_t len)
> +{
> +    DMAMemoryMap *map;
> +    DMAInvalidationState is;
> +
> +    is.count = 0;
> +    qemu_cond_init(&is.cond);
> +
> +    QLIST_FOREACH(map,&dma->memory_maps, list) {
> +        if (ranges_overlap(addr, len, map->addr, map->len)) {
> +            is.count++;
> +            map->invalidate =&is;
> +        }
> +    }
> +
> +    if (is.count) {
> +        qemu_cond_wait(&is.cond,&qemu_global_mutex);
> +    }
> +    assert(is.count == 0);
> +}

I don't get what's going on here but I don't think it can possibly be right. 
What is the purpose of this function?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Benjamin Herrenschmidt
@ 2012-05-15  0:52   ` Anthony Liguori
  2012-05-15  1:11     ` Benjamin Herrenschmidt
  2012-05-15  1:44     ` David Gibson
  0 siblings, 2 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  0:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel, David Gibson

On 05/09/2012 11:49 PM, Benjamin Herrenschmidt wrote:
> From: David Gibson<david@gibson.dropbear.id.au>
>
> The emulated devices can run simultaneously with the guest, so
> we need to be careful with ordering of load and stores done by
> them to the guest system memory, which need to be observed in
> the right order by the guest operating system.
>
> The simplest way for now to address that is to stick a memory
> barrier in the main DMA read/write function of the iommu layer,
> this will make everything using that layer hopefully "just work".
>
> We don't emulate devices supporting the relaxed ordering PCIe
> feature nor do we want to look at doing more fine grained
> barriers for now as it could quickly become too complex and not
> worth the cost.
>
> Note that this will not help devices using the map/unmap APIs,
> those will need to use explicit barriers, similar to what
> virtio does.
>
> Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>   dma-helpers.c |    3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/dma-helpers.c b/dma-helpers.c
> index 36fa963..4350cdf 100644
> --- a/dma-helpers.c
> +++ b/dma-helpers.c
> @@ -312,6 +312,9 @@ int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
>           buf += plen;
>       }
>
> +    /* HACK: full memory barrier here */
> +    __sync_synchronize();

I thought you were going to limit this to the TCE iommu?

Regards,

Anthony Liguori

>       return 0;
>   }
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 00/13] IOMMU infrastructure
  2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
                   ` (12 preceding siblings ...)
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Benjamin Herrenschmidt
@ 2012-05-15  0:52 ` Anthony Liguori
  13 siblings, 0 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  0:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel

On 05/09/2012 11:48 PM, Benjamin Herrenschmidt wrote:
> Hi folks !
>
> This is a repose (&  rebase on top of current HEAD) of
> David and Eduard iommu patch series which provides the
> necessary infrastructure for doing DMA through an iommu,
> along with the SPAPR iommu implementation.
>
> David is on vacation, so make sure to CC all comments
> to me.

Looks mostly good.  Just limited comments on a few of the patches.

Regards,

Anthony Liguori

>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-15  0:52   ` Anthony Liguori
@ 2012-05-15  1:11     ` Benjamin Herrenschmidt
  2012-05-15  1:44     ` David Gibson
  1 sibling, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-15  1:11 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, David Gibson

On Mon, 2012-05-14 at 19:52 -0500, Anthony Liguori wrote:

> >
> > diff --git a/dma-helpers.c b/dma-helpers.c
> > index 36fa963..4350cdf 100644
> > --- a/dma-helpers.c
> > +++ b/dma-helpers.c
> > @@ -312,6 +312,9 @@ int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
> >           buf += plen;
> >       }
> >
> > +    /* HACK: full memory barrier here */
> > +    __sync_synchronize();
> 
> I thought you were going to limit this to the TCE iommu?

I can I suppose but technically speaking, this isn't an attribute of the
iommu... in fact, from a model standpoint, it should be in the pci_*
accessors since PCI transactions are ordered (at least to some extent,
let's not get into relaxed ordering etc... at this stage).

It was just easier to stick it in the above function for now, which
handles all known cases... I'm happy to move it to the TCE backend if
you prefer for now but I can see that problem hitting other
architectures such as ARM or even on powerpc, hitting emulated PCI on
Alex "usermode" KVM using the Mac99 machine model which doesn't use TCEs
etc....

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero()
  2012-05-15  0:42   ` Anthony Liguori
@ 2012-05-15  1:23     ` David Gibson
  2012-05-15  2:03       ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: David Gibson @ 2012-05-15  1:23 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

On Mon, May 14, 2012 at 07:42:00PM -0500, Anthony Liguori wrote:
> On 05/09/2012 11:48 PM, Benjamin Herrenschmidt wrote:
> >From: David Gibson<david@gibson.dropbear.id.au>
[snip]
> >@@ -3581,6 +3581,59 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
> >      }
> >  }
> >
> >+void cpu_physical_memory_zero(target_phys_addr_t addr, int len)
> >+{
> 
> I'd think a memset() like interface would be better but...

I can work with that.

> We should definitely implement this function in terms of
> cpu_physical_memory_write instead of open coding the logic again.

Hrm.  Having solved merge conflicts several times by recopying the
cpu_physical_memory_rw() logic, I can certainly see the attraction in
that.  However, the point of this function is *not* to have to
allocate a temporary buffer, and I don't really see how to combine the
logic without that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15  0:49   ` Anthony Liguori
@ 2012-05-15  1:42     ` David Gibson
  2012-05-15  2:03       ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: David Gibson @ 2012-05-15  1:42 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Richard Henderson, Michael S. Tsirkin, qemu-devel,
	Eduard - Gabriel Munteanu

On Mon, May 14, 2012 at 07:49:16PM -0500, Anthony Liguori wrote:
[snip]
> >+void iommu_wait_for_invalidated_maps(DMAContext *dma,
> >+                                     dma_addr_t addr, dma_addr_t len)
> >+{
> >+    DMAMemoryMap *map;
> >+    DMAInvalidationState is;
> >+
> >+    is.count = 0;
> >+    qemu_cond_init(&is.cond);
> >+
> >+    QLIST_FOREACH(map,&dma->memory_maps, list) {
> >+        if (ranges_overlap(addr, len, map->addr, map->len)) {
> >+            is.count++;
> >+            map->invalidate =&is;
> >+        }
> >+    }
> >+
> >+    if (is.count) {
> >+        qemu_cond_wait(&is.cond,&qemu_global_mutex);
> >+    }
> >+    assert(is.count == 0);
> >+}
> 
> I don't get what's going on here but I don't think it can possibly
> be right. What is the purpose of this function?

So.  This is a function to be used by individual iommu
implementations.  When IOMMU mappings are updated on real hardware,
there may be some lag in th effect, particularly for in-flight DMAs,
due to shadow TLBs or other things.  But generally, there will be some
way to synchronize the IOMMU that once completed will ensure that no
further DMA access to the old translations may occur.  For the sPAPR
TCE MMU, this actually happens after every PUT_TCE hcall.

In our software implementation this is a problem if existing drivers
have done a dma_memory_map() and haven't yet done a
dma_memory_unmap(): they will have a real pointer to the translated
memory which can't be intercepted.  However, memory maps are supposed
to be transient, so this helper function invalidates memory maps based
on an IOVA address range, and blocks until they expire.  This function
would be called from CPU thread context, the dma_memory_unmap() would
come from the IO thread (in the only existing case from AIO completion
callbacks in the block IO code).

This gives the IOMMU implementation a way of blocking the CPU
initiating a sync operation until it really is safe to assume that no
further DMA operations may hit the invalidated mappings.  Note that if
we actually hit the blocking path here, that almost certainly
indicates the guest has done something wrong, or at least unusual -
DMA devices should be stopped before removing their IOMMU mappings
from under them.  However, one of the points of the IOMMU is the
ability to be able to forcibly stop DMAs, so we do need to implement
this behaviour for that case.

With difficulty, I've traced through qemu's difficult-to-follow thread
synchronization logic and I'm about 75% convinced this works correctly
with it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-15  0:52   ` Anthony Liguori
  2012-05-15  1:11     ` Benjamin Herrenschmidt
@ 2012-05-15  1:44     ` David Gibson
  2012-05-16  4:35       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 89+ messages in thread
From: David Gibson @ 2012-05-15  1:44 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

On Mon, May 14, 2012 at 07:52:15PM -0500, Anthony Liguori wrote:
> On 05/09/2012 11:49 PM, Benjamin Herrenschmidt wrote:
> >From: David Gibson<david@gibson.dropbear.id.au>
> >
> >The emulated devices can run simultaneously with the guest, so
> >we need to be careful with ordering of load and stores done by
> >them to the guest system memory, which need to be observed in
> >the right order by the guest operating system.
> >
> >The simplest way for now to address that is to stick a memory
> >barrier in the main DMA read/write function of the iommu layer,
> >this will make everything using that layer hopefully "just work".
> >
> >We don't emulate devices supporting the relaxed ordering PCIe
> >feature nor do we want to look at doing more fine grained
> >barriers for now as it could quickly become too complex and not
> >worth the cost.
> >
> >Note that this will not help devices using the map/unmap APIs,
> >those will need to use explicit barriers, similar to what
> >virtio does.
> >
> >Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
> >Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> >---
> >  dma-helpers.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> >diff --git a/dma-helpers.c b/dma-helpers.c
> >index 36fa963..4350cdf 100644
> >--- a/dma-helpers.c
> >+++ b/dma-helpers.c
> >@@ -312,6 +312,9 @@ int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
> >          buf += plen;
> >      }
> >
> >+    /* HACK: full memory barrier here */
> >+    __sync_synchronize();
> 
> I thought you were going to limit this to the TCE iommu?

So, it wasn't my intention to send this one with the rest, but I
forgot to explain that to Ben when he resent.  As the comment
suggests, this is a hack we have been using internally to see if
certain ordering problems were what we thought they were.  If that
turned out to be the case (and it now looks like it is), we need to
work out where to correctly place this barrier.  As Ben says, this
should probably really be in the PCI accessors, and we should use the
finer grained primitives from qemu-barrier.h rather than the brute
force __sync_synchronize().

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15  1:42     ` David Gibson
@ 2012-05-15  2:03       ` Anthony Liguori
  2012-05-15  2:32         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  2:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, qemu-devel, Eduard - Gabriel Munteanu,
	Richard Henderson, Michael S. Tsirkin

On 05/14/2012 08:42 PM, David Gibson wrote:
> On Mon, May 14, 2012 at 07:49:16PM -0500, Anthony Liguori wrote:
> [snip]
>>> +void iommu_wait_for_invalidated_maps(DMAContext *dma,
>>> +                                     dma_addr_t addr, dma_addr_t len)
>>> +{
>>> +    DMAMemoryMap *map;
>>> +    DMAInvalidationState is;
>>> +
>>> +    is.count = 0;
>>> +    qemu_cond_init(&is.cond);
>>> +
>>> +    QLIST_FOREACH(map,&dma->memory_maps, list) {
>>> +        if (ranges_overlap(addr, len, map->addr, map->len)) {
>>> +            is.count++;
>>> +            map->invalidate =&is;
>>> +        }
>>> +    }
>>> +
>>> +    if (is.count) {
>>> +        qemu_cond_wait(&is.cond,&qemu_global_mutex);
>>> +    }
>>> +    assert(is.count == 0);
>>> +}
>>
>> I don't get what's going on here but I don't think it can possibly
>> be right. What is the purpose of this function?
>
> So.  This is a function to be used by individual iommu
> implementations.  When IOMMU mappings are updated on real hardware,
> there may be some lag in th effect, particularly for in-flight DMAs,
> due to shadow TLBs or other things.  But generally, there will be some
> way to synchronize the IOMMU that once completed will ensure that no
> further DMA access to the old translations may occur.  For the sPAPR
> TCE MMU, this actually happens after every PUT_TCE hcall.
>
> In our software implementation this is a problem if existing drivers
> have done a dma_memory_map() and haven't yet done a
> dma_memory_unmap(): they will have a real pointer to the translated
> memory which can't be intercepted.  However, memory maps are supposed
> to be transient, so this helper function invalidates memory maps based
> on an IOVA address range, and blocks until they expire.  This function
> would be called from CPU thread context, the dma_memory_unmap() would
> come from the IO thread (in the only existing case from AIO completion
> callbacks in the block IO code).
>
> This gives the IOMMU implementation a way of blocking the CPU
> initiating a sync operation until it really is safe to assume that no
> further DMA operations may hit the invalidated mappings.  Note that if
> we actually hit the blocking path here, that almost certainly
> indicates the guest has done something wrong, or at least unusual -

So the CPU thread runs in lock-step with the I/O thread.  Dropping the CPU 
thread lock to let the I/O thread run is a dangerous thing to do in a place like 
this.

Also, I think you'd effectively block the CPU until pending DMA operations 
complete?  This could be many, many, milliseconds, no?  That's going to make 
guests very upset.

Regards,

Anthony Liguori

> DMA devices should be stopped before removing their IOMMU mappings
> from under them.  However, one of the points of the IOMMU is the
> ability to be able to forcibly stop DMAs, so we do need to implement
> this behaviour for that case.
>
> With difficulty, I've traced through qemu's difficult-to-follow thread
> synchronization logic and I'm about 75% convinced this works correctly
> with it.
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero()
  2012-05-15  1:23     ` David Gibson
@ 2012-05-15  2:03       ` Anthony Liguori
  0 siblings, 0 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  2:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, qemu-devel, David Gibson

On 05/14/2012 08:23 PM, David Gibson wrote:
> On Mon, May 14, 2012 at 07:42:00PM -0500, Anthony Liguori wrote:
>> On 05/09/2012 11:48 PM, Benjamin Herrenschmidt wrote:
>>> From: David Gibson<david@gibson.dropbear.id.au>
> [snip]
>>> @@ -3581,6 +3581,59 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>>>       }
>>>   }
>>>
>>> +void cpu_physical_memory_zero(target_phys_addr_t addr, int len)
>>> +{
>>
>> I'd think a memset() like interface would be better but...
>
> I can work with that.
>
>> We should definitely implement this function in terms of
>> cpu_physical_memory_write instead of open coding the logic again.
>
> Hrm.  Having solved merge conflicts several times by recopying the
> cpu_physical_memory_rw() logic, I can certainly see the attraction in
> that.  However, the point of this function is *not* to have to
> allocate a temporary buffer, and I don't really see how to combine the
> logic without that.

Just use a fixed buffer (uint8_t buffer[512]) and call cpu_physical_memory_rw 
multiple times with an offset.

Regards,

Anthony Liguori

>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15  2:03       ` Anthony Liguori
@ 2012-05-15  2:32         ` Benjamin Herrenschmidt
  2012-05-15  2:50           ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-15  2:32 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Richard Henderson, Michael S. Tsirkin, qemu-devel,
	Eduard - Gabriel Munteanu

On Mon, 2012-05-14 at 21:03 -0500, Anthony Liguori wrote:
> So the CPU thread runs in lock-step with the I/O thread.  Dropping the CPU 
> thread lock to let the I/O thread run is a dangerous thing to do in a place like 
> this.
> 
> Also, I think you'd effectively block the CPU until pending DMA operations 
> complete?  This could be many, many, milliseconds, no?  That's going to make 
> guests very upset.

Do you see any other option ?

IE. When the guest invalidates iommu translations, it must have a way to
synchronize with anything that may have used such translations. IE. It
must have a way to guarantee that

 - The translation is no longer used
 - Any physical address obtained as a result of looking at
   the translation table/tree/... is no longer used

This is true regardless of the iommu model. For the PAPR TCE model, we
need to provide such synchronization whenever we clear a TCE entry, but
if I was to emulate the HW iommu of a G5 for example, I would have to
provide similar guarantees in my emulation of MMIO accesses to the iommu
TLB invalidate register.

This is a problem with devices using the map/unmap calls. This is going
to be even more of a problem as we start being more threaded (which from
discussions I've read here or there seem to be the goal) since
synchronizing with the IO thread isn't going to be enough.

As long as the actual data transfer through the iommu is under control
of the iommu code, then the iommu implementation can use whatever
locking it needs to ensure this synchronization.

But map/unmap defeats that.

David's approach may not be the best long term, but provided it's not
totally broken (I don't know qemu locking well enough to judge how
dangerous it is) then it might be a "good enough" first step until we
come up with something better ?

The normal case will be that no map exist, ie, it will almost always be
a guest programming error to remove an iommu mapping while a device is
actively using it, so having this case be slow is probably a non-issue.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15  2:32         ` Benjamin Herrenschmidt
@ 2012-05-15  2:50           ` Anthony Liguori
  2012-05-15  3:02             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15  2:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Richard Henderson, Michael S. Tsirkin, qemu-devel,
	Eduard - Gabriel Munteanu

On 05/14/2012 09:32 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-14 at 21:03 -0500, Anthony Liguori wrote:
>> So the CPU thread runs in lock-step with the I/O thread.  Dropping the CPU
>> thread lock to let the I/O thread run is a dangerous thing to do in a place like
>> this.
>>
>> Also, I think you'd effectively block the CPU until pending DMA operations
>> complete?  This could be many, many, milliseconds, no?  That's going to make
>> guests very upset.
>
> Do you see any other option ?

Yes, ignore it.

I have a hard time believing software depends on changing DMA translation 
mid-way through a transaction.

>
> IE. When the guest invalidates iommu translations, it must have a way to
> synchronize with anything that may have used such translations. IE. It
> must have a way to guarantee that
>
>   - The translation is no longer used

You can certainly prevent future uses of the translation.  We're only talking 
about pending mapped requests.  My assertion is that they don't matter.

>   - Any physical address obtained as a result of looking at
>     the translation table/tree/... is no longer used

Why does this need to be guaranteed?  How can software depend on this in a 
meaningful way?

> This is true regardless of the iommu model. For the PAPR TCE model, we
> need to provide such synchronization whenever we clear a TCE entry, but
> if I was to emulate the HW iommu of a G5 for example, I would have to
> provide similar guarantees in my emulation of MMIO accesses to the iommu
> TLB invalidate register.
>
> This is a problem with devices using the map/unmap calls. This is going
> to be even more of a problem as we start being more threaded (which from
> discussions I've read here or there seem to be the goal) since
> synchronizing with the IO thread isn't going to be enough.
>
> As long as the actual data transfer through the iommu is under control
> of the iommu code, then the iommu implementation can use whatever
> locking it needs to ensure this synchronization.
>
> But map/unmap defeats that.
>
> David's approach may not be the best long term, but provided it's not
> totally broken (I don't know qemu locking well enough to judge how
> dangerous it is) then it might be a "good enough" first step until we
> come up with something better ?

No, it's definitely not good enough.  Dropping the global mutex in random places 
is asking for worlds of hurt.

If this is really important, then we need some sort of cancellation API to go 
along with map/unmap although I doubt that's really possible.

MMIO/PIO operations cannot block.

Regards,

Anthony Liguori

>
> The normal case will be that no map exist, ie, it will almost always be
> a guest programming error to remove an iommu mapping while a device is
> actively using it, so having this case be slow is probably a non-issue.
>
> Cheers,
> Ben.
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15  2:50           ` Anthony Liguori
@ 2012-05-15  3:02             ` Benjamin Herrenschmidt
  2012-05-15 14:02               ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-15  3:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Richard Henderson, Michael S. Tsirkin, qemu-devel,
	Eduard - Gabriel Munteanu

On Mon, 2012-05-14 at 21:50 -0500, Anthony Liguori wrote:
> On 05/14/2012 09:32 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2012-05-14 at 21:03 -0500, Anthony Liguori wrote:
> >> So the CPU thread runs in lock-step with the I/O thread.  Dropping the CPU
> >> thread lock to let the I/O thread run is a dangerous thing to do in a place like
> >> this.
> >>
> >> Also, I think you'd effectively block the CPU until pending DMA operations
> >> complete?  This could be many, many, milliseconds, no?  That's going to make
> >> guests very upset.
> >
> > Do you see any other option ?
> 
> Yes, ignore it.
> 
> I have a hard time believing software depends on changing DMA translation 
> mid-way through a transaction.

It's a correctness issue. It won't happen in normal circumstances but it
can, and thus should be handled gracefully.

Cases where that matter are unloading of a (broken) driver, kexec/kdump
from one guest to another etc... all involve potentially clearing all
iommu tables while a driver might have left a device DMA'ing. The
expectation is that the device will get target aborts from the iommu
until the situation gets "cleaned up" in SW.

> > IE. When the guest invalidates iommu translations, it must have a way to
> > synchronize with anything that may have used such translations. IE. It
> > must have a way to guarantee that
> >
> >   - The translation is no longer used
> 
> You can certainly prevent future uses of the translation.  We're only talking 
> about pending mapped requests.  My assertion is that they don't matter.
> 
> >   - Any physical address obtained as a result of looking at
> >     the translation table/tree/... is no longer used
> 
> Why does this need to be guaranteed?  How can software depend on this in a 
> meaningful way?

The same as TLB invalidations :-)

In real HW, this is a property of the HW itself, ie, whatever MMIO is
used to invalidate the HW TLB provides a way to ensure (usually by
reading back) that any request pending in the iommu pipeline has either
been completed or canned.

When we start having page fault capable iommu's this will be even more
important as faults will be be part of the non-error case.

For now, it's strictly error handling but it still should be done
correctly.

> > This is true regardless of the iommu model. For the PAPR TCE model, we
> > need to provide such synchronization whenever we clear a TCE entry, but
> > if I was to emulate the HW iommu of a G5 for example, I would have to
> > provide similar guarantees in my emulation of MMIO accesses to the iommu
> > TLB invalidate register.
> >
> > This is a problem with devices using the map/unmap calls. This is going
> > to be even more of a problem as we start being more threaded (which from
> > discussions I've read here or there seem to be the goal) since
> > synchronizing with the IO thread isn't going to be enough.
> >
> > As long as the actual data transfer through the iommu is under control
> > of the iommu code, then the iommu implementation can use whatever
> > locking it needs to ensure this synchronization.
> >
> > But map/unmap defeats that.
> >
> > David's approach may not be the best long term, but provided it's not
> > totally broken (I don't know qemu locking well enough to judge how
> > dangerous it is) then it might be a "good enough" first step until we
> > come up with something better ?
> 
> No, it's definitely not good enough.  Dropping the global mutex in random places 
> is asking for worlds of hurt.
> 
> If this is really important, then we need some sort of cancellation API to go 
> along with map/unmap although I doubt that's really possible.
> 
> MMIO/PIO operations cannot block.

Well, there's a truckload of cases in real HW where an MMIO/PIO read is
used to synchronize some sort of HW operation.... I suppose nothing that
involves blocking at this stage in qemu but I would be careful with your
expectations here... writes are usually pipelined but blocking on a read
response does make a lot of sense.

In any case, for the problem at hand, I can just drop the wait for now
and maybe just print a warning if I see an existing map.

We still need some kind of either locking or barrier to simply ensure
that the updates to the TCE table are visible to other processors but
that can be done in the backend.

But I wouldn't just forget about the issue, it's going to come back and
bite...

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> >
> > The normal case will be that no map exist, ie, it will almost always be
> > a guest programming error to remove an iommu mapping while a device is
> > actively using it, so having this case be slow is probably a non-issue.
> >
> > Cheers,
> > Ben.
> >
> >

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15  3:02             ` Benjamin Herrenschmidt
@ 2012-05-15 14:02               ` Anthony Liguori
  2012-05-15 21:55                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15 14:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On 05/14/2012 10:02 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-14 at 21:50 -0500, Anthony Liguori wrote:
>> On 05/14/2012 09:32 PM, Benjamin Herrenschmidt wrote:
>>> On Mon, 2012-05-14 at 21:03 -0500, Anthony Liguori wrote:
>>>> So the CPU thread runs in lock-step with the I/O thread.  Dropping the CPU
>>>> thread lock to let the I/O thread run is a dangerous thing to do in a place like
>>>> this.
>>>>
>>>> Also, I think you'd effectively block the CPU until pending DMA operations
>>>> complete?  This could be many, many, milliseconds, no?  That's going to make
>>>> guests very upset.
>>>
>>> Do you see any other option ?
>>
>> Yes, ignore it.
>>
>> I have a hard time believing software depends on changing DMA translation
>> mid-way through a transaction.
>
> It's a correctness issue. It won't happen in normal circumstances but it
> can, and thus should be handled gracefully.

I think the crux of your argument is that upon a change to the translation 
table, the operation acts as a barrier such that the exact moment it returns, 
you're guaranteed that no DMAs are in flight with the old translation mapping.

That's not my understanding of at least VT-d and I have a hard time believing 
it's true for other IOMMUs as that kind of synchronization seems like it would 
be very expensive to implement in hardware.

Rather, when the IOTLB is flushed, I believe the only guarantee that you have is 
that future IOTLB lookups will return the new mapping.  But that doesn't mean 
that there isn't a request in flight that uses the old mapping.

I will grant you that PCI transactions are typically much smaller than QEMU 
transactions such that we may continue to use the old mappings for much longer 
than real hardware would.  But I think that still puts us well within the realm 
of correctness.

> Cases where that matter are unloading of a (broken) driver, kexec/kdump
> from one guest to another etc... all involve potentially clearing all
> iommu tables while a driver might have left a device DMA'ing. The
> expectation is that the device will get target aborts from the iommu
> until the situation gets "cleaned up" in SW.

Yes, this would be worse in QEMU than on bare metal because we essentially have 
a much larger translation TLB.  But as I said above, I think we're well within 
the specified behavior here.

>> Why does this need to be guaranteed?  How can software depend on this in a
>> meaningful way?
>
> The same as TLB invalidations :-)
>
> In real HW, this is a property of the HW itself, ie, whatever MMIO is
> used to invalidate the HW TLB provides a way to ensure (usually by
> reading back) that any request pending in the iommu pipeline has either
> been completed or canned.

Can you point to a spec that says this?  This doesn't match my understanding.

> When we start having page fault capable iommu's this will be even more
> important as faults will be be part of the non-error case.

We can revisit this discussion after every PCI device is changed to cope with a 
page fault capable IOMMU ;-)

>>> David's approach may not be the best long term, but provided it's not
>>> totally broken (I don't know qemu locking well enough to judge how
>>> dangerous it is) then it might be a "good enough" first step until we
>>> come up with something better ?
>>
>> No, it's definitely not good enough.  Dropping the global mutex in random places
>> is asking for worlds of hurt.
>>
>> If this is really important, then we need some sort of cancellation API to go
>> along with map/unmap although I doubt that's really possible.
>>
>> MMIO/PIO operations cannot block.
>
> Well, there's a truckload of cases in real HW where an MMIO/PIO read is
> used to synchronize some sort of HW operation.... I suppose nothing that
> involves blocking at this stage in qemu but I would be careful with your
> expectations here... writes are usually pipelined but blocking on a read
> response does make a lot of sense.

Blocking on an MMIO/PIO request effectively freezes a CPU.  All sorts of badness 
results from that.  Best case scenario, you trigger soft lockup warnings.

> In any case, for the problem at hand, I can just drop the wait for now
> and maybe just print a warning if I see an existing map.
>
> We still need some kind of either locking or barrier to simply ensure
> that the updates to the TCE table are visible to other processors but
> that can be done in the backend.
>
> But I wouldn't just forget about the issue, it's going to come back and
> bite...

I think working out the exact semantics of what we need to do is absolutely 
important.  But I think you're taking an overly conservative approach to what we 
need to provide here.

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>> The normal case will be that no map exist, ie, it will almost always be
>>> a guest programming error to remove an iommu mapping while a device is
>>> actively using it, so having this case be slow is probably a non-issue.
>>>
>>> Cheers,
>>> Ben.
>>>
>>>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15 14:02               ` Anthony Liguori
@ 2012-05-15 21:55                 ` Benjamin Herrenschmidt
  2012-05-15 22:02                   ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-15 21:55 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On Tue, 2012-05-15 at 09:02 -0500, Anthony Liguori wrote:

> I think the crux of your argument is that upon a change to the translation 
> table, the operation acts as a barrier such that the exact moment it returns, 
> you're guaranteed that no DMAs are in flight with the old translation mapping.

Not when the translation is changed in memory but whenever the
translation cache are invalidated or whatever other mechanism the HW
provides to do that synchronization. On PAPR, this guarantee is provided
by the H_PUT_TCE hypervisor call which we use to manipulate
translations.

[ Note that for performance reasons, it might end up being very
impractical to provide that guarantee since it prevents us from handling
H_PUT_TCE entirely in kernel real mode like we to today... we'll have to
figure our what we want to do here for the TCE backend implementation,
maybe have qemu mark "in use" translations and cause exist when those
are modified ... ]

> That's not my understanding of at least VT-d and I have a hard time believing 
> it's true for other IOMMUs as that kind of synchronization seems like it would 
> be very expensive to implement in hardware.

How so ? It's perfectly standard stuff ... it's usually part of the TLB
flushing op.

> Rather, when the IOTLB is flushed, I believe the only guarantee that you have is 
> that future IOTLB lookups will return the new mapping.  But that doesn't mean 
> that there isn't a request in flight that uses the old mapping.

I would be very surprised if that was the case :-)

I don't think any sane HW implementation would fail to provide full
synchronization with invalidations. That's how MMUs operate and I don't
see any reason why an iommu shouldn't be held to the same standards.

If it didn't, you'd have a nice host attack... have a guest doing
pass-through start a very long transaction and immediately commit
suicide. KVM starts reclaiming the pages, they go back to the host,
might be re-used immediately ... while still being DMAed to.

> I will grant you that PCI transactions are typically much smaller than QEMU 
> transactions such that we may continue to use the old mappings for much longer 
> than real hardware would.  But I think that still puts us well within the realm 
> of correctness.

No, a "random amount of time after invalidation" is not and will never
be correct. On large SMP machines, the time between a page being freed
and that page being re-used can be very small. The memory being re-used
by something like kexec can happen almost immediately while qemu is
blocked on an AIO that takes milliseconds ... etc....

At least because this is emulated iommu, qemu only writes to virtual
addresses mapping the guest space, so this isn't a host attack (unlike
with a real HW iommu however where the lack of such synchronization
would definitely be, as I described earlier).

> > Cases where that matter are unloading of a (broken) driver, kexec/kdump
> > from one guest to another etc... all involve potentially clearing all
> > iommu tables while a driver might have left a device DMA'ing. The
> > expectation is that the device will get target aborts from the iommu
> > until the situation gets "cleaned up" in SW.
> 
> Yes, this would be worse in QEMU than on bare metal because we essentially have 
> a much larger translation TLB.  But as I said above, I think we're well within 
> the specified behavior here.

No :-)

> >> Why does this need to be guaranteed?  How can software depend on this in a
> >> meaningful way?
> >
> > The same as TLB invalidations :-)
> >
> > In real HW, this is a property of the HW itself, ie, whatever MMIO is
> > used to invalidate the HW TLB provides a way to ensure (usually by
> > reading back) that any request pending in the iommu pipeline has either
> > been completed or canned.
> 
> Can you point to a spec that says this?  This doesn't match my understanding.

Appart from common sense ? I'd have to dig to get you actual specs but
it should be plain obvious that you need that sort of sync or you simply
cannot trust your iommu to do virtualization.

> > When we start having page fault capable iommu's this will be even more
> > important as faults will be be part of the non-error case.
> 
> We can revisit this discussion after every PCI device is changed to cope with a 
> page fault capable IOMMU ;-)

Heh, well, the point is that is still part of the base iommu model, page
faulting is just going to make the problem worse.

> >>> David's approach may not be the best long term, but provided it's not
> >>> totally broken (I don't know qemu locking well enough to judge how
> >>> dangerous it is) then it might be a "good enough" first step until we
> >>> come up with something better ?
> >>
> >> No, it's definitely not good enough.  Dropping the global mutex in random places
> >> is asking for worlds of hurt.
> >>
> >> If this is really important, then we need some sort of cancellation API to go
> >> along with map/unmap although I doubt that's really possible.
> >>
> >> MMIO/PIO operations cannot block.
> >
> > Well, there's a truckload of cases in real HW where an MMIO/PIO read is
> > used to synchronize some sort of HW operation.... I suppose nothing that
> > involves blocking at this stage in qemu but I would be careful with your
> > expectations here... writes are usually pipelined but blocking on a read
> > response does make a lot of sense.
> 
> Blocking on an MMIO/PIO request effectively freezes a CPU.  All sorts of badness 
> results from that.  Best case scenario, you trigger soft lockup warnings.

Well, that's exactly what happens in HW on PIO accesses and MMIO reads
waiting for a reply...

> > In any case, for the problem at hand, I can just drop the wait for now
> > and maybe just print a warning if I see an existing map.
> >
> > We still need some kind of either locking or barrier to simply ensure
> > that the updates to the TCE table are visible to other processors but
> > that can be done in the backend.
> >
> > But I wouldn't just forget about the issue, it's going to come back and
> > bite...
> 
> I think working out the exact semantics of what we need to do is absolutely 
> important.  But I think you're taking an overly conservative approach to what we 
> need to provide here.

I'm happy to have the patches merged without that for now, it will get
us going with USB emulation etc... which we need for graphics, but we do
need to sort this out eventually.

I'll re-submit without it.

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> >
> > Cheers,
> > Ben.
> >
> >> Regards,
> >>
> >> Anthony Liguori
> >>
> >>>
> >>> The normal case will be that no map exist, ie, it will almost always be
> >>> a guest programming error to remove an iommu mapping while a device is
> >>> actively using it, so having this case be slow is probably a non-issue.
> >>>
> >>> Cheers,
> >>> Ben.
> >>>
> >>>
> >
> >

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15 21:55                 ` Benjamin Herrenschmidt
@ 2012-05-15 22:02                   ` Anthony Liguori
  2012-05-15 23:08                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15 22:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On 05/15/2012 04:55 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-15 at 09:02 -0500, Anthony Liguori wrote:
>
>> I think the crux of your argument is that upon a change to the translation
>> table, the operation acts as a barrier such that the exact moment it returns,
>> you're guaranteed that no DMAs are in flight with the old translation mapping.
>
> Not when the translation is changed in memory but whenever the
> translation cache are invalidated or whatever other mechanism the HW
> provides to do that synchronization. On PAPR, this guarantee is provided
> by the H_PUT_TCE hypervisor call which we use to manipulate
> translations.

So this is from the VT-d spec:

"6.2.1 Register Based Invalidation Interface
The register based invalidations provides a synchronous hardware interface for 
invalidations.  Software is expected to write to the IOTLB registers to submit 
invalidation command and may poll on these registers to check for invalidation 
completion. For optimal performance, hardware implementations are recommended to 
complete an invalidation request with minimal latency"

This makes perfect sense.  You write to an MMIO location to request invalidation 
and then *poll* on a separate register for completion.

It's not a single MMIO operation that has an indefinitely return duration.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15 22:02                   ` Anthony Liguori
@ 2012-05-15 23:08                     ` Benjamin Herrenschmidt
  2012-05-15 23:58                       ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-15 23:08 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On Tue, 2012-05-15 at 17:02 -0500, Anthony Liguori wrote:
> 
> "6.2.1 Register Based Invalidation Interface
> The register based invalidations provides a synchronous hardware interface for 
> invalidations.  Software is expected to write to the IOTLB registers to submit 
> invalidation command and may poll on these registers to check for invalidation 
> completion. For optimal performance, hardware implementations are recommended to 
> complete an invalidation request with minimal latency"
> 
> This makes perfect sense.  You write to an MMIO location to request invalidation 
> and then *poll* on a separate register for completion.
> 
> It's not a single MMIO operation that has an indefinitely return duration.

Sure, it's an implementation detail, I never meant that it had to be a
single blocking register access, all I said is that the HW must provide
such a mechanism that is typically used synchronously by the operating
system. Polling for completion is a perfectly legit way to do it, that's
how we do it on the Apple G5 "DART" iommu as well.

The fact that MMIO operations can block is orthogonal, it is possible
however, especially with ancient PIO devices.

In our case (TCEs) it's a hypervisor call, not an MMIO op, so to some
extent it's even more likely to do "blocking" things.

It would have been possible to implement a "busy" return status with the
guest having to try again, unfortunately that's not how Linux has
implemented it, so we are stuck with the current semantics.

Now, if you think that dropping the lock isn't good, what do you reckon
I should do ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15 23:08                     ` Benjamin Herrenschmidt
@ 2012-05-15 23:58                       ` Anthony Liguori
  2012-05-16  0:41                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-15 23:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On 05/15/2012 06:08 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-15 at 17:02 -0500, Anthony Liguori wrote:
>>
>> "6.2.1 Register Based Invalidation Interface
>> The register based invalidations provides a synchronous hardware interface for
>> invalidations.  Software is expected to write to the IOTLB registers to submit
>> invalidation command and may poll on these registers to check for invalidation
>> completion. For optimal performance, hardware implementations are recommended to
>> complete an invalidation request with minimal latency"
>>
>> This makes perfect sense.  You write to an MMIO location to request invalidation
>> and then *poll* on a separate register for completion.
>>
>> It's not a single MMIO operation that has an indefinitely return duration.
>
> Sure, it's an implementation detail, I never meant that it had to be a
> single blocking register access, all I said is that the HW must provide
> such a mechanism that is typically used synchronously by the operating
> system. Polling for completion is a perfectly legit way to do it, that's
> how we do it on the Apple G5 "DART" iommu as well.
>
> The fact that MMIO operations can block is orthogonal, it is possible
> however, especially with ancient PIO devices.

Even ancient PIO devices really don't block indefinitely.

> In our case (TCEs) it's a hypervisor call, not an MMIO op, so to some
> extent it's even more likely to do "blocking" things.

Yes, so I think the right thing to do is not model hypercalls for sPAPR as 
synchronous calls but rather as asynchronous calls.  Obviously, simply ones can 
use a synchronous implementation...

This is a matter of setting hlt=1 before dispatching the hypercall and passing a 
continuation to the call that when executed, prepare the CPUState for the 
hypercall return and then set hlt=0 to resume the CPU.

> It would have been possible to implement a "busy" return status with the
> guest having to try again, unfortunately that's not how Linux has
> implemented it, so we are stuck with the current semantics.
>
> Now, if you think that dropping the lock isn't good, what do you reckon
> I should do ?

Add a reference count to dma map calls and a flush_pending flag.  If 
flush_pending && ref > 0, return NULL for all map calls.

Decrement ref on unmap and if ref = 0 and flush_pending, clear flush_pending. 
You could add a flush_notifier too for this event.

dma_flush() sets flush_pending if ref > 0.  Your TCE flush hypercall would 
register for flush notifications and squirrel away the hypercall completion 
continuation.

VT-d actually has a concept of a invalidation completion queue which delivers 
interrupt based notification of invalidation completion events.  The above 
flush_notify would be the natural way to support this since in this case, there 
is no VCPU event that's directly involved in the completion event.

Regards,

Anthony Liguori

> Cheers,
> Ben.
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-15 23:58                       ` Anthony Liguori
@ 2012-05-16  0:41                         ` Benjamin Herrenschmidt
  2012-05-16  0:54                           ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-16  0:41 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On Tue, 2012-05-15 at 18:58 -0500, Anthony Liguori wrote:

> Even ancient PIO devices really don't block indefinitely.
> 
> > In our case (TCEs) it's a hypervisor call, not an MMIO op, so to some
> > extent it's even more likely to do "blocking" things.
> 
> Yes, so I think the right thing to do is not model hypercalls for sPAPR as 
> synchronous calls but rather as asynchronous calls.  Obviously, simply ones can 
> use a synchronous implementation...
> 
> This is a matter of setting hlt=1 before dispatching the hypercall and passing a 
> continuation to the call that when executed, prepare the CPUState for the 
> hypercall return and then set hlt=0 to resume the CPU.

Is there any reason not to set that hlt after the dispatch ? IE. from
within the hypercall, for the very few that want to do asynchronous
completion, do something like spapr_hcall_suspend() before returning ?

> > It would have been possible to implement a "busy" return status with the
> > guest having to try again, unfortunately that's not how Linux has
> > implemented it, so we are stuck with the current semantics.
> >
> > Now, if you think that dropping the lock isn't good, what do you reckon
> > I should do ?
> 
> Add a reference count to dma map calls and a flush_pending flag.  If 
> flush_pending && ref > 0, return NULL for all map calls.
> 
> Decrement ref on unmap and if ref = 0 and flush_pending, clear flush_pending. 
> You could add a flush_notifier too for this event.
> 
> dma_flush() sets flush_pending if ref > 0.  Your TCE flush hypercall would 
> register for flush notifications and squirrel away the hypercall completion 
> continuation.

Ok, I'll look into it, thanks. Any good example to look at for how that
continuation stuff works ?

> VT-d actually has a concept of a invalidation completion queue which delivers 
> interrupt based notification of invalidation completion events.  The above 
> flush_notify would be the natural way to support this since in this case, there 
> is no VCPU event that's directly involved in the completion event.

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> > Cheers,
> > Ben.
> >
> >

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-16  0:41                         ` Benjamin Herrenschmidt
@ 2012-05-16  0:54                           ` Anthony Liguori
  2012-05-16  1:20                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-16  0:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On 05/15/2012 07:41 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-15 at 18:58 -0500, Anthony Liguori wrote:
>
>> Even ancient PIO devices really don't block indefinitely.
>>
>>> In our case (TCEs) it's a hypervisor call, not an MMIO op, so to some
>>> extent it's even more likely to do "blocking" things.
>>
>> Yes, so I think the right thing to do is not model hypercalls for sPAPR as
>> synchronous calls but rather as asynchronous calls.  Obviously, simply ones can
>> use a synchronous implementation...
>>
>> This is a matter of setting hlt=1 before dispatching the hypercall and passing a
>> continuation to the call that when executed, prepare the CPUState for the
>> hypercall return and then set hlt=0 to resume the CPU.
>
> Is there any reason not to set that hlt after the dispatch ? IE. from
> within the hypercall, for the very few that want to do asynchronous
> completion, do something like spapr_hcall_suspend() before returning ?

You certainly could do that but it may get a little weird dealing with the 
return path.  You'd have to return something like -EWOULDBLOCK and make sure you 
handle that in the dispatch code appropriately.

>>> It would have been possible to implement a "busy" return status with the
>>> guest having to try again, unfortunately that's not how Linux has
>>> implemented it, so we are stuck with the current semantics.
>>>
>>> Now, if you think that dropping the lock isn't good, what do you reckon
>>> I should do ?
>>
>> Add a reference count to dma map calls and a flush_pending flag.  If
>> flush_pending&&  ref>  0, return NULL for all map calls.
>>
>> Decrement ref on unmap and if ref = 0 and flush_pending, clear flush_pending.
>> You could add a flush_notifier too for this event.
>>
>> dma_flush() sets flush_pending if ref>  0.  Your TCE flush hypercall would
>> register for flush notifications and squirrel away the hypercall completion
>> continuation.
>
> Ok, I'll look into it, thanks. Any good example to look at for how that
> continuation stuff works ?

Just a callback and an opaque.  You could look at the AIOCB's in the block layer.

Regards,

Anthony Liguori

>> VT-d actually has a concept of a invalidation completion queue which delivers
>> interrupt based notification of invalidation completion events.  The above
>> flush_notify would be the natural way to support this since in this case, there
>> is no VCPU event that's directly involved in the completion event.
>
> Cheers,
> Ben.
>
>> Regards,
>>
>> Anthony Liguori
>>
>>> Cheers,
>>> Ben.
>>>
>>>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-16  0:54                           ` Anthony Liguori
@ 2012-05-16  1:20                             ` Benjamin Herrenschmidt
  2012-05-16 19:36                               ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-16  1:20 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On Tue, 2012-05-15 at 19:54 -0500, Anthony Liguori wrote:
> 
> You certainly could do that but it may get a little weird dealing with the 
> return path.  You'd have to return something like -EWOULDBLOCK and make sure you 
> handle that in the dispatch code appropriately.

Hrm, our implementation of kvm_arch_handle_exit() always return "1"
after a hypercall, forcing kvm_cpu_exec() to return back to the thread
loop, so we should be ok to just set env->halted and return no ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-15  1:44     ` David Gibson
@ 2012-05-16  4:35       ` Benjamin Herrenschmidt
  2012-05-16  5:51         ` David Gibson
  2012-05-16 19:39         ` Anthony Liguori
  0 siblings, 2 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-16  4:35 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, Anthony Liguori


> > >
> > >+    /* HACK: full memory barrier here */
> > >+    __sync_synchronize();
> > 
> > I thought you were going to limit this to the TCE iommu?
> 
> So, it wasn't my intention to send this one with the rest, but I
> forgot to explain that to Ben when he resent.  As the comment
> suggests, this is a hack we have been using internally to see if
> certain ordering problems were what we thought they were.  If that
> turned out to be the case (and it now looks like it is), we need to
> work out where to correctly place this barrier.  As Ben says, this
> should probably really be in the PCI accessors, and we should use the
> finer grained primitives from qemu-barrier.h rather than the brute
> force __sync_synchronize().

Well, I knew you didn't intend to send them but I still think that's the
right patch for now :-)

So we -could- put it in the PCI accessors ... but that would mean fixing
all drivers to actually use them. For example, ide/ahci or usb/ohci
don't and they aren't the only one.

In the end, I don't think there's anything we care about which would not
benefit from ensuring that the DMAs it does appear in the order they
were issued to the guest kernel. Most busses provide that guarantee to
some extent and while some busses do have the ability to explicitly
request relaxed ordering I don't think this is the case with anything we
care about emulating at this stage (and we can always make that separate
accessors or flags to add to the direction for example).

So by putting the barrier right in the dma_* accessor we kill all the
birds with one stone without having to audit all drivers for use of the
right accessors and all bus types.

Also while the goal of using more targeted barriers might be worthwhile
in the long run, it's not totally trivial because we do want to order
store vs. subsequent loads in all cases and load vs. loads, and we don't
want to have to keep track of what the previous access was, so at this
stage it's simply easier to just use a full barrier.

So my suggestion is to see if that patch introduces a measurable
performance regression anywhere we care about (ie on x86) and if not,
just go for it, it will solve a very real problem and we can ponder ways
to do it better as a second step if it's worthwhile.

Anthony, how do you usually benchmark these things ? Any chance you can
run a few tests to see if there's any visible loss ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-16  4:35       ` Benjamin Herrenschmidt
@ 2012-05-16  5:51         ` David Gibson
  2012-05-16 19:39         ` Anthony Liguori
  1 sibling, 0 replies; 89+ messages in thread
From: David Gibson @ 2012-05-16  5:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: qemu-devel, Anthony Liguori

On Wed, May 16, 2012 at 02:35:38PM +1000, Benjamin Herrenschmidt wrote:
> 
> > > >
> > > >+    /* HACK: full memory barrier here */
> > > >+    __sync_synchronize();
> > > 
> > > I thought you were going to limit this to the TCE iommu?
> > 
> > So, it wasn't my intention to send this one with the rest, but I
> > forgot to explain that to Ben when he resent.  As the comment
> > suggests, this is a hack we have been using internally to see if
> > certain ordering problems were what we thought they were.  If that
> > turned out to be the case (and it now looks like it is), we need to
> > work out where to correctly place this barrier.  As Ben says, this
> > should probably really be in the PCI accessors, and we should use the
> > finer grained primitives from qemu-barrier.h rather than the brute
> > force __sync_synchronize().
> 
> Well, I knew you didn't intend to send them but I still think that's the
> right patch for now :-)
> 
> So we -could- put it in the PCI accessors ... but that would mean fixing
> all drivers to actually use them. For example, ide/ahci or usb/ohci
> don't and they aren't the only one.

Uh, right.  So, in fact, I didn't mean the PCI accessors precisely, I
think this should go in the general DMA accessors.  However, it should
go in the wrappers in the header file - we want the barrier even in
the non-IOMMU case.  And we should the finer grained (and arch
tailored) barriers from qemu-barrier.h rather than
__sync_synchronize().


Actually other patches in my DMA series should fix AHCI and OHCI to
use the PCI accessors.

> In the end, I don't think there's anything we care about which would not
> benefit from ensuring that the DMAs it does appear in the order they
> were issued to the guest kernel. Most busses provide that guarantee to
> some extent and while some busses do have the ability to explicitly
> request relaxed ordering I don't think this is the case with anything we
> care about emulating at this stage (and we can always make that separate
> accessors or flags to add to the direction for example).
> 
> So by putting the barrier right in the dma_* accessor we kill all the
> birds with one stone without having to audit all drivers for use of the
> right accessors and all bus types.
> 
> Also while the goal of using more targeted barriers might be worthwhile
> in the long run, it's not totally trivial because we do want to order
> store vs. subsequent loads in all cases and load vs. loads, and we don't
> want to have to keep track of what the previous access was, so at this
> stage it's simply easier to just use a full barrier.

True, we need more than a wmb() here.  But it's not necessarily as
strong as __sync_synchronize() (e.g. on POWER I think we only need
'eieio', not 'sync').  More importantly, I believe qemu does build for
some platforms with crappy old gcc versions which don't have
sync_syncronize, so we want the ifdefs in qemu-barrier.h for that
reason (in many cases mb() from there will translate into
__sync_synchronize() anyway).

> So my suggestion is to see if that patch introduces a measurable
> performance regression anywhere we care about (ie on x86) and if not,
> just go for it, it will solve a very real problem and we can ponder ways
> to do it better as a second step if it's worthwhile.
> 
> Anthony, how do you usually benchmark these things ? Any chance you can
> run a few tests to see if there's any visible loss ?
> 
> Cheers,
> Ben.
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure
  2012-05-16  1:20                             ` Benjamin Herrenschmidt
@ 2012-05-16 19:36                               ` Anthony Liguori
  0 siblings, 0 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-16 19:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, Richard Henderson, Michael S. Tsirkin,
	qemu-devel, Eduard - Gabriel Munteanu

On 05/15/2012 08:20 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-15 at 19:54 -0500, Anthony Liguori wrote:
>>
>> You certainly could do that but it may get a little weird dealing with the
>> return path.  You'd have to return something like -EWOULDBLOCK and make sure you
>> handle that in the dispatch code appropriately.
>
> Hrm, our implementation of kvm_arch_handle_exit() always return "1"
> after a hypercall, forcing kvm_cpu_exec() to return back to the thread
> loop, so we should be ok to just set env->halted and return no ?

I meant, if you wanted to have a synchronous hypercall function to dispatch, and 
then later call "hypercall_finish()", you would need a way to return an error in 
the synchronous hypercall to identify that something else would eventually call 
"hypercall_finish()."  The sync hypercall would need to return something like 
-EWOULDBLOC.

Setting env->halted=1 ought to be enough to delay returning to the guest 
although i'd have to go through the code to verify.

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-16  4:35       ` Benjamin Herrenschmidt
  2012-05-16  5:51         ` David Gibson
@ 2012-05-16 19:39         ` Anthony Liguori
  2012-05-16 21:10           ` Benjamin Herrenschmidt
  2012-05-17  0:07           ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-16 19:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On 05/15/2012 11:35 PM, Benjamin Herrenschmidt wrote:
>
>>>>
>>>> +    /* HACK: full memory barrier here */
>>>> +    __sync_synchronize();
>>>
>>> I thought you were going to limit this to the TCE iommu?
>>
>> So, it wasn't my intention to send this one with the rest, but I
>> forgot to explain that to Ben when he resent.  As the comment
>> suggests, this is a hack we have been using internally to see if
>> certain ordering problems were what we thought they were.  If that
>> turned out to be the case (and it now looks like it is), we need to
>> work out where to correctly place this barrier.  As Ben says, this
>> should probably really be in the PCI accessors, and we should use the
>> finer grained primitives from qemu-barrier.h rather than the brute
>> force __sync_synchronize().
>
> Well, I knew you didn't intend to send them but I still think that's the
> right patch for now :-)
>
> So we -could- put it in the PCI accessors ... but that would mean fixing
> all drivers to actually use them. For example, ide/ahci or usb/ohci
> don't and they aren't the only one.
>
> In the end, I don't think there's anything we care about which would not
> benefit from ensuring that the DMAs it does appear in the order they
> were issued to the guest kernel. Most busses provide that guarantee to
> some extent and while some busses do have the ability to explicitly
> request relaxed ordering I don't think this is the case with anything we
> care about emulating at this stage (and we can always make that separate
> accessors or flags to add to the direction for example).

I must confess, I have no idea what PCI et al guarantee with respect to 
ordering.  What's nasty about this patch is that you're not just ordering wrt 
device writes/reads, but also with the other VCPUs.  I don't suspect this would 
be prohibitively expensive but it still worries me.

> So by putting the barrier right in the dma_* accessor we kill all the
> birds with one stone without having to audit all drivers for use of the
> right accessors and all bus types.
>
> Also while the goal of using more targeted barriers might be worthwhile
> in the long run, it's not totally trivial because we do want to order
> store vs. subsequent loads in all cases and load vs. loads, and we don't
> want to have to keep track of what the previous access was, so at this
> stage it's simply easier to just use a full barrier.
>
> So my suggestion is to see if that patch introduces a measurable
> performance regression anywhere we care about (ie on x86) and if not,
> just go for it, it will solve a very real problem and we can ponder ways
> to do it better as a second step if it's worthwhile.
>
> Anthony, how do you usually benchmark these things ? Any chance you can
> run a few tests to see if there's any visible loss ?

My concern would really be limited to virtio ring processing. It all depends on 
where you place the barriers in the end.

I really don't want to just conservatively stick barriers everywhere either. 
I'd like to have a specific ordering guarantee and then implement that and deal 
with the performance consequences.

I also wonder if the "fix" that you see from this is papering around a bigger 
problem.  Can you explain the ohci problem that led you to do this in the first 
place?

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-16 19:39         ` Anthony Liguori
@ 2012-05-16 21:10           ` Benjamin Herrenschmidt
  2012-05-16 21:12             ` Benjamin Herrenschmidt
  2012-05-17  0:07           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-16 21:10 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On Wed, 2012-05-16 at 14:39 -0500, Anthony Liguori wrote:

> I must confess, I have no idea what PCI et al guarantee with respect to 
> ordering.  What's nasty about this patch is that you're not just ordering wrt 
> device writes/reads, but also with the other VCPUs.  I don't suspect this would 
> be prohibitively expensive but it still worries me.

So the precise ordering rules of various busses can vary slightly.

We could try to get as precise & fine grained as those busses are in HW
or ... it's my belief that it makes sense to simply guarantee that the
DMA accesses done by emulated devices always appear to other VCPUs in
the order they were done by the device emulation code.

IE. If we can prove that the cost of doing so is negligible, then it's
also the simplest approach since just sticking that one barrier here
will provide that ordering guarantee (at least for anything using the
dma_* accessors).

Ordering problems can be really sneaky & nasty to debug and so I'm
really tempted to use that big hammer approach here, provided there is
no problematic performance loss.

> > So by putting the barrier right in the dma_* accessor we kill all the
> > birds with one stone without having to audit all drivers for use of the
> > right accessors and all bus types.
> >
> > Also while the goal of using more targeted barriers might be worthwhile
> > in the long run, it's not totally trivial because we do want to order
> > store vs. subsequent loads in all cases and load vs. loads, and we don't
> > want to have to keep track of what the previous access was, so at this
> > stage it's simply easier to just use a full barrier.
> >
> > So my suggestion is to see if that patch introduces a measurable
> > performance regression anywhere we care about (ie on x86) and if not,
> > just go for it, it will solve a very real problem and we can ponder ways
> > to do it better as a second step if it's worthwhile.
> >
> > Anthony, how do you usually benchmark these things ? Any chance you can
> > run a few tests to see if there's any visible loss ?
> 
> My concern would really be limited to virtio ring processing. It all depends on 
> where you place the barriers in the end.

So virtio doesn't use the dma_* interface since it bypasses the iommu
(on purpose).

> I really don't want to just conservatively stick barriers everywhere either. 
> I'd like to have a specific ordering guarantee and then implement that and deal 
> with the performance consequences.

Well, my idea is to provide a well defined ordering semantic of all DMA
accesses issued by a device :-) IE. All DMAs done by the device
emulation appear to other VCPUs in the order they were issued by the
emulation code. IE. Making the storage accesses visible in the right
order to "other VCPUs" is the whole point of the exercise.

This is well defined, though a bit broad and possibly broader than
strictly necessary but it's a cost/benefit game here. If the cost is low
enough, the benefit is that it's going to be safe, we won't have subtle
cases of things passing each other etc... and it's also simpler to
implement and maintain since it's basically one barrier in the right
place.

I have a long experience with dealing with ordering issues on large SMP
systems and believe me, anything "fine grained" is really really hard to
generally get right, and the resulting bugs are really nasty to track
down and even identify. So I have a strong bias toward the big hammer
approach that is guaranteed to avoid the problem for anything using the
right DMA accessors.

> I also wonder if the "fix" that you see from this is papering around a bigger 
> problem.  Can you explain the ohci problem that led you to do this in the first 
> place?

Well, we did an audit of OHCI and we discovered several bugs there which
have been fixed since then, mostly cases where the emulated device would
incorrectly read/modify/write entire data structures in guest memory
rather than just updating the fields it's supposed to update, causing
simultaneous updates of other fields by the guest driver to be lost.

The result was that we still had an occasional mild instability where
every now and then the host would seem to get errors or miss
completions, which the barrier appeared to fix.

On the other hand, we -know- that not having the barrier is incorrect so
that was enough for me to be happy about the diagnosis.

IE. The OHCI -will- update fields that must be visible in the right
order by the host driver (such as a link pointer in a TD followed by the
done list pointer pointing to that TD) and we know that POWER cpus are
very good at shooting stores out of order, so missing TDs on completion
being one of our symptoms, I think we pretty much nailed it.

I'm going to try on some fast x86 using things like AHCI to see if I can
show a performance issue.

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> >
> > Cheers,
> > Ben.
> >
> >
> >

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-16 21:10           ` Benjamin Herrenschmidt
@ 2012-05-16 21:12             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-16 21:12 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On Thu, 2012-05-17 at 07:10 +1000, Benjamin Herrenschmidt wrote:

> I have a long experience with dealing with ordering issues on large SMP
> systems and believe me, anything "fine grained" is really really hard to
> generally get right, and the resulting bugs are really nasty to track
> down and even identify. So I have a strong bias toward the big hammer
> approach that is guaranteed to avoid the problem for anything using the
> right DMA accessors.

BTW, this is going to also hurt SMP ARMs ... ARM is getting increasingly
out of order as well.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-16 19:39         ` Anthony Liguori
  2012-05-16 21:10           ` Benjamin Herrenschmidt
@ 2012-05-17  0:07           ` Benjamin Herrenschmidt
  2012-05-17  0:24             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-17  0:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

So followup ....

For those on the list: Anthony and I had a chat and we agree that a
better thing to do is to have all cpu_physical_memory_* accesses to be
ordered in program order from the perspective of the VCPUs. Devices that
have performance critical accesses and want to do home made ordering can
use map/unmap.

Now looking at the code, however, there seem to be a lot of duplication,
ie cpu_physical_memory_rw() is an obvious choice to add a barrier but
what about all of the ldl_*, ldq_* etc... ? In fact there's about 45
different ways code can dig into guest memory, should they all be made
ordered ?

At this point, it might be easier to just stick a barrier in
qemu_get_ram_ptr() which seems to be called by everybody however that
means that things like cpu_physical_memory_rw() will end up hitting the
barrier for every page. It's safe but it might be a performance hit
(measurable ? I can give it a try, probably not).

Or we can just sprinkle the barrier everywhere, mostly it's going to be
in exec.c, all the "ram" cases in ld*_* and st*_*.

Also, should I make the barrier conditional to kvm_enabled() ? IE. It's
pointless in full emulation and might actually be a performance hit on
something already quite slow...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-17  0:07           ` Benjamin Herrenschmidt
@ 2012-05-17  0:24             ` Benjamin Herrenschmidt
  2012-05-17  0:52               ` [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
  2012-05-18  6:53               ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Paolo Bonzini
  0 siblings, 2 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-17  0:24 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On Thu, 2012-05-17 at 10:07 +1000, Benjamin Herrenschmidt wrote:

 .../...

> Also, should I make the barrier conditional to kvm_enabled() ? IE. It's
> pointless in full emulation and might actually be a performance hit on
> something already quite slow...

Finally ... something like smp_mb() in qemu will turn into a lock op or
an mfence on x86, ie not a nop.

That means overhead from today's implementation, which leads to the
question ... is today implementation correct ? IE. Is a barrier needed
on x86 as well or not ?

If not (I'm trying to figure out why exactly does x86 have a barrier in
the first place and when it's in order), then I might add a new barrier
type in qemu-barriers.h, something like dma_mb(), and define it as a nop
on x86, a lwsync or sync (still thinking about it) on ppc, and
__sync_synchronize() on unknown archs.

Any x86 guru around cares to explain me what exactly is the x86 memory
model and when does it need barriers ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17  0:24             ` Benjamin Herrenschmidt
@ 2012-05-17  0:52               ` Benjamin Herrenschmidt
  2012-05-17  2:28                 ` Anthony Liguori
  2012-05-18  6:53               ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Paolo Bonzini
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-17  0:52 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

The emulated devices can run simultaneously with the guest, so
we need to be careful with ordering of load and stores done by
them to the guest system memory, which need to be observed in
the right order by the guest operating system.

In order to avoid unnecessary overhead on i386, we define a new
barrier dma_mb() which is a full barrier on powerpc and a nop
on i386 and x86_64 (see the comment I added in the code).

This barrier is then added to qemu_get_ram_ptr() which is easier
than sprinkling into all the functions that provide guest
access and are all more or less open coded.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Discussion: So the other option is to do it only in
cpu_physical_memory_rw() and leave the responsibility to use explicit
barriers to the callers of ld*/st* accessors. For example virtio already
does it in a few places explicitly.

 exec.c         |   11 +++++++++++
 qemu-barrier.h |   29 ++++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 40cf52d..fc857b6 100644
--- a/exec.c
+++ b/exec.c
@@ -25,6 +25,7 @@
 #endif
 
 #include "qemu-common.h"
+#include "qemu-barrier.h"
 #include "cpu.h"
 #include "tcg.h"
 #include "hw/hw.h"
@@ -2794,6 +2795,9 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
 {
     RAMBlock *block;
 
+    /* We ensure ordering for all DMA transactions */
+    dma_mb();
+
     QLIST_FOREACH(block, &ram_list.blocks, next) {
         if (addr - block->offset < block->length) {
             /* Move this entry to to start of the list.  */
@@ -2830,6 +2834,9 @@ void *qemu_safe_ram_ptr(ram_addr_t addr)
 {
     RAMBlock *block;
 
+    /* We ensure ordering for all DMA transactions */
+    dma_mb();
+
     QLIST_FOREACH(block, &ram_list.blocks, next) {
         if (addr - block->offset < block->length) {
             if (xen_enabled()) {
@@ -2861,6 +2868,10 @@ void *qemu_ram_ptr_length(ram_addr_t addr, ram_addr_t *size)
     if (*size == 0) {
         return NULL;
     }
+
+    /* We ensure ordering for all DMA transactions */
+    dma_mb();
+
     if (xen_enabled()) {
         return xen_map_cache(addr, *size, 1);
     } else {
diff --git a/qemu-barrier.h b/qemu-barrier.h
index 7e11197..8c62683 100644
--- a/qemu-barrier.h
+++ b/qemu-barrier.h
@@ -23,7 +23,21 @@
 #define smp_mb() __sync_synchronize()
 #else
 #define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
-#endif
+ #endif
+
+/*
+ * DMA barrier is used to order accesses from qemu devices to
+ * guest memory, in order to make them appear to the guest in
+ * program order.
+ *
+ * We assume that we never uses non-temporal accesses for such
+ * DMA and so don't need anything other than a compiler barrier
+ *
+ * If some devices use weakly ordered SSE load/store instructions
+ * then those devices will be responsible for using the appropriate
+ * barriers as well.
+ */
+#define dma_mb()    barrier()
 
 #elif defined(__x86_64__)
 
@@ -31,6 +45,9 @@
 #define smp_rmb()   barrier()
 #define smp_mb() asm volatile("mfence" ::: "memory")
 
+/* Same comment as i386 */
+#define dma_mb()    barrier()
+
 #elif defined(_ARCH_PPC)
 
 /*
@@ -46,8 +63,13 @@
 #define smp_rmb()   asm volatile("sync" ::: "memory")
 #endif
 
-#define smp_mb()   asm volatile("sync" ::: "memory")
+#define smp_mb()    asm volatile("sync" ::: "memory")
 
+/*
+ * We use a full barrier for DMA which encompass the full
+ * requirements of the PCI ordering model.
+ */
+#define dma_mb()    smp_mb()
 #else
 
 /*
@@ -57,8 +79,9 @@
  * be overkill for wmb() and rmb().
  */
 #define smp_wmb()   __sync_synchronize()
-#define smp_mb()   __sync_synchronize()
+#define smp_mb()    __sync_synchronize()
 #define smp_rmb()   __sync_synchronize()
+#define dma_rmb()   __sync_synchronize()
 
 #endif
 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17  0:52               ` [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
@ 2012-05-17  2:28                 ` Anthony Liguori
  2012-05-17  2:44                   ` Benjamin Herrenschmidt
  2012-05-17  3:35                   ` David Gibson
  0 siblings, 2 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-17  2:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On 05/16/2012 07:52 PM, Benjamin Herrenschmidt wrote:
> The emulated devices can run simultaneously with the guest, so
> we need to be careful with ordering of load and stores done by
> them to the guest system memory, which need to be observed in
> the right order by the guest operating system.
>
> In order to avoid unnecessary overhead on i386, we define a new
> barrier dma_mb() which is a full barrier on powerpc and a nop
> on i386 and x86_64 (see the comment I added in the code).
>
> This barrier is then added to qemu_get_ram_ptr() which is easier
> than sprinkling into all the functions that provide guest
> access and are all more or less open coded.
>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>
> Discussion: So the other option is to do it only in
> cpu_physical_memory_rw() and leave the responsibility to use explicit
> barriers to the callers of ld*/st* accessors. For example virtio already
> does it in a few places explicitly.
>
>   exec.c         |   11 +++++++++++
>   qemu-barrier.h |   29 ++++++++++++++++++++++++++---
>   2 files changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 40cf52d..fc857b6 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -25,6 +25,7 @@
>   #endif
>
>   #include "qemu-common.h"
> +#include "qemu-barrier.h"
>   #include "cpu.h"
>   #include "tcg.h"
>   #include "hw/hw.h"
> @@ -2794,6 +2795,9 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
>   {
>       RAMBlock *block;
>
> +    /* We ensure ordering for all DMA transactions */
> +    dma_mb();
> +

I get being conservative, but I don't think this makes a lot of sense.  There 
are cases where the return of this function is cached (like the VGA ram area). 
I think it would make more sense if you explicitly put a barrier after write 
operations.

Regards,

Anthony Liguori

>       QLIST_FOREACH(block,&ram_list.blocks, next) {
>           if (addr - block->offset<  block->length) {
>               /* Move this entry to to start of the list.  */
> @@ -2830,6 +2834,9 @@ void *qemu_safe_ram_ptr(ram_addr_t addr)
>   {
>       RAMBlock *block;
>
> +    /* We ensure ordering for all DMA transactions */
> +    dma_mb();
> +
>       QLIST_FOREACH(block,&ram_list.blocks, next) {
>           if (addr - block->offset<  block->length) {
>               if (xen_enabled()) {
> @@ -2861,6 +2868,10 @@ void *qemu_ram_ptr_length(ram_addr_t addr, ram_addr_t *size)
>       if (*size == 0) {
>           return NULL;
>       }
> +
> +    /* We ensure ordering for all DMA transactions */
> +    dma_mb();
> +
>       if (xen_enabled()) {
>           return xen_map_cache(addr, *size, 1);
>       } else {
> diff --git a/qemu-barrier.h b/qemu-barrier.h
> index 7e11197..8c62683 100644
> --- a/qemu-barrier.h
> +++ b/qemu-barrier.h
> @@ -23,7 +23,21 @@
>   #define smp_mb() __sync_synchronize()
>   #else
>   #define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
> -#endif
> + #endif
> +
> +/*
> + * DMA barrier is used to order accesses from qemu devices to
> + * guest memory, in order to make them appear to the guest in
> + * program order.
> + *
> + * We assume that we never uses non-temporal accesses for such
> + * DMA and so don't need anything other than a compiler barrier
> + *
> + * If some devices use weakly ordered SSE load/store instructions
> + * then those devices will be responsible for using the appropriate
> + * barriers as well.
> + */
> +#define dma_mb()    barrier()
>
>   #elif defined(__x86_64__)
>
> @@ -31,6 +45,9 @@
>   #define smp_rmb()   barrier()
>   #define smp_mb() asm volatile("mfence" ::: "memory")
>
> +/* Same comment as i386 */
> +#define dma_mb()    barrier()
> +
>   #elif defined(_ARCH_PPC)
>
>   /*
> @@ -46,8 +63,13 @@
>   #define smp_rmb()   asm volatile("sync" ::: "memory")
>   #endif
>
> -#define smp_mb()   asm volatile("sync" ::: "memory")
> +#define smp_mb()    asm volatile("sync" ::: "memory")
>
> +/*
> + * We use a full barrier for DMA which encompass the full
> + * requirements of the PCI ordering model.
> + */
> +#define dma_mb()    smp_mb()
>   #else
>
>   /*
> @@ -57,8 +79,9 @@
>    * be overkill for wmb() and rmb().
>    */
>   #define smp_wmb()   __sync_synchronize()
> -#define smp_mb()   __sync_synchronize()
> +#define smp_mb()    __sync_synchronize()
>   #define smp_rmb()   __sync_synchronize()
> +#define dma_rmb()   __sync_synchronize()
>
>   #endif
>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17  2:28                 ` Anthony Liguori
@ 2012-05-17  2:44                   ` Benjamin Herrenschmidt
  2012-05-17 22:09                     ` Anthony Liguori
  2012-05-17  3:35                   ` David Gibson
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-17  2:44 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On Wed, 2012-05-16 at 21:28 -0500, Anthony Liguori wrote:

> > @@ -2794,6 +2795,9 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
> >   {
> >       RAMBlock *block;
> >
> > +    /* We ensure ordering for all DMA transactions */
> > +    dma_mb();
> > +
> 
> I get being conservative, but I don't think this makes a lot of sense.  There 
> are cases where the return of this function is cached (like the VGA ram area). 
> I think it would make more sense if you explicitly put a barrier after write 
> operations.

Well, it depends ... sure something that caches the result is akin to
map/unmap and responsible for doing its own barriers between accesses,
however as a whole, this means that an entire map/unmap section is
ordered surrounding accesses which is actually not a bad idea.

Anyway, I'll post a different patch that adds the barrier more
selectively to:

 - cpu_physical_memory_rw  (that's the obvious main one)
 - cpu_physical_memory_write_rom (probably overkill but
   not a fast path so no big deal)
 - ld*_* and st*_* (or do you think these should require
   explicit barriers in the callers ?)

Note that with the above, cpu_physical_memory_map and unmap will
imply a barrier when using bounce buffers, it would make sense to also
provide the same barrier when not.

That does actually make sense for the same reason explained above,
ie, when those are used for a DMA transfer via async IO, that guarantees
ordering of the "block" vs. surrounding accesses even if accesses within
the actual map/unmap region are not ordered vs. each other.

Any objection ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17  2:28                 ` Anthony Liguori
  2012-05-17  2:44                   ` Benjamin Herrenschmidt
@ 2012-05-17  3:35                   ` David Gibson
  1 sibling, 0 replies; 89+ messages in thread
From: David Gibson @ 2012-05-17  3:35 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

On Wed, May 16, 2012 at 09:28:39PM -0500, Anthony Liguori wrote:
> On 05/16/2012 07:52 PM, Benjamin Herrenschmidt wrote:
[snip]
> >@@ -2794,6 +2795,9 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
> >  {
> >      RAMBlock *block;
> >
> >+    /* We ensure ordering for all DMA transactions */
> >+    dma_mb();
> >+
> 
> I get being conservative, but I don't think this makes a lot of
> sense.  There are cases where the return of this function is cached
> (like the VGA ram area). I think it would make more sense if you
> explicitly put a barrier after write operations.

I tend to agree.  I think the barriers should be in
cpu_physical_memory_rw() and the st*_phys() functions.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17  2:44                   ` Benjamin Herrenschmidt
@ 2012-05-17 22:09                     ` Anthony Liguori
  2012-05-18  1:04                       ` David Gibson
  2012-05-18  1:16                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-17 22:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On 05/16/2012 09:44 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2012-05-16 at 21:28 -0500, Anthony Liguori wrote:
>
>>> @@ -2794,6 +2795,9 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
>>>    {
>>>        RAMBlock *block;
>>>
>>> +    /* We ensure ordering for all DMA transactions */
>>> +    dma_mb();
>>> +
>>
>> I get being conservative, but I don't think this makes a lot of sense.  There
>> are cases where the return of this function is cached (like the VGA ram area).
>> I think it would make more sense if you explicitly put a barrier after write
>> operations.
>
> Well, it depends ... sure something that caches the result is akin to
> map/unmap and responsible for doing its own barriers between accesses,
> however as a whole, this means that an entire map/unmap section is
> ordered surrounding accesses which is actually not a bad idea.
>
> Anyway, I'll post a different patch that adds the barrier more
> selectively to:
>
>   - cpu_physical_memory_rw  (that's the obvious main one)
>   - cpu_physical_memory_write_rom (probably overkill but
>     not a fast path so no big deal)
>   - ld*_* and st*_* (or do you think these should require
>     explicit barriers in the callers ?)

ld/st should not ever be used by device emulation because they use a concept of 
"target endianness" that doesn't exist for devices.

So no, I don't think you need to put barriers there as they should only be used 
by VCPU emulation code.

> Note that with the above, cpu_physical_memory_map and unmap will
> imply a barrier when using bounce buffers, it would make sense to also
> provide the same barrier when not.
>
> That does actually make sense for the same reason explained above,
> ie, when those are used for a DMA transfer via async IO, that guarantees
> ordering of the "block" vs. surrounding accesses even if accesses within
> the actual map/unmap region are not ordered vs. each other.
>
> Any objection ?

I think so.  I'd like to see a better comment about barrier usage at the top of 
the file or something like that too.

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17 22:09                     ` Anthony Liguori
@ 2012-05-18  1:04                       ` David Gibson
  2012-05-18  1:16                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 89+ messages in thread
From: David Gibson @ 2012-05-18  1:04 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

On Thu, May 17, 2012 at 05:09:22PM -0500, Anthony Liguori wrote:
> On 05/16/2012 09:44 PM, Benjamin Herrenschmidt wrote:
> >On Wed, 2012-05-16 at 21:28 -0500, Anthony Liguori wrote:
> >
> >>>@@ -2794,6 +2795,9 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
> >>>   {
> >>>       RAMBlock *block;
> >>>
> >>>+    /* We ensure ordering for all DMA transactions */
> >>>+    dma_mb();
> >>>+
> >>
> >>I get being conservative, but I don't think this makes a lot of sense.  There
> >>are cases where the return of this function is cached (like the VGA ram area).
> >>I think it would make more sense if you explicitly put a barrier after write
> >>operations.
> >
> >Well, it depends ... sure something that caches the result is akin to
> >map/unmap and responsible for doing its own barriers between accesses,
> >however as a whole, this means that an entire map/unmap section is
> >ordered surrounding accesses which is actually not a bad idea.
> >
> >Anyway, I'll post a different patch that adds the barrier more
> >selectively to:
> >
> >  - cpu_physical_memory_rw  (that's the obvious main one)
> >  - cpu_physical_memory_write_rom (probably overkill but
> >    not a fast path so no big deal)
> >  - ld*_* and st*_* (or do you think these should require
> >    explicit barriers in the callers ?)
> 
> ld/st should not ever be used by device emulation because they use a
> concept of "target endianness" that doesn't exist for devices.

Ah, but there are the explicit endian versions, which were used
routinely until I replaced a lot of them with ld*_dma().

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions
  2012-05-17 22:09                     ` Anthony Liguori
  2012-05-18  1:04                       ` David Gibson
@ 2012-05-18  1:16                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-18  1:16 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, qemu-devel, David Gibson

On Thu, 2012-05-17 at 17:09 -0500, Anthony Liguori wrote:

> ld/st should not ever be used by device emulation because they use a concept of 
> "target endianness" that doesn't exist for devices.

Hrm, there's a bit of both, some of them even have explicit endianness
arguments and some of them are definitely used by bits of hw/*

virtio core uses them directly as well but then virtio also does
explicit barriers (though I need to double check whether it does enough
of them in all the right places).

> So no, I don't think you need to put barriers there as they should only be used 
> by VCPU emulation code.

I'm ok to leave them alone for now, I'll give a quick look at the
various callers.

> > Note that with the above, cpu_physical_memory_map and unmap will
> > imply a barrier when using bounce buffers, it would make sense to also
> > provide the same barrier when not.
> >
> > That does actually make sense for the same reason explained above,
> > ie, when those are used for a DMA transfer via async IO, that guarantees
> > ordering of the "block" vs. surrounding accesses even if accesses within
> > the actual map/unmap region are not ordered vs. each other.
> >
> > Any objection ?
> 
> I think so.  I'd like to see a better comment about barrier usage at the top of 
> the file or something like that too.

Ok.

Cheers,
Ben.

> Regards,
> 
> Anthony Liguori
> 
> >
> > Cheers,
> > Ben.
> >
> >
> >

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-17  0:24             ` Benjamin Herrenschmidt
  2012-05-17  0:52               ` [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
@ 2012-05-18  6:53               ` Paolo Bonzini
  2012-05-18  8:18                 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 89+ messages in thread
From: Paolo Bonzini @ 2012-05-18  6:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, qemu-devel, Anthony Liguori, Michael S. Tsirkin

Il 17/05/2012 02:24, Benjamin Herrenschmidt ha scritto:
>> > Also, should I make the barrier conditional to kvm_enabled() ? IE. It's
>> > pointless in full emulation and might actually be a performance hit on
>> > something already quite slow...
> Finally ... something like smp_mb() in qemu will turn into a lock op or
> an mfence on x86, ie not a nop.
> 
> That means overhead from today's implementation, which leads to the
> question ... is today implementation correct ? IE. Is a barrier needed
> on x86 as well or not ?

It depends on what semantics you attach to dma_mb.  In my opinion,
having a separate barrier for DMA is wrong, because you want the same
semantics on all architectures.

The x86 requirements are roughly as follows:

1) it never needs explicit rmb and wmb (as long as you don't use
non-temporal stores etc.);

2) synchronized operations have an implicit mb before and after (unlike
LL/SC on PowerPC).

3) everywhere else, you need an mb.

So, on x86 you have more or less an implicit wmb after each write and an
implicit rmb before a read.  This explains why these kind of bugs are
very hard to see on x86 (or often impossible to see).  Adding these in
cpu_physical_memory_rw has the advantage that x86 performance is not
affected, but it would not cover the case of a device model doing a DMA
read after a DMA write.  Then the device model would need to issue a
smp_mb manually, on all architectures.  I think this is too brittle.

> If not (I'm trying to figure out why exactly does x86 have a barrier in
> the first place and when it's in order), then I might add a new barrier
> type in qemu-barriers.h, something like dma_mb(), and define it as a nop
> on x86, a lwsync or sync (still thinking about it) on ppc, and
> __sync_synchronize() on unknown archs.

I don't think it is correct to think of this in terms of low-level
operations such as sync/lwsync.  Rather, I think what we want is
sequentially consistent accesses; it's heavyweight, but you cannot go
wrong.  http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html says this
is how you do it:

x86 -> Load Seq_Cst:	mov                or mfence; mov
       Store Seq Cst:	mov; mfence        or mov

ARM -> Load Seq Cst: 	ldr; dmb           or dmb; ldr; dmb
       Store Seq Cst: 	dmb; str; dmb      or dmb; str

PPC -> Load Seq Cst: 	sync; ld; cmp; bc; isync
       Store Seq Cst: 	sync; st

       where cmp; bc; isync can be replaced by sync.

and says "As far as the memory model is concerned, the ARM processor is
broadly similar to PowerPC, differing mainly in having a DMB barrier
(analogous to the PowerPC sync in its programmer-observable behaviour
for normal memory) and no analogue of the PowerPC lwsync".

So one of the two ARM mappings, with smp_mb instead of dmb, is what we
want in cpu_physical_memory_rw.  Device models that want to do better
can just use cpu_physical_memory_map, or we can add a
cpu_physical_memory_rw_direct for them.

Paolo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-18  6:53               ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Paolo Bonzini
@ 2012-05-18  8:18                 ` Benjamin Herrenschmidt
  2012-05-18  8:57                   ` Paolo Bonzini
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-18  8:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Gibson, qemu-devel, Anthony Liguori, Michael S. Tsirkin

On Fri, 2012-05-18 at 08:53 +0200, Paolo Bonzini wrote:

> It depends on what semantics you attach to dma_mb.  In my opinion,
> having a separate barrier for DMA is wrong, because you want the same
> semantics on all architectures.
> 
> The x86 requirements are roughly as follows:
> 
> 1) it never needs explicit rmb and wmb (as long as you don't use
> non-temporal stores etc.);
> 
> 2) synchronized operations have an implicit mb before and after (unlike
> LL/SC on PowerPC).
> 
> 3) everywhere else, you need an mb.
> 
> So, on x86 you have more or less an implicit wmb after each write and an
> implicit rmb before a read.  This explains why these kind of bugs are
> very hard to see on x86 (or often impossible to see).

So what you mean is that on x86, a read can pass a write (and
vice-versa) ? Interesting.... I didn't know that.

>   Adding these in
> cpu_physical_memory_rw has the advantage that x86 performance is not
> affected, but it would not cover the case of a device model doing a DMA
> read after a DMA write.  Then the device model would need to issue a
> smp_mb manually, on all architectures.  I think this is too brittle.

Ok. Agreed.

I'll do a new patch on monday (I'm off for the week-end) that does that,
using the existing smp_mb in cpu_physical_memory_rw().

I'm still tempted to add barriers in map and unmap as well in the case
where they don't bounce to provide consistent semantics here, ie, all
accesses done between the map and unmap are ordered vs all previous and
subsequent accesses. Ok with that ?

I will not add barriers to the various ld*/st* variants.

> > If not (I'm trying to figure out why exactly does x86 have a barrier in
> > the first place and when it's in order), then I might add a new barrier
> > type in qemu-barriers.h, something like dma_mb(), and define it as a nop
> > on x86, a lwsync or sync (still thinking about it) on ppc, and
> > __sync_synchronize() on unknown archs.
> 
> I don't think it is correct to think of this in terms of low-level
> operations such as sync/lwsync.  Rather, I think what we want is
> sequentially consistent accesses; it's heavyweight, but you cannot go
> wrong.  http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html says this
> is how you do it:
> 
> x86 -> Load Seq_Cst:	mov                or mfence; mov
>        Store Seq Cst:	mov; mfence        or mov
> 
> ARM -> Load Seq Cst: 	ldr; dmb           or dmb; ldr; dmb
>        Store Seq Cst: 	dmb; str; dmb      or dmb; str
> 
> PPC -> Load Seq Cst: 	sync; ld; cmp; bc; isync
>        Store Seq Cst: 	sync; st
> 
>        where cmp; bc; isync can be replaced by sync.

Hrm, the cmp/bc/isync can be -very- expensive, we use a variant of that
using twi to enforce complete execution of reads in our readX()
accessors in the kernel but I don't think I want to do that in qemu.

The full sync should provide all the synchronization we need, the read
trick is really only meant in the kernel to enforce timings (ie, a read
followed by a delay, allows to make sure that the delay only starts
"counting" after the read has completed, this is useful when talking to
real HW which might have specific timing requirements).
 
> and says "As far as the memory model is concerned, the ARM processor is
> broadly similar to PowerPC, differing mainly in having a DMB barrier
> (analogous to the PowerPC sync in its programmer-observable behaviour
> for normal memory) and no analogue of the PowerPC lwsync".
> 
> So one of the two ARM mappings, with smp_mb instead of dmb, is what we
> want in cpu_physical_memory_rw.  Device models that want to do better
> can just use cpu_physical_memory_map, or we can add a
> cpu_physical_memory_rw_direct for them.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-18  8:18                 ` Benjamin Herrenschmidt
@ 2012-05-18  8:57                   ` Paolo Bonzini
  2012-05-18 22:26                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Paolo Bonzini @ 2012-05-18  8:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, qemu-devel, Anthony Liguori, Michael S. Tsirkin

Il 18/05/2012 10:18, Benjamin Herrenschmidt ha scritto:
> On Fri, 2012-05-18 at 08:53 +0200, Paolo Bonzini wrote:
> 
>> It depends on what semantics you attach to dma_mb.  In my opinion,
>> having a separate barrier for DMA is wrong, because you want the same
>> semantics on all architectures.
>>
>> The x86 requirements are roughly as follows:
>>
>> 1) it never needs explicit rmb and wmb (as long as you don't use
>> non-temporal stores etc.);
>>
>> 2) synchronized operations have an implicit mb before and after (unlike
>> LL/SC on PowerPC).
>>
>> 3) everywhere else, you need an mb.
>>
>> So, on x86 you have more or less an implicit wmb after each write and an
>> implicit rmb before a read.  This explains why these kind of bugs are
>> very hard to see on x86 (or often impossible to see).
> 
> So what you mean is that on x86, a read can pass a write (and
> vice-versa) ? Interesting.... I didn't know that.

I have to look it up every time.  It takes several pages in the x86
manuals, but the important part is this:

0) Reads are not moved before older reads, writes are not moved before
older writes

1) Writes are not moved before older reads

2) Reads may be moved before older writes to different locations, but
not before older writes to the same location.

3) Intra-processor forwarding is allowed. While a store is temporarily
held in a processor's store buffer, it can satisfy the processor's own
loads.

> I'm still tempted to add barriers in map and unmap as well in the case
> where they don't bounce to provide consistent semantics here, ie, all
> accesses done between the map and unmap are ordered vs all previous and
> subsequent accesses. Ok with that ? [...]
> I will not add barriers to the various ld*/st* variants.

In theory you would need a memory barrier before the first ld/st and one
after the last... considering virtio uses map/unmap, what about leaving
map/unmap and ld*_phys/st*_phys as the high performance unsafe API?
Then you can add barriers around ld*_pci_dma/st*_pci_dma.

>> x86 -> Load Seq_Cst:	 mov                or mfence; mov
>>        Store Seq Cst: mov; mfence        or mov
>>
>> ARM -> Load Seq Cst:  ldr; dmb           or dmb; ldr; dmb
>>        Store Seq Cst: dmb; str; dmb      or dmb; str
>>
>> PPC -> Load Seq Cst:  sync; ld; cmp; bc; isync
>>        Store Seq Cst: sync; st
>>
>>        where cmp; bc; isync can be replaced by sync.
> 
> Hrm, the cmp/bc/isync can be -very- expensive, we use a variant of that
> using twi to enforce complete execution of reads in our readX()
> accessors in the kernel but I don't think I want to do that in qemu.

Ah, ok, thanks for explaining what cmp;bc;isync really is. :)

> The full sync should provide all the synchronization we need

You mean "sync; ld; sync" for load and "sync; st" for store?  That would
do, yes.

Paolo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-18  8:57                   ` Paolo Bonzini
@ 2012-05-18 22:26                     ` Benjamin Herrenschmidt
  2012-05-19  7:24                       ` Paolo Bonzini
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-18 22:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Gibson, qemu-devel, Anthony Liguori, Michael S. Tsirkin

On Fri, 2012-05-18 at 10:57 +0200, Paolo Bonzini wrote:

> > I'm still tempted to add barriers in map and unmap as well in the case
> > where they don't bounce to provide consistent semantics here, ie, all
> > accesses done between the map and unmap are ordered vs all previous and
> > subsequent accesses. Ok with that ? [...]
> > I will not add barriers to the various ld*/st* variants.
> 
> In theory you would need a memory barrier before the first ld/st and one
> after the last... considering virtio uses map/unmap, what about leaving
> map/unmap and ld*_phys/st*_phys as the high performance unsafe API?
> Then you can add barriers around ld*_pci_dma/st*_pci_dma.

So no, my idea is to make anybody using ld_* and st_*  (non _dma)
responsible for their own barriers. The _dma are implemented in term of
cpu_physical_memory_rw so should inherit the barriers.

As for map/unmap, there's an inconsistency since when it falls back to
bounce buffering, it will get implicit barriers. My idea was to put a
barrier before always, see blow.

> >> x86 -> Load Seq_Cst:	 mov                or mfence; mov
> >>        Store Seq Cst: mov; mfence        or mov
> >>
> >> ARM -> Load Seq Cst:  ldr; dmb           or dmb; ldr; dmb
> >>        Store Seq Cst: dmb; str; dmb      or dmb; str
> >>
> >> PPC -> Load Seq Cst:  sync; ld; cmp; bc; isync
> >>        Store Seq Cst: sync; st
> >>
> >>        where cmp; bc; isync can be replaced by sync.
> > 
> > Hrm, the cmp/bc/isync can be -very- expensive, we use a variant of that
> > using twi to enforce complete execution of reads in our readX()
> > accessors in the kernel but I don't think I want to do that in qemu.
> 
> Ah, ok, thanks for explaining what cmp;bc;isync really is. :)
> 
> > The full sync should provide all the synchronization we need
> 
> You mean "sync; ld; sync" for load and "sync; st" for store?  That would
> do, yes.

No, just sync,ld

That should be enough. Only if the device needs additional
synchronization against the guest accessing directly map'ed device
memory should it need more and that's something the device can deal with
explicitly if ever.

IE. If I put a barrier "before" in cpu_physical_memory_rw I ensure
ordering vs all previous accesses. Anything using the low level ld/st
accessors is responsible for their own barriers. virtio for example
since they intentionally bypass the dma/iommu stuff.

As for map/unmap, the idea is to add a barrier in map() as well in the
non-bounce case (maybe not unmap, keep the "before" semantic). This
keeps the semantic of map/unmap as a "whole" being ordered which makes
sense. IE. They are "high performance" in that there is no barrier
between individual accesses within the map/unmap sequence itself which
is good, the device is responsible for that if needed, but ordering the
whole block vs. previous accesses makes sense.

That means that most users (like block devices) don't actually need to
bother, ie they use map/unmap for AIO, the barrier in map provides
synchronization with previous descriptor accesses and the barrier in
cpu_physial_memory_rw orders the transfer vs. subsequent descriptor
updates. (Assuming the transfer contains actual CPU stores which it can,
if it ends up being real DMA under the hood then it's already ordered by
the host kernel driver).

Anyway, I'll post a patch on monday my time and we can discuss further.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-18 22:26                     ` Benjamin Herrenschmidt
@ 2012-05-19  7:24                       ` Paolo Bonzini
  2012-05-20 21:36                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Paolo Bonzini @ 2012-05-19  7:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, qemu-devel, Anthony Liguori, David Gibson

Il 19/05/2012 00:26, Benjamin Herrenschmidt ha scritto:
>> In theory you would need a memory barrier before the first ld/st and one
>> after the last... considering virtio uses map/unmap, what about leaving
>> map/unmap and ld*_phys/st*_phys as the high performance unsafe API?
>> Then you can add barriers around ld*_pci_dma/st*_pci_dma.
> 
> So no, my idea is to make anybody using ld_* and st_*  (non _dma)
> responsible for their own barriers. The _dma are implemented in term of
> cpu_physical_memory_rw so should inherit the barriers.

Yeah, after these patches they are.

> As for map/unmap, there's an inconsistency since when it falls back to
> bounce buffering, it will get implicit barriers. My idea was to put a
> barrier before always, see blow.

The bounce buffering case is never hit in practice.  Your reasoning
about adding a barrier before always makes sense, but probably it's
better to add (a) a variant of map with no barrier; (b) a variant that
takes an sglist that would add only one barrier.

I agree that a barrier in unmap is not needed.

>>> The full sync should provide all the synchronization we need
>>
>> You mean "sync; ld; sync" for load and "sync; st" for store?  That would
>> do, yes.
> 
> No, just sync,ld
> 
> IE. If I put a barrier "before" in cpu_physical_memory_rw I ensure
> ordering vs all previous accesses. 

Ok.

I guess the C11/C++ guys required an isync barrier after either loads or
stores, because they need to order the load/store vs. code accessing
other memory.  This is not needed in QEMU because all guest accesses go
through cpu_physical_memory_rw (or has its own barriers).

Paolo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
  2012-05-19  7:24                       ` Paolo Bonzini
@ 2012-05-20 21:36                         ` Benjamin Herrenschmidt
  2012-05-21  1:56                           ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-20 21:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, qemu-devel, Anthony Liguori, David Gibson

On Sat, 2012-05-19 at 09:24 +0200, Paolo Bonzini wrote:

> I guess the C11/C++ guys required an isync barrier after either loads or
> stores, because they need to order the load/store vs. code accessing
> other memory.  This is not needed in QEMU because all guest accesses go
> through cpu_physical_memory_rw (or has its own barriers).

I am not sure, I don't quite see what it buys them really. I'd have to
ask Paul McKenney, he probably knows :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH 06/13 - UPDATED] ide/ahci: Use universal DMA helper functions
  2012-05-10  4:49 ` [Qemu-devel] [PATCH 06/13] ide/ahci: Use universal DMA helper functions Benjamin Herrenschmidt
@ 2012-05-21  1:51   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21  1:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Michael S. Tsirkin, David Gibson

The AHCI device can provide both PCI and SysBus AHCI device
emulations.  For this reason, it wasn't previously converted to use
the pci_dma_*() helper functions.  Now that we have universal DMA
helper functions, this converts AHCI to use them.

The DMAContext is obtained from pci_dma_context() in the PCI case and
set to NULL in the SysBus case (i.e. we assume for now that a SysBus
AHCI has no IOMMU translation).

Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

The previous version missed a couple of conversions, possibly
stuff that was added since the patch was originally written. 

 hw/ide/ahci.c |   51 +++++++++++++++++++++++++++++----------------------
 hw/ide/ahci.h |    3 ++-
 hw/ide/ich.c  |    2 +-
 3 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 96d8f62..895a756 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -172,17 +172,18 @@ static void ahci_trigger_irq(AHCIState *s, AHCIDevice *d,
     ahci_check_irq(s);
 }
 
-static void map_page(uint8_t **ptr, uint64_t addr, uint32_t wanted)
+static void map_page(AHCIState *s, uint8_t **ptr, uint64_t addr,
+                     uint32_t wanted)
 {
     target_phys_addr_t len = wanted;
 
     if (*ptr) {
-        cpu_physical_memory_unmap(*ptr, len, 1, len);
+        dma_memory_unmap(s->dma, *ptr, len, DMA_DIRECTION_FROM_DEVICE, len);
     }
 
-    *ptr = cpu_physical_memory_map(addr, &len, 1);
+    *ptr = dma_memory_map(s->dma, addr, &len, DMA_DIRECTION_FROM_DEVICE);
     if (len < wanted) {
-        cpu_physical_memory_unmap(*ptr, len, 1, len);
+        dma_memory_unmap(s->dma, *ptr, len, DMA_DIRECTION_FROM_DEVICE, len);
         *ptr = NULL;
     }
 }
@@ -195,24 +196,24 @@ static void  ahci_port_write(AHCIState *s, int port, int offset, uint32_t val)
     switch (offset) {
         case PORT_LST_ADDR:
             pr->lst_addr = val;
-            map_page(&s->dev[port].lst,
+            map_page(s, &s->dev[port].lst,
                      ((uint64_t)pr->lst_addr_hi << 32) | pr->lst_addr, 1024);
             s->dev[port].cur_cmd = NULL;
             break;
         case PORT_LST_ADDR_HI:
             pr->lst_addr_hi = val;
-            map_page(&s->dev[port].lst,
+            map_page(s, &s->dev[port].lst,
                      ((uint64_t)pr->lst_addr_hi << 32) | pr->lst_addr, 1024);
             s->dev[port].cur_cmd = NULL;
             break;
         case PORT_FIS_ADDR:
             pr->fis_addr = val;
-            map_page(&s->dev[port].res_fis,
+            map_page(s, &s->dev[port].res_fis,
                      ((uint64_t)pr->fis_addr_hi << 32) | pr->fis_addr, 256);
             break;
         case PORT_FIS_ADDR_HI:
             pr->fis_addr_hi = val;
-            map_page(&s->dev[port].res_fis,
+            map_page(s, &s->dev[port].res_fis,
                      ((uint64_t)pr->fis_addr_hi << 32) | pr->fis_addr, 256);
             break;
         case PORT_IRQ_STAT:
@@ -588,7 +589,7 @@ static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     AHCIPortRegs *pr = &ad->port_regs;
     uint8_t *d2h_fis;
     int i;
-    target_phys_addr_t cmd_len = 0x80;
+    dma_addr_t cmd_len = 0x80;
     int cmd_mapped = 0;
 
     if (!ad->res_fis || !(pr->cmd & PORT_CMD_FIS_RX)) {
@@ -598,7 +599,8 @@ static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     if (!cmd_fis) {
         /* map cmd_fis */
         uint64_t tbl_addr = le64_to_cpu(ad->cur_cmd->tbl_addr);
-        cmd_fis = cpu_physical_memory_map(tbl_addr, &cmd_len, 0);
+        cmd_fis = dma_memory_map(ad->hba->dma, tbl_addr, &cmd_len,
+                                 DMA_DIRECTION_TO_DEVICE);
         cmd_mapped = 1;
     }
 
@@ -630,7 +632,8 @@ static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     ahci_trigger_irq(ad->hba, ad, PORT_IRQ_D2H_REG_FIS);
 
     if (cmd_mapped) {
-        cpu_physical_memory_unmap(cmd_fis, cmd_len, 0, cmd_len);
+        dma_memory_unmap(ad->hba->dma, cmd_fis, cmd_len,
+                         DMA_DIRECTION_TO_DEVICE, cmd_len);
     }
 }
 
@@ -640,8 +643,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     uint32_t opts = le32_to_cpu(cmd->opts);
     uint64_t prdt_addr = le64_to_cpu(cmd->tbl_addr) + 0x80;
     int sglist_alloc_hint = opts >> AHCI_CMD_HDR_PRDT_LEN;
-    target_phys_addr_t prdt_len = (sglist_alloc_hint * sizeof(AHCI_SG));
-    target_phys_addr_t real_prdt_len = prdt_len;
+    dma_addr_t prdt_len = (sglist_alloc_hint * sizeof(AHCI_SG));
+    dma_addr_t real_prdt_len = prdt_len;
     uint8_t *prdt;
     int i;
     int r = 0;
@@ -652,7 +655,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     }
 
     /* map PRDT */
-    if (!(prdt = cpu_physical_memory_map(prdt_addr, &prdt_len, 0))){
+    if (!(prdt = dma_memory_map(ad->hba->dma, prdt_addr, &prdt_len,
+                                DMA_DIRECTION_TO_DEVICE))){
         DPRINTF(ad->port_no, "map failed\n");
         return -1;
     }
@@ -667,8 +671,7 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     if (sglist_alloc_hint > 0) {
         AHCI_SG *tbl = (AHCI_SG *)prdt;
 
-        /* FIXME: pass the correct DMAContext */
-        qemu_sglist_init(sglist, sglist_alloc_hint, NULL);
+        qemu_sglist_init(sglist, sglist_alloc_hint, ad->hba->dma);
         for (i = 0; i < sglist_alloc_hint; i++) {
             /* flags_size is zero-based */
             qemu_sglist_add(sglist, le64_to_cpu(tbl[i].addr),
@@ -677,7 +680,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
     }
 
 out:
-    cpu_physical_memory_unmap(prdt, prdt_len, 0, prdt_len);
+    dma_memory_unmap(ad->hba->dma, prdt, prdt_len,
+                     DMA_DIRECTION_TO_DEVICE, prdt_len);
     return r;
 }
 
@@ -787,7 +791,7 @@ static int handle_cmd(AHCIState *s, int port, int slot)
     uint64_t tbl_addr;
     AHCICmdHdr *cmd;
     uint8_t *cmd_fis;
-    target_phys_addr_t cmd_len;
+    dma_addr_t cmd_len;
 
     if (s->dev[port].port.ifs[0].status & (BUSY_STAT|DRQ_STAT)) {
         /* Engine currently busy, try again later */
@@ -809,7 +813,8 @@ static int handle_cmd(AHCIState *s, int port, int slot)
     tbl_addr = le64_to_cpu(cmd->tbl_addr);
 
     cmd_len = 0x80;
-    cmd_fis = cpu_physical_memory_map(tbl_addr, &cmd_len, 1);
+    cmd_fis = dma_memory_map(s->dma, tbl_addr, &cmd_len,
+                             DMA_DIRECTION_FROM_DEVICE);
 
     if (!cmd_fis) {
         DPRINTF(port, "error: guest passed us an invalid cmd fis\n");
@@ -935,7 +940,8 @@ static int handle_cmd(AHCIState *s, int port, int slot)
     }
 
 out:
-    cpu_physical_memory_unmap(cmd_fis, cmd_len, 1, cmd_len);
+    dma_memory_unmap(s->dma, cmd_fis, cmd_len, DMA_DIRECTION_FROM_DEVICE,
+                     cmd_len);
 
     if (s->dev[port].port.ifs[0].status & (BUSY_STAT|DRQ_STAT)) {
         /* async command, complete later */
@@ -1115,11 +1121,12 @@ static const IDEDMAOps ahci_dma_ops = {
     .reset = ahci_dma_reset,
 };
 
-void ahci_init(AHCIState *s, DeviceState *qdev, int ports)
+void ahci_init(AHCIState *s, DeviceState *qdev, DMAContext *dma, int ports)
 {
     qemu_irq *irqs;
     int i;
 
+    s->dma = dma;
     s->ports = ports;
     s->dev = g_malloc0(sizeof(AHCIDevice) * ports);
     ahci_reg_init(s);
@@ -1182,7 +1189,7 @@ static const VMStateDescription vmstate_sysbus_ahci = {
 static int sysbus_ahci_init(SysBusDevice *dev)
 {
     SysbusAHCIState *s = FROM_SYSBUS(SysbusAHCIState, dev);
-    ahci_init(&s->ahci, &dev->qdev, s->num_ports);
+    ahci_init(&s->ahci, &dev->qdev, NULL, s->num_ports);
 
     sysbus_init_mmio(dev, &s->ahci.mem);
     sysbus_init_irq(dev, &s->ahci.irq);
diff --git a/hw/ide/ahci.h b/hw/ide/ahci.h
index b223d2c..af8c6ef 100644
--- a/hw/ide/ahci.h
+++ b/hw/ide/ahci.h
@@ -299,6 +299,7 @@ typedef struct AHCIState {
     uint32_t idp_index;     /* Current IDP index */
     int ports;
     qemu_irq irq;
+    DMAContext *dma;
 } AHCIState;
 
 typedef struct AHCIPCIState {
@@ -329,7 +330,7 @@ typedef struct NCQFrame {
     uint8_t reserved10;
 } QEMU_PACKED NCQFrame;
 
-void ahci_init(AHCIState *s, DeviceState *qdev, int ports);
+void ahci_init(AHCIState *s, DeviceState *qdev, DMAContext *dma, int ports);
 void ahci_uninit(AHCIState *s);
 
 void ahci_reset(void *opaque);
diff --git a/hw/ide/ich.c b/hw/ide/ich.c
index 560ae37..5354e13 100644
--- a/hw/ide/ich.c
+++ b/hw/ide/ich.c
@@ -91,7 +91,7 @@ static int pci_ich9_ahci_init(PCIDevice *dev)
     uint8_t *sata_cap;
     d = DO_UPCAST(struct AHCIPCIState, card, dev);
 
-    ahci_init(&d->ahci, &dev->qdev, 6);
+    ahci_init(&d->ahci, &dev->qdev, pci_dma_context(dev), 6);
 
     pci_config_set_prog_interface(d->card.config, AHCI_PROGMODE_MAJOR_REV_1);
 
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-20 21:36                         ` Benjamin Herrenschmidt
@ 2012-05-21  1:56                           ` Benjamin Herrenschmidt
  2012-05-21  8:11                             ` Paolo Bonzini
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21  1:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Gibson, qemu-devel, Anthony Liguori, Michael S. Tsirkin

The emulated devices can run simultaneously with the guest, so
we need to be careful with ordering of load and stores done by
them to the guest system memory, which need to be observed in
the right order by the guest operating system.

This adds barriers to some standard guest memory access functions
along with a comment explaining the semantics to exec.c

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

So after the discussion with Paolo, I removed the specific accessors,
just used a normal smp_mb() in only two places, cpu_physical_memory_rw
and cpu_physical_memory_map.

I don't see an obvious need to provide a "relaxed" variant of the
later at this stage, a quick grep doesn't seem to show that most cases
where it's used are either not performance sensitive or the barrier
makes sense, but feel free to prove me wrong :-)

If we really want that, my suggestion would be to change the "is_write"
flag into a proper bitmask of direction and relaxed attribute (which we
can use for more attributes in the future if needed). 

Also, we probably want an smp_mb() when shooting MSIs (not LSIs, those
are not ordered, that's why the guest driver needs to do an MMIO read
after an LSI, but MSIs are). I haven't looked at that yet, we can do
it from a separate patch if needed.

 exec.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index 363ec98..997dbb0 100644
--- a/exec.c
+++ b/exec.c
@@ -16,6 +16,34 @@
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
+
+/*
+ * Note on memory barriers usage:
+ *
+ * In order for emulated devices "DMA" operations to appear
+ * with a consistent ordering to the guest, we provide some
+ * amount of ordering guarantees:
+ *
+ * cpy_physical_memory_rw() (and all functions using it) is
+ * ordered vs. all previous accesses (it begins with a full
+ * memory barrier)
+ *
+ * This include all the new dma_* accessors.
+ *
+ * The old variants of ld* and st* that have not been convered
+ * to dma_ are not ordered. Users are reponsible for using their
+ * own ordering.
+ *
+ * cpu_physical_memory_map() provides a memory barrier vs. all
+ * previous accesses. There is no implicit barrier on unmap.
+ * If ordering is required between accessed within the map/unmmap
+ * sequence, then it needs to be done explicitely.
+ *
+ * This means that a typical block driver using map/unmap accross
+ * the transfer of a block followed by dma_ writes to signal
+ * completion or interrupt shouldn't require the addition of
+ * explicit barriers.
+ */ 
 #include "config.h"
 #ifdef _WIN32
 #include <windows.h>
@@ -25,6 +53,7 @@
 #endif
 
 #include "qemu-common.h"
+#include "qemu-barrier.h"
 #include "cpu.h"
 #include "tcg.h"
 #include "hw/hw.h"
@@ -3516,6 +3545,10 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
     target_phys_addr_t page;
     MemoryRegionSection *section;
 
+    /* Provides ordering vs. previous accesses, see comments
+     * at the top of this file */
+    smp_mb();
+
     while (len > 0) {
         page = addr & TARGET_PAGE_MASK;
         l = (page + TARGET_PAGE_SIZE) - addr;
@@ -3713,6 +3746,10 @@ void *cpu_physical_memory_map(target_phys_addr_t addr,
     ram_addr_t rlen;
     void *ret;
 
+    /* Provides ordering vs. previous accesses, see comments
+     * at the top of this file */
+    smp_mb();
+
     while (len > 0) {
         page = addr & TARGET_PAGE_MASK;
         l = (page + TARGET_PAGE_SIZE) - addr;

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  1:56                           ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
@ 2012-05-21  8:11                             ` Paolo Bonzini
  2012-05-21  8:31                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 89+ messages in thread
From: Paolo Bonzini @ 2012-05-21  8:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, qemu-devel, Anthony Liguori, Michael S. Tsirkin

Il 21/05/2012 03:56, Benjamin Herrenschmidt ha scritto:
> I don't see an obvious need to provide a "relaxed" variant of the
> later at this stage, a quick grep doesn't seem to show that most cases
> where it's used are either not performance sensitive or the barrier
> makes sense, but feel free to prove me wrong :-)

The only problem here is that you have useless memory barriers when
calling cpu_physical_memory_map in a loop (see virtqueue_map_sg).

Paolo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  8:11                             ` Paolo Bonzini
@ 2012-05-21  8:31                               ` Michael S. Tsirkin
  2012-05-21  8:58                                 ` Benjamin Herrenschmidt
  2012-05-21 22:18                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Anthony Liguori
  0 siblings, 2 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21  8:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, Anthony Liguori, David Gibson

On Mon, May 21, 2012 at 10:11:06AM +0200, Paolo Bonzini wrote:
> Il 21/05/2012 03:56, Benjamin Herrenschmidt ha scritto:
> > I don't see an obvious need to provide a "relaxed" variant of the
> > later at this stage, a quick grep doesn't seem to show that most cases
> > where it's used are either not performance sensitive or the barrier
> > makes sense, but feel free to prove me wrong :-)
> 
> The only problem here is that you have useless memory barriers when
> calling cpu_physical_memory_map in a loop (see virtqueue_map_sg).
> 
> Paolo

More than that. smp_mb is pretty expensive. You
often can do just smp_wmb and smp_rmb and that is
very cheap.
Many operations run in the vcpu context
or start when guest exits to host and work
is bounced from there and thus no barrier is needed
here.

Example? start_xmit in e1000. Executed in vcpu context
so no barrier is needed.

virtio of course is another example since it does its own
barriers. But even without that, virtio_blk_handle_output
runs in vcpu context.

But more importantly, this hack just sweeps the
dirt under the carpet. Understanding the interaction
with guest drivers is important anyway. So
I really don't see why don't we audit devices
and add proper barriers.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  8:31                               ` Michael S. Tsirkin
@ 2012-05-21  8:58                                 ` Benjamin Herrenschmidt
  2012-05-21  9:07                                   ` Benjamin Herrenschmidt
  2012-05-21 22:18                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Anthony Liguori
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21  8:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, 2012-05-21 at 11:31 +0300, Michael S. Tsirkin wrote:
> On Mon, May 21, 2012 at 10:11:06AM +0200, Paolo Bonzini wrote:
> > Il 21/05/2012 03:56, Benjamin Herrenschmidt ha scritto:
> > > I don't see an obvious need to provide a "relaxed" variant of the
> > > later at this stage, a quick grep doesn't seem to show that most cases
> > > where it's used are either not performance sensitive or the barrier
> > > makes sense, but feel free to prove me wrong :-)
> > 
> > The only problem here is that you have useless memory barriers when
> > calling cpu_physical_memory_map in a loop (see virtqueue_map_sg).
> > 
> > Paolo
> 
> More than that. smp_mb is pretty expensive. You
> often can do just smp_wmb and smp_rmb and that is
> very cheap.

Except that you mostly don't know at that level what you can or cannot
do, it depends on the caller. We should have the standard accessors do
it the "safe" way and have performance sensitive stuff do map/unmap, at
least that's the result of the discussions with Anthony.

If we can address the virtqueue_map_sg problem, I think we should be
good, I'll look at it tomorrow. Maybe the right way for now is to remove
the barrier I added to "map" and only leave the one in _rw

> Many operations run in the vcpu context
> or start when guest exits to host and work
> is bounced from there and thus no barrier is needed
> here.

But we don't always know it unless we start sprinkling the drivers.

> Example? start_xmit in e1000. Executed in vcpu context
> so no barrier is needed.

If it's performance sensitive, it can always use map/unmap and
hand-tuned barriers. I suspect the cost of the barrier is drowned in the
cost of the exit tho, isn't it ?

Also while it doesn't need barriers for the read it does of the
descriptor it still needs one barrier for the write back to it.

> virtio of course is another example since it does its own
> barriers. But even without that, virtio_blk_handle_output
> runs in vcpu context.
> 
> But more importantly, this hack just sweeps the
> dirt under the carpet. Understanding the interaction
> with guest drivers is important anyway. So
> I really don't see why don't we audit devices
> and add proper barriers.

Because my experience is that you'll never get them all right and will
keep getting more breakage as devices are added. This has always been
the case kernel wise and I don't think qemu will get it any better.

That's why Linus always insisted that our basic MMIO accessors provide
full ordering vs. each other and vs. accesses to main memory (DMA)
limiting the extent of the barriers needed for drivers for example.

It's a small price to pay considered the risk and how nasty those bugs
are to debug. And performance critical drivers are few and can be hand
tuned.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  8:58                                 ` Benjamin Herrenschmidt
@ 2012-05-21  9:07                                   ` Benjamin Herrenschmidt
  2012-05-21  9:16                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21  9:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, 2012-05-21 at 18:58 +1000, Benjamin Herrenschmidt wrote:

> Except that you mostly don't know at that level what you can or cannot
> do, it depends on the caller. We should have the standard accessors do
> it the "safe" way and have performance sensitive stuff do map/unmap, at
> least that's the result of the discussions with Anthony.
> 
> If we can address the virtqueue_map_sg problem, I think we should be
> good, I'll look at it tomorrow. Maybe the right way for now is to remove
> the barrier I added to "map" and only leave the one in _rw

One thing that might alleviate some of your concerns would possibly be
to "remember" in a global (to be replaced with a thread var eventually)
the last transfer direction and use a simple test to chose the barrier,
ie, store + store -> wmb, load + load -> rmb, other -> mb.

But first I'd be curious if some x86 folks could actually measure the
impact of the patch as I proposed it. That would give us an idea of how
bad the performance problem is and how far we need to go to address it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  9:07                                   ` Benjamin Herrenschmidt
@ 2012-05-21  9:16                                     ` Benjamin Herrenschmidt
  2012-05-21  9:34                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21  9:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:

> One thing that might alleviate some of your concerns would possibly be
> to "remember" in a global (to be replaced with a thread var eventually)
> the last transfer direction and use a simple test to chose the barrier,
> ie, store + store -> wmb, load + load -> rmb, other -> mb.
> 
> But first I'd be curious if some x86 folks could actually measure the
> impact of the patch as I proposed it. That would give us an idea of how
> bad the performance problem is and how far we need to go to address it.

Another option.... go back to something more like the original patch,
ie, put the barrier in the new dma_* accessors (and provide a
non-barrier one while at it) rather than the low level cpu_physical_*
accessor.

That makes it a lot easier for selected driver to be converted to avoid
the barrier in thing like code running in the vcpu context. It also
means that virtio doesn't get any added barriers which is what we want
as well.

IE. Have something along the lines (based on the accessors added by the
iommu series) (using __ kernel-style, feel free to provide a better
naming)

static inline int __dma_memory_rw( ... args ... )
{
    if (!dma_has_iommu(dma)) {
        /* Fast-path for no IOMMU */
        cpu_physical_memory_rw( ... args ...);
        return 0;
    } else {
        return iommu_dma_memory_rw( ... args ...);
    }
}

static inline int dma_memory_rw( ... args ... )
{
	smp_mb(); /* Or use finer grained as discussied earlier */

	return __dma_memory_rw( ... args ... )
}

And corresponding __dma_memory_read/__dma_memory_write (again, feel
free to suggest a more "qemu'ish" naming if you don't like __, it's
a kernel habit, not sure what you guys do in qemu land).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  9:16                                     ` Benjamin Herrenschmidt
@ 2012-05-21  9:34                                       ` Michael S. Tsirkin
  2012-05-21  9:53                                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21  9:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> 
> > One thing that might alleviate some of your concerns would possibly be
> > to "remember" in a global (to be replaced with a thread var eventually)
> > the last transfer direction and use a simple test to chose the barrier,
> > ie, store + store -> wmb, load + load -> rmb, other -> mb.

But how do you know guest did a store?

> > 
> > But first I'd be curious if some x86 folks could actually measure the
> > impact of the patch as I proposed it. That would give us an idea of how
> > bad the performance problem is and how far we need to go to address it.
> 
> Another option.... go back to something more like the original patch,
> ie, put the barrier in the new dma_* accessors (and provide a
> non-barrier one while at it) rather than the low level cpu_physical_*
> accessor.
> 
> That makes it a lot easier for selected driver to be converted to avoid
> the barrier in thing like code running in the vcpu context. It also
> means that virtio doesn't get any added barriers which is what we want
> as well.
> 
> IE. Have something along the lines (based on the accessors added by the
> iommu series) (using __ kernel-style, feel free to provide a better
> naming)
> 
> static inline int __dma_memory_rw( ... args ... )
> {
>     if (!dma_has_iommu(dma)) {
>         /* Fast-path for no IOMMU */
>         cpu_physical_memory_rw( ... args ...);
>         return 0;
>     } else {
>         return iommu_dma_memory_rw( ... args ...);
>     }
> }
> 
> static inline int dma_memory_rw( ... args ... )
> {
> 	smp_mb(); /* Or use finer grained as discussied earlier */
> 
> 	return __dma_memory_rw( ... args ... )

Heh. But don't we need an mb afterwards too?

> }
> 
> And corresponding __dma_memory_read/__dma_memory_write (again, feel
> free to suggest a more "qemu'ish" naming if you don't like __, it's
> a kernel habit, not sure what you guys do in qemu land).
> 
> Cheers,
> Ben.

And my preference is to first convert everyone to __ variants and
carefully switch devices to the barrier version after a bit of
consideration.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  9:34                                       ` Michael S. Tsirkin
@ 2012-05-21  9:53                                         ` Benjamin Herrenschmidt
  2012-05-21 10:31                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21  9:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, 2012-05-21 at 12:34 +0300, Michael S. Tsirkin wrote:
> On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> > 
> > > One thing that might alleviate some of your concerns would possibly be
> > > to "remember" in a global (to be replaced with a thread var eventually)
> > > the last transfer direction and use a simple test to chose the barrier,
> > > ie, store + store -> wmb, load + load -> rmb, other -> mb.
> 
> But how do you know guest did a store?

This isn't vs. guest access, but vs DMA access, ie we are ordering DMA
accesses vs. each other. The guest is still responsible to do it's own
side of the barriers as usual.

> > > But first I'd be curious if some x86 folks could actually measure the
> > > impact of the patch as I proposed it. That would give us an idea of how
> > > bad the performance problem is and how far we need to go to address it.
> > 
> > Another option.... go back to something more like the original patch,
> > ie, put the barrier in the new dma_* accessors (and provide a
> > non-barrier one while at it) rather than the low level cpu_physical_*
> > accessor.
> > 
> > That makes it a lot easier for selected driver to be converted to avoid
> > the barrier in thing like code running in the vcpu context. It also
> > means that virtio doesn't get any added barriers which is what we want
> > as well.
> > 
> > IE. Have something along the lines (based on the accessors added by the
> > iommu series) (using __ kernel-style, feel free to provide a better
> > naming)
> > 
> > static inline int __dma_memory_rw( ... args ... )
> > {
> >     if (!dma_has_iommu(dma)) {
> >         /* Fast-path for no IOMMU */
> >         cpu_physical_memory_rw( ... args ...);
> >         return 0;
> >     } else {
> >         return iommu_dma_memory_rw( ... args ...);
> >     }
> > }
> > 
> > static inline int dma_memory_rw( ... args ... )
> > {
> > 	smp_mb(); /* Or use finer grained as discussied earlier */
> > 
> > 	return __dma_memory_rw( ... args ... )
> 
> Heh. But don't we need an mb afterwards too?

Not really no, but we can discuss the fine point, I'm pretty sure
one-before is enough as long as we ensure MSIs are properly ordered.

> > }
> > 
> > And corresponding __dma_memory_read/__dma_memory_write (again, feel
> > free to suggest a more "qemu'ish" naming if you don't like __, it's
> > a kernel habit, not sure what you guys do in qemu land).
> > 
> > Cheers,
> > Ben.
> 
> And my preference is to first convert everyone to __ variants and
> carefully switch devices to the barrier version after a bit of
> consideration.

I very strongly disagree. This is exactly the wrong approach. In pretty
much -all- cases the ordered versions are going to be safer, since they
basically provide the similar ordering semantics to what a PCI bus would
provide.

IE. Just making the default accessors ordered means that all devices
written with the assumption that the guest will see accesses in the
order they are written in the emulated device will be correct, which
means pretty much all of them (well, almost).

 --> It actually fixes a real bug that affects almost all devices
     that do DMA today in qemu

Then, fine-tuning performance critical one by selectively removing
barriers allows to improve performance where it would be othewise
harmed.

So on that I will not compromise.

However, I think it might be better to leave the barrier in the dma
accessor since that's how you also get iommu transparency etc... so it's
not a bad place to put them, and leave the cpu_physical_* for use by
lower level device drivers which are thus responsible also for dealing
with ordering if they have to.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  9:53                                         ` Benjamin Herrenschmidt
@ 2012-05-21 10:31                                           ` Michael S. Tsirkin
  2012-05-21 11:45                                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21 10:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, May 21, 2012 at 07:53:23PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 12:34 +0300, Michael S. Tsirkin wrote:
> > On Mon, May 21, 2012 at 07:16:27PM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2012-05-21 at 19:07 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > > One thing that might alleviate some of your concerns would possibly be
> > > > to "remember" in a global (to be replaced with a thread var eventually)
> > > > the last transfer direction and use a simple test to chose the barrier,
> > > > ie, store + store -> wmb, load + load -> rmb, other -> mb.
> > 
> > But how do you know guest did a store?
> 
> This isn't vs. guest access, but vs DMA access, ie we are ordering DMA
> accesses vs. each other. The guest is still responsible to do it's own
> side of the barriers as usual.
> > > > But first I'd be curious if some x86 folks could actually measure the
> > > > impact of the patch as I proposed it. That would give us an idea of how
> > > > bad the performance problem is and how far we need to go to address it.
> > > 
> > > Another option.... go back to something more like the original patch,
> > > ie, put the barrier in the new dma_* accessors (and provide a
> > > non-barrier one while at it) rather than the low level cpu_physical_*
> > > accessor.
> > > 
> > > That makes it a lot easier for selected driver to be converted to avoid
> > > the barrier in thing like code running in the vcpu context. It also
> > > means that virtio doesn't get any added barriers which is what we want
> > > as well.
> > > 
> > > IE. Have something along the lines (based on the accessors added by the
> > > iommu series) (using __ kernel-style, feel free to provide a better
> > > naming)
> > > 
> > > static inline int __dma_memory_rw( ... args ... )
> > > {
> > >     if (!dma_has_iommu(dma)) {
> > >         /* Fast-path for no IOMMU */
> > >         cpu_physical_memory_rw( ... args ...);
> > >         return 0;
> > >     } else {
> > >         return iommu_dma_memory_rw( ... args ...);
> > >     }
> > > }
> > > 
> > > static inline int dma_memory_rw( ... args ... )
> > > {
> > > 	smp_mb(); /* Or use finer grained as discussied earlier */
> > > 
> > > 	return __dma_memory_rw( ... args ... )
> > 
> > Heh. But don't we need an mb afterwards too?
> 
> Not really no, but we can discuss the fine point, I'm pretty sure
> one-before is enough as long as we ensure MSIs are properly ordered.

Hmm. MSI injection causes IPI. So that does an SMP
barrier I think. But see below about the use of
write-combining in guest.

> > > }
> > > 
> > > And corresponding __dma_memory_read/__dma_memory_write (again, feel
> > > free to suggest a more "qemu'ish" naming if you don't like __, it's
> > > a kernel habit, not sure what you guys do in qemu land).
> > > 
> > > Cheers,
> > > Ben.
> > 
> > And my preference is to first convert everyone to __ variants and
> > carefully switch devices to the barrier version after a bit of
> > consideration.
> 
> I very strongly disagree. This is exactly the wrong approach. In pretty
> much -all- cases the ordered versions are going to be safer, since they
> basically provide the similar ordering semantics to what a PCI bus would
> provide.
> 
> IE. Just making the default accessors ordered means that all devices
> written with the assumption that the guest will see accesses in the
> order they are written in the emulated device will be correct, which
> means pretty much all of them (well, almost).
> 
>  --> It actually fixes a real bug that affects almost all devices
>      that do DMA today in qemu

In theory fine but practical examples that affect x86?
We might want to at least document some of them.

wmb and rmb are nops so there's no bug in practice.
So the only actual rule which might be violated by qemu is that
read flushes out writes.
It's unlikely you will find real examples where this matters
but I'm interested to hear otherwise.


I also note that guests do use write-combining e.g. for vga.
One wonders whether stronger barrriers are needed
because of that?


> Then, fine-tuning performance critical one by selectively removing
> barriers allows to improve performance where it would be othewise
> harmed.

So that breaks attempts to bisect performance regressions.
Not good.

> So on that I will not compromise.
> 
> However, I think it might be better to leave the barrier in the dma
> accessor since that's how you also get iommu transparency etc... so it's
> not a bad place to put them, and leave the cpu_physical_* for use by
> lower level device drivers which are thus responsible also for dealing
> with ordering if they have to.
> 
> Cheers,
> Ben.

You claim to understand what matters for all devices I doubt that.

Why don't we add safe APIs, then go over devices and switch over?
I counted 97 pci_dma_ accesses.
33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.

Let maintainers make a decision where does speed matter.


-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 10:31                                           ` Michael S. Tsirkin
@ 2012-05-21 11:45                                             ` Benjamin Herrenschmidt
  2012-05-21 12:18                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21 11:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote:

> > IE. Just making the default accessors ordered means that all devices
> > written with the assumption that the guest will see accesses in the
> > order they are written in the emulated device will be correct, which
> > means pretty much all of them (well, almost).
> > 
> >  --> It actually fixes a real bug that affects almost all devices
> >      that do DMA today in qemu
> 
> In theory fine but practical examples that affect x86?
> We might want to at least document some of them.

x86 I don't know, I suspect mostly none that have actually been hit but
I could be wrong, I'm not familiar enough with it.

I have practical examples that affect power though :-) And I'm
reasonably confident they'll affect ARM as soon as people start doing
serious SMP on it etc...

IE. The code is provably wrong without barriers.

> wmb and rmb are nops so there's no bug in practice.
> So the only actual rule which might be violated by qemu is that
> read flushes out writes.
> It's unlikely you will find real examples where this matters
> but I'm interested to hear otherwise.

wmb and rmb are not nops on powerpc and arm to name a couple.

mb is more than just "read flush writes" (besides it's not a statement
about flushing, it's a statement about ordering. whether it has a
flushing side effect on x86 is a separate issue, it doesn't on power for
example).

Real flushing out writes matters very much in real life in two very
different contexts that tend to not affect emulation in qemu as much.

One is flushing write in the opposite direction (or rather, having the
read response queued up behind those writes) which is critical to
ensuring proper completion of DMAs after an LSI from a guest driver
perspective on real HW typically.

The other classic case is to flush posted MMIO writes in order to ensure
that a subsequent delay is respected.

Most of those don't actually matter when doing emulation. Besides a
barrier won't provide you the second guarantee, you need a nastier
construct at least on some architectures like power.

However, we do need to ensure that read and writes are properly ordered
vs. each other (regardless of any "flush" semantic) or things could go
very wrong on OO architectures (here too, my understanding on x86 is
limited).

> I also note that guests do use write-combining e.g. for vga.
> One wonders whether stronger barrriers are needed
> because of that?


What I'm trying to address here is really to ensure that load and stores
issued by qemu emulated devices "appear" in the right order in respect
to guest driver code running simultaneously on different CPUs (both P
and V in this context).

I've somewhat purposefully for now left alone the problem ordering
"accross" transfer directions which in the case of PCI (and most "sane"
busses) means read flushing writes in the other direction. (Here too
it's not really a flush, it's just an ordering statement between the
write and the read response).

If you want to look at that other problem more closely, it breaks down
into two parts:

 - Guest writes vs. qemu reads. My gut feeling is that this should take
care of itself for the most part, tho you might want to bracket PIO
operations with barriers for extra safety. But we might want to dig
more.

 - qemu writes vs. guest reads. Here too, this only matters as long as
the guest can observe the cat inside the box. This means the guest gets
to "observe" the writes vs. some other event that might need to be
synchronized with the read, such as an interrupt. Here my gut feeling is
that we might be safer by having a barrier before any interrupt get shot
to the guest (and a syscall/ioctl is not a memory barrier, though at
least with kvm on power, a guest entry is so we should be fine) but here
too, it might be a good idea to dig more.

However, both of these can (and probably should) be treated separately
and are rather unlikely to cause problems in practice. In most case
these have to do with flushing non coherent DMA buffers in intermediary
bridges, and while those could be considered as akin to a store queue
that hasn't yet reached coherency on a core, in practice I think we are
fine.

So let's go back to the problem at hand, which is to make sure that load
and stores done by emulated devices get observed by the guest code in
the right order.

My preference is to do it in the dma_* accessors (with possibly
providing __relaxed versions), which means "low level" stuff using cpu_*
directly such as virtio doesn't pay the price (or rather is responsible
for doing its own barriers, which is good since typically that's our
real perf sensitive stuff).

Anthony earlier preferred all the way down into cpu_physical_* which is
what my latest round of patches did. But I can understand that this
makes people uncomfortable.

In any case, we yet have to get some measurements of the actual cost of
those barriers on x86. I can try to give it a go on my laptop next week,
but I'm not an x86 expert and don't have that much different x86 HW
around (such as large SMP machines where the cost might differ or
different generations of x86 processors etc...).

> > Then, fine-tuning performance critical one by selectively removing
> > barriers allows to improve performance where it would be othewise
> > harmed.
> 
> So that breaks attempts to bisect performance regressions.
> Not good.

Disagreed. In all cases safety trumps performances. Almost all our
devices were written without any thought given to ordering, so they
basically can and should be considered as all broken. Since thinking
about ordering is something that, by experience, very few programmer can
do and get right, the default should absolutely be fully ordered.

Performance regressions aren't a big deal to bisect in that case: If
there's a regression for a given driver and it points to this specific
patch adding the barriers then we know precisely where the regression
come from, and we can get some insight about how this specific driver
can be improved to use more relaxed accessors.

I don't see the problem.

One thing that might be worth looking at is if indeed mb() is so much
more costly than just wmb/rmb, in which circumstances we could have some
smarts in the accessors to make them skip the full mb based on knowledge
of previous access direction, though here too I would be tempted to only
do that if absolutely necessary (ie if we can't instead just fix the
sensitive driver to use explicitly relaxed accessors).

> > So on that I will not compromise.
> > 
> > However, I think it might be better to leave the barrier in the dma
> > accessor since that's how you also get iommu transparency etc... so it's
> > not a bad place to put them, and leave the cpu_physical_* for use by
> > lower level device drivers which are thus responsible also for dealing
> > with ordering if they have to.
> > 
> > Cheers,
> > Ben.
> 
> You claim to understand what matters for all devices I doubt that.

It's pretty obvious that anything that does DMA using a classic
descriptor + buffers structure is broken without appropriate ordering.

And yes, I claim to have a fairly good idea of the problem, but I don't
think throwing credentials around is going to be helpful.
 
> Why don't we add safe APIs, then go over devices and switch over?
> I counted 97 pci_dma_ accesses.
> 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> 
> Let maintainers make a decision where does speed matter.

No. Let's fix the general bug first. Then let's people who know the
individual drivers intimately and understand their access patterns make
the call as to when things can/should be relaxed.

Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 11:45                                             ` Benjamin Herrenschmidt
@ 2012-05-21 12:18                                               ` Michael S. Tsirkin
  2012-05-21 15:16                                                 ` Paolo Bonzini
  2012-05-21 21:58                                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function Benjamin Herrenschmidt
  0 siblings, 2 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21 12:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, May 21, 2012 at 09:45:58PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote:
> 
> > > IE. Just making the default accessors ordered means that all devices
> > > written with the assumption that the guest will see accesses in the
> > > order they are written in the emulated device will be correct, which
> > > means pretty much all of them (well, almost).
> > > 
> > >  --> It actually fixes a real bug that affects almost all devices
> > >      that do DMA today in qemu
> > 
> > In theory fine but practical examples that affect x86?
> > We might want to at least document some of them.
> 
> x86 I don't know, I suspect mostly none that have actually been hit but
> I could be wrong, I'm not familiar enough with it.
> 
> I have practical examples that affect power though :-) And I'm
> reasonably confident they'll affect ARM as soon as people start doing
> serious SMP on it etc...
> 
> IE. The code is provably wrong without barriers.
> 
> > wmb and rmb are nops so there's no bug in practice.
> > So the only actual rule which might be violated by qemu is that
> > read flushes out writes.
> > It's unlikely you will find real examples where this matters
> > but I'm interested to hear otherwise.
> 
> wmb and rmb are not nops on powerpc and arm to name a couple.
> 
> mb is more than just "read flush writes" (besides it's not a statement
> about flushing, it's a statement about ordering. whether it has a
> flushing side effect on x86 is a separate issue, it doesn't on power for
> example).

I referred to reads not bypassing writes on PCI.
This is the real argument in my eyes: that we
should behave the way real hardware does.

> Real flushing out writes matters very much in real life in two very
> different contexts that tend to not affect emulation in qemu as much.
> 
> One is flushing write in the opposite direction (or rather, having the
> read response queued up behind those writes) which is critical to
> ensuring proper completion of DMAs after an LSI from a guest driver
> perspective on real HW typically.
> 
> The other classic case is to flush posted MMIO writes in order to ensure
> that a subsequent delay is respected.
> 
> Most of those don't actually matter when doing emulation. Besides a
> barrier won't provide you the second guarantee, you need a nastier
> construct at least on some architectures like power.

Exactly. This is what I was saying too.

> However, we do need to ensure that read and writes are properly ordered
> vs. each other (regardless of any "flush" semantic) or things could go
> very wrong on OO architectures (here too, my understanding on x86 is
> limited).

Right. Here's a compromize:
- add smp_rmb() on any DMA read
- add smp_wmb( on any DMA write
This is almost zero cost on x86 at least.
So we are not regressing existing setups.

Are there any places where devices do read after write?
My preferred way is to find them and do pci_dma_flush() invoking
smp_mb(). If there is such a case it's likely on datapath anyway
so we do care.

But I can also live with a global flag "latest_dma_read"
and on read we could do
	if (unlikely(latest_dma_read))
		smp_mb();

if you really insist on it
though I do think it's inelegant.

> > I also note that guests do use write-combining e.g. for vga.
> > One wonders whether stronger barrriers are needed
> > because of that?
> 
> 
> What I'm trying to address here is really to ensure that load and stores
> issued by qemu emulated devices "appear" in the right order in respect
> to guest driver code running simultaneously on different CPUs (both P
> and V in this context).
> 
> I've somewhat purposefully for now left alone the problem ordering
> "accross" transfer directions which in the case of PCI (and most "sane"
> busses) means read flushing writes in the other direction. (Here too
> it's not really a flush, it's just an ordering statement between the
> write and the read response).
> 
> If you want to look at that other problem more closely, it breaks down
> into two parts:
> 
>  - Guest writes vs. qemu reads. My gut feeling is that this should take
> care of itself for the most part, tho you might want to bracket PIO
> operations with barriers for extra safety. But we might want to dig
> more.
> 
>  - qemu writes vs. guest reads. Here too, this only matters as long as
> the guest can observe the cat inside the box. This means the guest gets
> to "observe" the writes vs. some other event that might need to be
> synchronized with the read, such as an interrupt. Here my gut feeling is
> that we might be safer by having a barrier before any interrupt get shot
> to the guest (and a syscall/ioctl is not a memory barrier, though at
> least with kvm on power, a guest entry is so we should be fine) but here
> too, it might be a good idea to dig more.
> 
> However, both of these can (and probably should) be treated separately
> and are rather unlikely to cause problems in practice. In most case
> these have to do with flushing non coherent DMA buffers in intermediary
> bridges, and while those could be considered as akin to a store queue
> that hasn't yet reached coherency on a core, in practice I think we are
> fine.
> 
> So let's go back to the problem at hand, which is to make sure that load
> and stores done by emulated devices get observed by the guest code in
> the right order.
> 
> My preference is to do it in the dma_* accessors (with possibly
> providing __relaxed versions), which means "low level" stuff using cpu_*
> directly such as virtio doesn't pay the price (or rather is responsible
> for doing its own barriers, which is good since typically that's our
> real perf sensitive stuff).
> 
> Anthony earlier preferred all the way down into cpu_physical_* which is
> what my latest round of patches did. But I can understand that this
> makes people uncomfortable.
> 
> In any case, we yet have to get some measurements of the actual cost of
> those barriers on x86. I can try to give it a go on my laptop next week,
> but I'm not an x86 expert and don't have that much different x86 HW
> around (such as large SMP machines where the cost might differ or
> different generations of x86 processors etc...).
> 
> > > Then, fine-tuning performance critical one by selectively removing
> > > barriers allows to improve performance where it would be othewise
> > > harmed.
> > 
> > So that breaks attempts to bisect performance regressions.
> > Not good.
> 
> Disagreed. In all cases safety trumps performances.

You said above x86 is unaffected. This is portability, not safety.

> Almost all our
> devices were written without any thought given to ordering, so they
> basically can and should be considered as all broken.

Problem is, a lot of code is likely broken even after you sprinkle
barriers around. For example qemu might write A then B where guest driver
expects to see B written before A.

> Since thinking
> about ordering is something that, by experience, very few programmer can
> do and get right, the default should absolutely be fully ordered.

Give it bus ordering. That is not fully ordered.

> Performance regressions aren't a big deal to bisect in that case: If
> there's a regression for a given driver and it points to this specific
> patch adding the barriers then we know precisely where the regression
> come from, and we can get some insight about how this specific driver
> can be improved to use more relaxed accessors.
> 
> I don't see the problem.
> 
> One thing that might be worth looking at is if indeed mb() is so much
> more costly than just wmb/rmb, in which circumstances we could have some
> smarts in the accessors to make them skip the full mb based on knowledge
> of previous access direction, though here too I would be tempted to only
> do that if absolutely necessary (ie if we can't instead just fix the
> sensitive driver to use explicitly relaxed accessors).

We did this in virtio and yes it is measureable.
branches are pretty cheap though.

> > > So on that I will not compromise.
> > > 
> > > However, I think it might be better to leave the barrier in the dma
> > > accessor since that's how you also get iommu transparency etc... so it's
> > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > lower level device drivers which are thus responsible also for dealing
> > > with ordering if they have to.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > You claim to understand what matters for all devices I doubt that.
> 
> It's pretty obvious that anything that does DMA using a classic
> descriptor + buffers structure is broken without appropriate ordering.
> 
> And yes, I claim to have a fairly good idea of the problem, but I don't
> think throwing credentials around is going to be helpful.
>  
> > Why don't we add safe APIs, then go over devices and switch over?
> > I counted 97 pci_dma_ accesses.
> > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > 
> > Let maintainers make a decision where does speed matter.
> 
> No. Let's fix the general bug first. Then let's people who know the
> individual drivers intimately and understand their access patterns make
> the call as to when things can/should be relaxed.
> 
> Ben.

As a maintainer of a device, if you send me a patch I can review.
If you change core APIs creating performance regressions
I don't even know what to review without wasting time debugging
and bisecting.

According to what you said you want to fix kvm on powerpc.
Good. Find a way that looks non intrusive on x86 please.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 12:18                                               ` Michael S. Tsirkin
@ 2012-05-21 15:16                                                 ` Paolo Bonzini
  2012-05-21 21:58                                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function Benjamin Herrenschmidt
  1 sibling, 0 replies; 89+ messages in thread
From: Paolo Bonzini @ 2012-05-21 15:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, Anthony Liguori, David Gibson

Il 21/05/2012 14:18, Michael S. Tsirkin ha scritto:
>> > Almost all our
>> > devices were written without any thought given to ordering, so they
>> > basically can and should be considered as all broken.
> Problem is, a lot of code is likely broken even after you sprinkle
> barriers around. For example qemu might write A then B where guest driver
> expects to see B written before A.

This would be a bug in the guest driver, and usually relatively easy to
reproduce.  The specs (I know of UHCI) should be very precise on this
for obvious reasons.

Paolo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
  2012-05-21 12:18                                               ` Michael S. Tsirkin
  2012-05-21 15:16                                                 ` Paolo Bonzini
@ 2012-05-21 21:58                                                 ` Benjamin Herrenschmidt
  2012-05-21 22:22                                                   ` Michael S. Tsirkin
  2012-05-22  4:19                                                   ` Rusty Russell
  1 sibling, 2 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21 21:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:

> > mb is more than just "read flush writes" (besides it's not a statement
> > about flushing, it's a statement about ordering. whether it has a
> > flushing side effect on x86 is a separate issue, it doesn't on power for
> > example).
> 
> I referred to reads not bypassing writes on PCI.

Again, from which originator ? From a given initiator, nothing bypasses
anything, so the right thing to do here is a full mb(). However, I
suspect what you are talking about here is read -responses- not
bypassing writes in the direction of the response (ie, the "flushing"
semantic of reads) which is a different matter. Also don't forget that
this is only a semantic of PCI, not of the system fabric, ie, a device
DMA read doesn't flush a CPU write that is still in that CPU store
queue.

> This is the real argument in my eyes: that we
> should behave the way real hardware does.

But that doesn't really make much sense since we don't actually have a
non-coherent bus sitting in the middle :-)

However we should as much as possible be observed to behave as such, I
agree, though I don't think we need to bother too much about timings
since we don't really have way to enforce the immediate visibility of
stores within the coherent domain without a bunch of arch specific very
very heavy hammers which we really don't want to wield at this point.

> > Real flushing out writes matters very much in real life in two very
> > different contexts that tend to not affect emulation in qemu as much.
> > 
> > One is flushing write in the opposite direction (or rather, having the
> > read response queued up behind those writes) which is critical to
> > ensuring proper completion of DMAs after an LSI from a guest driver
> > perspective on real HW typically.
> > 
> > The other classic case is to flush posted MMIO writes in order to ensure
> > that a subsequent delay is respected.
> > 
> > Most of those don't actually matter when doing emulation. Besides a
> > barrier won't provide you the second guarantee, you need a nastier
> > construct at least on some architectures like power.
> 
> Exactly. This is what I was saying too.

Right and I'm reasonably sure that none of those above is our problem. 

As I said, at this point, what I want to sort out is purely the
observable ordering of DMA transactions. The side effect of reads in one
direction on writes in the other direction is an orthogonal problem
which as I wrote above is probably not hurting us.

> > However, we do need to ensure that read and writes are properly ordered
> > vs. each other (regardless of any "flush" semantic) or things could go
> > very wrong on OO architectures (here too, my understanding on x86 is
> > limited).
> 
> Right. Here's a compromize:
> - add smp_rmb() on any DMA read
> - add smp_wmb( on any DMA write
> This is almost zero cost on x86 at least.
> So we are not regressing existing setups.

And it's not correct. With that setup, DMA writes can pass DMA reads
(and vice-versa) which doesn't correspond to the guarantees of the PCI
spec. The question I suppose is whether this is a problem in practice...

> Are there any places where devices do read after write?

It's possible, things like update of a descriptor followed by reading of
the next one, etc...  I don't have an example hot in mind right know of
a device that would be hurt but I'm a bit nervous as this would be a
violation of the PCI guaranteed ordering.

> My preferred way is to find them and do pci_dma_flush() invoking
> smp_mb(). If there is such a case it's likely on datapath anyway
> so we do care.
> 
> But I can also live with a global flag "latest_dma_read"
> and on read we could do
> 	if (unlikely(latest_dma_read))
> 		smp_mb();
> 
> if you really insist on it
> though I do think it's inelegant.

Again, why do you object on simply making the default accessors fully
ordered ? Do you think it will be a measurable different in most cases ?

Shouldn't we measure it first ?

> You said above x86 is unaffected. This is portability, not safety.

x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
the missing ordering between loads and stores, I don't know, as I said,
I don't fully know the x86 memory model.

In any case, opposing "portability" to "safety" the way you do it means
you are making assumptions that basically "qemu is written for x86 and
nothing else matters".

If that's your point of view, so be it and be clear about it, but I will
disagree :-) And while I can understand that powerpc might not be
considered as the most important arch around at this point in time,
these problems are going to affect ARM as well.

> > Almost all our
> > devices were written without any thought given to ordering, so they
> > basically can and should be considered as all broken.
> 
> Problem is, a lot of code is likely broken even after you sprinkle
> barriers around. For example qemu might write A then B where guest driver
> expects to see B written before A.

No, this is totally unrelated bugs, nothing to do with barriers. You are
mixing up two completely different problems and using one as an excuse
to not fix the other one :-)

A device with the above problem would be broken today on x86 regardless.

> > Since thinking
> > about ordering is something that, by experience, very few programmer can
> > do and get right, the default should absolutely be fully ordered.
> 
> Give it bus ordering. That is not fully ordered.

It pretty much is actually, look at your PCI spec :-)

> > Performance regressions aren't a big deal to bisect in that case: If
> > there's a regression for a given driver and it points to this specific
> > patch adding the barriers then we know precisely where the regression
> > come from, and we can get some insight about how this specific driver
> > can be improved to use more relaxed accessors.
> > 
> > I don't see the problem.
> > 
> > One thing that might be worth looking at is if indeed mb() is so much
> > more costly than just wmb/rmb, in which circumstances we could have some
> > smarts in the accessors to make them skip the full mb based on knowledge
> > of previous access direction, though here too I would be tempted to only
> > do that if absolutely necessary (ie if we can't instead just fix the
> > sensitive driver to use explicitly relaxed accessors).
> 
> We did this in virtio and yes it is measureable.

You did it in virtio in a very hot spot on a performance critical
driver. My argument is that:

 - We can do it in a way that doesn't affect virtio at all (by using the
dma accessors instead of cpu_*)

 - Only few drivers have that kind of performance criticality and they
can be easily hand fixed.

> branches are pretty cheap though.

Depends, not always but yes, cheaper than barriers in many cases.

> > > > So on that I will not compromise.
> > > > 
> > > > However, I think it might be better to leave the barrier in the dma
> > > > accessor since that's how you also get iommu transparency etc... so it's
> > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > lower level device drivers which are thus responsible also for dealing
> > > > with ordering if they have to.
> > > > 
> > > > Cheers,
> > > > Ben.
> > > 
> > > You claim to understand what matters for all devices I doubt that.
> > 
> > It's pretty obvious that anything that does DMA using a classic
> > descriptor + buffers structure is broken without appropriate ordering.
> > 
> > And yes, I claim to have a fairly good idea of the problem, but I don't
> > think throwing credentials around is going to be helpful.
> >  
> > > Why don't we add safe APIs, then go over devices and switch over?
> > > I counted 97 pci_dma_ accesses.
> > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > 
> > > Let maintainers make a decision where does speed matter.
> > 
> > No. Let's fix the general bug first. Then let's people who know the
> > individual drivers intimately and understand their access patterns make
> > the call as to when things can/should be relaxed.
> > 
> > Ben.
> 
> As a maintainer of a device, if you send me a patch I can review.
> If you change core APIs creating performance regressions
> I don't even know what to review without wasting time debugging
> and bisecting.

Well, if you don't know then you ask on the list and others (such as
myself or a certain mst) who happens to know those issues will help that
lone clueless maintainer, seriously it's not like it's hard, and a lot
easier than just keeping everything broken and hoping we get to audit
them all properly.

> According to what you said you want to fix kvm on powerpc.
> Good. Find a way that looks non intrusive on x86 please.

And ARM. And any other OO arch.

Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21  8:31                               ` Michael S. Tsirkin
  2012-05-21  8:58                                 ` Benjamin Herrenschmidt
@ 2012-05-21 22:18                                 ` Anthony Liguori
  2012-05-21 22:26                                   ` Benjamin Herrenschmidt
  2012-05-21 22:37                                   ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Michael S. Tsirkin
  1 sibling, 2 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-21 22:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, David Gibson, qemu-devel

On 05/21/2012 03:31 AM, Michael S. Tsirkin wrote:
> More than that. smp_mb is pretty expensive. You
> often can do just smp_wmb and smp_rmb and that is
> very cheap.
> Many operations run in the vcpu context
> or start when guest exits to host and work
> is bounced from there and thus no barrier is needed
> here.
>
> Example? start_xmit in e1000. Executed in vcpu context
> so no barrier is needed.
>
> virtio of course is another example since it does its own
> barriers. But even without that, virtio_blk_handle_output
> runs in vcpu context.
>
> But more importantly, this hack just sweeps the
> dirt under the carpet. Understanding the interaction
> with guest drivers is important anyway. So

But this isn't what this series is about.

This series is only attempting to make sure that writes are ordered with respect 
to other writes in main memory.

It's based on the assumption that write ordering is well defined (and typically 
strict) on most busses including PCI.  I have not confirmed this myself but I 
trust that Ben has.

So the only problem trying to be solved here is to make sure that if a write A 
is issued by the device model while it's on PCPU 0, if PCPU 1 does a write B to 
another location, and then the device model runs on PCPU 2 and does a read of 
both A and B, it will only see the new value of B if the it sees the new value of A.

Whether the driver on VCPU 0 (which may be on any PCPU) also sees the write 
ordering is irrelevant.

If you want to avoid taking a barrier on every write, we can make use of map() 
and issue explicit barriers (as virtio does).

Regards,

Anthony Liguori

> I really don't see why don't we audit devices
> and add proper barriers.
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
  2012-05-21 21:58                                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function Benjamin Herrenschmidt
@ 2012-05-21 22:22                                                   ` Michael S. Tsirkin
  2012-05-21 22:56                                                     ` Benjamin Herrenschmidt
  2012-05-22  0:00                                                     ` Benjamin Herrenschmidt
  2012-05-22  4:19                                                   ` Rusty Russell
  1 sibling, 2 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21 22:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
> 
> > > mb is more than just "read flush writes" (besides it's not a statement
> > > about flushing, it's a statement about ordering. whether it has a
> > > flushing side effect on x86 is a separate issue, it doesn't on power for
> > > example).
> > 
> > I referred to reads not bypassing writes on PCI.
> 
> Again, from which originator ? From a given initiator, nothing bypasses
> anything, so the right thing to do here is a full mb(). However, I
> suspect what you are talking about here is read -responses- not
> bypassing writes in the direction of the response (ie, the "flushing"
> semantic of reads) which is a different matter.

No. My spec says:
A3, A4
A Posted Request must be able to pass Non-Posted Requests to avoid
deadlocks.


> Also don't forget that
> this is only a semantic of PCI, not of the system fabric, ie, a device
> DMA read doesn't flush a CPU write that is still in that CPU store
> queue.

We need to emulate what hardware does IMO.
So if usb has different rules it needs a different barriers.


> > This is the real argument in my eyes: that we
> > should behave the way real hardware does.
> 
> But that doesn't really make much sense since we don't actually have a
> non-coherent bus sitting in the middle :-)
> 
> However we should as much as possible be observed to behave as such, I
> agree, though I don't think we need to bother too much about timings
> since we don't really have way to enforce the immediate visibility of
> stores within the coherent domain without a bunch of arch specific very
> very heavy hammers which we really don't want to wield at this point.
> 
> > > Real flushing out writes matters very much in real life in two very
> > > different contexts that tend to not affect emulation in qemu as much.
> > > 
> > > One is flushing write in the opposite direction (or rather, having the
> > > read response queued up behind those writes) which is critical to
> > > ensuring proper completion of DMAs after an LSI from a guest driver
> > > perspective on real HW typically.
> > > 
> > > The other classic case is to flush posted MMIO writes in order to ensure
> > > that a subsequent delay is respected.
> > > 
> > > Most of those don't actually matter when doing emulation. Besides a
> > > barrier won't provide you the second guarantee, you need a nastier
> > > construct at least on some architectures like power.
> > 
> > Exactly. This is what I was saying too.
> 
> Right and I'm reasonably sure that none of those above is our problem. 
> 
> As I said, at this point, what I want to sort out is purely the
> observable ordering of DMA transactions. The side effect of reads in one
> direction on writes in the other direction is an orthogonal problem
> which as I wrote above is probably not hurting us.
> 
> > > However, we do need to ensure that read and writes are properly ordered
> > > vs. each other (regardless of any "flush" semantic) or things could go
> > > very wrong on OO architectures (here too, my understanding on x86 is
> > > limited).
> > 
> > Right. Here's a compromize:
> > - add smp_rmb() on any DMA read
> > - add smp_wmb( on any DMA write
> > This is almost zero cost on x86 at least.
> > So we are not regressing existing setups.
> 
> And it's not correct. With that setup, DMA writes can pass DMA reads
> (and vice-versa) which doesn't correspond to the guarantees of the PCI
> spec.

Cite the spec please. Express spec matches this at least.

> The question I suppose is whether this is a problem in practice...
> 
> > Are there any places where devices do read after write?
> 
> It's possible, things like update of a descriptor followed by reading of
> the next one, etc...  I don't have an example hot in mind right know of
> a device that would be hurt but I'm a bit nervous as this would be a
> violation of the PCI guaranteed ordering.
> 
> > My preferred way is to find them and do pci_dma_flush() invoking
> > smp_mb(). If there is such a case it's likely on datapath anyway
> > so we do care.
> > 
> > But I can also live with a global flag "latest_dma_read"
> > and on read we could do
> > 	if (unlikely(latest_dma_read))
> > 		smp_mb();
> > 
> > if you really insist on it
> > though I do think it's inelegant.
> 
> Again, why do you object on simply making the default accessors fully
> ordered ? Do you think it will be a measurable different in most cases ?
> 
> Shouldn't we measure it first ?

It's a lot of work. We measured the effect for virtio in
the past. I don't think we need to redo it.

> > You said above x86 is unaffected. This is portability, not safety.
> 
> x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
> the missing ordering between loads and stores, I don't know, as I said,
> I don't fully know the x86 memory model.

You don't need to understand it. Assume memory-barriers.h is correct.

> In any case, opposing "portability" to "safety" the way you do it means
> you are making assumptions that basically "qemu is written for x86 and
> nothing else matters".

No. But find a way to fix power without hurting working setups.

> If that's your point of view, so be it and be clear about it, but I will
> disagree :-) And while I can understand that powerpc might not be
> considered as the most important arch around at this point in time,
> these problems are going to affect ARM as well.
> 
> > > Almost all our
> > > devices were written without any thought given to ordering, so they
> > > basically can and should be considered as all broken.
> > 
> > Problem is, a lot of code is likely broken even after you sprinkle
> > barriers around. For example qemu might write A then B where guest driver
> > expects to see B written before A.
> 
> No, this is totally unrelated bugs, nothing to do with barriers. You are
> mixing up two completely different problems and using one as an excuse
> to not fix the other one :-)
> 
> A device with the above problem would be broken today on x86 regardless.
> 
> > > Since thinking
> > > about ordering is something that, by experience, very few programmer can
> > > do and get right, the default should absolutely be fully ordered.
> > 
> > Give it bus ordering. That is not fully ordered.
> 
> It pretty much is actually, look at your PCI spec :-)

I looked. 2.4.1.  Transaction Ordering Rules

> > > Performance regressions aren't a big deal to bisect in that case: If
> > > there's a regression for a given driver and it points to this specific
> > > patch adding the barriers then we know precisely where the regression
> > > come from, and we can get some insight about how this specific driver
> > > can be improved to use more relaxed accessors.
> > > 
> > > I don't see the problem.
> > > 
> > > One thing that might be worth looking at is if indeed mb() is so much
> > > more costly than just wmb/rmb, in which circumstances we could have some
> > > smarts in the accessors to make them skip the full mb based on knowledge
> > > of previous access direction, though here too I would be tempted to only
> > > do that if absolutely necessary (ie if we can't instead just fix the
> > > sensitive driver to use explicitly relaxed accessors).
> > 
> > We did this in virtio and yes it is measureable.
> 
> You did it in virtio in a very hot spot on a performance critical
> driver. My argument is that:
> 
>  - We can do it in a way that doesn't affect virtio at all (by using the
> dma accessors instead of cpu_*)
> 
>  - Only few drivers have that kind of performance criticality and they
> can be easily hand fixed.
> 
> > branches are pretty cheap though.
> 
> Depends, not always but yes, cheaper than barriers in many cases.
> 
> > > > > So on that I will not compromise.
> > > > > 
> > > > > However, I think it might be better to leave the barrier in the dma
> > > > > accessor since that's how you also get iommu transparency etc... so it's
> > > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > > lower level device drivers which are thus responsible also for dealing
> > > > > with ordering if they have to.
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > 
> > > > You claim to understand what matters for all devices I doubt that.
> > > 
> > > It's pretty obvious that anything that does DMA using a classic
> > > descriptor + buffers structure is broken without appropriate ordering.
> > > 
> > > And yes, I claim to have a fairly good idea of the problem, but I don't
> > > think throwing credentials around is going to be helpful.
> > >  
> > > > Why don't we add safe APIs, then go over devices and switch over?
> > > > I counted 97 pci_dma_ accesses.
> > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > > 
> > > > Let maintainers make a decision where does speed matter.
> > > 
> > > No. Let's fix the general bug first. Then let's people who know the
> > > individual drivers intimately and understand their access patterns make
> > > the call as to when things can/should be relaxed.
> > > 
> > > Ben.
> > 
> > As a maintainer of a device, if you send me a patch I can review.
> > If you change core APIs creating performance regressions
> > I don't even know what to review without wasting time debugging
> > and bisecting.
> 
> Well, if you don't know then you ask on the list and others (such as
> myself or a certain mst) who happens to know those issues will help that
> lone clueless maintainer, seriously it's not like it's hard, and a lot
> easier than just keeping everything broken and hoping we get to audit
> them all properly.
> 
> > According to what you said you want to fix kvm on powerpc.
> > Good. Find a way that looks non intrusive on x86 please.
> 
> And ARM. And any other OO arch.
> 
> Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 22:18                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Anthony Liguori
@ 2012-05-21 22:26                                   ` Benjamin Herrenschmidt
  2012-05-21 22:31                                     ` Anthony Liguori
  2012-05-21 22:37                                   ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Michael S. Tsirkin
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21 22:26 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, David Gibson, qemu-devel, Michael S. Tsirkin

On Mon, 2012-05-21 at 17:18 -0500, Anthony Liguori wrote:
> But this isn't what this series is about.
> 
> This series is only attempting to make sure that writes are ordered
> with respect 
> to other writes in main memory.

Actually, it applies to both reads and writes. They can't pass each
other either and that can be fairly important.

It's in fact the main contention point because if it was only writes we
could just use wmb and be done with it (that's a nop on x86).

Because we are trying to order everything (and specifically store
followed by a load), we need a full barrier which is more expensive on
x86.

Ben.

> It's based on the assumption that write ordering is well defined (and
> typically 
> strict) on most busses including PCI.  I have not confirmed this
> myself but I 
> trust that Ben has.
> 
> So the only problem trying to be solved here is to make sure that if a
> write A 
> is issued by the device model while it's on PCPU 0, if PCPU 1 does a
> write B to 
> another location, and then the device model runs on PCPU 2 and does a
> read of 
> both A and B, it will only see the new value of B if the it sees the
> new value of A.
> 
> Whether the driver on VCPU 0 (which may be on any PCPU) also sees the
> write 
> ordering is irrelevant.
> 
> If you want to avoid taking a barrier on every write, we can make use
> of map() 
> and issue explicit barriers (as virtio does).
> 
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 22:26                                   ` Benjamin Herrenschmidt
@ 2012-05-21 22:31                                     ` Anthony Liguori
  2012-05-21 22:44                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-05-21 22:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, David Gibson, qemu-devel, Michael S. Tsirkin

On 05/21/2012 05:26 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 17:18 -0500, Anthony Liguori wrote:
>> But this isn't what this series is about.
>>
>> This series is only attempting to make sure that writes are ordered
>> with respect
>> to other writes in main memory.
>
> Actually, it applies to both reads and writes. They can't pass each
> other either and that can be fairly important.

That's fine but that's a detail of the bus.

> It's in fact the main contention point because if it was only writes we
> could just use wmb and be done with it (that's a nop on x86).
>
> Because we are trying to order everything (and specifically store
> followed by a load), we need a full barrier which is more expensive on
> x86.

I think the thing to do is make the barrier implemented in the dma API and allow 
it to be overridden by the bus.  The default implementation should be a full 
barrier.

If we can establish that the bus guarantees a weaker ordering guarantee, a bus 
could override the default implementation and do something weaker.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 22:18                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Anthony Liguori
  2012-05-21 22:26                                   ` Benjamin Herrenschmidt
@ 2012-05-21 22:37                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21 22:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paolo Bonzini, David Gibson, qemu-devel

On Mon, May 21, 2012 at 05:18:21PM -0500, Anthony Liguori wrote:
> On 05/21/2012 03:31 AM, Michael S. Tsirkin wrote:
> >More than that. smp_mb is pretty expensive. You
> >often can do just smp_wmb and smp_rmb and that is
> >very cheap.
> >Many operations run in the vcpu context
> >or start when guest exits to host and work
> >is bounced from there and thus no barrier is needed
> >here.
> >
> >Example? start_xmit in e1000. Executed in vcpu context
> >so no barrier is needed.
> >
> >virtio of course is another example since it does its own
> >barriers. But even without that, virtio_blk_handle_output
> >runs in vcpu context.
> >
> >But more importantly, this hack just sweeps the
> >dirt under the carpet. Understanding the interaction
> >with guest drivers is important anyway. So
> 
> But this isn't what this series is about.
> 
> This series is only attempting to make sure that writes are ordered
> with respect to other writes in main memory.

They it should use smp_wmb() not smp_mb().
I would be 100% fine with that.

> It's based on the assumption that write ordering is well defined
> (and typically strict) on most busses including PCI.  I have not
> confirmed this myself but I trust that Ben has.
> 
> So the only problem trying to be solved here is to make sure that if
> a write A is issued by the device model while it's on PCPU 0, if
> PCPU 1 does a write B to another location, and then the device model
> runs on PCPU 2 and does a read of both A and B, it will only see the
> new value of B if the it sees the new value of A.
> 
> Whether the driver on VCPU 0 (which may be on any PCPU) also sees
> the write ordering is irrelevant.
> 
> If you want to avoid taking a barrier on every write, we can make
> use of map() and issue explicit barriers (as virtio does).
> 
> Regards,
> 
> Anthony Liguori
> 
> >I really don't see why don't we audit devices
> >and add proper barriers.
> >

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 22:31                                     ` Anthony Liguori
@ 2012-05-21 22:44                                       ` Michael S. Tsirkin
  2012-05-21 23:02                                         ` Benjamin Herrenschmidt
  2012-05-22  4:34                                         ` [Qemu-devel] [PATCH] Add a memory barrier to DMA functions Benjamin Herrenschmidt
  0 siblings, 2 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-21 22:44 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: David Gibson, qemu-devel, Paolo Bonzini

On Mon, May 21, 2012 at 05:31:06PM -0500, Anthony Liguori wrote:
> On 05/21/2012 05:26 PM, Benjamin Herrenschmidt wrote:
> >On Mon, 2012-05-21 at 17:18 -0500, Anthony Liguori wrote:
> >>But this isn't what this series is about.
> >>
> >>This series is only attempting to make sure that writes are ordered
> >>with respect
> >>to other writes in main memory.
> >
> >Actually, it applies to both reads and writes. They can't pass each
> >other either and that can be fairly important.
> 
> That's fine but that's a detail of the bus.
> 
> >It's in fact the main contention point because if it was only writes we
> >could just use wmb and be done with it (that's a nop on x86).
> >
> >Because we are trying to order everything (and specifically store
> >followed by a load), we need a full barrier which is more expensive on
> >x86.
> 
> I think the thing to do is make the barrier implemented in the dma
> API and allow it to be overridden by the bus.  The default
> implementation should be a full barrier.

I think what's called for is what Ben proposed: track
last transaction and use the appropriate barrier.

> If we can establish that the bus guarantees a weaker ordering
> guarantee, a bus could override the default implementation and do
> something weaker.
> 
> Regards,
> 
> Anthony Liguori

OK. Just not another level of indirect function callbacks please.  Make
it a library so each bus can do the right thing.  There are not so many
buses.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
  2012-05-21 22:22                                                   ` Michael S. Tsirkin
@ 2012-05-21 22:56                                                     ` Benjamin Herrenschmidt
  2012-05-22  5:11                                                       ` Michael S. Tsirkin
  2012-05-22  0:00                                                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21 22:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
> > Again, from which originator ? From a given initiator, nothing
> bypasses
> > anything, so the right thing to do here is a full mb(). However, I
> > suspect what you are talking about here is read -responses- not
> > bypassing writes in the direction of the response (ie, the
> "flushing"
> > semantic of reads) which is a different matter.
> 
> No. My spec says:
> A3, A4
> A Posted Request must be able to pass Non-Posted Requests to avoid
> deadlocks.

Right, a read + write can become write + read at the target, I forgot
about that, or you can deadlock due to the flush semantics, but a write
+ read must remain in order or am I missing something ?

And write + read afaik is typically the one that x86 can re-order
without a barrier isn't it ?

> > Also don't forget that
> > this is only a semantic of PCI, not of the system fabric, ie, a
> device
> > DMA read doesn't flush a CPU write that is still in that CPU store
> > queue.
> 
> We need to emulate what hardware does IMO.
> So if usb has different rules it needs a different barriers.

Who talks about USB here ? Whatever rules USB has only matter between
the USB device and the USB controller emulation, USB doesn't sit
directly on the memory bus doing DMA, it all goes through the HCI, which
adheres the the ordering rules of whatever bus it sits on.

Here, sanity must prevail :-) I suggest ordering by default...

> > And it's not correct. With that setup, DMA writes can pass DMA reads
> > (and vice-versa) which doesn't correspond to the guarantees of the
> PCI
> > spec.
> 
> Cite the spec please. Express spec matches this at least.

Sure, see above. Yes I did forgot that a read + write could be
re-ordered on PCI but that isn't the case of a write + read, or am I
reading the table sideways ?

> It's a lot of work. We measured the effect for virtio in
> the past. I don't think we need to redo it.

virtio is specifically our high performance case and what I'm proposing
isn't affecting it.

> > > You said above x86 is unaffected. This is portability, not safety.
> > 
> > x86 is unaffected by the missing wmb/rmb, it might not be unaffected
> by
> > the missing ordering between loads and stores, I don't know, as I
> said,
> > I don't fully know the x86 memory model.
> 
> You don't need to understand it. Assume memory-barriers.h is correct.

In which case we still need a full mb() unless we can convince ourselves
that the ordering between a write and a subsequent read can be relaxed
safely and I'm really not sure about it.

> > In any case, opposing "portability" to "safety" the way you do it
> means
> > you are making assumptions that basically "qemu is written for x86
> and
> > nothing else matters".
> 
> No. But find a way to fix power without hurting working setups.

And ARM ;-)

Arguably x86 is wrong too anyway, at least from a strict interpretation
off the spec (and unless I missed something).

> > If that's your point of view, so be it and be clear about it, but I
> will
> > disagree :-) And while I can understand that powerpc might not be
> > considered as the most important arch around at this point in time,
> > these problems are going to affect ARM as well.
> > 
> > > > Almost all our
> > > > devices were written without any thought given to ordering, so
> they
> > > > basically can and should be considered as all broken.
> > > 
> > > Problem is, a lot of code is likely broken even after you sprinkle
> > > barriers around. For example qemu might write A then B where guest
> driver
> > > expects to see B written before A.
> > 
> > No, this is totally unrelated bugs, nothing to do with barriers. You
> are
> > mixing up two completely different problems and using one as an
> excuse
> > to not fix the other one :-)
> > 
> > A device with the above problem would be broken today on x86
> regardless.
> > 
> > > > Since thinking
> > > > about ordering is something that, by experience, very few
> programmer can
> > > > do and get right, the default should absolutely be fully
> ordered.
> > > 
> > > Give it bus ordering. That is not fully ordered.
> > 
> > It pretty much is actually, look at your PCI spec :-)
> 
> I looked. 2.4.1.  Transaction Ordering Rules
> 
> > > > Performance regressions aren't a big deal to bisect in that
> case: If
> > > > there's a regression for a given driver and it points to this
> specific
> > > > patch adding the barriers then we know precisely where the
> regression
> > > > come from, and we can get some insight about how this specific
> driver
> > > > can be improved to use more relaxed accessors.
> > > > 
> > > > I don't see the problem.
> > > > 
> > > > One thing that might be worth looking at is if indeed mb() is so
> much
> > > > more costly than just wmb/rmb, in which circumstances we could
> have some
> > > > smarts in the accessors to make them skip the full mb based on
> knowledge
> > > > of previous access direction, though here too I would be tempted
> to only
> > > > do that if absolutely necessary (ie if we can't instead just fix
> the
> > > > sensitive driver to use explicitly relaxed accessors).
> > > 
> > > We did this in virtio and yes it is measureable.
> > 
> > You did it in virtio in a very hot spot on a performance critical
> > driver. My argument is that:
> > 
> >  - We can do it in a way that doesn't affect virtio at all (by using
> the
> > dma accessors instead of cpu_*)
> > 
> >  - Only few drivers have that kind of performance criticality and
> they
> > can be easily hand fixed.
> > 
> > > branches are pretty cheap though.
> > 
> > Depends, not always but yes, cheaper than barriers in many cases.
> > 
> > > > > > So on that I will not compromise.
> > > > > > 
> > > > > > However, I think it might be better to leave the barrier in
> the dma
> > > > > > accessor since that's how you also get iommu transparency
> etc... so it's
> > > > > > not a bad place to put them, and leave the cpu_physical_*
> for use by
> > > > > > lower level device drivers which are thus responsible also
> for dealing
> > > > > > with ordering if they have to.
> > > > > > 
> > > > > > Cheers,
> > > > > > Ben.
> > > > > 
> > > > > You claim to understand what matters for all devices I doubt
> that.
> > > > 
> > > > It's pretty obvious that anything that does DMA using a classic
> > > > descriptor + buffers structure is broken without appropriate
> ordering.
> > > > 
> > > > And yes, I claim to have a fairly good idea of the problem, but
> I don't
> > > > think throwing credentials around is going to be helpful.
> > > >  
> > > > > Why don't we add safe APIs, then go over devices and switch
> over?
> > > > > I counted 97 pci_dma_ accesses.
> > > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > > > 
> > > > > Let maintainers make a decision where does speed matter.
> > > > 
> > > > No. Let's fix the general bug first. Then let's people who know
> the
> > > > individual drivers intimately and understand their access
> patterns make
> > > > the call as to when things can/should be relaxed.
> > > > 
> > > > Ben.
> > > 
> > > As a maintainer of a device, if you send me a patch I can review.
> > > If you change core APIs creating performance regressions
> > > I don't even know what to review without wasting time debugging
> > > and bisecting.
> > 
> > Well, if you don't know then you ask on the list and others (such as
> > myself or a certain mst) who happens to know those issues will help
> that
> > lone clueless maintainer, seriously it's not like it's hard, and a
> lot
> > easier than just keeping everything broken and hoping we get to
> audit
> > them all properly.
> > 
> > > According to what you said you want to fix kvm on powerpc.
> > > Good. Find a way that looks non intrusive on x86 please.
> > 
> > And ARM. And any other OO arch.
> > 
> > Ben.
> 
> 
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions
  2012-05-21 22:44                                       ` Michael S. Tsirkin
@ 2012-05-21 23:02                                         ` Benjamin Herrenschmidt
  2012-05-22  4:34                                         ` [Qemu-devel] [PATCH] Add a memory barrier to DMA functions Benjamin Herrenschmidt
  1 sibling, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-21 23:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Tue, 2012-05-22 at 01:44 +0300, Michael S. Tsirkin wrote:
> 
> OK. Just not another level of indirect function callbacks please.  Make
> it a library so each bus can do the right thing.  There are not so many
> buses. 

I think we are a long way from having to deal with subtly different
ordering rules, at this stage I'm keen on just sticking it in the inline
dma_ accessor, period.

We can always have flags in the dma context to indicate the sort of
ordering, avoids indirect pointer calls which can be costly (often
mispredicted).

Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
  2012-05-21 22:22                                                   ` Michael S. Tsirkin
  2012-05-21 22:56                                                     ` Benjamin Herrenschmidt
@ 2012-05-22  0:00                                                     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-22  0:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
> > Again, from which originator ? From a given initiator, nothing
> bypasses
> > anything, so the right thing to do here is a full mb(). However, I
> > suspect what you are talking about here is read -responses- not
> > bypassing writes in the direction of the response (ie, the
> "flushing"
> > semantic of reads) which is a different matter.
> 
> No. My spec says:
> A3, A4
> A Posted Request must be able to pass Non-Posted Requests to avoid
> deadlocks. 

An additional note about that one: It only applies obviously if the
initiator doesn't wait for the read response before shooting the write.

Most devices actually do wait. Here too, we would have to explicitly
make sure if what is the right semantic on an individual device basis.

Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
  2012-05-21 21:58                                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function Benjamin Herrenschmidt
  2012-05-21 22:22                                                   ` Michael S. Tsirkin
@ 2012-05-22  4:19                                                   ` Rusty Russell
  1 sibling, 0 replies; 89+ messages in thread
From: Rusty Russell @ 2012-05-22  4:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Tue, 22 May 2012 07:58:17 +1000, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
> > But I can also live with a global flag "latest_dma_read"
> > and on read we could do
> > 	if (unlikely(latest_dma_read))
> > 		smp_mb();
> > 
> > if you really insist on it
> > though I do think it's inelegant.
> 
> Again, why do you object on simply making the default accessors fully
> ordered ? Do you think it will be a measurable different in most cases ?
> 
> Shouldn't we measure it first ?

Yes.  It seems clear to me that qemu's default DMA operations should be
strictly ordered.  It's just far easier to get right.

After that, we can get tricky with conditional barriers, and we can get
tricky with using special unordered variants in critical drivers, but I
really don't want to be chasing subtle SMP ordering problems in
production.

If you're working on ARM or PPC, "it works for x86" makes it *worse*,
not better, since you have fewer users to find bugs.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-21 22:44                                       ` Michael S. Tsirkin
  2012-05-21 23:02                                         ` Benjamin Herrenschmidt
@ 2012-05-22  4:34                                         ` Benjamin Herrenschmidt
  2012-05-22  4:51                                           ` Benjamin Herrenschmidt
  2012-05-22  7:17                                           ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-22  4:34 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, Michael S. Tsirkin, qemu-devel, David Gibson

The emulated devices can run simultaneously with the guest, so
we need to be careful with ordering of load and stores done by
them to the guest system memory, which need to be observed in
the right order by the guest operating system.

This adds a barrier call to the basic DMA read/write ops which
is currently implemented as a smp_mb(), but could be later
improved for more fine grained control of barriers.

Additionally, a _relaxed() variant of the accessors is provided
to easily convert devices who would be performance sensitive
and negatively impacted by the change.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

So here's the latest try :-) I've kept is simple, I don't
add anything to map/unmap at this stage, so we might still
have a problem with drivers who do that a lot without any
explicit barrier (ahci ?). My preference is to also add
barriers to map/unmap by default but we can discuss it.

Note that I've put the barrier in an inline "helper" so
we can nag about the type of barriers, or try to be smart,
or use flags in the DMAContext or whatever we want reasonably
easily as we have a single spot to modify.

I'm now going to see if I can measure a performance hit on
my x86 laptop, but if somebody who has existing x86 guest setups
wants to help, that would be much welcome :-)

 dma.h |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/dma.h b/dma.h
index f1fcb71..0d57e50 100644
--- a/dma.h
+++ b/dma.h
@@ -13,6 +13,7 @@
 #include <stdio.h>
 #include "hw/hw.h"
 #include "block.h"
+#include "kvm.h"
 
 typedef struct DMAContext DMAContext;
 typedef struct ScatterGatherEntry ScatterGatherEntry;
@@ -70,6 +71,30 @@ typedef struct DMAContext {
     DMAUnmapFunc *unmap;
 } DMAContext;
 
+static inline void dma_barrier(DMAContext *dma, DMADirection dir)
+{
+    /*
+     * This is called before DMA read and write operations
+     * unless the _relaxed form is used and is responsible
+     * for providing some sane ordering of accesses vs
+     * concurrently running VCPUs.
+     *
+     * Users of map(), unmap() or lower level st/ld_*
+     * operations are responsible for providing their own
+     * ordering via barriers.
+     *
+     * This primitive implementation does a simple smp_mb()
+     * before each operation which provides pretty much full
+     * ordering.
+     *
+     * A smarter implementation can be devised if needed to
+     * use lighter barriers based on the direction of the
+     * transfer, the DMA context, etc...
+     */
+    if (kvm_enabled())
+        smp_mb();
+}
+
 static inline bool dma_has_iommu(DMAContext *dma)
 {
     return !!dma;
@@ -93,8 +118,9 @@ static inline bool dma_memory_valid(DMAContext *dma,
 
 int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
                         void *buf, dma_addr_t len, DMADirection dir);
-static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
-                                void *buf, dma_addr_t len, DMADirection dir)
+static inline int dma_memory_rw_relaxed(DMAContext *dma, dma_addr_t addr,
+                                        void *buf, dma_addr_t len,
+                                        DMADirection dir)
 {
     if (!dma_has_iommu(dma)) {
         /* Fast-path for no IOMMU */
@@ -106,6 +132,28 @@ static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
     }
 }
 
+static inline int dma_memory_read_relaxed(DMAContext *dma, dma_addr_t addr,
+                                          void *buf, dma_addr_t len)
+{
+    return dma_memory_rw_relaxed(dma, addr, buf, len, DMA_DIRECTION_TO_DEVICE);
+}
+
+static inline int dma_memory_write_relaxed(DMAContext *dma, dma_addr_t addr,
+                                           const void *buf, dma_addr_t len)
+{
+    return dma_memory_rw_relaxed(dma, addr, (void *)buf, len,
+                                 DMA_DIRECTION_FROM_DEVICE);
+}
+
+static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
+                                void *buf, dma_addr_t len,
+                                DMADirection dir)
+{
+    dma_barrier(dma, dir);
+
+    return dma_memory_rw_relaxed(dma, addr, buf, len, dir);
+}
+
 static inline int dma_memory_read(DMAContext *dma, dma_addr_t addr,
                                   void *buf, dma_addr_t len)
 {
@@ -124,6 +172,8 @@ int iommu_dma_memory_set(DMAContext *dma, dma_addr_t addr, uint8_t c,
 static inline int dma_memory_set(DMAContext *dma, dma_addr_t addr,
                                  uint8_t c, dma_addr_t len)
 {
+    dma_barrier(dma, DMA_DIRECTION_FROM_DEVICE);
+
     if (!dma_has_iommu(dma)) {
         /* Fast-path for no IOMMU */
         cpu_physical_memory_set(addr, c, len);

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22  4:34                                         ` [Qemu-devel] [PATCH] Add a memory barrier to DMA functions Benjamin Herrenschmidt
@ 2012-05-22  4:51                                           ` Benjamin Herrenschmidt
  2012-05-22  7:17                                           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-22  4:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, David Gibson, qemu-devel, Michael S. Tsirkin

On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> So here's the latest try :-) I've kept is simple, I don't
> add anything to map/unmap at this stage, so we might still
> have a problem with drivers who do that a lot without any
> explicit barrier (ahci ?). My preference is to also add
> barriers to map/unmap by default but we can discuss it.
> 
> Note that I've put the barrier in an inline "helper" so
> we can nag about the type of barriers, or try to be smart,
> or use flags in the DMAContext or whatever we want reasonably
> easily as we have a single spot to modify.
> 
> I'm now going to see if I can measure a performance hit on
> my x86 laptop, but if somebody who has existing x86 guest setups
> wants to help, that would be much welcome :-) 

Also an idea I had to make it easier & avoid a clutter of
_relaxed variants of everything:

Can we change DMADirection to be a DMAAttributes or DMAFlags,
and basically have more than just the direction in there ?

That way we can easily have "relaxed" be a flag, and thus
apply to most of the accessors without adding a bunch of
different variant (especially since we already have that
"cancel" variant for map, I don't want to make 2 more
for _relaxed).

If I do that change I'd also like to change the enum so that
read and write are distinct bits, and so we can set both for
bidirectional. This makes more sense if we ever have to play
with cache issues etc... and might help using more optimal
barriers. It might also fit better to IOMMUs who provide
distinct read vs. write permissions.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
  2012-05-21 22:56                                                     ` Benjamin Herrenschmidt
@ 2012-05-22  5:11                                                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-22  5:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, qemu-devel, Anthony Liguori, David Gibson

On Tue, May 22, 2012 at 08:56:12AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
> > > Again, from which originator ? From a given initiator, nothing
> > bypasses
> > > anything, so the right thing to do here is a full mb(). However, I
> > > suspect what you are talking about here is read -responses- not
> > > bypassing writes in the direction of the response (ie, the
> > "flushing"
> > > semantic of reads) which is a different matter.
> > 
> > No. My spec says:
> > A3, A4
> > A Posted Request must be able to pass Non-Posted Requests to avoid
> > deadlocks.
> 
> Right, a read + write can become write + read at the target, I forgot
> about that, or you can deadlock due to the flush semantics, but a write
> + read must remain in order or am I missing something ?

Exactly.

> And write + read afaik is typically the one that x86 can re-order
> without a barrier isn't it ?

AFAIK without a barrier, x86 can reorder them however you initiate them.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22  4:34                                         ` [Qemu-devel] [PATCH] Add a memory barrier to DMA functions Benjamin Herrenschmidt
  2012-05-22  4:51                                           ` Benjamin Herrenschmidt
@ 2012-05-22  7:17                                           ` Benjamin Herrenschmidt
  2012-05-22 11:14                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-22  7:17 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, Rusty Russell, Michael S. Tsirkin, qemu-devel,
	David Gibson

On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> The emulated devices can run simultaneously with the guest, so
> we need to be careful with ordering of load and stores done by
> them to the guest system memory, which need to be observed in
> the right order by the guest operating system.
> 
> This adds a barrier call to the basic DMA read/write ops which
> is currently implemented as a smp_mb(), but could be later
> improved for more fine grained control of barriers.
> 
> Additionally, a _relaxed() variant of the accessors is provided
> to easily convert devices who would be performance sensitive
> and negatively impacted by the change.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---

(Note to Rusty: The number I told you on ST is wrong, see below)

So I tried to do some performance measurements with that patch using
netperf on an x86 laptop (x220 with core i7).

It's a bit tricky. For example, if I just create a tap interface,
give it a local IP on the laptop and a different IP on the guest,
(ie talking to a netserver on the host basically from the guest
via tap), the performance is pretty poor and the numbers seem
useless with and without the barrier.

So I did tests involving talking to a server on our gigabit network
instead.

The baseline is the laptop without kvm talking to the server. The
TCP_STREAM test results are:

(The "*" at the beginning of the lines is something I added to
 distinguish multi-line results on some tests)

MIGRATED TCP STREAM TEST
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

* 87380  16384  16384    10.02     933.02   
* 87380  16384  16384    10.03     908.64   
* 87380  16384  16384    10.03     926.78   
* 87380  16384  16384    10.03     919.73   

It's a bit noisy, ideally I should do a point-to-point setup to
an otherwise idle machine, here I'm getting some general lab network
noise but it gives us a pretty good baseline to begin with.

I have not managed to get any sensible result out of UDP_STREAM in
that configuration for some reason (ie just the host laptop), ie
they look insane :

MIGRATED UDP STREAM TEST
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

*229376   65507   10.00      270468      0    14173.84
 126976           10.00          44              2.31
*229376   65507   10.00      266526      0    13967.32
 126976           10.00          41              2.15

So we don't have a good comparison baseline but we can still compare
KVM against itself with and without the barrier.

Now KVM. This is x86_64 running an ubuntu precise guest (I had the
ISO laying around) and using the default setup which appears to be
an emulated e1000. I've done some tests with slirp just to see how
bad it was and it's bad enough to be irrelevant. The numbers have
thus been done using a tap interface bridged to the host ethernet
(who happens to also be some kind of e1000).

For each test I've done 3 series of numbers:

 - Without the barrier added
 - With the barrier added to dma_memory_rw
 - With the barrier added to dma_memory_rw -and- dma_memory_map

First TCP_STREAM. The numbers are a bit noisy, I suspect somebody
was hammering the server machine while I was doing one of the tests,
but here's what I got when it appeared to have stabilized:

MIGRATED TCP STREAM TEST
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

no barrier
* 87380 16384   16384    10.01    880.31
* 87380 16384   16384    10.01    876.73
* 87380 16384   16384    10.01    880.73
* 87380 16384   16384    10.01    878.63
barrier
* 87380 16384   16384    10.01    869.39
* 87380 16384   16384    10.01    864.99
* 87380 16384   16384    10.01    886.13
* 87380 16384   16384    10.01    872.90
barrier + map
* 87380 16384   16384    10.01    867.45
* 87380 16384   16384    10.01    868.51
* 87380 16384   16384    10.01    888.94
* 87380 16384   16384    10.01    888.19

As far as I can tell, it's all in the noise. I was about to concede a
small (1 % ?) loss to the barrier until I ran the last 2 tests and
then I stopped caring :-)

With UDP_STREAM, we get something like that:
MIGRATED UDP STREAM TEST
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

no barrier
*229376   65507   10.00        5208     0      272.92
 126976           10.00        5208            272.92
*229376   65507   10.00        5447     0      285.44
 126976           10.00        5447            285.44
*229376   65507   10.00        5119     0      268.22
 126976           10.00        5119            268.22
barrier
*229376   65507   10.00        5326     0      279.06
 126976           10.00        5326            279.06
*229376   65507   10.00        5072     0      265.75
 126976           10.00        5072            265.75
*229376   65507   10.00        5282     0      276.78
 126976           10.00        5282            276.78
barrier + map
*229376   65507   10.00        5512     0      288.83
 126976           10.00        5512            288.83
*229376   65507   10.00        5571     0      291.94
 126976           10.00        5571            291.94
*229376   65507   10.00        5195     0      272.23
 126976           10.00        5195            272.23

So I think here too we're in the noise. In fact, that makes me want to
stick the barrier in map() as well (though see my other email about
using a flag to implement "relaxed" to avoid an explosion of accessors).

Now, I suspect somebody needs to re-run those tests on HW that is known
to be more sensitive to memory barriers, it could be that my SB i7 in
64-bit mode is just the best case scenario and that some old core1 or 2
using a 32-bit lock instruction will suck a lot more.

In any case, it looks like the performance loss is minimal if measurable
at all, and in case there's a real concern on a given driver we can
always fix -that- driver to use more relaxed accessors.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22  7:17                                           ` Benjamin Herrenschmidt
@ 2012-05-22 11:14                                             ` Michael S. Tsirkin
  2012-05-22 11:41                                               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-22 11:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, Rusty Russell, qemu-devel, Anthony Liguori, David Gibson

On Tue, May 22, 2012 at 05:17:39PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> > The emulated devices can run simultaneously with the guest, so
> > we need to be careful with ordering of load and stores done by
> > them to the guest system memory, which need to be observed in
> > the right order by the guest operating system.
> > 
> > This adds a barrier call to the basic DMA read/write ops which
> > is currently implemented as a smp_mb(), but could be later
> > improved for more fine grained control of barriers.
> > 
> > Additionally, a _relaxed() variant of the accessors is provided
> > to easily convert devices who would be performance sensitive
> > and negatively impacted by the change.
> > 
> > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> 
> (Note to Rusty: The number I told you on ST is wrong, see below)
> 
> So I tried to do some performance measurements with that patch using
> netperf on an x86 laptop (x220 with core i7).
> 
> It's a bit tricky. For example, if I just create a tap interface,
> give it a local IP on the laptop and a different IP on the guest,
> (ie talking to a netserver on the host basically from the guest
> via tap), the performance is pretty poor and the numbers seem
> useless with and without the barrier.
> 
> So I did tests involving talking to a server on our gigabit network
> instead.
> 
> The baseline is the laptop without kvm talking to the server. The
> TCP_STREAM test results are:

It's not a good test. The thing most affecting throughput results is how
much CPU does you guest get. So as a minumum you need to measure CPU
utilization on the host and divide by that.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22 11:14                                             ` Michael S. Tsirkin
@ 2012-05-22 11:41                                               ` Benjamin Herrenschmidt
  2012-05-22 12:03                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-22 11:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Rusty Russell, qemu-devel, Anthony Liguori, David Gibson

On Tue, 2012-05-22 at 14:14 +0300, Michael S. Tsirkin wrote:
> > The baseline is the laptop without kvm talking to the server. The
> > TCP_STREAM test results are:
> 
> It's not a good test. The thing most affecting throughput results is
> how
> much CPU does you guest get. So as a minumum you need to measure CPU
> utilization on the host and divide by that. 

The simple fact that we don't reach the baseline while in qemu seems to
be a reasonably good indication that we tend to be CPU bound already so
it's not -that- relevant. It would be if we were saturating the network.

But yes, I can try to do more tests tomorrow, it would be nice if you
could contribute a proper test protocol (or even test on some machines)
since you seem to be familiar with those measurements (and I have a very
limited access to x86 gear ... basically just my laptop).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22 11:41                                               ` Benjamin Herrenschmidt
@ 2012-05-22 12:03                                                 ` Michael S. Tsirkin
  2012-05-22 21:24                                                   ` Benjamin Herrenschmidt
  2012-05-22 21:40                                                   ` Anthony Liguori
  0 siblings, 2 replies; 89+ messages in thread
From: Michael S. Tsirkin @ 2012-05-22 12:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paolo Bonzini, Rusty Russell, qemu-devel, Anthony Liguori, David Gibson

On Tue, May 22, 2012 at 09:41:41PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-22 at 14:14 +0300, Michael S. Tsirkin wrote:
> > > The baseline is the laptop without kvm talking to the server. The
> > > TCP_STREAM test results are:
> > 
> > It's not a good test. The thing most affecting throughput results is
> > how
> > much CPU does you guest get. So as a minumum you need to measure CPU
> > utilization on the host and divide by that. 
> 
> The simple fact that we don't reach the baseline while in qemu seems to
> be a reasonably good indication that we tend to be CPU bound already so
> it's not -that- relevant. It would be if we were saturating the network.
> 
> But yes, I can try to do more tests tomorrow, it would be nice if you
> could contribute a proper test protocol (or even test on some machines)
> since you seem to be familiar with those measurements (and I have a very
> limited access to x86 gear ... basically just my laptop).
> 
> Cheers,
> Ben.

I have a deja vu. Amos sent perf results when you argued about
exactly the same issue in guest virtio. Delta was small but
measureable. At the moment I have no free time or free hardware
to redo the same work all over again. It's a well known fact that
actual memory barrier is slow on x86 CPUs. You can't see
it with network on your laptop? Write a microbenchmark.

-- 
MST

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22 12:03                                                 ` Michael S. Tsirkin
@ 2012-05-22 21:24                                                   ` Benjamin Herrenschmidt
  2012-05-22 21:40                                                   ` Anthony Liguori
  1 sibling, 0 replies; 89+ messages in thread
From: Benjamin Herrenschmidt @ 2012-05-22 21:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Rusty Russell, qemu-devel, Anthony Liguori, David Gibson

On Tue, 2012-05-22 at 15:03 +0300, Michael S. Tsirkin wrote:
> I have a deja vu. Amos sent perf results when you argued about
> exactly the same issue in guest virtio. Delta was small but
> measureable. At the moment I have no free time or free hardware
> to redo the same work all over again. It's a well known fact that
> actual memory barrier is slow on x86 CPUs. You can't see
> it with network on your laptop? Write a microbenchmark. 

Or just screw it, it's small enough, for emulated devices we can swallow
it, and move on with life, since virtio is unaffected, and we can always
fine tune e1000 if performance of that one is critical.

Ben.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [PATCH] Add a memory barrier to DMA functions
  2012-05-22 12:03                                                 ` Michael S. Tsirkin
  2012-05-22 21:24                                                   ` Benjamin Herrenschmidt
@ 2012-05-22 21:40                                                   ` Anthony Liguori
  1 sibling, 0 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-05-22 21:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, David Gibson, qemu-devel, Paolo Bonzini

On 05/22/2012 07:03 AM, Michael S. Tsirkin wrote:
> On Tue, May 22, 2012 at 09:41:41PM +1000, Benjamin Herrenschmidt wrote:
>> On Tue, 2012-05-22 at 14:14 +0300, Michael S. Tsirkin wrote:
>>>> The baseline is the laptop without kvm talking to the server. The
>>>> TCP_STREAM test results are:
>>>
>>> It's not a good test. The thing most affecting throughput results is
>>> how
>>> much CPU does you guest get. So as a minumum you need to measure CPU
>>> utilization on the host and divide by that.
>>
>> The simple fact that we don't reach the baseline while in qemu seems to
>> be a reasonably good indication that we tend to be CPU bound already so
>> it's not -that- relevant. It would be if we were saturating the network.
>>
>> But yes, I can try to do more tests tomorrow, it would be nice if you
>> could contribute a proper test protocol (or even test on some machines)
>> since you seem to be familiar with those measurements (and I have a very
>> limited access to x86 gear ... basically just my laptop).
>>
>> Cheers,
>> Ben.
>
> I have a deja vu. Amos sent perf results when you argued about
> exactly the same issue in guest virtio. Delta was small but
> measureable. At the moment I have no free time or free hardware
> to redo the same work all over again. It's a well known fact that
> actual memory barrier is slow on x86 CPUs. You can't see
> it with network on your laptop? Write a microbenchmark.

The latest patch doesn't put a barrier in map().  I have an extremely hard time 
believing anything that uses _rw is going to be performance sensitive.

There is a correctness problem here.  I think it's important that we fix that 
and then we can focus on improving performance later.

Regards,

Anthony Liguori

>

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2012-05-22 21:41 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-10  4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
2012-05-10  4:48 ` [Qemu-devel] [PATCH 01/13] Better support for dma_addr_t variables Benjamin Herrenschmidt
2012-05-10  4:48 ` [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero() Benjamin Herrenschmidt
2012-05-15  0:42   ` Anthony Liguori
2012-05-15  1:23     ` David Gibson
2012-05-15  2:03       ` Anthony Liguori
2012-05-10  4:48 ` [Qemu-devel] [PATCH 03/13] iommu: Add universal DMA helper functions Benjamin Herrenschmidt
2012-05-10  4:48 ` [Qemu-devel] [PATCH 04/13] usb-ohci: Use " Benjamin Herrenschmidt
2012-05-10  4:48 ` [Qemu-devel] [PATCH 05/13] iommu: Make sglists and dma_bdrv helpers use new universal DMA helpers Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 06/13] ide/ahci: Use universal DMA helper functions Benjamin Herrenschmidt
2012-05-21  1:51   ` [Qemu-devel] [PATCH 06/13 - UPDATED] " Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 07/13] usb: Convert usb_packet_{map, unmap} to universal DMA helpers Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure Benjamin Herrenschmidt
2012-05-15  0:49   ` Anthony Liguori
2012-05-15  1:42     ` David Gibson
2012-05-15  2:03       ` Anthony Liguori
2012-05-15  2:32         ` Benjamin Herrenschmidt
2012-05-15  2:50           ` Anthony Liguori
2012-05-15  3:02             ` Benjamin Herrenschmidt
2012-05-15 14:02               ` Anthony Liguori
2012-05-15 21:55                 ` Benjamin Herrenschmidt
2012-05-15 22:02                   ` Anthony Liguori
2012-05-15 23:08                     ` Benjamin Herrenschmidt
2012-05-15 23:58                       ` Anthony Liguori
2012-05-16  0:41                         ` Benjamin Herrenschmidt
2012-05-16  0:54                           ` Anthony Liguori
2012-05-16  1:20                             ` Benjamin Herrenschmidt
2012-05-16 19:36                               ` Anthony Liguori
2012-05-10  4:49 ` [Qemu-devel] [PATCH 09/13] iommu: Add facility to cancel in-use dma memory maps Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 10/13] pseries: Convert sPAPR TCEs to use generic IOMMU infrastructure Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 11/13] iommu: Allow PCI to use " Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 12/13] pseries: Implement IOMMU and DMA for PAPR PCI devices Benjamin Herrenschmidt
2012-05-10  4:49 ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Benjamin Herrenschmidt
2012-05-15  0:52   ` Anthony Liguori
2012-05-15  1:11     ` Benjamin Herrenschmidt
2012-05-15  1:44     ` David Gibson
2012-05-16  4:35       ` Benjamin Herrenschmidt
2012-05-16  5:51         ` David Gibson
2012-05-16 19:39         ` Anthony Liguori
2012-05-16 21:10           ` Benjamin Herrenschmidt
2012-05-16 21:12             ` Benjamin Herrenschmidt
2012-05-17  0:07           ` Benjamin Herrenschmidt
2012-05-17  0:24             ` Benjamin Herrenschmidt
2012-05-17  0:52               ` [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
2012-05-17  2:28                 ` Anthony Liguori
2012-05-17  2:44                   ` Benjamin Herrenschmidt
2012-05-17 22:09                     ` Anthony Liguori
2012-05-18  1:04                       ` David Gibson
2012-05-18  1:16                       ` Benjamin Herrenschmidt
2012-05-17  3:35                   ` David Gibson
2012-05-18  6:53               ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Paolo Bonzini
2012-05-18  8:18                 ` Benjamin Herrenschmidt
2012-05-18  8:57                   ` Paolo Bonzini
2012-05-18 22:26                     ` Benjamin Herrenschmidt
2012-05-19  7:24                       ` Paolo Bonzini
2012-05-20 21:36                         ` Benjamin Herrenschmidt
2012-05-21  1:56                           ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
2012-05-21  8:11                             ` Paolo Bonzini
2012-05-21  8:31                               ` Michael S. Tsirkin
2012-05-21  8:58                                 ` Benjamin Herrenschmidt
2012-05-21  9:07                                   ` Benjamin Herrenschmidt
2012-05-21  9:16                                     ` Benjamin Herrenschmidt
2012-05-21  9:34                                       ` Michael S. Tsirkin
2012-05-21  9:53                                         ` Benjamin Herrenschmidt
2012-05-21 10:31                                           ` Michael S. Tsirkin
2012-05-21 11:45                                             ` Benjamin Herrenschmidt
2012-05-21 12:18                                               ` Michael S. Tsirkin
2012-05-21 15:16                                                 ` Paolo Bonzini
2012-05-21 21:58                                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function Benjamin Herrenschmidt
2012-05-21 22:22                                                   ` Michael S. Tsirkin
2012-05-21 22:56                                                     ` Benjamin Herrenschmidt
2012-05-22  5:11                                                       ` Michael S. Tsirkin
2012-05-22  0:00                                                     ` Benjamin Herrenschmidt
2012-05-22  4:19                                                   ` Rusty Russell
2012-05-21 22:18                                 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Anthony Liguori
2012-05-21 22:26                                   ` Benjamin Herrenschmidt
2012-05-21 22:31                                     ` Anthony Liguori
2012-05-21 22:44                                       ` Michael S. Tsirkin
2012-05-21 23:02                                         ` Benjamin Herrenschmidt
2012-05-22  4:34                                         ` [Qemu-devel] [PATCH] Add a memory barrier to DMA functions Benjamin Herrenschmidt
2012-05-22  4:51                                           ` Benjamin Herrenschmidt
2012-05-22  7:17                                           ` Benjamin Herrenschmidt
2012-05-22 11:14                                             ` Michael S. Tsirkin
2012-05-22 11:41                                               ` Benjamin Herrenschmidt
2012-05-22 12:03                                                 ` Michael S. Tsirkin
2012-05-22 21:24                                                   ` Benjamin Herrenschmidt
2012-05-22 21:40                                                   ` Anthony Liguori
2012-05-21 22:37                                   ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Michael S. Tsirkin
2012-05-15  0:52 ` [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Anthony Liguori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.