All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes
@ 2013-12-11 18:30 Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 01/28] hw: Pass QEMUMachine to its init() method Michael S. Tsirkin
                   ` (27 more replies)
  0 siblings, 28 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori
  Cc: agraf, pingfank, marcel.a, mst, armbru, qemu-stable, kraxel,
	pbonzini, imammedo, qemulist, lcapitulino, afaerber

The following changes since commit 8f84271da83c0e9f92aa7c1c2d0d3875bf0a5cb8:

  target-mips: Use macro ARRAY_SIZE where possible (2013-12-09 16:44:04 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_anthony

for you to fetch changes up to 511161027a0ecab6e12107128adeb8a884c5bcbe:

  pc: use macro for HPET type (2013-12-11 20:11:10 +0200)

----------------------------------------------------------------
acpi.pci,pc,memory core fixes

Most notably this includes changes to exec to support
full 64 bit addresses.

This also flushes out patches that got queued during 1.7 freeze.
There are new tests, and a bunch of bug fixes all over the place.
There are also some changes mostly useful for downstreams.

I'm also listing myself as pc co-maintainer. I'm doing this reluctantly,
but this seems to be necessary to make sure patches are not lost or delayed too
much, and posting the MAINTAINERS patch did not seem to make anyone else
volunteer.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Gerd Hoffmann (1):
      pci: fix pci bridge fw path

Gerd Hoffmann (1):
  pci: fix pci bridge fw path

Liu Ping Fan (2):
  hpet: inverse polarity when pin above ISA_NUM_IRQS
  hpet: enable to entitle more irq pins for hpet

Marcel Apfelbaum (5):
  acpi unit-test: verify signature and checksum
  memory.c: bugfix - ref counting mismatch in memory_region_find
  exec: separate sections and nodes per address space
  acpi unit-test: load and check facs table
  acpi unit-test: adjust the test data structure for better handling

Markus Armbruster (2):
  hw: Pass QEMUMachine to its init() method
  smbios: Set system manufacturer, product & version by default

Michael S. Tsirkin (14):
  pc: map PCI address space as catchall region for not mapped addresses
  acpi-test: basic acpi unit-test
  MAINTAINERS: update X86 machine entry
  pci: fix address space size for bridge
  spapr_pci: s/INT64_MAX/UINT64_MAX/
  exec: replace leaf with skip
  exec: extend skip field to 6 bit, page entry to 32 bit
  exec: pass hw address to phys_page_find
  exec: memory radix tree page level compression
  exec: reduce L2_PAGE_SIZE
  acpi: strip compiler info in built-in DSDT
  ACPI DSDT: Make control method `IQCR` serialized
  hpet: fix build with CONFIG_HPET off
  pc: use macro for HPET type

Paolo Bonzini (4):
  qtest: split configuration of qtest accelerator and chardev
  pc: s/INT64_MAX/UINT64_MAX/
  split definitions for exec.c and translate-all.c radix trees
  exec: make address spaces 64-bit wide

 MAINTAINERS                         |  18 +-
 exec.c                              | 278 +++++++++++++++----------
 hw/i386/acpi-build.c                |   8 +-
 hw/i386/acpi-dsdt.dsl               |   2 +-
 hw/i386/acpi-dsdt.hex.generated     |   4 +-
 hw/i386/pc.c                        |  39 ++--
 hw/i386/pc_piix.c                   |  31 ++-
 hw/i386/pc_q35.c                    |  34 +++-
 hw/i386/q35-acpi-dsdt.dsl           |   2 +-
 hw/i386/q35-acpi-dsdt.hex.generated |   4 +-
 hw/i386/smbios.c                    |  14 ++
 hw/pci-host/piix.c                  |  26 +--
 hw/pci-host/q35.c                   |  27 +--
 hw/pci/pci.c                        |   2 +-
 hw/pci/pci_bridge.c                 |   2 +-
 hw/ppc/spapr_pci.c                  |   2 +-
 hw/timer/hpet.c                     |  29 ++-
 include/hw/boards.h                 |   7 +-
 include/hw/i386/pc.h                |  38 ++--
 include/hw/i386/smbios.h            |   2 +
 include/hw/pci-host/q35.h           |   2 -
 include/hw/timer/hpet.h             |  10 +-
 include/sysemu/qtest.h              |  25 +--
 memory.c                            |   1 +
 qtest.c                             |  20 +-
 tests/Makefile                      |   2 +
 tests/acpi-test.c                   | 394 ++++++++++++++++++++++++++++++++++++
 translate-all.c                     |  32 +--
 translate-all.h                     |   7 -
 vl.c                                |  11 +-
 30 files changed, 798 insertions(+), 275 deletions(-)
 create mode 100644 tests/acpi-test.c

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 01/28] hw: Pass QEMUMachine to its init() method
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 02/28] pc: map PCI address space as catchall region for not mapped addresses Michael S. Tsirkin
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Anthony Liguori, Markus Armbruster, Eduardo Habkost,
	=?UTF-8?q?Andreas=20F=C3=A4rber?=

From: Markus Armbruster <armbru@redhat.com>

Put it in QEMUMachineInitArgs, so I don't have to touch every board.

Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/boards.h | 7 +++++--
 vl.c                | 3 ++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5a7ae9f..2151460 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -6,7 +6,10 @@
 #include "sysemu/blockdev.h"
 #include "hw/qdev.h"
 
+typedef struct QEMUMachine QEMUMachine;
+
 typedef struct QEMUMachineInitArgs {
+    const QEMUMachine *machine;
     ram_addr_t ram_size;
     const char *boot_order;
     const char *kernel_filename;
@@ -21,7 +24,7 @@ typedef void QEMUMachineResetFunc(void);
 
 typedef void QEMUMachineHotAddCPUFunc(const int64_t id, Error **errp);
 
-typedef struct QEMUMachine {
+struct QEMUMachine {
     const char *name;
     const char *alias;
     const char *desc;
@@ -43,7 +46,7 @@ typedef struct QEMUMachine {
     GlobalProperty *compat_props;
     struct QEMUMachine *next;
     const char *hw_version;
-} QEMUMachine;
+};
 
 int qemu_register_machine(QEMUMachine *m);
 QEMUMachine *find_default_machine(void);
diff --git a/vl.c b/vl.c
index b0399de..29e566f 100644
--- a/vl.c
+++ b/vl.c
@@ -4239,7 +4239,8 @@ int main(int argc, char **argv, char **envp)
 
     qdev_machine_init();
 
-    QEMUMachineInitArgs args = { .ram_size = ram_size,
+    QEMUMachineInitArgs args = { .machine = machine,
+                                 .ram_size = ram_size,
                                  .boot_order = boot_order,
                                  .kernel_filename = kernel_filename,
                                  .kernel_cmdline = kernel_cmdline,
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 02/28] pc: map PCI address space as catchall region for not mapped addresses
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 01/28] hw: Pass QEMUMachine to its init() method Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 03/28] qtest: split configuration of qtest accelerator and chardev Michael S. Tsirkin
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Igor Mammedov, Anthony Liguori

With a help of negative memory region priority PCI address space
is mapped underneath RAM regions effectively catching every access
to addresses not mapped by any other region.
It simplifies PCI address space mapping into system address space.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 include/hw/i386/pc.h      | 14 ++------------
 include/hw/pci-host/q35.h |  2 --
 hw/i386/pc.c              | 20 ++++++--------------
 hw/i386/pc_piix.c         |  2 --
 hw/pci-host/piix.c        | 26 ++++----------------------
 hw/pci-host/q35.c         | 27 +++++----------------------
 6 files changed, 17 insertions(+), 74 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 09652fb..8ea1a98 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -128,17 +128,9 @@ PcGuestInfo *pc_guest_info_init(ram_addr_t below_4g_mem_size,
 #define PCI_HOST_PROP_PCI_HOLE64_SIZE  "pci-hole64-size"
 #define DEFAULT_PCI_HOLE64_SIZE (~0x0ULL)
 
-static inline uint64_t pci_host_get_hole64_size(uint64_t pci_hole64_size)
-{
-    if (pci_hole64_size == DEFAULT_PCI_HOLE64_SIZE) {
-        return 1ULL << 62;
-    } else {
-        return pci_hole64_size;
-    }
-}
 
-void pc_init_pci64_hole(PcPciInfo *pci_info, uint64_t pci_hole64_start,
-                        uint64_t pci_hole64_size);
+void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
+                            MemoryRegion *pci_address_space);
 
 FWCfgState *pc_memory_init(MemoryRegion *system_memory,
                            const char *kernel_filename,
@@ -187,8 +179,6 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int *piix_devfn,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
-                    hwaddr pci_hole_start,
-                    hwaddr pci_hole_size,
                     ram_addr_t above_4g_mem_size,
                     MemoryRegion *pci_memory,
                     MemoryRegion *ram_memory);
diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index 309065f..d0355b7 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -53,8 +53,6 @@ typedef struct MCHPCIState {
     MemoryRegion *address_space_io;
     PAMMemoryRegion pam_regions[13];
     MemoryRegion smram_region;
-    MemoryRegion pci_hole;
-    MemoryRegion pci_hole_64bit;
     PcPciInfo pci_info;
     uint8_t smm_enabled;
     ram_addr_t below_4g_mem_size;
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 12c436e..6c82ada 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1093,21 +1093,13 @@ PcGuestInfo *pc_guest_info_init(ram_addr_t below_4g_mem_size,
     return guest_info;
 }
 
-void pc_init_pci64_hole(PcPciInfo *pci_info, uint64_t pci_hole64_start,
-                        uint64_t pci_hole64_size)
+/* setup pci memory address space mapping into system address space */
+void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
+                            MemoryRegion *pci_address_space)
 {
-    if ((sizeof(hwaddr) == 4) || (!pci_hole64_size)) {
-        return;
-    }
-    /*
-     * BIOS does not set MTRR entries for the 64 bit window, so no need to
-     * align address to power of two.  Align address at 1G, this makes sure
-     * it can be exactly covered with a PAT entry even when using huge
-     * pages.
-     */
-    pci_info->w64.begin = ROUND_UP(pci_hole64_start, 0x1ULL << 30);
-    pci_info->w64.end = pci_info->w64.begin + pci_hole64_size;
-    assert(pci_info->w64.begin <= pci_info->w64.end);
+    /* Set to lower priority than RAM */
+    memory_region_add_subregion_overlap(system_memory, 0x0,
+                                        pci_address_space, -1);
 }
 
 void pc_acpi_init(const char *default_dsdt)
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index ab56285..636f59f 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -149,8 +149,6 @@ static void pc_init1(QEMUMachineInitArgs *args,
     if (pci_enabled) {
         pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, &isa_bus, gsi,
                               system_memory, system_io, args->ram_size,
-                              below_4g_mem_size,
-                              0x100000000ULL - below_4g_mem_size,
                               above_4g_mem_size,
                               pci_memory, ram_memory);
     } else {
diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
index edc974e..63be7f6 100644
--- a/hw/pci-host/piix.c
+++ b/hw/pci-host/piix.c
@@ -103,8 +103,6 @@ struct PCII440FXState {
     MemoryRegion *system_memory;
     MemoryRegion *pci_address_space;
     MemoryRegion *ram_memory;
-    MemoryRegion pci_hole;
-    MemoryRegion pci_hole_64bit;
     PAMMemoryRegion pam_regions[13];
     MemoryRegion smram_region;
     uint8_t smm_enabled;
@@ -313,8 +311,6 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
-                    hwaddr pci_hole_start,
-                    hwaddr pci_hole_size,
                     ram_addr_t above_4g_mem_size,
                     MemoryRegion *pci_address_space,
                     MemoryRegion *ram_memory)
@@ -327,7 +323,6 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state,
     PCII440FXState *f;
     unsigned i;
     I440FXState *i440fx;
-    uint64_t pci_hole64_size;
 
     dev = qdev_create(NULL, TYPE_I440FX_PCI_HOST_BRIDGE);
     s = PCI_HOST_BRIDGE(dev);
@@ -355,23 +350,10 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state,
         i440fx->pci_info.w32.begin = 0xe0000000;
     }
 
-    memory_region_init_alias(&f->pci_hole, OBJECT(d), "pci-hole", f->pci_address_space,
-                             pci_hole_start, pci_hole_size);
-    memory_region_add_subregion(f->system_memory, pci_hole_start, &f->pci_hole);
-
-    pci_hole64_size = pci_host_get_hole64_size(i440fx->pci_hole64_size);
-
-    pc_init_pci64_hole(&i440fx->pci_info, 0x100000000ULL + above_4g_mem_size,
-                       pci_hole64_size);
-    memory_region_init_alias(&f->pci_hole_64bit, OBJECT(d), "pci-hole64",
-                             f->pci_address_space,
-                             i440fx->pci_info.w64.begin,
-                             pci_hole64_size);
-    if (pci_hole64_size) {
-        memory_region_add_subregion(f->system_memory,
-                                    i440fx->pci_info.w64.begin,
-                                    &f->pci_hole_64bit);
-    }
+    /* setup pci memory mapping */
+    pc_pci_as_mapping_init(OBJECT(f), f->system_memory,
+                           f->pci_address_space);
+
     memory_region_init_alias(&f->smram_region, OBJECT(d), "smram-region",
                              f->pci_address_space, 0xa0000, 0x20000);
     memory_region_add_subregion_overlap(f->system_memory, 0xa0000,
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index c043998..81c8240 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -356,28 +356,11 @@ static int mch_init(PCIDevice *d)
 {
     int i;
     MCHPCIState *mch = MCH_PCI_DEVICE(d);
-    uint64_t pci_hole64_size;
-
-    /* setup pci memory regions */
-    memory_region_init_alias(&mch->pci_hole, OBJECT(mch), "pci-hole",
-                             mch->pci_address_space,
-                             mch->below_4g_mem_size,
-                             0x100000000ULL - mch->below_4g_mem_size);
-    memory_region_add_subregion(mch->system_memory, mch->below_4g_mem_size,
-                                &mch->pci_hole);
-
-    pci_hole64_size = pci_host_get_hole64_size(mch->pci_hole64_size);
-    pc_init_pci64_hole(&mch->pci_info, 0x100000000ULL + mch->above_4g_mem_size,
-                       pci_hole64_size);
-    memory_region_init_alias(&mch->pci_hole_64bit, OBJECT(mch), "pci-hole64",
-                             mch->pci_address_space,
-                             mch->pci_info.w64.begin,
-                             pci_hole64_size);
-    if (pci_hole64_size) {
-        memory_region_add_subregion(mch->system_memory,
-                                    mch->pci_info.w64.begin,
-                                    &mch->pci_hole_64bit);
-    }
+
+    /* setup pci memory mapping */
+    pc_pci_as_mapping_init(OBJECT(mch), mch->system_memory,
+                           mch->pci_address_space);
+
     /* smram */
     cpu_smm_register(&mch_set_smm, mch);
     memory_region_init_alias(&mch->smram_region, OBJECT(mch), "smram-region",
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 03/28] qtest: split configuration of qtest accelerator and chardev
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 01/28] hw: Pass QEMUMachine to its init() method Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 02/28] pc: map PCI address space as catchall region for not mapped addresses Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 04/28] acpi-test: basic acpi unit-test Michael S. Tsirkin
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Anthony Liguori

From: Paolo Bonzini <pbonzini@redhat.com>

qtest uses the icount infrastructure to implement a test-driven vm_clock.  This
however is not necessary when using -qtest as a "probe" together with a normal
TCG-, KVM- or Xen-based virtual machine.  Hence, split out the call to
configure_icount into a new function that is called only for "-machine
accel=qtest"; and disable those commands when running with an accelerator
other than qtest.

This also fixes an assertion failure with "qemu-system-x86_64 -machine
accel=qtest" but no -qtest option.  This is a valid case, albeit somewhat
weird; nothing will happen in the VM but you'll still be able to
interact with the monitor or the GUI.

Now that qtest_init is not limited to an int(void) function, change
global variables that are not used outside qtest_init to arguments.

And finally, cleanup useless parts of include/sysemu/qtest.h.  The file
is not used at all for user-only emulation, and qtest is not available
on Win32 due to its usage of sigwait.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/sysemu/qtest.h | 25 +++++--------------------
 qtest.c                | 20 ++++++++++----------
 vl.c                   |  8 +++++---
 3 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index 9a0c6b3..112a661 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -16,38 +16,23 @@
 
 #include "qemu-common.h"
 
-#if !defined(CONFIG_USER_ONLY)
 extern bool qtest_allowed;
-extern const char *qtest_chrdev;
-extern const char *qtest_log;
 
 static inline bool qtest_enabled(void)
 {
     return qtest_allowed;
 }
 
+int qtest_init_accel(void);
+void qtest_init(const char *qtest_chrdev, const char *qtest_log);
+
 static inline int qtest_available(void)
 {
+#ifdef CONFIG_POSIX
     return 1;
-}
-
-int qtest_init(void);
 #else
-static inline bool qtest_enabled(void)
-{
-    return false;
-}
-
-static inline int qtest_available(void)
-{
-    return 0;
-}
-
-static inline int qtest_init(void)
-{
     return 0;
-}
-
 #endif
+}
 
 #endif
diff --git a/qtest.c b/qtest.c
index 584c707..dcf1301 100644
--- a/qtest.c
+++ b/qtest.c
@@ -22,8 +22,6 @@
 
 #define MAX_IRQ 256
 
-const char *qtest_chrdev;
-const char *qtest_log;
 bool qtest_allowed;
 
 static DeviceState *irq_intercept_dev;
@@ -406,7 +404,7 @@ static void qtest_process_command(CharDriverState *chr, gchar **words)
 
         qtest_send_prefix(chr);
         qtest_send(chr, "OK\n");
-    } else if (strcmp(words[0], "clock_step") == 0) {
+    } else if (qtest_enabled() && strcmp(words[0], "clock_step") == 0) {
         int64_t ns;
 
         if (words[1]) {
@@ -417,7 +415,7 @@ static void qtest_process_command(CharDriverState *chr, gchar **words)
         qtest_clock_warp(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns);
         qtest_send_prefix(chr);
         qtest_send(chr, "OK %"PRIi64"\n", (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL));
-    } else if (strcmp(words[0], "clock_set") == 0) {
+    } else if (qtest_enabled() && strcmp(words[0], "clock_set") == 0) {
         int64_t ns;
 
         g_assert(words[1]);
@@ -502,13 +500,17 @@ static void qtest_event(void *opaque, int event)
     }
 }
 
-int qtest_init(void)
+int qtest_init_accel(void)
 {
-    CharDriverState *chr;
+    configure_icount("0");
 
-    g_assert(qtest_chrdev != NULL);
+    return 0;
+}
+
+void qtest_init(const char *qtest_chrdev, const char *qtest_log)
+{
+    CharDriverState *chr;
 
-    configure_icount("0");
     chr = qemu_chr_new("qtest", qtest_chrdev, NULL);
 
     qemu_chr_add_handlers(chr, qtest_can_read, qtest_read, qtest_event, chr);
@@ -525,6 +527,4 @@ int qtest_init(void)
     }
 
     qtest_chr = chr;
-
-    return 0;
 }
diff --git a/vl.c b/vl.c
index 29e566f..60dbbcb 100644
--- a/vl.c
+++ b/vl.c
@@ -2624,7 +2624,7 @@ static struct {
     { "tcg", "tcg", tcg_available, tcg_init, &tcg_allowed },
     { "xen", "Xen", xen_available, xen_init, &xen_allowed },
     { "kvm", "KVM", kvm_available, kvm_init, &kvm_allowed },
-    { "qtest", "QTest", qtest_available, qtest_init, &qtest_allowed },
+    { "qtest", "QTest", qtest_available, qtest_init_accel, &qtest_allowed },
 };
 
 static int configure_accelerator(void)
@@ -2836,6 +2836,8 @@ int main(int argc, char **argv, char **envp)
     QEMUMachine *machine;
     const char *cpu_model;
     const char *vga_model = "none";
+    const char *qtest_chrdev = NULL;
+    const char *qtest_log = NULL;
     const char *pid_file = NULL;
     const char *incoming = NULL;
 #ifdef CONFIG_VNC
@@ -4043,8 +4045,8 @@ int main(int argc, char **argv, char **envp)
 
     configure_accelerator();
 
-    if (!qtest_enabled() && qtest_chrdev) {
-        qtest_init();
+    if (qtest_chrdev) {
+        qtest_init(qtest_chrdev, qtest_log);
     }
 
     machine_opts = qemu_get_machine_opts();
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 04/28] acpi-test: basic acpi unit-test
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 03/28] qtest: split configuration of qtest accelerator and chardev Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 05/28] MAINTAINERS: update X86 machine entry Michael S. Tsirkin
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, =?UTF-8?q?Andreas=20F=C3=A4rber?=, Markus Armbruster

We run bios, and boot a minimal boot sector that immediately halts.
Then poke at memory to find ACPI tables.

This only checks that RSDP is there.
More will be added later.

Cc: Andreas Färber <afaerber@suse.de>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tests/acpi-test.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/Makefile    |   2 +
 2 files changed, 137 insertions(+)
 create mode 100644 tests/acpi-test.c

diff --git a/tests/acpi-test.c b/tests/acpi-test.c
new file mode 100644
index 0000000..468c4f5
--- /dev/null
+++ b/tests/acpi-test.c
@@ -0,0 +1,135 @@
+/*
+ * Boot order test cases.
+ *
+ * Copyright (c) 2013 Red Hat Inc.
+ *
+ * Authors:
+ *  Michael S. Tsirkin <mst@redhat.com>,
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include <string.h>
+#include <stdio.h>
+#include <glib.h>
+#include "libqtest.h"
+
+typedef struct {
+    const char *args;
+    uint64_t expected_boot;
+    uint64_t expected_reboot;
+} boot_order_test;
+
+#define LOW(x) ((x) & 0xff)
+#define HIGH(x) ((x) >> 8)
+
+#define SIGNATURE 0xdead
+#define SIGNATURE_OFFSET 0x10
+#define BOOT_SECTOR_ADDRESS 0x7c00
+
+/* Boot sector code: write SIGNATURE into memory,
+ * then halt.
+ */
+static uint8_t boot_sector[0x200] = {
+    /* 7c00: mov $0xdead,%ax */
+    [0x00] = 0xb8,
+    [0x01] = LOW(SIGNATURE),
+    [0x02] = HIGH(SIGNATURE),
+    /* 7c03:  mov %ax,0x7c10 */
+    [0x03] = 0xa3,
+    [0x04] = LOW(BOOT_SECTOR_ADDRESS + SIGNATURE_OFFSET),
+    [0x05] = HIGH(BOOT_SECTOR_ADDRESS + SIGNATURE_OFFSET),
+    /* 7c06: cli */
+    [0x06] = 0xfa,
+    /* 7c07: hlt */
+    [0x07] = 0xf4,
+    /* 7c08: jmp 0x7c07=0x7c0a-3 */
+    [0x08] = 0xeb,
+    [0x09] = LOW(-3),
+    /* We mov 0xdead here: set value to make debugging easier */
+    [SIGNATURE_OFFSET] = LOW(0xface),
+    [SIGNATURE_OFFSET + 1] = HIGH(0xface),
+    /* End of boot sector marker */
+    [0x1FE] = 0x55,
+    [0x1FF] = 0xAA,
+};
+
+static const char *disk = "tests/acpi-test-disk.raw";
+
+static void test_acpi_one(const char *params)
+{
+    char *args;
+    uint8_t signature_low;
+    uint8_t signature_high;
+    uint16_t signature;
+    int i;
+    uint32_t off;
+
+
+    args = g_strdup_printf("-net none -display none %s %s",
+                           params ? params : "", disk);
+    qtest_start(args);
+
+   /* Wait at most 1 minute */
+#define TEST_DELAY (1 * G_USEC_PER_SEC / 10)
+#define TEST_CYCLES MAX((60 * G_USEC_PER_SEC / TEST_DELAY), 1)
+
+    /* Poll until code has run and modified memory.  Once it has we know BIOS
+     * initialization is done.  TODO: check that IP reached the halt
+     * instruction.
+     */
+    for (i = 0; i < TEST_CYCLES; ++i) {
+        signature_low = readb(BOOT_SECTOR_ADDRESS + SIGNATURE_OFFSET);
+        signature_high = readb(BOOT_SECTOR_ADDRESS + SIGNATURE_OFFSET + 1);
+        signature = (signature_high << 8) | signature_low;
+        if (signature == SIGNATURE) {
+            break;
+        }
+        g_usleep(TEST_DELAY);
+    }
+    g_assert_cmphex(signature, ==, SIGNATURE);
+
+    /* OK, now find RSDP */
+    for (off = 0xf0000; off < 0x100000; off += 0x10)
+    {
+        uint8_t sig[] = "RSD PTR ";
+        int i;
+
+        for (i = 0; i < sizeof sig - 1; ++i) {
+            sig[i] = readb(off + i);
+        }
+
+        if (!memcmp(sig, "RSD PTR ", sizeof sig)) {
+            break;
+        }
+    }
+
+    g_assert_cmphex(off, <, 0x100000);
+
+    qtest_quit(global_qtest);
+    g_free(args);
+}
+
+static void test_acpi_tcg(void)
+{
+    /* Supplying -machine accel argument overrides the default (qtest).
+     * This is to make guest actually run.
+     */
+    test_acpi_one("-machine accel=tcg");
+}
+
+int main(int argc, char *argv[])
+{
+    const char *arch = qtest_get_arch();
+    FILE *f = fopen(disk, "w");
+    fwrite(boot_sector, 1, sizeof boot_sector, f);
+    fclose(f);
+
+    g_test_init(&argc, &argv, NULL);
+
+    if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
+        qtest_add_func("acpi/tcg", test_acpi_tcg);
+    }
+    return g_test_run();
+}
diff --git a/tests/Makefile b/tests/Makefile
index 379cdd9..8d25878 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -64,6 +64,7 @@ check-qtest-i386-y += tests/ide-test$(EXESUF)
 check-qtest-i386-y += tests/hd-geo-test$(EXESUF)
 gcov-files-i386-y += hw/hd-geometry.c
 check-qtest-i386-y += tests/boot-order-test$(EXESUF)
+check-qtest-i386-y += tests/acpi-test$(EXESUF)
 check-qtest-i386-y += tests/rtc-test$(EXESUF)
 check-qtest-i386-y += tests/i440fx-test$(EXESUF)
 check-qtest-i386-y += tests/fw_cfg-test$(EXESUF)
@@ -198,6 +199,7 @@ tests/fdc-test$(EXESUF): tests/fdc-test.o
 tests/ide-test$(EXESUF): tests/ide-test.o $(libqos-pc-obj-y)
 tests/hd-geo-test$(EXESUF): tests/hd-geo-test.o
 tests/boot-order-test$(EXESUF): tests/boot-order-test.o $(libqos-obj-y)
+tests/acpi-test$(EXESUF): tests/acpi-test.o $(libqos-obj-y)
 tests/tmp105-test$(EXESUF): tests/tmp105-test.o $(libqos-omap-obj-y)
 tests/i440fx-test$(EXESUF): tests/i440fx-test.o $(libqos-pc-obj-y)
 tests/fw_cfg-test$(EXESUF): tests/fw_cfg-test.o $(libqos-pc-obj-y)
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 05/28] MAINTAINERS: update X86 machine entry
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 04/28] acpi-test: basic acpi unit-test Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 06/28] pci: fix address space size for bridge Michael S. Tsirkin
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

Add a bunch of files missing, and add self as maintainer.  Since I'm
hacking on these anyway, it will be helpful if people Cc me on patches.
Anthony gets to review everything anyway ...

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 MAINTAINERS | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3e61ac8..e250d72 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -500,9 +500,23 @@ X86 Machines
 ------------
 PC
 M: Anthony Liguori <aliguori@amazon.com>
+M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
-F: hw/i386/pc.[ch]
-F: hw/i386/pc_piix.c
+F: include/hw/i386/
+F: hw/i386/
+F: hw/pci-host/piix.c
+F: hw/pci-host/q35.c
+F: hw/pci-host/pam.c
+F: include/hw/pci-host/q35.h
+F: include/hw/pci-host/pam.h
+F: hw/isa/piix4.c
+F: hw/isa/lpc_ich9.c
+F: hw/i2c/smbus_ich9.c
+F: hw/acpi/piix4.c
+F: hw/acpi/ich9.c
+F: include/hw/acpi/ich9.h
+F: include/hw/acpi/piix.h
+
 
 Xtensa Machines
 ---------------
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 06/28] pci: fix address space size for bridge
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 05/28] MAINTAINERS: update X86 machine entry Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 07/28] pc: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

Address space size for bridge should be full 64 bit,
so we should use UINT64_MAX not INT64_MAX as it's size.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/pci/pci_bridge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index 290abab..f72872e 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -372,7 +372,7 @@ int pci_bridge_initfn(PCIDevice *dev, const char *typename)
     sec_bus->parent_dev = dev;
     sec_bus->map_irq = br->map_irq ? br->map_irq : pci_swizzle_map_irq_fn;
     sec_bus->address_space_mem = &br->address_space_mem;
-    memory_region_init(&br->address_space_mem, OBJECT(br), "pci_bridge_pci", INT64_MAX);
+    memory_region_init(&br->address_space_mem, OBJECT(br), "pci_bridge_pci", UINT64_MAX);
     sec_bus->address_space_io = &br->address_space_io;
     memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io", 65536);
     br->windows = pci_bridge_region_init(br);
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 07/28] pc: s/INT64_MAX/UINT64_MAX/
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 06/28] pci: fix address space size for bridge Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 08/28] spapr_pci: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Anthony Liguori, Luiz Capitulino

From: Paolo Bonzini <pbonzini@redhat.com>

It doesn't make sense for a region to be INT64_MAX in size:
memory core uses UINT64_MAX as a special value meaning
"all 64 bit" this is what was meant here.

While this should never affect the PC system which at the moment always
has < 63 bit size, this makes us hit all kind of corner case bugs with
sub-pages, so users are probably better off if we just use UINT64_MAX
instead.

Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/pc_piix.c | 2 +-
 hw/i386/pc_q35.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 636f59f..646b65f 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -114,7 +114,7 @@ static void pc_init1(QEMUMachineInitArgs *args,
 
     if (pci_enabled) {
         pci_memory = g_new(MemoryRegion, 1);
-        memory_region_init(pci_memory, NULL, "pci", INT64_MAX);
+        memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
     } else {
         pci_memory = NULL;
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 97aa842..4c47026 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -101,7 +101,7 @@ static void pc_q35_init(QEMUMachineInitArgs *args)
     /* pci enabled */
     if (pci_enabled) {
         pci_memory = g_new(MemoryRegion, 1);
-        memory_region_init(pci_memory, NULL, "pci", INT64_MAX);
+        memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
     } else {
         pci_memory = NULL;
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 08/28] spapr_pci: s/INT64_MAX/UINT64_MAX/
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 07/28] pc: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 09/28] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-ppc, Alexander Graf

It doesn't make sense for a region to be INT64_MAX in size:
memory core uses UINT64_MAX as a special value meaning
"all 64 bit" this is what was meant here.

While this should never affect the spapr system which at the moment always
has < 63 bit size, this makes us hit all kind of corner case bugs with
sub-pages, so users are probably better off if we just use UINT64_MAX
instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Alexander Graf <agraf@suse.de>
---
 hw/ppc/spapr_pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index edb4cb0..2beedd4 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -555,7 +555,7 @@ static int spapr_phb_init(SysBusDevice *s)
 
     /* Initialize memory regions */
     sprintf(namebuf, "%s.mmio", sphb->dtbusname);
-    memory_region_init(&sphb->memspace, OBJECT(sphb), namebuf, INT64_MAX);
+    memory_region_init(&sphb->memspace, OBJECT(sphb), namebuf, UINT64_MAX);
 
     sprintf(namebuf, "%s.mmio-alias", sphb->dtbusname);
     memory_region_init_alias(&sphb->memwindow, OBJECT(sphb),
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 09/28] split definitions for exec.c and translate-all.c radix trees
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 08/28] spapr_pci: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 10/28] exec: replace leaf with skip Michael S. Tsirkin
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini

From: Paolo Bonzini <pbonzini@redhat.com>

The exec.c and translate-all.c radix trees are quite different, and
the exec.c one in particular is not limited to the CPU---it can be
used also by devices that do DMA, and in that case the address space
is not limited to TARGET_PHYS_ADDR_SPACE_BITS bits.

We want to make exec.c's radix trees 64-bit wide.  As a first step,
stop sharing the constants between exec.c and translate-all.c.
exec.c gets P_L2_* constants, translate-all.c gets V_L2_*, for
consistency with the existing V_L1_* symbols.  Though actually
in the softmmu case translate-all.c is also indexed by physical
addresses...

This patch has no semantic change.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 translate-all.h |  7 -------
 exec.c          | 29 +++++++++++++++++++++--------
 translate-all.c | 32 ++++++++++++++++++--------------
 3 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/translate-all.h b/translate-all.h
index 5c38819..f7e5932 100644
--- a/translate-all.h
+++ b/translate-all.h
@@ -19,13 +19,6 @@
 #ifndef TRANSLATE_ALL_H
 #define TRANSLATE_ALL_H
 
-/* Size of the L2 (and L3, etc) page tables.  */
-#define L2_BITS 10
-#define L2_SIZE (1 << L2_BITS)
-
-#define P_L2_LEVELS \
-    (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
-
 /* translate-all.c */
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
 void cpu_unlink_tb(CPUState *cpu);
diff --git a/exec.c b/exec.c
index f4b9ef2..060f3f3 100644
--- a/exec.c
+++ b/exec.c
@@ -88,7 +88,15 @@ struct PhysPageEntry {
     uint16_t ptr : 15;
 };
 
-typedef PhysPageEntry Node[L2_SIZE];
+/* Size of the L2 (and L3, etc) page tables.  */
+#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
+
+#define P_L2_BITS 10
+#define P_L2_SIZE (1 << P_L2_BITS)
+
+#define P_L2_LEVELS (((ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / P_L2_BITS) + 1)
+
+typedef PhysPageEntry Node[P_L2_SIZE];
 
 struct AddressSpaceDispatch {
     /* This is a multi-level map on the physical address space.
@@ -155,7 +163,7 @@ static uint16_t phys_map_node_alloc(void)
     ret = next_map.nodes_nb++;
     assert(ret != PHYS_MAP_NODE_NIL);
     assert(ret != next_map.nodes_nb_alloc);
-    for (i = 0; i < L2_SIZE; ++i) {
+    for (i = 0; i < P_L2_SIZE; ++i) {
         next_map.nodes[ret][i].is_leaf = 0;
         next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
     }
@@ -168,13 +176,13 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
 {
     PhysPageEntry *p;
     int i;
-    hwaddr step = (hwaddr)1 << (level * L2_BITS);
+    hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
 
     if (!lp->is_leaf && lp->ptr == PHYS_MAP_NODE_NIL) {
         lp->ptr = phys_map_node_alloc();
         p = next_map.nodes[lp->ptr];
         if (level == 0) {
-            for (i = 0; i < L2_SIZE; i++) {
+            for (i = 0; i < P_L2_SIZE; i++) {
                 p[i].is_leaf = 1;
                 p[i].ptr = PHYS_SECTION_UNASSIGNED;
             }
@@ -182,9 +190,9 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
     } else {
         p = next_map.nodes[lp->ptr];
     }
-    lp = &p[(*index >> (level * L2_BITS)) & (L2_SIZE - 1)];
+    lp = &p[(*index >> (level * P_L2_BITS)) & (P_L2_SIZE - 1)];
 
-    while (*nb && lp < &p[L2_SIZE]) {
+    while (*nb && lp < &p[P_L2_SIZE]) {
         if ((*index & (step - 1)) == 0 && *nb >= step) {
             lp->is_leaf = true;
             lp->ptr = leaf;
@@ -218,7 +226,7 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
             return &sections[PHYS_SECTION_UNASSIGNED];
         }
         p = nodes[lp.ptr];
-        lp = p[(index >> (i * L2_BITS)) & (L2_SIZE - 1)];
+        lp = p[(index >> (i * P_L2_BITS)) & (P_L2_SIZE - 1)];
     }
     return &sections[lp.ptr];
 }
@@ -1778,7 +1786,12 @@ void address_space_destroy_dispatch(AddressSpace *as)
 static void memory_map_init(void)
 {
     system_memory = g_malloc(sizeof(*system_memory));
-    memory_region_init(system_memory, NULL, "system", INT64_MAX);
+
+    assert(ADDR_SPACE_BITS <= 64);
+
+    memory_region_init(system_memory, NULL, "system",
+                       ADDR_SPACE_BITS == 64 ?
+                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
     address_space_init(&address_space_memory, system_memory, "memory");
 
     system_io = g_malloc(sizeof(*system_io));
diff --git a/translate-all.c b/translate-all.c
index aeda54d..1c63d78 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -96,12 +96,16 @@ typedef struct PageDesc {
 # define L1_MAP_ADDR_SPACE_BITS  TARGET_VIRT_ADDR_SPACE_BITS
 #endif
 
+/* Size of the L2 (and L3, etc) page tables.  */
+#define V_L2_BITS 10
+#define V_L2_SIZE (1 << V_L2_BITS)
+
 /* The bits remaining after N lower levels of page tables.  */
 #define V_L1_BITS_REM \
-    ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % L2_BITS)
+    ((L1_MAP_ADDR_SPACE_BITS - TARGET_PAGE_BITS) % V_L2_BITS)
 
 #if V_L1_BITS_REM < 4
-#define V_L1_BITS  (V_L1_BITS_REM + L2_BITS)
+#define V_L1_BITS  (V_L1_BITS_REM + V_L2_BITS)
 #else
 #define V_L1_BITS  V_L1_BITS_REM
 #endif
@@ -395,18 +399,18 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
     lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
     /* Level 2..N-1.  */
-    for (i = V_L1_SHIFT / L2_BITS - 1; i > 0; i--) {
+    for (i = V_L1_SHIFT / V_L2_BITS - 1; i > 0; i--) {
         void **p = *lp;
 
         if (p == NULL) {
             if (!alloc) {
                 return NULL;
             }
-            ALLOC(p, sizeof(void *) * L2_SIZE);
+            ALLOC(p, sizeof(void *) * V_L2_SIZE);
             *lp = p;
         }
 
-        lp = p + ((index >> (i * L2_BITS)) & (L2_SIZE - 1));
+        lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
     }
 
     pd = *lp;
@@ -414,13 +418,13 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
         if (!alloc) {
             return NULL;
         }
-        ALLOC(pd, sizeof(PageDesc) * L2_SIZE);
+        ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
         *lp = pd;
     }
 
 #undef ALLOC
 
-    return pd + (index & (L2_SIZE - 1));
+    return pd + (index & (V_L2_SIZE - 1));
 }
 
 static inline PageDesc *page_find(tb_page_addr_t index)
@@ -655,14 +659,14 @@ static void page_flush_tb_1(int level, void **lp)
     if (level == 0) {
         PageDesc *pd = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             pd[i].first_tb = NULL;
             invalidate_page_bitmap(pd + i);
         }
     } else {
         void **pp = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             page_flush_tb_1(level - 1, pp + i);
         }
     }
@@ -673,7 +677,7 @@ static void page_flush_tb(void)
     int i;
 
     for (i = 0; i < V_L1_SIZE; i++) {
-        page_flush_tb_1(V_L1_SHIFT / L2_BITS - 1, l1_map + i);
+        page_flush_tb_1(V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
     }
 }
 
@@ -1600,7 +1604,7 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
     if (level == 0) {
         PageDesc *pd = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             int prot = pd[i].flags;
 
             pa = base | (i << TARGET_PAGE_BITS);
@@ -1614,9 +1618,9 @@ static int walk_memory_regions_1(struct walk_memory_regions_data *data,
     } else {
         void **pp = *lp;
 
-        for (i = 0; i < L2_SIZE; ++i) {
+        for (i = 0; i < V_L2_SIZE; ++i) {
             pa = base | ((abi_ulong)i <<
-                (TARGET_PAGE_BITS + L2_BITS * level));
+                (TARGET_PAGE_BITS + V_L2_BITS * level));
             rc = walk_memory_regions_1(data, pa, level - 1, pp + i);
             if (rc != 0) {
                 return rc;
@@ -1639,7 +1643,7 @@ int walk_memory_regions(void *priv, walk_memory_regions_fn fn)
 
     for (i = 0; i < V_L1_SIZE; i++) {
         int rc = walk_memory_regions_1(&data, (abi_ulong)i << V_L1_SHIFT,
-                                       V_L1_SHIFT / L2_BITS - 1, l1_map + i);
+                                       V_L1_SHIFT / V_L2_BITS - 1, l1_map + i);
 
         if (rc != 0) {
             return rc;
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 10/28] exec: replace leaf with skip
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (8 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 09/28] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 11/28] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

In preparation for dynamic radix tree depth support, rename is_leaf
field to skip, telling us how many bits to skip to next level.
Set to 0 for leaf.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/exec.c b/exec.c
index 060f3f3..e3e5bc0 100644
--- a/exec.c
+++ b/exec.c
@@ -83,8 +83,9 @@ int use_icount;
 typedef struct PhysPageEntry PhysPageEntry;
 
 struct PhysPageEntry {
-    uint16_t is_leaf : 1;
-     /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
+    /* How many bits skip to next level (in units of L2_SIZE). 0 for a leaf. */
+    uint16_t skip : 1;
+     /* index into phys_sections (!skip) or phys_map_nodes (skip) */
     uint16_t ptr : 15;
 };
 
@@ -164,7 +165,7 @@ static uint16_t phys_map_node_alloc(void)
     assert(ret != PHYS_MAP_NODE_NIL);
     assert(ret != next_map.nodes_nb_alloc);
     for (i = 0; i < P_L2_SIZE; ++i) {
-        next_map.nodes[ret][i].is_leaf = 0;
+        next_map.nodes[ret][i].skip = 1;
         next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
     }
     return ret;
@@ -178,12 +179,12 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
     int i;
     hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
 
-    if (!lp->is_leaf && lp->ptr == PHYS_MAP_NODE_NIL) {
+    if (lp->skip && lp->ptr == PHYS_MAP_NODE_NIL) {
         lp->ptr = phys_map_node_alloc();
         p = next_map.nodes[lp->ptr];
         if (level == 0) {
             for (i = 0; i < P_L2_SIZE; i++) {
-                p[i].is_leaf = 1;
+                p[i].skip = 0;
                 p[i].ptr = PHYS_SECTION_UNASSIGNED;
             }
         }
@@ -194,7 +195,7 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
 
     while (*nb && lp < &p[P_L2_SIZE]) {
         if ((*index & (step - 1)) == 0 && *nb >= step) {
-            lp->is_leaf = true;
+            lp->skip = 0;
             lp->ptr = leaf;
             *index += step;
             *nb -= step;
@@ -221,7 +222,7 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
     PhysPageEntry *p;
     int i;
 
-    for (i = P_L2_LEVELS - 1; i >= 0 && !lp.is_leaf; i--) {
+    for (i = P_L2_LEVELS; lp.skip && (i -= lp.skip) >= 0;) {
         if (lp.ptr == PHYS_MAP_NODE_NIL) {
             return &sections[PHYS_SECTION_UNASSIGNED];
         }
@@ -1681,7 +1682,7 @@ static void mem_begin(MemoryListener *listener)
     AddressSpace *as = container_of(listener, AddressSpace, dispatch_listener);
     AddressSpaceDispatch *d = g_new(AddressSpaceDispatch, 1);
 
-    d->phys_map  = (PhysPageEntry) { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
+    d->phys_map  = (PhysPageEntry) { .ptr = PHYS_MAP_NODE_NIL, .skip = 1 };
     d->as = as;
     as->next_dispatch = d;
 }
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 11/28] exec: extend skip field to 6 bit, page entry to 32 bit
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (9 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 10/28] exec: replace leaf with skip Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 12/28] exec: pass hw address to phys_page_find Michael S. Tsirkin
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

Extend skip to 6 bit. As page entry doesn't fit in 16 bit
any longer anyway, extend it to 32 bit.
This doubles node map memory requirements, but follow-up
patches will save this memory.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index e3e5bc0..154ae97 100644
--- a/exec.c
+++ b/exec.c
@@ -84,11 +84,13 @@ typedef struct PhysPageEntry PhysPageEntry;
 
 struct PhysPageEntry {
     /* How many bits skip to next level (in units of L2_SIZE). 0 for a leaf. */
-    uint16_t skip : 1;
+    uint32_t skip : 6;
      /* index into phys_sections (!skip) or phys_map_nodes (skip) */
-    uint16_t ptr : 15;
+    uint32_t ptr : 26;
 };
 
+#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
+
 /* Size of the L2 (and L3, etc) page tables.  */
 #define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
 
@@ -134,8 +136,6 @@ typedef struct PhysPageMap {
 static PhysPageMap *prev_map;
 static PhysPageMap next_map;
 
-#define PHYS_MAP_NODE_NIL (((uint16_t)~0) >> 1)
-
 static void io_mem_init(void);
 static void memory_map_init(void);
 
@@ -156,10 +156,10 @@ static void phys_map_node_reserve(unsigned nodes)
     }
 }
 
-static uint16_t phys_map_node_alloc(void)
+static uint32_t phys_map_node_alloc(void)
 {
     unsigned i;
-    uint16_t ret;
+    uint32_t ret;
 
     ret = next_map.nodes_nb++;
     assert(ret != PHYS_MAP_NODE_NIL);
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 12/28] exec: pass hw address to phys_page_find
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 11/28] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 13/28] exec: memory radix tree page level compression Michael S. Tsirkin
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

callers always shift by target page bits so let's just do this
internally.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index 154ae97..b528dad 100644
--- a/exec.c
+++ b/exec.c
@@ -216,10 +216,11 @@ static void phys_page_set(AddressSpaceDispatch *d,
     phys_page_set_level(&d->phys_map, &index, &nb, leaf, P_L2_LEVELS - 1);
 }
 
-static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr index,
+static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr addr,
                                            Node *nodes, MemoryRegionSection *sections)
 {
     PhysPageEntry *p;
+    hwaddr index = addr >> TARGET_PAGE_BITS;
     int i;
 
     for (i = P_L2_LEVELS; lp.skip && (i -= lp.skip) >= 0;) {
@@ -245,8 +246,7 @@ static MemoryRegionSection *address_space_lookup_region(AddressSpaceDispatch *d,
     MemoryRegionSection *section;
     subpage_t *subpage;
 
-    section = phys_page_find(d->phys_map, addr >> TARGET_PAGE_BITS,
-                             d->nodes, d->sections);
+    section = phys_page_find(d->phys_map, addr, d->nodes, d->sections);
     if (resolve_subpage && section->mr->subpage) {
         subpage = container_of(section->mr, subpage_t, iomem);
         section = &d->sections[subpage->sub_section[SUBPAGE_IDX(addr)]];
@@ -802,7 +802,7 @@ static void register_subpage(AddressSpaceDispatch *d, MemoryRegionSection *secti
     subpage_t *subpage;
     hwaddr base = section->offset_within_address_space
         & TARGET_PAGE_MASK;
-    MemoryRegionSection *existing = phys_page_find(d->phys_map, base >> TARGET_PAGE_BITS,
+    MemoryRegionSection *existing = phys_page_find(d->phys_map, base,
                                                    next_map.nodes, next_map.sections);
     MemoryRegionSection subsection = {
         .offset_within_address_space = base,
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 13/28] exec: memory radix tree page level compression
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (11 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 12/28] exec: pass hw address to phys_page_find Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide Michael S. Tsirkin
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/exec.c b/exec.c
index b528dad..7e5ce93 100644
--- a/exec.c
+++ b/exec.c
@@ -51,6 +51,8 @@
 
 #include "exec/memory-internal.h"
 
+#include "qemu/range.h"
+
 //#define DEBUG_SUBPAGE
 
 #if !defined(CONFIG_USER_ONLY)
@@ -216,6 +218,68 @@ static void phys_page_set(AddressSpaceDispatch *d,
     phys_page_set_level(&d->phys_map, &index, &nb, leaf, P_L2_LEVELS - 1);
 }
 
+/* Compact a non leaf page entry. Simply detect that the entry has a single child,
+ * and update our entry so we can skip it and go directly to the destination.
+ */
+static void phys_page_compact(PhysPageEntry *lp, Node *nodes, unsigned long *compacted)
+{
+    unsigned valid_ptr = P_L2_SIZE;
+    int valid = 0;
+    PhysPageEntry *p;
+    int i;
+
+    if (lp->ptr == PHYS_MAP_NODE_NIL) {
+        return;
+    }
+
+    p = nodes[lp->ptr];
+    for (i = 0; i < P_L2_SIZE; i++) {
+        if (p[i].ptr == PHYS_MAP_NODE_NIL) {
+            continue;
+        }
+
+        valid_ptr = i;
+        valid++;
+        if (p[i].skip) {
+            phys_page_compact(&p[i], nodes, compacted);
+        }
+    }
+
+    /* We can only compress if there's only one child. */
+    if (valid != 1) {
+        return;
+    }
+
+    assert(valid_ptr < P_L2_SIZE);
+
+    /* Don't compress if it won't fit in the # of bits we have. */
+    if (lp->skip + p[valid_ptr].skip >= (1 << 3)) {
+        return;
+    }
+
+    lp->ptr = p[valid_ptr].ptr;
+    if (!p[valid_ptr].skip) {
+        /* If our only child is a leaf, make this a leaf. */
+        /* By design, we should have made this node a leaf to begin with so we
+         * should never reach here.
+         * But since it's so simple to handle this, let's do it just in case we
+         * change this rule.
+         */
+        lp->skip = 0;
+    } else {
+        lp->skip += p[valid_ptr].skip;
+    }
+}
+
+static void phys_page_compact_all(AddressSpaceDispatch *d, int nodes_nb)
+{
+    DECLARE_BITMAP(compacted, nodes_nb);
+
+    if (d->phys_map.skip) {
+        phys_page_compact(&d->phys_map, d->nodes, compacted);
+    }
+}
+
 static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr addr,
                                            Node *nodes, MemoryRegionSection *sections)
 {
@@ -230,7 +294,14 @@ static MemoryRegionSection *phys_page_find(PhysPageEntry lp, hwaddr addr,
         p = nodes[lp.ptr];
         lp = p[(index >> (i * P_L2_BITS)) & (P_L2_SIZE - 1)];
     }
-    return &sections[lp.ptr];
+
+    if (sections[lp.ptr].size.hi ||
+        range_covers_byte(sections[lp.ptr].offset_within_address_space,
+                          sections[lp.ptr].size.lo, addr)) {
+        return &sections[lp.ptr];
+    } else {
+        return &sections[PHYS_SECTION_UNASSIGNED];
+    }
 }
 
 bool memory_region_is_unassigned(MemoryRegion *mr)
@@ -1696,6 +1767,8 @@ static void mem_commit(MemoryListener *listener)
     next->nodes = next_map.nodes;
     next->sections = next_map.sections;
 
+    phys_page_compact_all(next, next_map.nodes_nb);
+
     as->dispatch = next;
     g_free(cur);
 }
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (12 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 13/28] exec: memory radix tree page level compression Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2014-01-09 17:24   ` Alex Williamson
  2013-12-11 18:30 ` [Qemu-devel] [PULL 15/28] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
                   ` (13 subsequent siblings)
  27 siblings, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Luiz Capitulino

From: Paolo Bonzini <pbonzini@redhat.com>

As an alternative to commit 818f86b (exec: limit system memory
size, 2013-11-04) let's just make all address spaces 64-bit wide.
This eliminates problems with phys_page_find ignoring bits above
TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
consequently messing up the computations.

In Luiz's reported crash, at startup gdb attempts to read from address
0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
is the newly introduced master abort region, which is as big as the PCI
address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
not 2^64.  But we get it anyway because phys_page_find ignores the upper
bits of the physical address.  In address_space_translate_internal then

    diff = int128_sub(section->mr->size, int128_make64(addr));
    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));

diff becomes negative, and int128_get64 booms.

The size of the PCI address space region should be fixed anyway.

Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 7e5ce93..f907f5f 100644
--- a/exec.c
+++ b/exec.c
@@ -94,7 +94,7 @@ struct PhysPageEntry {
 #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
 
 /* Size of the L2 (and L3, etc) page tables.  */
-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
+#define ADDR_SPACE_BITS 64
 
 #define P_L2_BITS 10
 #define P_L2_SIZE (1 << P_L2_BITS)
@@ -1861,11 +1861,7 @@ static void memory_map_init(void)
 {
     system_memory = g_malloc(sizeof(*system_memory));
 
-    assert(ADDR_SPACE_BITS <= 64);
-
-    memory_region_init(system_memory, NULL, "system",
-                       ADDR_SPACE_BITS == 64 ?
-                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
+    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
     address_space_init(&address_space_memory, system_memory, "memory");
 
     system_io = g_malloc(sizeof(*system_io));
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 15/28] exec: reduce L2_PAGE_SIZE
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (13 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:30 ` [Qemu-devel] [PULL 16/28] smbios: Set system manufacturer, product & version by default Michael S. Tsirkin
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel

With the single exception of ppc with 16M pages,
we get the same number of levels
with L2_PAGE_SIZE = 10 as with L2_PAGE_SIZE = 9.

by doing this we reduce memory footprint of a single level
in the node memory map by 2x without runtime overhead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/exec.c b/exec.c
index f907f5f..67a073c 100644
--- a/exec.c
+++ b/exec.c
@@ -96,7 +96,7 @@ struct PhysPageEntry {
 /* Size of the L2 (and L3, etc) page tables.  */
 #define ADDR_SPACE_BITS 64
 
-#define P_L2_BITS 10
+#define P_L2_BITS 9
 #define P_L2_SIZE (1 << P_L2_BITS)
 
 #define P_L2_LEVELS (((ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / P_L2_BITS) + 1)
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 16/28] smbios: Set system manufacturer, product & version by default
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (14 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 15/28] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
@ 2013-12-11 18:30 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 17/28] acpi unit-test: verify signature and checksum Michael S. Tsirkin
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Markus Armbruster, Anthony Liguori, Eduardo Habkost

From: Markus Armbruster <armbru@redhat.com>

Currently, we get SeaBIOS defaults: manufacturer Bochs, product Bochs,
no version.  Best SeaBIOS can do, but we can provide better defaults:
manufacturer QEMU, product & version taken from QEMUMachine desc and
name.

Take care to do this only for new machine types, of course.

Note: Michael Tsirkin doesn't trust us to keep values of QEMUMachine member
product stable in the future.  Use copies instead, and in a way that
makes it obvious that they're guest ABI.

Note that we can be trusted to keep values of member name, because
that has always been ABI.

Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/i386/smbios.h |  2 ++
 hw/i386/pc_piix.c        | 24 +++++++++++++++++++++++-
 hw/i386/pc_q35.c         | 20 ++++++++++++++++++++
 hw/i386/smbios.c         | 14 ++++++++++++++
 4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/include/hw/i386/smbios.h b/include/hw/i386/smbios.h
index b08ec71..18fb970 100644
--- a/include/hw/i386/smbios.h
+++ b/include/hw/i386/smbios.h
@@ -16,6 +16,8 @@
 #include "qemu/option.h"
 
 void smbios_entry_add(QemuOpts *opts);
+void smbios_set_type1_defaults(const char *manufacturer,
+                               const char *product, const char *version);
 uint8_t *smbios_get_table(size_t *length);
 
 /*
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 646b65f..9fc3b11 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -28,6 +28,7 @@
 #include "hw/loader.h"
 #include "hw/i386/pc.h"
 #include "hw/i386/apic.h"
+#include "hw/i386/smbios.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_ids.h"
 #include "hw/usb.h"
@@ -59,6 +60,7 @@ static const int ide_irq[MAX_IDE_BUS] = { 14, 15 };
 
 static bool has_pci_info;
 static bool has_acpi_build = true;
+static bool smbios_type1_defaults = true;
 
 /* PC hardware initialisation */
 static void pc_init1(QEMUMachineInitArgs *args,
@@ -128,6 +130,12 @@ static void pc_init1(QEMUMachineInitArgs *args,
     guest_info->has_pci_info = has_pci_info;
     guest_info->isapc_ram_fw = !pci_enabled;
 
+    if (smbios_type1_defaults) {
+        /* These values are guest ABI, do not change */
+        smbios_set_type1_defaults("QEMU", "Standard PC (i440FX + PIIX, 1996)",
+                                  args->machine->name);
+    }
+
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         fw_cfg = pc_memory_init(system_memory,
@@ -233,8 +241,14 @@ static void pc_init_pci(QEMUMachineInitArgs *args)
     pc_init1(args, 1, 1);
 }
 
+static void pc_compat_1_7(QEMUMachineInitArgs *args)
+{
+    smbios_type1_defaults = false;
+}
+
 static void pc_compat_1_6(QEMUMachineInitArgs *args)
 {
+    pc_compat_1_7(args);
     has_pci_info = false;
     rom_file_in_ram = false;
     has_acpi_build = false;
@@ -265,6 +279,12 @@ static void pc_compat_1_2(QEMUMachineInitArgs *args)
     disable_kvm_pv_eoi();
 }
 
+static void pc_init_pci_1_7(QEMUMachineInitArgs *args)
+{
+    pc_compat_1_7(args);
+    pc_init_pci(args);
+}
+
 static void pc_init_pci_1_6(QEMUMachineInitArgs *args)
 {
     pc_compat_1_6(args);
@@ -301,6 +321,7 @@ static void pc_init_pci_no_kvmclock(QEMUMachineInitArgs *args)
 {
     has_pci_info = false;
     has_acpi_build = false;
+    smbios_type1_defaults = false;
     disable_kvm_pv_eoi();
     enable_compat_apic_id_mode();
     pc_init1(args, 1, 0);
@@ -310,6 +331,7 @@ static void pc_init_isa(QEMUMachineInitArgs *args)
 {
     has_pci_info = false;
     has_acpi_build = false;
+    smbios_type1_defaults = false;
     if (!args->cpu_model) {
         args->cpu_model = "486";
     }
@@ -354,7 +376,7 @@ static QEMUMachine pc_i440fx_machine_v2_0 = {
 static QEMUMachine pc_i440fx_machine_v1_7 = {
     PC_I440FX_1_7_MACHINE_OPTIONS,
     .name = "pc-i440fx-1.7",
-    .init = pc_init_pci,
+    .init = pc_init_pci_1_7,
 };
 
 #define PC_I440FX_1_6_MACHINE_OPTIONS PC_I440FX_MACHINE_OPTIONS
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 4c47026..b4e39f0 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -39,6 +39,7 @@
 #include "hw/pci-host/q35.h"
 #include "exec/address-spaces.h"
 #include "hw/i386/ich9.h"
+#include "hw/i386/smbios.h"
 #include "hw/ide/pci.h"
 #include "hw/ide/ahci.h"
 #include "hw/usb.h"
@@ -49,6 +50,7 @@
 
 static bool has_pci_info;
 static bool has_acpi_build = true;
+static bool smbios_type1_defaults = true;
 
 /* PC hardware initialisation */
 static void pc_q35_init(QEMUMachineInitArgs *args)
@@ -113,6 +115,12 @@ static void pc_q35_init(QEMUMachineInitArgs *args)
     guest_info->isapc_ram_fw = false;
     guest_info->has_acpi_build = has_acpi_build;
 
+    if (smbios_type1_defaults) {
+        /* These values are guest ABI, do not change */
+        smbios_set_type1_defaults("QEMU", "Standard PC (Q35 + ICH9, 2009)",
+                                  args->machine->name);
+    }
+
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         pc_memory_init(get_system_memory(),
@@ -217,8 +225,14 @@ static void pc_q35_init(QEMUMachineInitArgs *args)
     }
 }
 
+static void pc_compat_1_7(QEMUMachineInitArgs *args)
+{
+    smbios_type1_defaults = false;
+}
+
 static void pc_compat_1_6(QEMUMachineInitArgs *args)
 {
+    pc_compat_1_7(args);
     has_pci_info = false;
     rom_file_in_ram = false;
     has_acpi_build = false;
@@ -236,6 +250,12 @@ static void pc_compat_1_4(QEMUMachineInitArgs *args)
     x86_cpu_compat_set_features("Westmere", FEAT_1_ECX, 0, CPUID_EXT_PCLMULQDQ);
 }
 
+static void pc_q35_init_1_7(QEMUMachineInitArgs *args)
+{
+    pc_compat_1_7(args);
+    pc_q35_init(args);
+}
+
 static void pc_q35_init_1_6(QEMUMachineInitArgs *args)
 {
     pc_compat_1_6(args);
diff --git a/hw/i386/smbios.c b/hw/i386/smbios.c
index d3f1ee6..e8f41ad 100644
--- a/hw/i386/smbios.c
+++ b/hw/i386/smbios.c
@@ -256,6 +256,20 @@ static void smbios_build_type_1_fields(void)
     }
 }
 
+void smbios_set_type1_defaults(const char *manufacturer,
+                               const char *product, const char *version)
+{
+    if (!type1.manufacturer) {
+        type1.manufacturer = manufacturer;
+    }
+    if (!type1.product) {
+        type1.product = product;
+    }
+    if (!type1.version) {
+        type1.version = version;
+    }
+}
+
 uint8_t *smbios_get_table(size_t *length)
 {
     if (!smbios_immutable) {
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 17/28] acpi unit-test: verify signature and checksum
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (15 preceding siblings ...)
  2013-12-11 18:30 ` [Qemu-devel] [PULL 16/28] smbios: Set system manufacturer, product & version by default Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 18/28] acpi: strip compiler info in built-in DSDT Michael S. Tsirkin
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Marcel Apfelbaum

From: Marcel Apfelbaum <marcel.a@redhat.com>

Read all ACPI tables from guest - will be useful for further unit tests.

Follow pointers between ACPI tables checking signature and format for
correctness.  Verify checksum for all tables.

Signed-off-by: Marcel Apfelbaum <marcel.a@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tests/acpi-test.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 252 insertions(+), 20 deletions(-)

diff --git a/tests/acpi-test.c b/tests/acpi-test.c
index 468c4f5..d6ff66f 100644
--- a/tests/acpi-test.c
+++ b/tests/acpi-test.c
@@ -13,13 +13,28 @@
 #include <string.h>
 #include <stdio.h>
 #include <glib.h>
+#include "qemu-common.h"
 #include "libqtest.h"
+#include "qemu/compiler.h"
+#include "hw/i386/acpi-defs.h"
 
+/* DSDT and SSDTs format */
 typedef struct {
-    const char *args;
-    uint64_t expected_boot;
-    uint64_t expected_reboot;
-} boot_order_test;
+    AcpiTableHeader header;
+    uint8_t *aml;
+    int aml_len;
+} AcpiSdtTable;
+
+typedef struct {
+    uint32_t rsdp_addr;
+    AcpiRsdpDescriptor rsdp_table;
+    AcpiRsdtDescriptorRev1 rsdt_table;
+    AcpiFadtDescriptorRev1 fadt_table;
+    uint32_t *rsdt_tables_addr;
+    int rsdt_tables_nr;
+    AcpiSdtTable dsdt_table;
+    AcpiSdtTable *ssdt_tables;
+} test_data;
 
 #define LOW(x) ((x) & 0xff)
 #define HIGH(x) ((x) >> 8)
@@ -28,6 +43,51 @@ typedef struct {
 #define SIGNATURE_OFFSET 0x10
 #define BOOT_SECTOR_ADDRESS 0x7c00
 
+#define ACPI_READ_FIELD(field, addr)           \
+    do {                                       \
+        switch (sizeof(field)) {               \
+        case 1:                                \
+            field = readb(addr);               \
+            break;                             \
+        case 2:                                \
+            field = le16_to_cpu(readw(addr));  \
+            break;                             \
+        case 4:                                \
+            field = le32_to_cpu(readl(addr));  \
+            break;                             \
+        case 8:                                \
+            field = le64_to_cpu(readq(addr));  \
+            break;                             \
+        default:                               \
+            g_assert(false);                   \
+        }                                      \
+        addr += sizeof(field);                  \
+    } while (0);
+
+#define ACPI_READ_ARRAY_PTR(arr, length, addr)  \
+    do {                                        \
+        int idx;                                \
+        for (idx = 0; idx < length; ++idx) {    \
+            ACPI_READ_FIELD(arr[idx], addr);    \
+        }                                       \
+    } while (0);
+
+#define ACPI_READ_ARRAY(arr, addr)                               \
+    ACPI_READ_ARRAY_PTR(arr, sizeof(arr)/sizeof(arr[0]), addr)
+
+#define ACPI_READ_TABLE_HEADER(table, addr)                      \
+    do {                                                         \
+        ACPI_READ_FIELD((table)->signature, addr);               \
+        ACPI_READ_FIELD((table)->length, addr);                  \
+        ACPI_READ_FIELD((table)->revision, addr);                \
+        ACPI_READ_FIELD((table)->checksum, addr);                \
+        ACPI_READ_ARRAY((table)->oem_id, addr);                  \
+        ACPI_READ_ARRAY((table)->oem_table_id, addr);            \
+        ACPI_READ_FIELD((table)->oem_revision, addr);            \
+        ACPI_READ_ARRAY((table)->asl_compiler_id, addr);         \
+        ACPI_READ_FIELD((table)->asl_compiler_revision, addr);   \
+    } while (0);
+
 /* Boot sector code: write SIGNATURE into memory,
  * then halt.
  */
@@ -57,6 +117,181 @@ static uint8_t boot_sector[0x200] = {
 
 static const char *disk = "tests/acpi-test-disk.raw";
 
+static uint8_t acpi_checksum(const uint8_t *data, int len)
+{
+    int i;
+    uint8_t sum = 0;
+
+    for (i = 0; i < len; i++) {
+        sum += data[i];
+    }
+
+    return sum;
+}
+
+static void test_acpi_rsdp_address(test_data *data)
+{
+    uint32_t off;
+
+    /* OK, now find RSDP */
+    for (off = 0xf0000; off < 0x100000; off += 0x10) {
+        uint8_t sig[] = "RSD PTR ";
+        int i;
+
+        for (i = 0; i < sizeof sig - 1; ++i) {
+            sig[i] = readb(off + i);
+        }
+
+        if (!memcmp(sig, "RSD PTR ", sizeof sig)) {
+            break;
+        }
+    }
+
+    g_assert_cmphex(off, <, 0x100000);
+    data->rsdp_addr = off;
+}
+
+static void test_acpi_rsdp_table(test_data *data)
+{
+    AcpiRsdpDescriptor *rsdp_table = &data->rsdp_table;
+    uint32_t addr = data->rsdp_addr;
+
+    ACPI_READ_FIELD(rsdp_table->signature, addr);
+    g_assert_cmphex(rsdp_table->signature, ==, ACPI_RSDP_SIGNATURE);
+
+    ACPI_READ_FIELD(rsdp_table->checksum, addr);
+    ACPI_READ_ARRAY(rsdp_table->oem_id, addr);
+    ACPI_READ_FIELD(rsdp_table->revision, addr);
+    ACPI_READ_FIELD(rsdp_table->rsdt_physical_address, addr);
+    ACPI_READ_FIELD(rsdp_table->length, addr);
+
+    /* rsdp checksum is not for the whole table, but for the first 20 bytes */
+    g_assert(!acpi_checksum((uint8_t *)rsdp_table, 20));
+}
+
+static void test_acpi_rsdt_table(test_data *data)
+{
+    AcpiRsdtDescriptorRev1 *rsdt_table = &data->rsdt_table;
+    uint32_t addr = data->rsdp_table.rsdt_physical_address;
+    uint32_t *tables;
+    int tables_nr;
+    uint8_t checksum;
+
+    /* read the header */
+    ACPI_READ_TABLE_HEADER(rsdt_table, addr);
+    g_assert_cmphex(rsdt_table->signature, ==, ACPI_RSDT_SIGNATURE);
+
+    /* compute the table entries in rsdt */
+    tables_nr = (rsdt_table->length - sizeof(AcpiRsdtDescriptorRev1)) /
+                sizeof(uint32_t);
+    g_assert_cmpint(tables_nr, >, 0);
+
+    /* get the addresses of the tables pointed by rsdt */
+    tables = g_new0(uint32_t, tables_nr);
+    ACPI_READ_ARRAY_PTR(tables, tables_nr, addr);
+
+    checksum = acpi_checksum((uint8_t *)rsdt_table, rsdt_table->length) +
+               acpi_checksum((uint8_t *)tables, tables_nr * sizeof(uint32_t));
+    g_assert(!checksum);
+
+   /* SSDT tables after FADT */
+    data->rsdt_tables_addr = tables;
+    data->rsdt_tables_nr = tables_nr;
+}
+
+static void test_acpi_fadt_table(test_data *data)
+{
+    AcpiFadtDescriptorRev1 *fadt_table = &data->fadt_table;
+    uint32_t addr;
+
+    /* FADT table comes first */
+    addr = data->rsdt_tables_addr[0];
+    ACPI_READ_TABLE_HEADER(fadt_table, addr);
+
+    ACPI_READ_FIELD(fadt_table->firmware_ctrl, addr);
+    ACPI_READ_FIELD(fadt_table->dsdt, addr);
+    ACPI_READ_FIELD(fadt_table->model, addr);
+    ACPI_READ_FIELD(fadt_table->reserved1, addr);
+    ACPI_READ_FIELD(fadt_table->sci_int, addr);
+    ACPI_READ_FIELD(fadt_table->smi_cmd, addr);
+    ACPI_READ_FIELD(fadt_table->acpi_enable, addr);
+    ACPI_READ_FIELD(fadt_table->acpi_disable, addr);
+    ACPI_READ_FIELD(fadt_table->S4bios_req, addr);
+    ACPI_READ_FIELD(fadt_table->reserved2, addr);
+    ACPI_READ_FIELD(fadt_table->pm1a_evt_blk, addr);
+    ACPI_READ_FIELD(fadt_table->pm1b_evt_blk, addr);
+    ACPI_READ_FIELD(fadt_table->pm1a_cnt_blk, addr);
+    ACPI_READ_FIELD(fadt_table->pm1b_cnt_blk, addr);
+    ACPI_READ_FIELD(fadt_table->pm2_cnt_blk, addr);
+    ACPI_READ_FIELD(fadt_table->pm_tmr_blk, addr);
+    ACPI_READ_FIELD(fadt_table->gpe0_blk, addr);
+    ACPI_READ_FIELD(fadt_table->gpe1_blk, addr);
+    ACPI_READ_FIELD(fadt_table->pm1_evt_len, addr);
+    ACPI_READ_FIELD(fadt_table->pm1_cnt_len, addr);
+    ACPI_READ_FIELD(fadt_table->pm2_cnt_len, addr);
+    ACPI_READ_FIELD(fadt_table->pm_tmr_len, addr);
+    ACPI_READ_FIELD(fadt_table->gpe0_blk_len, addr);
+    ACPI_READ_FIELD(fadt_table->gpe1_blk_len, addr);
+    ACPI_READ_FIELD(fadt_table->gpe1_base, addr);
+    ACPI_READ_FIELD(fadt_table->reserved3, addr);
+    ACPI_READ_FIELD(fadt_table->plvl2_lat, addr);
+    ACPI_READ_FIELD(fadt_table->plvl3_lat, addr);
+    ACPI_READ_FIELD(fadt_table->flush_size, addr);
+    ACPI_READ_FIELD(fadt_table->flush_stride, addr);
+    ACPI_READ_FIELD(fadt_table->duty_offset, addr);
+    ACPI_READ_FIELD(fadt_table->duty_width, addr);
+    ACPI_READ_FIELD(fadt_table->day_alrm, addr);
+    ACPI_READ_FIELD(fadt_table->mon_alrm, addr);
+    ACPI_READ_FIELD(fadt_table->century, addr);
+    ACPI_READ_FIELD(fadt_table->reserved4, addr);
+    ACPI_READ_FIELD(fadt_table->reserved4a, addr);
+    ACPI_READ_FIELD(fadt_table->reserved4b, addr);
+    ACPI_READ_FIELD(fadt_table->flags, addr);
+
+    g_assert_cmphex(fadt_table->signature, ==, ACPI_FACP_SIGNATURE);
+    g_assert(!acpi_checksum((uint8_t *)fadt_table, fadt_table->length));
+}
+
+static void test_dst_table(AcpiSdtTable *sdt_table, uint32_t addr)
+{
+    uint8_t checksum;
+
+    ACPI_READ_TABLE_HEADER(&sdt_table->header, addr);
+
+    sdt_table->aml_len = sdt_table->header.length - sizeof(AcpiTableHeader);
+    sdt_table->aml = g_malloc0(sdt_table->aml_len);
+    ACPI_READ_ARRAY_PTR(sdt_table->aml, sdt_table->aml_len, addr);
+
+    checksum = acpi_checksum((uint8_t *)sdt_table, sizeof(AcpiTableHeader)) +
+               acpi_checksum(sdt_table->aml, sdt_table->aml_len);
+    g_assert(!checksum);
+}
+
+static void test_acpi_dsdt_table(test_data *data)
+{
+    AcpiSdtTable *dsdt_table = &data->dsdt_table;
+    uint32_t addr = data->fadt_table.dsdt;
+
+    test_dst_table(dsdt_table, addr);
+    g_assert_cmphex(dsdt_table->header.signature, ==, ACPI_DSDT_SIGNATURE);
+}
+
+static void test_acpi_ssdt_tables(test_data *data)
+{
+    AcpiSdtTable *ssdt_tables;
+    int ssdt_tables_nr = data->rsdt_tables_nr - 1; /* fadt is first */
+    int i;
+
+    ssdt_tables = g_new0(AcpiSdtTable, ssdt_tables_nr);
+    for (i = 0; i < ssdt_tables_nr; i++) {
+        AcpiSdtTable *ssdt_table = &ssdt_tables[i];
+        uint32_t addr = data->rsdt_tables_addr[i + 1]; /* fadt is first */
+
+        test_dst_table(ssdt_table, addr);
+    }
+    data->ssdt_tables = ssdt_tables;
+}
+
 static void test_acpi_one(const char *params)
 {
     char *args;
@@ -64,9 +299,9 @@ static void test_acpi_one(const char *params)
     uint8_t signature_high;
     uint16_t signature;
     int i;
-    uint32_t off;
-
+    test_data data;
 
+    memset(&data, 0, sizeof(data));
     args = g_strdup_printf("-net none -display none %s %s",
                            params ? params : "", disk);
     qtest_start(args);
@@ -90,22 +325,19 @@ static void test_acpi_one(const char *params)
     }
     g_assert_cmphex(signature, ==, SIGNATURE);
 
-    /* OK, now find RSDP */
-    for (off = 0xf0000; off < 0x100000; off += 0x10)
-    {
-        uint8_t sig[] = "RSD PTR ";
-        int i;
-
-        for (i = 0; i < sizeof sig - 1; ++i) {
-            sig[i] = readb(off + i);
-        }
+    test_acpi_rsdp_address(&data);
+    test_acpi_rsdp_table(&data);
+    test_acpi_rsdt_table(&data);
+    test_acpi_fadt_table(&data);
+    test_acpi_dsdt_table(&data);
+    test_acpi_ssdt_tables(&data);
 
-        if (!memcmp(sig, "RSD PTR ", sizeof sig)) {
-            break;
-        }
+    g_free(data.rsdt_tables_addr);
+    for (i = 0; i < (data.rsdt_tables_nr - 1); ++i) {
+        g_free(data.ssdt_tables[i].aml);
     }
-
-    g_assert_cmphex(off, <, 0x100000);
+    g_free(data.ssdt_tables);
+    g_free(data.dsdt_table.aml);
 
     qtest_quit(global_qtest);
     g_free(args);
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 18/28] acpi: strip compiler info in built-in DSDT
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (16 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 17/28] acpi unit-test: verify signature and checksum Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 19/28] ACPI DSDT: Make control method `IQCR` serialized Michael S. Tsirkin
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Marcel Apfelbaum

IASL stores it's revision in each table header it generates.
That's not nice since guests will see a change each time they move
between hypervisors.  We generally fill our own info for tables, but we
(and seabios) forgot to do this for the built-in DSDT.

Modifications in DSDT table:
 OEM ID:            "BXPC" -> "BOCHS "
 OEM Table ID:      "BXDSDT" -> "BXPCDSDT"
 Compiler ID:       "INTL" -> "BXPC"
 Compiler Version:  0x20130823 -> 0x00000001

Tested-by: Marcel Apfelbaum <marcel.a@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/acpi-build.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index befc39f..48312f5 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -924,10 +924,16 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
 static void
 build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
 {
-    void *dsdt;
+    AcpiTableHeader *dsdt;
+
     assert(misc->dsdt_code && misc->dsdt_size);
+
     dsdt = acpi_data_push(table_data, misc->dsdt_size);
     memcpy(dsdt, misc->dsdt_code, misc->dsdt_size);
+
+    memset(dsdt, 0, sizeof *dsdt);
+    build_header(linker, table_data, dsdt, ACPI_DSDT_SIGNATURE,
+                 misc->dsdt_size, 1);
 }
 
 /* Build final rsdt table */
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 19/28] ACPI DSDT: Make control method `IQCR` serialized
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (17 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 18/28] acpi: strip compiler info in built-in DSDT Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 20/28] pci: fix pci bridge fw path Michael S. Tsirkin
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Marcel Apfelbaum

Forward-port the following commit from seabios:

commit 995bbeef78b338370f426bf8d0399038c3fa259c
Author: Paul Menzel <paulepanter@users.sourceforge.net>
Date:   Thu Oct 3 11:30:52 2013 +0200

    The ASL Optimizing Compiler version 20130823-32 [Sep 11 2013] issues the
    following warning.

            $ make
            […]
              Compiling IASL out/src/fw/acpi-dsdt.hex
            out/src/fw/acpi-dsdt.dsl.i    360:         Method(IQCR, 1, NotSerialized) {
            Remark   2120 -                                     ^ Control Method should be made Serialized (due to creation of named objects within)
            […]
            ASL Input:     out/src/fw/acpi-dsdt.dsl.i - 475 lines, 19181 bytes, 316 keywords
            AML Output:    out/src/fw/acpi-dsdt.aml - 4407 bytes, 159 named objects, 157 executable opcodes
            Listing File:  out/src/fw/acpi-dsdt.lst - 143715 bytes
            Hex Dump:      out/src/fw/acpi-dsdt.hex - 41661 bytes

            Compilation complete. 0 Errors, 0 Warnings, 1 Remarks, 246 Optimizations
            […]

    After changing the parameter from `NotSerialized` to `Serialized`, the
    remark is indeed gone and there is no size change.

    The remark was added in ACPICA version 20130517 [1] and gives the
    following explanation.

            If a thread blocks within the method for any reason, and another thread
            enters the method, the method will fail because an attempt will be
            made to create the same (named) object twice.

            In this case, issue a remark that the method should be marked
            serialized. ACPICA BZ 909.

    [1] https://github.com/acpica/acpica/commit/ba84d0fc18ba910a47a3f71c68a43543c06e6831

    Signed-off-by: Paul Menzel <paulepanter@users.sourceforge.net>

Reported-by: Marcel Apfelbaum <marcel.a@redhat.com>
Tested-by: Marcel Apfelbaum <marcel.a@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/acpi-dsdt.dsl               | 2 +-
 hw/i386/acpi-dsdt.hex.generated     | 4 ++--
 hw/i386/q35-acpi-dsdt.dsl           | 2 +-
 hw/i386/q35-acpi-dsdt.hex.generated | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/i386/acpi-dsdt.dsl b/hw/i386/acpi-dsdt.dsl
index 90efce0..a377424 100644
--- a/hw/i386/acpi-dsdt.dsl
+++ b/hw/i386/acpi-dsdt.dsl
@@ -235,7 +235,7 @@ DefinitionBlock (
             }
             Return (0x0B)
         }
-        Method(IQCR, 1, NotSerialized) {
+        Method(IQCR, 1, Serialized) {
             // _CRS method - get current settings
             Name(PRR0, ResourceTemplate() {
                 Interrupt(, Level, ActiveHigh, Shared) { 0 }
diff --git a/hw/i386/acpi-dsdt.hex.generated b/hw/i386/acpi-dsdt.hex.generated
index 2c01107..f8bd4ea 100644
--- a/hw/i386/acpi-dsdt.hex.generated
+++ b/hw/i386/acpi-dsdt.hex.generated
@@ -8,7 +8,7 @@ static unsigned char AcpiDsdtAmlCode[] = {
 0x0,
 0x0,
 0x1,
-0xe0,
+0xd8,
 0x42,
 0x58,
 0x50,
@@ -3379,7 +3379,7 @@ static unsigned char AcpiDsdtAmlCode[] = {
 0x51,
 0x43,
 0x52,
-0x1,
+0x9,
 0x8,
 0x50,
 0x52,
diff --git a/hw/i386/q35-acpi-dsdt.dsl b/hw/i386/q35-acpi-dsdt.dsl
index 21c89b0..575c5d7 100644
--- a/hw/i386/q35-acpi-dsdt.dsl
+++ b/hw/i386/q35-acpi-dsdt.dsl
@@ -333,7 +333,7 @@ DefinitionBlock (
             }
             Return (0x0B)
         }
-        Method(IQCR, 1, NotSerialized) {
+        Method(IQCR, 1, Serialized) {
             // _CRS method - get current settings
             Name(PRR0, ResourceTemplate() {
                 Interrupt(, Level, ActiveHigh, Shared) { 0 }
diff --git a/hw/i386/q35-acpi-dsdt.hex.generated b/hw/i386/q35-acpi-dsdt.hex.generated
index 32c16ff..111ad3e 100644
--- a/hw/i386/q35-acpi-dsdt.hex.generated
+++ b/hw/i386/q35-acpi-dsdt.hex.generated
@@ -8,7 +8,7 @@ static unsigned char Q35AcpiDsdtAmlCode[] = {
 0x0,
 0x0,
 0x1,
-0x6,
+0xfe,
 0x42,
 0x58,
 0x50,
@@ -5338,7 +5338,7 @@ static unsigned char Q35AcpiDsdtAmlCode[] = {
 0x51,
 0x43,
 0x52,
-0x1,
+0x9,
 0x8,
 0x50,
 0x52,
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 20/28] pci: fix pci bridge fw path
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (18 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 19/28] ACPI DSDT: Make control method `IQCR` serialized Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 21/28] hpet: inverse polarity when pin above ISA_NUM_IRQS Michael S. Tsirkin
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Gerd Hoffmann

From: Gerd Hoffmann <kraxel@redhat.com>

qemu uses "pci" as name for pci bridges in the firmware device path.
seabios expects "pci-bridge".  Result is that bootorder is broken for
devices behind pci bridges.

Some googling suggests that "pci-bridge" is the correct one.  At least
PPC-based Apple machines are using this.  See question "How do I boot
from a device attached to a PCI card" here:
	http://www.netbsd.org/ports/macppc/faq.html

So lets change qemu to use "pci-bridge" too.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 49eca95..82c11ec 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1330,7 +1330,7 @@ static const pci_class_desc pci_class_descriptions[] =
     { 0x0601, "ISA bridge", "isa"},
     { 0x0602, "EISA bridge", "eisa"},
     { 0x0603, "MC bridge", "mca"},
-    { 0x0604, "PCI bridge", "pci"},
+    { 0x0604, "PCI bridge", "pci-bridge"},
     { 0x0605, "PCMCIA bridge", "pcmcia"},
     { 0x0606, "NUBUS bridge", "nubus"},
     { 0x0607, "CARDBUS bridge", "cardbus"},
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 21/28] hpet: inverse polarity when pin above ISA_NUM_IRQS
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (19 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 20/28] pci: fix pci bridge fw path Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 22/28] hpet: enable to entitle more irq pins for hpet Michael S. Tsirkin
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Liu Ping Fan, Liu Ping Fan

From: Liu Ping Fan <qemulist@gmail.com>

According to hpet spec, hpet irq is high active. But according to
ICH spec, there is inversion before the input of ioapic. So the OS
will expect low active on this IRQ line. (On bare metal, if OS driver
claims high active on this line, spurious irq is generated)

We fold the emulation of this inversion inside the hpet logic.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/timer/hpet.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 2eb75ea..0aee2c1 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -198,13 +198,23 @@ static void update_irq(struct HPETTimer *timer, int set)
     if (!set || !timer_enabled(timer) || !hpet_enabled(timer->state)) {
         s->isr &= ~mask;
         if (!timer_fsb_route(timer)) {
-            qemu_irq_lower(s->irqs[route]);
+            /* fold the ICH PIRQ# pin's internal inversion logic into hpet */
+            if (route >= ISA_NUM_IRQS) {
+                qemu_irq_raise(s->irqs[route]);
+            } else {
+                qemu_irq_lower(s->irqs[route]);
+            }
         }
     } else if (timer_fsb_route(timer)) {
         stl_le_phys(timer->fsb >> 32, timer->fsb & 0xffffffff);
     } else if (timer->config & HPET_TN_TYPE_LEVEL) {
         s->isr |= mask;
-        qemu_irq_raise(s->irqs[route]);
+        /* fold the ICH PIRQ# pin's internal inversion logic into hpet */
+        if (route >= ISA_NUM_IRQS) {
+            qemu_irq_lower(s->irqs[route]);
+        } else {
+            qemu_irq_raise(s->irqs[route]);
+        }
     } else {
         s->isr &= ~mask;
         qemu_irq_pulse(s->irqs[route]);
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 22/28] hpet: enable to entitle more irq pins for hpet
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (20 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 21/28] hpet: inverse polarity when pin above ISA_NUM_IRQS Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 23/28] memory.c: bugfix - ref counting mismatch in memory_region_find Michael S. Tsirkin
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Liu Ping Fan, Liu Ping Fan, Anthony Liguori

From: Liu Ping Fan <qemulist@gmail.com>

Owning to some different hardware design, piix and q35 need
different compat. So making them diverge.

On q35, IRQ2/8 can be reserved for hpet timer 0/1. And pin 16~23
can be assigned to hpet as guest chooses. So we introduce intcap
property to do that.

Consider the compat and piix/q35, we finally have the following
value for intcap: For piix, hpet's intcap is hard coded as IRQ2.
For pc-q35-1.7 and earlier, we use IRQ2 for compat reason. Otherwise
IRQ2, IRQ8, and IRQ16~23 are allowed.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/i386/pc.h | 24 +++++++++++++++++++++++-
 hw/i386/pc.c         | 19 ++++++++++++++++---
 hw/i386/pc_piix.c    |  3 ++-
 hw/i386/pc_q35.c     | 12 ++++++++----
 hw/timer/hpet.c      |  9 +++++++--
 5 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 8ea1a98..24eb3de 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -13,6 +13,8 @@
 #include "sysemu/sysemu.h"
 #include "hw/pci/pci.h"
 
+#define HPET_INTCAP "hpet-intcap"
+
 /* PC-style peripherals (also used by other machines).  */
 
 typedef struct PcPciInfo {
@@ -146,7 +148,8 @@ DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(ISABus *isa_bus, qemu_irq *gsi,
                           ISADevice **rtc_state,
                           ISADevice **floppy,
-                          bool no_vmport);
+                          bool no_vmport,
+                          uint32 hpet_irqs);
 void pc_init_ne2k_isa(ISABus *bus, NICInfo *nd);
 void pc_cmos_init(ram_addr_t ram_size, ram_addr_t above_4g_mem_size,
                   const char *boot_device,
@@ -236,6 +239,25 @@ uint16_t pvpanic_port(void);
 
 int e820_add_entry(uint64_t, uint64_t, uint32_t);
 
+#define PC_Q35_COMPAT_1_7 \
+        {\
+            .driver   = "hpet",\
+            .property = HPET_INTCAP,\
+            .value    = stringify(4),\
+        }
+
+#define PC_Q35_COMPAT_1_6 \
+        PC_COMPAT_1_6, \
+        PC_Q35_COMPAT_1_7
+
+#define PC_Q35_COMPAT_1_5 \
+        PC_COMPAT_1_5, \
+        PC_Q35_COMPAT_1_6
+
+#define PC_Q35_COMPAT_1_4 \
+        PC_COMPAT_1_4, \
+        PC_Q35_COMPAT_1_5
+
 #define PC_COMPAT_1_6 \
         {\
             .driver   = "e1000",\
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6c82ada..8353d10 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1253,7 +1253,8 @@ static const MemoryRegionOps ioportF0_io_ops = {
 void pc_basic_device_init(ISABus *isa_bus, qemu_irq *gsi,
                           ISADevice **rtc_state,
                           ISADevice **floppy,
-                          bool no_vmport)
+                          bool no_vmport,
+                          uint32 hpet_irqs)
 {
     int i;
     DriveInfo *fd[MAX_FD];
@@ -1280,9 +1281,21 @@ void pc_basic_device_init(ISABus *isa_bus, qemu_irq *gsi,
      * when the HPET wants to take over. Thus we have to disable the latter.
      */
     if (!no_hpet && (!kvm_irqchip_in_kernel() || kvm_has_pit_state2())) {
-        hpet = sysbus_try_create_simple("hpet", HPET_BASE, NULL);
-
+        /* In order to set property, here not using sysbus_try_create_simple */
+        hpet = qdev_try_create(NULL, "hpet");
         if (hpet) {
+            /* For pc-piix-*, hpet's intcap is always IRQ2. For pc-q35-1.7
+             * and earlier, use IRQ2 for compat. Otherwise, use IRQ16~23,
+             * IRQ8 and IRQ2.
+             */
+            uint8_t compat = object_property_get_int(OBJECT(hpet),
+                    HPET_INTCAP, NULL);
+            if (!compat) {
+                qdev_prop_set_uint32(hpet, HPET_INTCAP, hpet_irqs);
+            }
+            qdev_init_nofail(hpet);
+            sysbus_mmio_map(SYS_BUS_DEVICE(hpet), 0, HPET_BASE);
+
             for (i = 0; i < GSI_NUM_PINS; i++) {
                 sysbus_connect_irq(SYS_BUS_DEVICE(hpet), i, gsi[i]);
             }
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 9fc3b11..4e0dae7 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -189,7 +189,8 @@ static void pc_init1(QEMUMachineInitArgs *args,
     pc_vga_init(isa_bus, pci_enabled ? pci_bus : NULL);
 
     /* init basic PC hardware */
-    pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, xen_enabled());
+    pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, xen_enabled(),
+        0x4);
 
     pc_nic_init(isa_bus, pci_bus);
 
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index b4e39f0..07f38ff 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -190,7 +190,7 @@ static void pc_q35_init(QEMUMachineInitArgs *args)
     pc_register_ferr_irq(gsi[13]);
 
     /* init basic PC hardware */
-    pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, false);
+    pc_basic_device_init(isa_bus, gsi, &rtc_state, &floppy, false, 0xff0104);
 
     /* connect pm stuff to lpc */
     ich9_lpc_pm_init(lpc);
@@ -295,7 +295,11 @@ static QEMUMachine pc_q35_machine_v2_0 = {
 static QEMUMachine pc_q35_machine_v1_7 = {
     PC_Q35_1_7_MACHINE_OPTIONS,
     .name = "pc-q35-1.7",
-    .init = pc_q35_init,
+    .init = pc_q35_init_1_7,
+    .compat_props = (GlobalProperty[]) {
+        PC_Q35_COMPAT_1_7,
+        { /* end of list */ }
+    },
 };
 
 #define PC_Q35_1_6_MACHINE_OPTIONS PC_Q35_MACHINE_OPTIONS
@@ -305,7 +309,7 @@ static QEMUMachine pc_q35_machine_v1_6 = {
     .name = "pc-q35-1.6",
     .init = pc_q35_init_1_6,
     .compat_props = (GlobalProperty[]) {
-        PC_COMPAT_1_6,
+        PC_Q35_COMPAT_1_6,
         { /* end of list */ }
     },
 };
@@ -315,7 +319,7 @@ static QEMUMachine pc_q35_machine_v1_5 = {
     .name = "pc-q35-1.5",
     .init = pc_q35_init_1_5,
     .compat_props = (GlobalProperty[]) {
-        PC_COMPAT_1_5,
+        PC_Q35_COMPAT_1_5,
         { /* end of list */ }
     },
 };
diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 0aee2c1..0ec440e 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -73,6 +73,7 @@ typedef struct HPETState {
     uint8_t rtc_irq_level;
     qemu_irq pit_enabled;
     uint8_t num_timers;
+    uint32_t intcap;
     HPETTimer timer[HPET_MAX_TIMERS];
 
     /* Memory-mapped, software visible registers */
@@ -663,8 +664,8 @@ static void hpet_reset(DeviceState *d)
         if (s->flags & (1 << HPET_MSI_SUPPORT)) {
             timer->config |= HPET_TN_FSB_CAP;
         }
-        /* advertise availability of ioapic inti2 */
-        timer->config |=  0x00000004ULL << 32;
+        /* advertise availability of ioapic int */
+        timer->config |=  (uint64_t)s->intcap << 32;
         timer->period = 0ULL;
         timer->wrap_flag = 0;
     }
@@ -713,6 +714,9 @@ static void hpet_realize(DeviceState *dev, Error **errp)
     int i;
     HPETTimer *timer;
 
+    if (!s->intcap) {
+        error_printf("Hpet's intcap not initialized.\n");
+    }
     if (hpet_cfg.count == UINT8_MAX) {
         /* first instance */
         hpet_cfg.count = 0;
@@ -753,6 +757,7 @@ static void hpet_realize(DeviceState *dev, Error **errp)
 static Property hpet_device_properties[] = {
     DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
     DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false),
+    DEFINE_PROP_UINT32(HPET_INTCAP, HPETState, intcap, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 23/28] memory.c: bugfix - ref counting mismatch in memory_region_find
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (21 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 22/28] hpet: enable to entitle more irq pins for hpet Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 24/28] exec: separate sections and nodes per address space Michael S. Tsirkin
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, qemu-stable, Marcel Apfelbaum

From: Marcel Apfelbaum <marcel.a@redhat.com>

'address_space_get_flatview' gets a reference to a FlatView.
If the flatview lookup fails, the code returns without
"unreferencing" the view.

Cc: qemu-stable@nongnu.org

Signed-off-by: Marcel Apfelbaum <marcel.a@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/memory.c b/memory.c
index 28f6449..7764314 100644
--- a/memory.c
+++ b/memory.c
@@ -1596,6 +1596,7 @@ MemoryRegionSection memory_region_find(MemoryRegion *mr,
     view = address_space_get_flatview(as);
     fr = flatview_lookup(view, range);
     if (!fr) {
+        flatview_unref(view);
         return ret;
     }
 
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 24/28] exec: separate sections and nodes per address space
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (22 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 23/28] memory.c: bugfix - ref counting mismatch in memory_region_find Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 25/28] acpi unit-test: load and check facs table Michael S. Tsirkin
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Marcel Apfelbaum

From: Marcel Apfelbaum <marcel.a@redhat.com>

Every address space has its own nodes and sections, but
it uses the same global arrays of nodes/section.

This limits the number of devices that can be attached
to the guest to 20-30 devices. It happens because:
 - The sections array is limited to 2^12 entries.
 - The main memory has at least 100 sections.
 - Each device address space is actually an alias to
   main memory, multiplying its number of nodes/sections.

Remove the limitation by using separate arrays of
nodes and sections for each address space.

Signed-off-by: Marcel Apfelbaum <marcel.a@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 155 ++++++++++++++++++++++++++++-------------------------------------
 1 file changed, 66 insertions(+), 89 deletions(-)

diff --git a/exec.c b/exec.c
index 67a073c..00526d1 100644
--- a/exec.c
+++ b/exec.c
@@ -103,13 +103,21 @@ struct PhysPageEntry {
 
 typedef PhysPageEntry Node[P_L2_SIZE];
 
+typedef struct PhysPageMap {
+    unsigned sections_nb;
+    unsigned sections_nb_alloc;
+    unsigned nodes_nb;
+    unsigned nodes_nb_alloc;
+    Node *nodes;
+    MemoryRegionSection *sections;
+} PhysPageMap;
+
 struct AddressSpaceDispatch {
     /* This is a multi-level map on the physical address space.
      * The bottom level has pointers to MemoryRegionSections.
      */
     PhysPageEntry phys_map;
-    Node *nodes;
-    MemoryRegionSection *sections;
+    PhysPageMap map;
     AddressSpace *as;
 };
 
@@ -126,18 +134,6 @@ typedef struct subpage_t {
 #define PHYS_SECTION_ROM 2
 #define PHYS_SECTION_WATCH 3
 
-typedef struct PhysPageMap {
-    unsigned sections_nb;
-    unsigned sections_nb_alloc;
-    unsigned nodes_nb;
-    unsigned nodes_nb_alloc;
-    Node *nodes;
-    MemoryRegionSection *sections;
-} PhysPageMap;
-
-static PhysPageMap *prev_map;
-static PhysPageMap next_map;
-
 static void io_mem_init(void);
 static void memory_map_init(void);
 
@@ -146,35 +142,32 @@ static MemoryRegion io_mem_watch;
 
 #if !defined(CONFIG_USER_ONLY)
 
-static void phys_map_node_reserve(unsigned nodes)
+static void phys_map_node_reserve(PhysPageMap *map, unsigned nodes)
 {
-    if (next_map.nodes_nb + nodes > next_map.nodes_nb_alloc) {
-        next_map.nodes_nb_alloc = MAX(next_map.nodes_nb_alloc * 2,
-                                            16);
-        next_map.nodes_nb_alloc = MAX(next_map.nodes_nb_alloc,
-                                      next_map.nodes_nb + nodes);
-        next_map.nodes = g_renew(Node, next_map.nodes,
-                                 next_map.nodes_nb_alloc);
+    if (map->nodes_nb + nodes > map->nodes_nb_alloc) {
+        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc * 2, 16);
+        map->nodes_nb_alloc = MAX(map->nodes_nb_alloc, map->nodes_nb + nodes);
+        map->nodes = g_renew(Node, map->nodes, map->nodes_nb_alloc);
     }
 }
 
-static uint32_t phys_map_node_alloc(void)
+static uint32_t phys_map_node_alloc(PhysPageMap *map)
 {
     unsigned i;
     uint32_t ret;
 
-    ret = next_map.nodes_nb++;
+    ret = map->nodes_nb++;
     assert(ret != PHYS_MAP_NODE_NIL);
-    assert(ret != next_map.nodes_nb_alloc);
+    assert(ret != map->nodes_nb_alloc);
     for (i = 0; i < P_L2_SIZE; ++i) {
-        next_map.nodes[ret][i].skip = 1;
-        next_map.nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
+        map->nodes[ret][i].skip = 1;
+        map->nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
     }
     return ret;
 }
 
-static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
-                                hwaddr *nb, uint16_t leaf,
+static void phys_page_set_level(PhysPageMap *map, PhysPageEntry *lp,
+                                hwaddr *index, hwaddr *nb, uint16_t leaf,
                                 int level)
 {
     PhysPageEntry *p;
@@ -182,8 +175,8 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
     hwaddr step = (hwaddr)1 << (level * P_L2_BITS);
 
     if (lp->skip && lp->ptr == PHYS_MAP_NODE_NIL) {
-        lp->ptr = phys_map_node_alloc();
-        p = next_map.nodes[lp->ptr];
+        lp->ptr = phys_map_node_alloc(map);
+        p = map->nodes[lp->ptr];
         if (level == 0) {
             for (i = 0; i < P_L2_SIZE; i++) {
                 p[i].skip = 0;
@@ -191,7 +184,7 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
             }
         }
     } else {
-        p = next_map.nodes[lp->ptr];
+        p = map->nodes[lp->ptr];
     }
     lp = &p[(*index >> (level * P_L2_BITS)) & (P_L2_SIZE - 1)];
 
@@ -202,7 +195,7 @@ static void phys_page_set_level(PhysPageEntry *lp, hwaddr *index,
             *index += step;
             *nb -= step;
         } else {
-            phys_page_set_level(lp, index, nb, leaf, level - 1);
+            phys_page_set_level(map, lp, index, nb, leaf, level - 1);
         }
         ++lp;
     }
@@ -213,9 +206,9 @@ static void phys_page_set(AddressSpaceDispatch *d,
                           uint16_t leaf)
 {
     /* Wildly overreserve - it doesn't matter much. */
-    phys_map_node_reserve(3 * P_L2_LEVELS);
+    phys_map_node_reserve(&d->map, 3 * P_L2_LEVELS);
 
-    phys_page_set_level(&d->phys_map, &index, &nb, leaf, P_L2_LEVELS - 1);
+    phys_page_set_level(&d->map, &d->phys_map, &index, &nb, leaf, P_L2_LEVELS - 1);
 }
 
 /* Compact a non leaf page entry. Simply detect that the entry has a single child,
@@ -276,7 +269,7 @@ static void phys_page_compact_all(AddressSpaceDispatch *d, int nodes_nb)
     DECLARE_BITMAP(compacted, nodes_nb);
 
     if (d->phys_map.skip) {
-        phys_page_compact(&d->phys_map, d->nodes, compacted);
+        phys_page_compact(&d->phys_map, d->map.nodes, compacted);
     }
 }
 
@@ -317,10 +310,10 @@ static MemoryRegionSection *address_space_lookup_region(AddressSpaceDispatch *d,
     MemoryRegionSection *section;
     subpage_t *subpage;
 
-    section = phys_page_find(d->phys_map, addr, d->nodes, d->sections);
+    section = phys_page_find(d->phys_map, addr, d->map.nodes, d->map.sections);
     if (resolve_subpage && section->mr->subpage) {
         subpage = container_of(section->mr, subpage_t, iomem);
-        section = &d->sections[subpage->sub_section[SUBPAGE_IDX(addr)]];
+        section = &d->map.sections[subpage->sub_section[SUBPAGE_IDX(addr)]];
     }
     return section;
 }
@@ -788,7 +781,7 @@ hwaddr memory_region_section_get_iotlb(CPUArchState *env,
             iotlb |= PHYS_SECTION_ROM;
         }
     } else {
-        iotlb = section - address_space_memory.dispatch->sections;
+        iotlb = section - address_space_memory.dispatch->map.sections;
         iotlb += xlat;
     }
 
@@ -827,23 +820,23 @@ void phys_mem_set_alloc(void *(*alloc)(size_t))
     phys_mem_alloc = alloc;
 }
 
-static uint16_t phys_section_add(MemoryRegionSection *section)
+static uint16_t phys_section_add(PhysPageMap *map,
+                                 MemoryRegionSection *section)
 {
     /* The physical section number is ORed with a page-aligned
      * pointer to produce the iotlb entries.  Thus it should
      * never overflow into the page-aligned value.
      */
-    assert(next_map.sections_nb < TARGET_PAGE_SIZE);
+    assert(map->sections_nb < TARGET_PAGE_SIZE);
 
-    if (next_map.sections_nb == next_map.sections_nb_alloc) {
-        next_map.sections_nb_alloc = MAX(next_map.sections_nb_alloc * 2,
-                                         16);
-        next_map.sections = g_renew(MemoryRegionSection, next_map.sections,
-                                    next_map.sections_nb_alloc);
+    if (map->sections_nb == map->sections_nb_alloc) {
+        map->sections_nb_alloc = MAX(map->sections_nb_alloc * 2, 16);
+        map->sections = g_renew(MemoryRegionSection, map->sections,
+                                map->sections_nb_alloc);
     }
-    next_map.sections[next_map.sections_nb] = *section;
+    map->sections[map->sections_nb] = *section;
     memory_region_ref(section->mr);
-    return next_map.sections_nb++;
+    return map->sections_nb++;
 }
 
 static void phys_section_destroy(MemoryRegion *mr)
@@ -865,7 +858,6 @@ static void phys_sections_free(PhysPageMap *map)
     }
     g_free(map->sections);
     g_free(map->nodes);
-    g_free(map);
 }
 
 static void register_subpage(AddressSpaceDispatch *d, MemoryRegionSection *section)
@@ -874,7 +866,7 @@ static void register_subpage(AddressSpaceDispatch *d, MemoryRegionSection *secti
     hwaddr base = section->offset_within_address_space
         & TARGET_PAGE_MASK;
     MemoryRegionSection *existing = phys_page_find(d->phys_map, base,
-                                                   next_map.nodes, next_map.sections);
+                                                   d->map.nodes, d->map.sections);
     MemoryRegionSection subsection = {
         .offset_within_address_space = base,
         .size = int128_make64(TARGET_PAGE_SIZE),
@@ -887,13 +879,14 @@ static void register_subpage(AddressSpaceDispatch *d, MemoryRegionSection *secti
         subpage = subpage_init(d->as, base);
         subsection.mr = &subpage->iomem;
         phys_page_set(d, base >> TARGET_PAGE_BITS, 1,
-                      phys_section_add(&subsection));
+                      phys_section_add(&d->map, &subsection));
     } else {
         subpage = container_of(existing->mr, subpage_t, iomem);
     }
     start = section->offset_within_address_space & ~TARGET_PAGE_MASK;
     end = start + int128_get64(section->size) - 1;
-    subpage_register(subpage, start, end, phys_section_add(section));
+    subpage_register(subpage, start, end,
+                     phys_section_add(&d->map, section));
 }
 
 
@@ -901,7 +894,7 @@ static void register_multipage(AddressSpaceDispatch *d,
                                MemoryRegionSection *section)
 {
     hwaddr start_addr = section->offset_within_address_space;
-    uint16_t section_index = phys_section_add(section);
+    uint16_t section_index = phys_section_add(&d->map, section);
     uint64_t num_pages = int128_get64(int128_rshift(section->size,
                                                     TARGET_PAGE_BITS));
 
@@ -1720,7 +1713,7 @@ static subpage_t *subpage_init(AddressSpace *as, hwaddr base)
     return mmio;
 }
 
-static uint16_t dummy_section(MemoryRegion *mr)
+static uint16_t dummy_section(PhysPageMap *map, MemoryRegion *mr)
 {
     MemoryRegionSection section = {
         .mr = mr,
@@ -1729,12 +1722,13 @@ static uint16_t dummy_section(MemoryRegion *mr)
         .size = int128_2_64(),
     };
 
-    return phys_section_add(&section);
+    return phys_section_add(map, &section);
 }
 
 MemoryRegion *iotlb_to_region(hwaddr index)
 {
-    return address_space_memory.dispatch->sections[index & ~TARGET_PAGE_MASK].mr;
+    return address_space_memory.dispatch->map.sections[
+           index & ~TARGET_PAGE_MASK].mr;
 }
 
 static void io_mem_init(void)
@@ -1751,7 +1745,17 @@ static void io_mem_init(void)
 static void mem_begin(MemoryListener *listener)
 {
     AddressSpace *as = container_of(listener, AddressSpace, dispatch_listener);
-    AddressSpaceDispatch *d = g_new(AddressSpaceDispatch, 1);
+    AddressSpaceDispatch *d = g_new0(AddressSpaceDispatch, 1);
+    uint16_t n;
+
+    n = dummy_section(&d->map, &io_mem_unassigned);
+    assert(n == PHYS_SECTION_UNASSIGNED);
+    n = dummy_section(&d->map, &io_mem_notdirty);
+    assert(n == PHYS_SECTION_NOTDIRTY);
+    n = dummy_section(&d->map, &io_mem_rom);
+    assert(n == PHYS_SECTION_ROM);
+    n = dummy_section(&d->map, &io_mem_watch);
+    assert(n == PHYS_SECTION_WATCH);
 
     d->phys_map  = (PhysPageEntry) { .ptr = PHYS_MAP_NODE_NIL, .skip = 1 };
     d->as = as;
@@ -1764,39 +1768,14 @@ static void mem_commit(MemoryListener *listener)
     AddressSpaceDispatch *cur = as->dispatch;
     AddressSpaceDispatch *next = as->next_dispatch;
 
-    next->nodes = next_map.nodes;
-    next->sections = next_map.sections;
-
-    phys_page_compact_all(next, next_map.nodes_nb);
+    phys_page_compact_all(next, next->map.nodes_nb);
 
     as->dispatch = next;
-    g_free(cur);
-}
-
-static void core_begin(MemoryListener *listener)
-{
-    uint16_t n;
 
-    prev_map = g_new(PhysPageMap, 1);
-    *prev_map = next_map;
-
-    memset(&next_map, 0, sizeof(next_map));
-    n = dummy_section(&io_mem_unassigned);
-    assert(n == PHYS_SECTION_UNASSIGNED);
-    n = dummy_section(&io_mem_notdirty);
-    assert(n == PHYS_SECTION_NOTDIRTY);
-    n = dummy_section(&io_mem_rom);
-    assert(n == PHYS_SECTION_ROM);
-    n = dummy_section(&io_mem_watch);
-    assert(n == PHYS_SECTION_WATCH);
-}
-
-/* This listener's commit run after the other AddressSpaceDispatch listeners'.
- * All AddressSpaceDispatch instances have switched to the next map.
- */
-static void core_commit(MemoryListener *listener)
-{
-    phys_sections_free(prev_map);
+    if (cur) {
+        phys_sections_free(&cur->map);
+        g_free(cur);
+    }
 }
 
 static void tcg_commit(MemoryListener *listener)
@@ -1824,8 +1803,6 @@ static void core_log_global_stop(MemoryListener *listener)
 }
 
 static MemoryListener core_memory_listener = {
-    .begin = core_begin,
-    .commit = core_commit,
     .log_global_start = core_log_global_start,
     .log_global_stop = core_log_global_stop,
     .priority = 1,
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 25/28] acpi unit-test: load and check facs table
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (23 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 24/28] exec: separate sections and nodes per address space Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 26/28] acpi unit-test: adjust the test data structure for better handling Michael S. Tsirkin
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Marcel Apfelbaum

From: Marcel Apfelbaum <marcel.a@redhat.com>

FACS table does not have a checksum, so we can
check at least the signature (existence).

Signed-off-by: Marcel Apfelbaum <marcel.a@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tests/acpi-test.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tests/acpi-test.c b/tests/acpi-test.c
index d6ff66f..43775cd 100644
--- a/tests/acpi-test.c
+++ b/tests/acpi-test.c
@@ -30,6 +30,7 @@ typedef struct {
     AcpiRsdpDescriptor rsdp_table;
     AcpiRsdtDescriptorRev1 rsdt_table;
     AcpiFadtDescriptorRev1 fadt_table;
+    AcpiFacsDescriptorRev1 facs_table;
     uint32_t *rsdt_tables_addr;
     int rsdt_tables_nr;
     AcpiSdtTable dsdt_table;
@@ -252,6 +253,22 @@ static void test_acpi_fadt_table(test_data *data)
     g_assert(!acpi_checksum((uint8_t *)fadt_table, fadt_table->length));
 }
 
+static void test_acpi_facs_table(test_data *data)
+{
+    AcpiFacsDescriptorRev1 *facs_table = &data->facs_table;
+    uint32_t addr = data->fadt_table.firmware_ctrl;
+
+    ACPI_READ_FIELD(facs_table->signature, addr);
+    ACPI_READ_FIELD(facs_table->length, addr);
+    ACPI_READ_FIELD(facs_table->hardware_signature, addr);
+    ACPI_READ_FIELD(facs_table->firmware_waking_vector, addr);
+    ACPI_READ_FIELD(facs_table->global_lock, addr);
+    ACPI_READ_FIELD(facs_table->flags, addr);
+    ACPI_READ_ARRAY(facs_table->resverved3, addr);
+
+    g_assert_cmphex(facs_table->signature, ==, ACPI_FACS_SIGNATURE);
+}
+
 static void test_dst_table(AcpiSdtTable *sdt_table, uint32_t addr)
 {
     uint8_t checksum;
@@ -329,6 +346,7 @@ static void test_acpi_one(const char *params)
     test_acpi_rsdp_table(&data);
     test_acpi_rsdt_table(&data);
     test_acpi_fadt_table(&data);
+    test_acpi_facs_table(data);
     test_acpi_dsdt_table(&data);
     test_acpi_ssdt_tables(&data);
 
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 26/28] acpi unit-test: adjust the test data structure for better handling
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (24 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 25/28] acpi unit-test: load and check facs table Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 27/28] hpet: fix build with CONFIG_HPET off Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 28/28] pc: use macro for HPET type Michael S. Tsirkin
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Marcel Apfelbaum

From: Marcel Apfelbaum <marcel.a@redhat.com>

Ensure more then one instance of test_data may exist
at a given time. It will help to compare different
acpi table versions.

Signed-off-by: Marcel Apfelbaum <marcel.a@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tests/acpi-test.c | 55 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/tests/acpi-test.c b/tests/acpi-test.c
index 43775cd..ca83b1d 100644
--- a/tests/acpi-test.c
+++ b/tests/acpi-test.c
@@ -34,7 +34,7 @@ typedef struct {
     uint32_t *rsdt_tables_addr;
     int rsdt_tables_nr;
     AcpiSdtTable dsdt_table;
-    AcpiSdtTable *ssdt_tables;
+    GArray *ssdt_tables;
 } test_data;
 
 #define LOW(x) ((x) & 0xff)
@@ -118,6 +118,18 @@ static uint8_t boot_sector[0x200] = {
 
 static const char *disk = "tests/acpi-test-disk.raw";
 
+static void free_test_data(test_data *data)
+{
+    int i;
+
+    g_free(data->rsdt_tables_addr);
+    for (i = 0; i < data->ssdt_tables->len; ++i) {
+        g_free(g_array_index(data->ssdt_tables, AcpiSdtTable, i).aml);
+    }
+    g_array_free(data->ssdt_tables, false);
+    g_free(data->dsdt_table.aml);
+}
+
 static uint8_t acpi_checksum(const uint8_t *data, int len)
 {
     int i;
@@ -295,30 +307,30 @@ static void test_acpi_dsdt_table(test_data *data)
 
 static void test_acpi_ssdt_tables(test_data *data)
 {
-    AcpiSdtTable *ssdt_tables;
+    GArray *ssdt_tables;
     int ssdt_tables_nr = data->rsdt_tables_nr - 1; /* fadt is first */
     int i;
 
-    ssdt_tables = g_new0(AcpiSdtTable, ssdt_tables_nr);
+    ssdt_tables = g_array_sized_new(false, true, sizeof(AcpiSdtTable),
+                                    ssdt_tables_nr);
     for (i = 0; i < ssdt_tables_nr; i++) {
-        AcpiSdtTable *ssdt_table = &ssdt_tables[i];
+        AcpiSdtTable ssdt_table;
         uint32_t addr = data->rsdt_tables_addr[i + 1]; /* fadt is first */
-
-        test_dst_table(ssdt_table, addr);
+        test_dst_table(&ssdt_table, addr);
+        g_array_append_val(ssdt_tables, ssdt_table);
     }
     data->ssdt_tables = ssdt_tables;
 }
 
-static void test_acpi_one(const char *params)
+static void test_acpi_one(const char *params, test_data *data)
 {
     char *args;
     uint8_t signature_low;
     uint8_t signature_high;
     uint16_t signature;
     int i;
-    test_data data;
 
-    memset(&data, 0, sizeof(data));
+    memset(data, 0, sizeof(*data));
     args = g_strdup_printf("-net none -display none %s %s",
                            params ? params : "", disk);
     qtest_start(args);
@@ -342,20 +354,13 @@ static void test_acpi_one(const char *params)
     }
     g_assert_cmphex(signature, ==, SIGNATURE);
 
-    test_acpi_rsdp_address(&data);
-    test_acpi_rsdp_table(&data);
-    test_acpi_rsdt_table(&data);
-    test_acpi_fadt_table(&data);
+    test_acpi_rsdp_address(data);
+    test_acpi_rsdp_table(data);
+    test_acpi_rsdt_table(data);
+    test_acpi_fadt_table(data);
     test_acpi_facs_table(data);
-    test_acpi_dsdt_table(&data);
-    test_acpi_ssdt_tables(&data);
-
-    g_free(data.rsdt_tables_addr);
-    for (i = 0; i < (data.rsdt_tables_nr - 1); ++i) {
-        g_free(data.ssdt_tables[i].aml);
-    }
-    g_free(data.ssdt_tables);
-    g_free(data.dsdt_table.aml);
+    test_acpi_dsdt_table(data);
+    test_acpi_ssdt_tables(data);
 
     qtest_quit(global_qtest);
     g_free(args);
@@ -363,10 +368,14 @@ static void test_acpi_one(const char *params)
 
 static void test_acpi_tcg(void)
 {
+    test_data data;
+
     /* Supplying -machine accel argument overrides the default (qtest).
      * This is to make guest actually run.
      */
-    test_acpi_one("-machine accel=tcg");
+    test_acpi_one("-machine accel=tcg", &data);
+
+    free_test_data(&data);
 }
 
 int main(int argc, char *argv[])
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 27/28] hpet: fix build with CONFIG_HPET off
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (25 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 26/28] acpi unit-test: adjust the test data structure for better handling Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  2013-12-11 18:31 ` [Qemu-devel] [PULL 28/28] pc: use macro for HPET type Michael S. Tsirkin
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-stable

make hpet_find inline so we don't need
to build hpet.c to check if hpet is enabled.

Fixes link error with CONFIG_HPET off.

Cc: qemu-stable@nongnu.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/timer/hpet.h | 10 +++++++++-
 hw/timer/hpet.c         |  6 ------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/hw/timer/hpet.h b/include/hw/timer/hpet.h
index ab44bd3..773953b 100644
--- a/include/hw/timer/hpet.h
+++ b/include/hw/timer/hpet.h
@@ -13,6 +13,8 @@
 #ifndef QEMU_HPET_EMUL_H
 #define QEMU_HPET_EMUL_H
 
+#include "qom/object.h"
+
 #define HPET_BASE               0xfed00000
 #define HPET_CLK_PERIOD         10000000ULL /* 10000000 femtoseconds == 10ns*/
 
@@ -72,5 +74,11 @@ struct hpet_fw_config
 
 extern struct hpet_fw_config hpet_cfg;
 
-bool hpet_find(void);
+#define TYPE_HPET "hpet"
+
+static inline bool hpet_find(void)
+{
+    return object_resolve_path_type("", TYPE_HPET, NULL);
+}
+
 #endif
diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 0ec440e..bb3bf98 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -42,7 +42,6 @@
 
 #define HPET_MSI_SUPPORT        0
 
-#define TYPE_HPET "hpet"
 #define HPET(obj) OBJECT_CHECK(HPETState, (obj), TYPE_HPET)
 
 struct HPETState;
@@ -772,11 +771,6 @@ static void hpet_device_class_init(ObjectClass *klass, void *data)
     dc->props = hpet_device_properties;
 }
 
-bool hpet_find(void)
-{
-    return object_resolve_path_type("", TYPE_HPET, NULL);
-}
-
 static const TypeInfo hpet_device_info = {
     .name          = TYPE_HPET,
     .parent        = TYPE_SYS_BUS_DEVICE,
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Qemu-devel] [PULL 28/28] pc: use macro for HPET type
  2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
                   ` (26 preceding siblings ...)
  2013-12-11 18:31 ` [Qemu-devel] [PULL 27/28] hpet: fix build with CONFIG_HPET off Michael S. Tsirkin
@ 2013-12-11 18:31 ` Michael S. Tsirkin
  27 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2013-12-11 18:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori

avoid hard-coding strings

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/pc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 8353d10..3cd8f38 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1282,7 +1282,7 @@ void pc_basic_device_init(ISABus *isa_bus, qemu_irq *gsi,
      */
     if (!no_hpet && (!kvm_irqchip_in_kernel() || kvm_has_pit_state2())) {
         /* In order to set property, here not using sysbus_try_create_simple */
-        hpet = qdev_try_create(NULL, "hpet");
+        hpet = qdev_try_create(NULL, TYPE_HPET);
         if (hpet) {
             /* For pc-piix-*, hpet's intcap is always IRQ2. For pc-q35-1.7
              * and earlier, use IRQ2 for compat. Otherwise, use IRQ16~23,
-- 
MST

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2013-12-11 18:30 ` [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide Michael S. Tsirkin
@ 2014-01-09 17:24   ` Alex Williamson
  2014-01-09 18:00     ` Michael S. Tsirkin
  2014-01-20 16:20     ` Mike Day
  0 siblings, 2 replies; 74+ messages in thread
From: Alex Williamson @ 2014-01-09 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> As an alternative to commit 818f86b (exec: limit system memory
> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> This eliminates problems with phys_page_find ignoring bits above
> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> consequently messing up the computations.
> 
> In Luiz's reported crash, at startup gdb attempts to read from address
> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> is the newly introduced master abort region, which is as big as the PCI
> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> bits of the physical address.  In address_space_translate_internal then
> 
>     diff = int128_sub(section->mr->size, int128_make64(addr));
>     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> 
> diff becomes negative, and int128_get64 booms.
> 
> The size of the PCI address space region should be fixed anyway.
> 
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  exec.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 7e5ce93..f907f5f 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>  
>  /* Size of the L2 (and L3, etc) page tables.  */
> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> +#define ADDR_SPACE_BITS 64
>  
>  #define P_L2_BITS 10
>  #define P_L2_SIZE (1 << P_L2_BITS)
> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>  {
>      system_memory = g_malloc(sizeof(*system_memory));
>  
> -    assert(ADDR_SPACE_BITS <= 64);
> -
> -    memory_region_init(system_memory, NULL, "system",
> -                       ADDR_SPACE_BITS == 64 ?
> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>      address_space_init(&address_space_memory, system_memory, "memory");
>  
>      system_io = g_malloc(sizeof(*system_io));

This seems to have some unexpected consequences around sizing 64bit PCI
BARs that I'm not sure how to handle.  After this patch I get vfio
traces like this:

vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
(save lower 32bits of BAR)
vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
(write mask to BAR)
vfio: region_del febe0000 - febe3fff
(memory region gets unmapped)
vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
(read size mask)
vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
(restore BAR)
vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
(memory region re-mapped)
vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
(save upper 32bits of BAR)
vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
(write mask to BAR)
vfio: region_del febe0000 - febe3fff
(memory region gets unmapped)
vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
(memory region gets re-mapped with new address)
qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
(iommu barfs because it can only handle 48bit physical addresses)

Prior to this change, there was no re-map with the fffffffffebe0000
address, presumably because it was beyond the address space of the PCI
window.  This address is clearly not in a PCI MMIO space, so why are we
allowing it to be realized in the system address space at this location?
Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 17:24   ` Alex Williamson
@ 2014-01-09 18:00     ` Michael S. Tsirkin
  2014-01-09 18:47       ` Alex Williamson
  2014-01-20 16:20     ` Mike Day
  1 sibling, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-09 18:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > From: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > As an alternative to commit 818f86b (exec: limit system memory
> > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > This eliminates problems with phys_page_find ignoring bits above
> > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > consequently messing up the computations.
> > 
> > In Luiz's reported crash, at startup gdb attempts to read from address
> > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > is the newly introduced master abort region, which is as big as the PCI
> > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > bits of the physical address.  In address_space_translate_internal then
> > 
> >     diff = int128_sub(section->mr->size, int128_make64(addr));
> >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > 
> > diff becomes negative, and int128_get64 booms.
> > 
> > The size of the PCI address space region should be fixed anyway.
> > 
> > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  exec.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 7e5ce93..f907f5f 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >  
> >  /* Size of the L2 (and L3, etc) page tables.  */
> > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > +#define ADDR_SPACE_BITS 64
> >  
> >  #define P_L2_BITS 10
> >  #define P_L2_SIZE (1 << P_L2_BITS)
> > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >  {
> >      system_memory = g_malloc(sizeof(*system_memory));
> >  
> > -    assert(ADDR_SPACE_BITS <= 64);
> > -
> > -    memory_region_init(system_memory, NULL, "system",
> > -                       ADDR_SPACE_BITS == 64 ?
> > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >      address_space_init(&address_space_memory, system_memory, "memory");
> >  
> >      system_io = g_malloc(sizeof(*system_io));
> 
> This seems to have some unexpected consequences around sizing 64bit PCI
> BARs that I'm not sure how to handle.

BARs are often disabled during sizing. Maybe you
don't detect BAR being disabled?

>  After this patch I get vfio
> traces like this:
> 
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> (save lower 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> (write mask to BAR)
> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> (read size mask)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> (restore BAR)
> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> (memory region re-mapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> (save upper 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> (write mask to BAR)
> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> (memory region gets re-mapped with new address)
> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> (iommu barfs because it can only handle 48bit physical addresses)
> 

Why are you trying to program BAR addresses for dma in the iommu?

> Prior to this change, there was no re-map with the fffffffffebe0000
> address, presumably because it was beyond the address space of the PCI
> window.  This address is clearly not in a PCI MMIO space, so why are we
> allowing it to be realized in the system address space at this location?
> Thanks,
> 
> Alex

Why do you think it is not in PCI MMIO space?
True, CPU can't access this address but other pci devices can.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 18:00     ` Michael S. Tsirkin
@ 2014-01-09 18:47       ` Alex Williamson
  2014-01-09 19:03         ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-09 18:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > 
> > > As an alternative to commit 818f86b (exec: limit system memory
> > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > This eliminates problems with phys_page_find ignoring bits above
> > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > consequently messing up the computations.
> > > 
> > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > is the newly introduced master abort region, which is as big as the PCI
> > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > bits of the physical address.  In address_space_translate_internal then
> > > 
> > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > 
> > > diff becomes negative, and int128_get64 booms.
> > > 
> > > The size of the PCI address space region should be fixed anyway.
> > > 
> > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  exec.c | 8 ++------
> > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/exec.c b/exec.c
> > > index 7e5ce93..f907f5f 100644
> > > --- a/exec.c
> > > +++ b/exec.c
> > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >  
> > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > +#define ADDR_SPACE_BITS 64
> > >  
> > >  #define P_L2_BITS 10
> > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >  {
> > >      system_memory = g_malloc(sizeof(*system_memory));
> > >  
> > > -    assert(ADDR_SPACE_BITS <= 64);
> > > -
> > > -    memory_region_init(system_memory, NULL, "system",
> > > -                       ADDR_SPACE_BITS == 64 ?
> > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >      address_space_init(&address_space_memory, system_memory, "memory");
> > >  
> > >      system_io = g_malloc(sizeof(*system_io));
> > 
> > This seems to have some unexpected consequences around sizing 64bit PCI
> > BARs that I'm not sure how to handle.
> 
> BARs are often disabled during sizing. Maybe you
> don't detect BAR being disabled?

See the trace below, the BARs are not disabled.  QEMU pci-core is doing
the sizing an memory region updates for the BARs, vfio is just a
pass-through here.

> >  After this patch I get vfio
> > traces like this:
> > 
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > (save lower 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > (write mask to BAR)
> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > (read size mask)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > (restore BAR)
> > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > (memory region re-mapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > (save upper 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > (write mask to BAR)
> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > (memory region gets re-mapped with new address)
> > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > (iommu barfs because it can only handle 48bit physical addresses)
> > 
> 
> Why are you trying to program BAR addresses for dma in the iommu?

Two reasons, first I can't tell the difference between RAM and MMIO.
Second, it enables peer-to-peer DMA between devices, which is something
that we might be able to take advantage of with GPU passthrough.

> > Prior to this change, there was no re-map with the fffffffffebe0000
> > address, presumably because it was beyond the address space of the PCI
> > window.  This address is clearly not in a PCI MMIO space, so why are we
> > allowing it to be realized in the system address space at this location?
> > Thanks,
> > 
> > Alex
> 
> Why do you think it is not in PCI MMIO space?
> True, CPU can't access this address but other pci devices can.

What happens on real hardware when an address like this is programmed to
a device?  The CPU doesn't have the physical bits to access it.  I have
serious doubts that another PCI device would be able to access it
either.  Maybe in some limited scenario where the devices are on the
same conventional PCI bus.  In the typical case, PCI addresses are
always limited by some kind of aperture, whether that's explicit in
bridge windows or implicit in hardware design (and perhaps made explicit
in ACPI).  Even if I wanted to filter these out as noise in vfio, how
would I do it in a way that still allows real 64bit MMIO to be
programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 18:47       ` Alex Williamson
@ 2014-01-09 19:03         ` Alex Williamson
  2014-01-09 21:56           ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-09 19:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > 
> > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > This eliminates problems with phys_page_find ignoring bits above
> > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > consequently messing up the computations.
> > > > 
> > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > is the newly introduced master abort region, which is as big as the PCI
> > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > bits of the physical address.  In address_space_translate_internal then
> > > > 
> > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > 
> > > > diff becomes negative, and int128_get64 booms.
> > > > 
> > > > The size of the PCI address space region should be fixed anyway.
> > > > 
> > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >  exec.c | 8 ++------
> > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/exec.c b/exec.c
> > > > index 7e5ce93..f907f5f 100644
> > > > --- a/exec.c
> > > > +++ b/exec.c
> > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > >  
> > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > +#define ADDR_SPACE_BITS 64
> > > >  
> > > >  #define P_L2_BITS 10
> > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > >  {
> > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > >  
> > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > -
> > > > -    memory_region_init(system_memory, NULL, "system",
> > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > >  
> > > >      system_io = g_malloc(sizeof(*system_io));
> > > 
> > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > BARs that I'm not sure how to handle.
> > 
> > BARs are often disabled during sizing. Maybe you
> > don't detect BAR being disabled?
> 
> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> the sizing an memory region updates for the BARs, vfio is just a
> pass-through here.

Sorry, not in the trace below, but yes the sizing seems to be happening
while I/O & memory are enabled int he command register.  Thanks,

Alex

> > >  After this patch I get vfio
> > > traces like this:
> > > 
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > (save lower 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > (read size mask)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > (restore BAR)
> > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > (memory region re-mapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > (save upper 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > (memory region gets re-mapped with new address)
> > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > (iommu barfs because it can only handle 48bit physical addresses)
> > > 
> > 
> > Why are you trying to program BAR addresses for dma in the iommu?
> 
> Two reasons, first I can't tell the difference between RAM and MMIO.
> Second, it enables peer-to-peer DMA between devices, which is something
> that we might be able to take advantage of with GPU passthrough.
> 
> > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > address, presumably because it was beyond the address space of the PCI
> > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > allowing it to be realized in the system address space at this location?
> > > Thanks,
> > > 
> > > Alex
> > 
> > Why do you think it is not in PCI MMIO space?
> > True, CPU can't access this address but other pci devices can.
> 
> What happens on real hardware when an address like this is programmed to
> a device?  The CPU doesn't have the physical bits to access it.  I have
> serious doubts that another PCI device would be able to access it
> either.  Maybe in some limited scenario where the devices are on the
> same conventional PCI bus.  In the typical case, PCI addresses are
> always limited by some kind of aperture, whether that's explicit in
> bridge windows or implicit in hardware design (and perhaps made explicit
> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> would I do it in a way that still allows real 64bit MMIO to be
> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 19:03         ` Alex Williamson
@ 2014-01-09 21:56           ` Michael S. Tsirkin
  2014-01-09 22:42             ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-09 21:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > 
> > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > consequently messing up the computations.
> > > > > 
> > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > 
> > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > 
> > > > > diff becomes negative, and int128_get64 booms.
> > > > > 
> > > > > The size of the PCI address space region should be fixed anyway.
> > > > > 
> > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > ---
> > > > >  exec.c | 8 ++------
> > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/exec.c b/exec.c
> > > > > index 7e5ce93..f907f5f 100644
> > > > > --- a/exec.c
> > > > > +++ b/exec.c
> > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > >  
> > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > +#define ADDR_SPACE_BITS 64
> > > > >  
> > > > >  #define P_L2_BITS 10
> > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > >  {
> > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > >  
> > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > -
> > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > >  
> > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > 
> > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > BARs that I'm not sure how to handle.
> > > 
> > > BARs are often disabled during sizing. Maybe you
> > > don't detect BAR being disabled?
> > 
> > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > the sizing an memory region updates for the BARs, vfio is just a
> > pass-through here.
> 
> Sorry, not in the trace below, but yes the sizing seems to be happening
> while I/O & memory are enabled int he command register.  Thanks,
> 
> Alex

OK then from QEMU POV this BAR value is not special at all.

> > > >  After this patch I get vfio
> > > > traces like this:
> > > > 
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > (save lower 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > (read size mask)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > (restore BAR)
> > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > (memory region re-mapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > (save upper 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > (memory region gets re-mapped with new address)
> > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > 
> > > 
> > > Why are you trying to program BAR addresses for dma in the iommu?
> > 
> > Two reasons, first I can't tell the difference between RAM and MMIO.

Why can't you? Generally memory core let you find out easily.
But in this case it's vfio device itself that is sized so for sure you
know it's MMIO.
Maybe you will have same issue if there's another device with a 64 bit
bar though, like ivshmem?

> > Second, it enables peer-to-peer DMA between devices, which is something
> > that we might be able to take advantage of with GPU passthrough.
> > 
> > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > address, presumably because it was beyond the address space of the PCI
> > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > allowing it to be realized in the system address space at this location?
> > > > Thanks,
> > > > 
> > > > Alex
> > > 
> > > Why do you think it is not in PCI MMIO space?
> > > True, CPU can't access this address but other pci devices can.
> > 
> > What happens on real hardware when an address like this is programmed to
> > a device?  The CPU doesn't have the physical bits to access it.  I have
> > serious doubts that another PCI device would be able to access it
> > either.  Maybe in some limited scenario where the devices are on the
> > same conventional PCI bus.  In the typical case, PCI addresses are
> > always limited by some kind of aperture, whether that's explicit in
> > bridge windows or implicit in hardware design (and perhaps made explicit
> > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > would I do it in a way that still allows real 64bit MMIO to be
> > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > 
> > Alex
> 

AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
full 64 bit addresses must be allowed and hardware validation
test suites normally check that it actually does work
if it happens.

Yes, if there's a bridge somewhere on the path that bridge's
windows would protect you, but pci already does this filtering:
if you see this address in the memory map this means
your virtual device is on root bus.

So I think it's the other way around: if VFIO requires specific
address ranges to be assigned to devices, it should give this
info to qemu and qemu can give this to guest.
Then anything outside that range can be ignored by VFIO.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 21:56           ` Michael S. Tsirkin
@ 2014-01-09 22:42             ` Alex Williamson
  2014-01-10 12:55               ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-09 22:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > 
> > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > consequently messing up the computations.
> > > > > > 
> > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > 
> > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > 
> > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > 
> > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > 
> > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > ---
> > > > > >  exec.c | 8 ++------
> > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > diff --git a/exec.c b/exec.c
> > > > > > index 7e5ce93..f907f5f 100644
> > > > > > --- a/exec.c
> > > > > > +++ b/exec.c
> > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > >  
> > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > +#define ADDR_SPACE_BITS 64
> > > > > >  
> > > > > >  #define P_L2_BITS 10
> > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > >  {
> > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > >  
> > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > -
> > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > >  
> > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > 
> > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > BARs that I'm not sure how to handle.
> > > > 
> > > > BARs are often disabled during sizing. Maybe you
> > > > don't detect BAR being disabled?
> > > 
> > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > the sizing an memory region updates for the BARs, vfio is just a
> > > pass-through here.
> > 
> > Sorry, not in the trace below, but yes the sizing seems to be happening
> > while I/O & memory are enabled int he command register.  Thanks,
> > 
> > Alex
> 
> OK then from QEMU POV this BAR value is not special at all.

Unfortunately

> > > > >  After this patch I get vfio
> > > > > traces like this:
> > > > > 
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > (save lower 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > (read size mask)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > (restore BAR)
> > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > (memory region re-mapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > (save upper 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > (memory region gets re-mapped with new address)
> > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > 
> > > > 
> > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > 
> > > Two reasons, first I can't tell the difference between RAM and MMIO.
> 
> Why can't you? Generally memory core let you find out easily.

My MemoryListener is setup for &address_space_memory and I then filter
out anything that's not memory_region_is_ram().  This still gets
through, so how do I easily find out?

> But in this case it's vfio device itself that is sized so for sure you
> know it's MMIO.

How so?  I have a MemoryListener as described above and pass everything
through to the IOMMU.  I suppose I could look through all the
VFIODevices and check if the MemoryRegion matches, but that seems really
ugly.

> Maybe you will have same issue if there's another device with a 64 bit
> bar though, like ivshmem?

Perhaps, I suspect I'll see anything that registers their BAR
MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.

> > > Second, it enables peer-to-peer DMA between devices, which is something
> > > that we might be able to take advantage of with GPU passthrough.
> > > 
> > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > address, presumably because it was beyond the address space of the PCI
> > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > allowing it to be realized in the system address space at this location?
> > > > > Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > Why do you think it is not in PCI MMIO space?
> > > > True, CPU can't access this address but other pci devices can.
> > > 
> > > What happens on real hardware when an address like this is programmed to
> > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > serious doubts that another PCI device would be able to access it
> > > either.  Maybe in some limited scenario where the devices are on the
> > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > always limited by some kind of aperture, whether that's explicit in
> > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > would I do it in a way that still allows real 64bit MMIO to be
> > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > 
> > > Alex
> > 
> 
> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> full 64 bit addresses must be allowed and hardware validation
> test suites normally check that it actually does work
> if it happens.

Sure, PCI devices themselves, but the chipset typically has defined
routing, that's more what I'm referring to.  There are generally only
fixed address windows for RAM vs MMIO.

> Yes, if there's a bridge somewhere on the path that bridge's
> windows would protect you, but pci already does this filtering:
> if you see this address in the memory map this means
> your virtual device is on root bus.
> 
> So I think it's the other way around: if VFIO requires specific
> address ranges to be assigned to devices, it should give this
> info to qemu and qemu can give this to guest.
> Then anything outside that range can be ignored by VFIO.

Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
currently no way to find out the address width of the IOMMU.  We've been
getting by because it's safely close enough to the CPU address width to
not be a concern until we start exposing things at the top of the 64bit
address space.  Maybe I can safely ignore anything above
TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 22:42             ` Alex Williamson
@ 2014-01-10 12:55               ` Michael S. Tsirkin
  2014-01-10 15:31                 ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-10 12:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > 
> > > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > consequently messing up the computations.
> > > > > > > 
> > > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > > 
> > > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > 
> > > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > > 
> > > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > > 
> > > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > ---
> > > > > > >  exec.c | 8 ++------
> > > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/exec.c b/exec.c
> > > > > > > index 7e5ce93..f907f5f 100644
> > > > > > > --- a/exec.c
> > > > > > > +++ b/exec.c
> > > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > >  
> > > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > +#define ADDR_SPACE_BITS 64
> > > > > > >  
> > > > > > >  #define P_L2_BITS 10
> > > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > >  {
> > > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > > >  
> > > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > -
> > > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > >  
> > > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > > 
> > > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > BARs that I'm not sure how to handle.
> > > > > 
> > > > > BARs are often disabled during sizing. Maybe you
> > > > > don't detect BAR being disabled?
> > > > 
> > > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > the sizing an memory region updates for the BARs, vfio is just a
> > > > pass-through here.
> > > 
> > > Sorry, not in the trace below, but yes the sizing seems to be happening
> > > while I/O & memory are enabled int he command register.  Thanks,
> > > 
> > > Alex
> > 
> > OK then from QEMU POV this BAR value is not special at all.
> 
> Unfortunately
> 
> > > > > >  After this patch I get vfio
> > > > > > traces like this:
> > > > > > 
> > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > (save lower 32bits of BAR)
> > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > (write mask to BAR)
> > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > (memory region gets unmapped)
> > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > (read size mask)
> > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > (restore BAR)
> > > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > (memory region re-mapped)
> > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > (save upper 32bits of BAR)
> > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > (write mask to BAR)
> > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > (memory region gets unmapped)
> > > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > (memory region gets re-mapped with new address)
> > > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > 
> > > > > 
> > > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > > 
> > > > Two reasons, first I can't tell the difference between RAM and MMIO.
> > 
> > Why can't you? Generally memory core let you find out easily.
> 
> My MemoryListener is setup for &address_space_memory and I then filter
> out anything that's not memory_region_is_ram().  This still gets
> through, so how do I easily find out?
> 
> > But in this case it's vfio device itself that is sized so for sure you
> > know it's MMIO.
> 
> How so?  I have a MemoryListener as described above and pass everything
> through to the IOMMU.  I suppose I could look through all the
> VFIODevices and check if the MemoryRegion matches, but that seems really
> ugly.
> 
> > Maybe you will have same issue if there's another device with a 64 bit
> > bar though, like ivshmem?
> 
> Perhaps, I suspect I'll see anything that registers their BAR
> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.

Must be a 64 bit BAR to trigger the issue though.

> > > > Second, it enables peer-to-peer DMA between devices, which is something
> > > > that we might be able to take advantage of with GPU passthrough.
> > > > 
> > > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > address, presumably because it was beyond the address space of the PCI
> > > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > allowing it to be realized in the system address space at this location?
> > > > > > Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > Why do you think it is not in PCI MMIO space?
> > > > > True, CPU can't access this address but other pci devices can.
> > > > 
> > > > What happens on real hardware when an address like this is programmed to
> > > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > serious doubts that another PCI device would be able to access it
> > > > either.  Maybe in some limited scenario where the devices are on the
> > > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > > always limited by some kind of aperture, whether that's explicit in
> > > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > would I do it in a way that still allows real 64bit MMIO to be
> > > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > 
> > > > Alex
> > > 
> > 
> > AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > full 64 bit addresses must be allowed and hardware validation
> > test suites normally check that it actually does work
> > if it happens.
> 
> Sure, PCI devices themselves, but the chipset typically has defined
> routing, that's more what I'm referring to.  There are generally only
> fixed address windows for RAM vs MMIO.

The physical chipset? Likely - in the presence of IOMMU.
Without that, devices can talk to each other without going
through chipset, and bridge spec is very explicit that
full 64 bit addressing must be supported.

So as long as we don't emulate an IOMMU,
guest will normally think it's okay to use any address.

> > Yes, if there's a bridge somewhere on the path that bridge's
> > windows would protect you, but pci already does this filtering:
> > if you see this address in the memory map this means
> > your virtual device is on root bus.
> > 
> > So I think it's the other way around: if VFIO requires specific
> > address ranges to be assigned to devices, it should give this
> > info to qemu and qemu can give this to guest.
> > Then anything outside that range can be ignored by VFIO.
> 
> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> currently no way to find out the address width of the IOMMU.  We've been
> getting by because it's safely close enough to the CPU address width to
> not be a concern until we start exposing things at the top of the 64bit
> address space.  Maybe I can safely ignore anything above
> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> 
> Alex

I think it's not related to target CPU at all - it's a host limitation.
So just make up your own constant, maybe depending on host architecture.
Long term add an ioctl to query it.

Also, we can add a fwcfg interface to tell bios that it should avoid
placing BARs above some address.

Since it's a vfio limitation I think it should be a vfio API, along the
lines of vfio_get_addr_space_bits(void).
(Is this true btw? legacy assignment doesn't have this problem?)

Something like this makes sense to you?

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-10 12:55               ` Michael S. Tsirkin
@ 2014-01-10 15:31                 ` Alex Williamson
  2014-01-12  7:54                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-10 15:31 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, qemu-devel, Luiz Capitulino

On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > 
> > > > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > > consequently messing up the computations.
> > > > > > > > 
> > > > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > > > 
> > > > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > > 
> > > > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > > > 
> > > > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > > > 
> > > > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > ---
> > > > > > > >  exec.c | 8 ++------
> > > > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/exec.c b/exec.c
> > > > > > > > index 7e5ce93..f907f5f 100644
> > > > > > > > --- a/exec.c
> > > > > > > > +++ b/exec.c
> > > > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > >  
> > > > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > > +#define ADDR_SPACE_BITS 64
> > > > > > > >  
> > > > > > > >  #define P_L2_BITS 10
> > > > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > >  {
> > > > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > >  
> > > > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > > -
> > > > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > >  
> > > > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > > > 
> > > > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > BARs that I'm not sure how to handle.
> > > > > > 
> > > > > > BARs are often disabled during sizing. Maybe you
> > > > > > don't detect BAR being disabled?
> > > > > 
> > > > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > the sizing an memory region updates for the BARs, vfio is just a
> > > > > pass-through here.
> > > > 
> > > > Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > while I/O & memory are enabled int he command register.  Thanks,
> > > > 
> > > > Alex
> > > 
> > > OK then from QEMU POV this BAR value is not special at all.
> > 
> > Unfortunately
> > 
> > > > > > >  After this patch I get vfio
> > > > > > > traces like this:
> > > > > > > 
> > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > (save lower 32bits of BAR)
> > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > (write mask to BAR)
> > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > (memory region gets unmapped)
> > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > (read size mask)
> > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > (restore BAR)
> > > > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > (memory region re-mapped)
> > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > (save upper 32bits of BAR)
> > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > (write mask to BAR)
> > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > (memory region gets unmapped)
> > > > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > (memory region gets re-mapped with new address)
> > > > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > 
> > > > > > 
> > > > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > > > 
> > > > > Two reasons, first I can't tell the difference between RAM and MMIO.
> > > 
> > > Why can't you? Generally memory core let you find out easily.
> > 
> > My MemoryListener is setup for &address_space_memory and I then filter
> > out anything that's not memory_region_is_ram().  This still gets
> > through, so how do I easily find out?
> > 
> > > But in this case it's vfio device itself that is sized so for sure you
> > > know it's MMIO.
> > 
> > How so?  I have a MemoryListener as described above and pass everything
> > through to the IOMMU.  I suppose I could look through all the
> > VFIODevices and check if the MemoryRegion matches, but that seems really
> > ugly.
> > 
> > > Maybe you will have same issue if there's another device with a 64 bit
> > > bar though, like ivshmem?
> > 
> > Perhaps, I suspect I'll see anything that registers their BAR
> > MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> 
> Must be a 64 bit BAR to trigger the issue though.
> 
> > > > > Second, it enables peer-to-peer DMA between devices, which is something
> > > > > that we might be able to take advantage of with GPU passthrough.
> > > > > 
> > > > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > address, presumably because it was beyond the address space of the PCI
> > > > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > allowing it to be realized in the system address space at this location?
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > Why do you think it is not in PCI MMIO space?
> > > > > > True, CPU can't access this address but other pci devices can.
> > > > > 
> > > > > What happens on real hardware when an address like this is programmed to
> > > > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > serious doubts that another PCI device would be able to access it
> > > > > either.  Maybe in some limited scenario where the devices are on the
> > > > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > always limited by some kind of aperture, whether that's explicit in
> > > > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > would I do it in a way that still allows real 64bit MMIO to be
> > > > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > 
> > > AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > full 64 bit addresses must be allowed and hardware validation
> > > test suites normally check that it actually does work
> > > if it happens.
> > 
> > Sure, PCI devices themselves, but the chipset typically has defined
> > routing, that's more what I'm referring to.  There are generally only
> > fixed address windows for RAM vs MMIO.
> 
> The physical chipset? Likely - in the presence of IOMMU.
> Without that, devices can talk to each other without going
> through chipset, and bridge spec is very explicit that
> full 64 bit addressing must be supported.
> 
> So as long as we don't emulate an IOMMU,
> guest will normally think it's okay to use any address.
> 
> > > Yes, if there's a bridge somewhere on the path that bridge's
> > > windows would protect you, but pci already does this filtering:
> > > if you see this address in the memory map this means
> > > your virtual device is on root bus.
> > > 
> > > So I think it's the other way around: if VFIO requires specific
> > > address ranges to be assigned to devices, it should give this
> > > info to qemu and qemu can give this to guest.
> > > Then anything outside that range can be ignored by VFIO.
> > 
> > Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > currently no way to find out the address width of the IOMMU.  We've been
> > getting by because it's safely close enough to the CPU address width to
> > not be a concern until we start exposing things at the top of the 64bit
> > address space.  Maybe I can safely ignore anything above
> > TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > 
> > Alex
> 
> I think it's not related to target CPU at all - it's a host limitation.
> So just make up your own constant, maybe depending on host architecture.
> Long term add an ioctl to query it.

It's a hardware limitation which I'd imagine has some loose ties to the
physical address bits of the CPU.

> Also, we can add a fwcfg interface to tell bios that it should avoid
> placing BARs above some address.

That doesn't help this case, it's a spurious mapping caused by sizing
the BARs with them enabled.  We may still want such a thing to feed into
building ACPI tables though.

> Since it's a vfio limitation I think it should be a vfio API, along the
> lines of vfio_get_addr_space_bits(void).
> (Is this true btw? legacy assignment doesn't have this problem?)

It's an IOMMU hardware limitation, legacy assignment has the same
problem.  It looks like legacy will abort() in QEMU for the failed
mapping and I'm planning to tighten vfio to also kill the VM for failed
mappings.  In the short term, I think I'll ignore any mappings above
TARGET_PHYS_ADDR_SPACE_BITS, long term vfio already has an IOMMU info
ioctl that we could use to return this information, but we'll need to
figure out how to get it out of the IOMMU driver first.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-10 15:31                 ` Alex Williamson
@ 2014-01-12  7:54                   ` Michael S. Tsirkin
  2014-01-12 15:03                     ` Alexander Graf
  2014-01-14 13:50                     ` Mike Day
  0 siblings, 2 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-12  7:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: peter.maydell, aik, agraf, qemu-devel, Paolo Bonzini,
	Luiz Capitulino, david

On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > 
> > > > > > > > > As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > > > size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > > > This eliminates problems with phys_page_find ignoring bits above
> > > > > > > > > TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > > > consequently messing up the computations.
> > > > > > > > > 
> > > > > > > > > In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > > > 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > > > is the newly introduced master abort region, which is as big as the PCI
> > > > > > > > > address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > > > not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > > > bits of the physical address.  In address_space_translate_internal then
> > > > > > > > > 
> > > > > > > > >     diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > > >     *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > > > 
> > > > > > > > > diff becomes negative, and int128_get64 booms.
> > > > > > > > > 
> > > > > > > > > The size of the PCI address space region should be fixed anyway.
> > > > > > > > > 
> > > > > > > > > Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  exec.c | 8 ++------
> > > > > > > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/exec.c b/exec.c
> > > > > > > > > index 7e5ce93..f907f5f 100644
> > > > > > > > > --- a/exec.c
> > > > > > > > > +++ b/exec.c
> > > > > > > > > @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > > >  #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > > >  
> > > > > > > > >  /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > > > -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > > > +#define ADDR_SPACE_BITS 64
> > > > > > > > >  
> > > > > > > > >  #define P_L2_BITS 10
> > > > > > > > >  #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > > > @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > > >  {
> > > > > > > > >      system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > > >  
> > > > > > > > > -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > > > -
> > > > > > > > > -    memory_region_init(system_memory, NULL, "system",
> > > > > > > > > -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > > > -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > > > +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > > >      address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > > >  
> > > > > > > > >      system_io = g_malloc(sizeof(*system_io));
> > > > > > > > 
> > > > > > > > This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > > BARs that I'm not sure how to handle.
> > > > > > > 
> > > > > > > BARs are often disabled during sizing. Maybe you
> > > > > > > don't detect BAR being disabled?
> > > > > > 
> > > > > > See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > the sizing an memory region updates for the BARs, vfio is just a
> > > > > > pass-through here.
> > > > > 
> > > > > Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > while I/O & memory are enabled int he command register.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > OK then from QEMU POV this BAR value is not special at all.
> > > 
> > > Unfortunately
> > > 
> > > > > > > >  After this patch I get vfio
> > > > > > > > traces like this:
> > > > > > > > 
> > > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > > (save lower 32bits of BAR)
> > > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > > (write mask to BAR)
> > > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > > (memory region gets unmapped)
> > > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > > (read size mask)
> > > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > > (restore BAR)
> > > > > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > > (memory region re-mapped)
> > > > > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > > (save upper 32bits of BAR)
> > > > > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > > (write mask to BAR)
> > > > > > > > vfio: region_del febe0000 - febe3fff
> > > > > > > > (memory region gets unmapped)
> > > > > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > > (memory region gets re-mapped with new address)
> > > > > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > > 
> > > > > > > 
> > > > > > > Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > 
> > > > > > Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > 
> > > > Why can't you? Generally memory core let you find out easily.
> > > 
> > > My MemoryListener is setup for &address_space_memory and I then filter
> > > out anything that's not memory_region_is_ram().  This still gets
> > > through, so how do I easily find out?
> > > 
> > > > But in this case it's vfio device itself that is sized so for sure you
> > > > know it's MMIO.
> > > 
> > > How so?  I have a MemoryListener as described above and pass everything
> > > through to the IOMMU.  I suppose I could look through all the
> > > VFIODevices and check if the MemoryRegion matches, but that seems really
> > > ugly.
> > > 
> > > > Maybe you will have same issue if there's another device with a 64 bit
> > > > bar though, like ivshmem?
> > > 
> > > Perhaps, I suspect I'll see anything that registers their BAR
> > > MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > 
> > Must be a 64 bit BAR to trigger the issue though.
> > 
> > > > > > Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > that we might be able to take advantage of with GPU passthrough.
> > > > > > 
> > > > > > > > Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > > address, presumably because it was beyond the address space of the PCI
> > > > > > > > window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > > allowing it to be realized in the system address space at this location?
> > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > Why do you think it is not in PCI MMIO space?
> > > > > > > True, CPU can't access this address but other pci devices can.
> > > > > > 
> > > > > > What happens on real hardware when an address like this is programmed to
> > > > > > a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > serious doubts that another PCI device would be able to access it
> > > > > > either.  Maybe in some limited scenario where the devices are on the
> > > > > > same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > always limited by some kind of aperture, whether that's explicit in
> > > > > > bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > would I do it in a way that still allows real 64bit MMIO to be
> > > > > > programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > 
> > > > AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > full 64 bit addresses must be allowed and hardware validation
> > > > test suites normally check that it actually does work
> > > > if it happens.
> > > 
> > > Sure, PCI devices themselves, but the chipset typically has defined
> > > routing, that's more what I'm referring to.  There are generally only
> > > fixed address windows for RAM vs MMIO.
> > 
> > The physical chipset? Likely - in the presence of IOMMU.
> > Without that, devices can talk to each other without going
> > through chipset, and bridge spec is very explicit that
> > full 64 bit addressing must be supported.
> > 
> > So as long as we don't emulate an IOMMU,
> > guest will normally think it's okay to use any address.
> > 
> > > > Yes, if there's a bridge somewhere on the path that bridge's
> > > > windows would protect you, but pci already does this filtering:
> > > > if you see this address in the memory map this means
> > > > your virtual device is on root bus.
> > > > 
> > > > So I think it's the other way around: if VFIO requires specific
> > > > address ranges to be assigned to devices, it should give this
> > > > info to qemu and qemu can give this to guest.
> > > > Then anything outside that range can be ignored by VFIO.
> > > 
> > > Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > currently no way to find out the address width of the IOMMU.  We've been
> > > getting by because it's safely close enough to the CPU address width to
> > > not be a concern until we start exposing things at the top of the 64bit
> > > address space.  Maybe I can safely ignore anything above
> > > TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > 
> > > Alex
> > 
> > I think it's not related to target CPU at all - it's a host limitation.
> > So just make up your own constant, maybe depending on host architecture.
> > Long term add an ioctl to query it.
> 
> It's a hardware limitation which I'd imagine has some loose ties to the
> physical address bits of the CPU.
> 
> > Also, we can add a fwcfg interface to tell bios that it should avoid
> > placing BARs above some address.
> 
> That doesn't help this case, it's a spurious mapping caused by sizing
> the BARs with them enabled.  We may still want such a thing to feed into
> building ACPI tables though.

Well the point is that if you want BIOS to avoid
specific addresses, you need to tell it what to avoid.
But neither BIOS nor ACPI actually cover the range above
2^48 ATM so it's not a high priority.

> > Since it's a vfio limitation I think it should be a vfio API, along the
> > lines of vfio_get_addr_space_bits(void).
> > (Is this true btw? legacy assignment doesn't have this problem?)
> 
> It's an IOMMU hardware limitation, legacy assignment has the same
> problem.  It looks like legacy will abort() in QEMU for the failed
> mapping and I'm planning to tighten vfio to also kill the VM for failed
> mappings.  In the short term, I think I'll ignore any mappings above
> TARGET_PHYS_ADDR_SPACE_BITS,

That seems very wrong. It will still fail on an x86 host if we are
emulating a CPU with full 64 bit addressing. The limitation is on the
host side there's no real reason to tie it to the target.

> long term vfio already has an IOMMU info
> ioctl that we could use to return this information, but we'll need to
> figure out how to get it out of the IOMMU driver first.
>  Thanks,
> 
> Alex

Short term, just assume 48 bits on x86.

We need to figure out what's the limitation on ppc and arm -
maybe there's none and it can address full 64 bit range.

Cc some people who might know about these platforms.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-12  7:54                   ` Michael S. Tsirkin
@ 2014-01-12 15:03                     ` Alexander Graf
  2014-01-13 21:39                       ` Alex Williamson
  2014-01-14 13:50                     ` Mike Day
  1 sibling, 1 reply; 74+ messages in thread
From: Alexander Graf @ 2014-01-12 15:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Alexey Kardashevskiy, QEMU Developers,
	Luiz Capitulino, Alex Williamson, Paolo Bonzini, David Gibson


On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:

> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>> 
>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>> 
>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>> 
>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>> 
>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>> 
>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>> 
>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>> 
>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>> --- a/exec.c
>>>>>>>>>> +++ b/exec.c
>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>> 
>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>> 
>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>> {
>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>> 
>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>> -
>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>> 
>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
>>>>>>>>> 
>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>> 
>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>> don't detect BAR being disabled?
>>>>>>> 
>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>> pass-through here.
>>>>>> 
>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>> 
>>>>>> Alex
>>>>> 
>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>> 
>>>> Unfortunately
>>>> 
>>>>>>>>> After this patch I get vfio
>>>>>>>>> traces like this:
>>>>>>>>> 
>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>> (write mask to BAR)
>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>> (memory region gets unmapped)
>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>> (read size mask)
>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>> (restore BAR)
>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>> (memory region re-mapped)
>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>> (write mask to BAR)
>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>> (memory region gets unmapped)
>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>> 
>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>> 
>>>>> Why can't you? Generally memory core let you find out easily.
>>>> 
>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>> out anything that's not memory_region_is_ram().  This still gets
>>>> through, so how do I easily find out?
>>>> 
>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>> know it's MMIO.
>>>> 
>>>> How so?  I have a MemoryListener as described above and pass everything
>>>> through to the IOMMU.  I suppose I could look through all the
>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>> ugly.
>>>> 
>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>> bar though, like ivshmem?
>>>> 
>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>> 
>>> Must be a 64 bit BAR to trigger the issue though.
>>> 
>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>> 
>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>> 
>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>> 
>>>>>>> Alex
>>>>>> 
>>>>> 
>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>> test suites normally check that it actually does work
>>>>> if it happens.
>>>> 
>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>> routing, that's more what I'm referring to.  There are generally only
>>>> fixed address windows for RAM vs MMIO.
>>> 
>>> The physical chipset? Likely - in the presence of IOMMU.
>>> Without that, devices can talk to each other without going
>>> through chipset, and bridge spec is very explicit that
>>> full 64 bit addressing must be supported.
>>> 
>>> So as long as we don't emulate an IOMMU,
>>> guest will normally think it's okay to use any address.
>>> 
>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>> windows would protect you, but pci already does this filtering:
>>>>> if you see this address in the memory map this means
>>>>> your virtual device is on root bus.
>>>>> 
>>>>> So I think it's the other way around: if VFIO requires specific
>>>>> address ranges to be assigned to devices, it should give this
>>>>> info to qemu and qemu can give this to guest.
>>>>> Then anything outside that range can be ignored by VFIO.
>>>> 
>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>> getting by because it's safely close enough to the CPU address width to
>>>> not be a concern until we start exposing things at the top of the 64bit
>>>> address space.  Maybe I can safely ignore anything above
>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>> 
>>>> Alex
>>> 
>>> I think it's not related to target CPU at all - it's a host limitation.
>>> So just make up your own constant, maybe depending on host architecture.
>>> Long term add an ioctl to query it.
>> 
>> It's a hardware limitation which I'd imagine has some loose ties to the
>> physical address bits of the CPU.
>> 
>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>> placing BARs above some address.
>> 
>> That doesn't help this case, it's a spurious mapping caused by sizing
>> the BARs with them enabled.  We may still want such a thing to feed into
>> building ACPI tables though.
> 
> Well the point is that if you want BIOS to avoid
> specific addresses, you need to tell it what to avoid.
> But neither BIOS nor ACPI actually cover the range above
> 2^48 ATM so it's not a high priority.
> 
>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>> lines of vfio_get_addr_space_bits(void).
>>> (Is this true btw? legacy assignment doesn't have this problem?)
>> 
>> It's an IOMMU hardware limitation, legacy assignment has the same
>> problem.  It looks like legacy will abort() in QEMU for the failed
>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>> mappings.  In the short term, I think I'll ignore any mappings above
>> TARGET_PHYS_ADDR_SPACE_BITS,
> 
> That seems very wrong. It will still fail on an x86 host if we are
> emulating a CPU with full 64 bit addressing. The limitation is on the
> host side there's no real reason to tie it to the target.
> 
>> long term vfio already has an IOMMU info
>> ioctl that we could use to return this information, but we'll need to
>> figure out how to get it out of the IOMMU driver first.
>> Thanks,
>> 
>> Alex
> 
> Short term, just assume 48 bits on x86.
> 
> We need to figure out what's the limitation on ppc and arm -
> maybe there's none and it can address full 64 bit range.

IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.

Or did I misunderstand the question?


Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-12 15:03                     ` Alexander Graf
@ 2014-01-13 21:39                       ` Alex Williamson
  2014-01-13 21:48                         ` Alexander Graf
  2014-01-14 12:21                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 74+ messages in thread
From: Alex Williamson @ 2014-01-13 21:39 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Michael S. Tsirkin, Alexey Kardashevskiy,
	QEMU Developers, Luiz Capitulino, Paolo Bonzini, David Gibson

On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>> 
> >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>> 
> >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>> 
> >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>> 
> >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>> 
> >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>> 
> >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>> ---
> >>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>> 
> >>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>> --- a/exec.c
> >>>>>>>>>> +++ b/exec.c
> >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>> 
> >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>> 
> >>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>> {
> >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>> 
> >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>> -
> >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>> 
> >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>> 
> >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>> 
> >>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>> don't detect BAR being disabled?
> >>>>>>> 
> >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>> pass-through here.
> >>>>>> 
> >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>> 
> >>>> Unfortunately
> >>>> 
> >>>>>>>>> After this patch I get vfio
> >>>>>>>>> traces like this:
> >>>>>>>>> 
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>> (write mask to BAR)
> >>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>> (memory region gets unmapped)
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>> (read size mask)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>> (restore BAR)
> >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>> (memory region re-mapped)
> >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>> (write mask to BAR)
> >>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>> (memory region gets unmapped)
> >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>> 
> >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>> 
> >>>>> Why can't you? Generally memory core let you find out easily.
> >>>> 
> >>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>> out anything that's not memory_region_is_ram().  This still gets
> >>>> through, so how do I easily find out?
> >>>> 
> >>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>> know it's MMIO.
> >>>> 
> >>>> How so?  I have a MemoryListener as described above and pass everything
> >>>> through to the IOMMU.  I suppose I could look through all the
> >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>> ugly.
> >>>> 
> >>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>> bar though, like ivshmem?
> >>>> 
> >>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>> 
> >>> Must be a 64 bit BAR to trigger the issue though.
> >>> 
> >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>> 
> >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>> Thanks,
> >>>>>>>>> 
> >>>>>>>>> Alex
> >>>>>>>> 
> >>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>> 
> >>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>> 
> >>>>>>> Alex
> >>>>>> 
> >>>>> 
> >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>> test suites normally check that it actually does work
> >>>>> if it happens.
> >>>> 
> >>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>> routing, that's more what I'm referring to.  There are generally only
> >>>> fixed address windows for RAM vs MMIO.
> >>> 
> >>> The physical chipset? Likely - in the presence of IOMMU.
> >>> Without that, devices can talk to each other without going
> >>> through chipset, and bridge spec is very explicit that
> >>> full 64 bit addressing must be supported.
> >>> 
> >>> So as long as we don't emulate an IOMMU,
> >>> guest will normally think it's okay to use any address.
> >>> 
> >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>> windows would protect you, but pci already does this filtering:
> >>>>> if you see this address in the memory map this means
> >>>>> your virtual device is on root bus.
> >>>>> 
> >>>>> So I think it's the other way around: if VFIO requires specific
> >>>>> address ranges to be assigned to devices, it should give this
> >>>>> info to qemu and qemu can give this to guest.
> >>>>> Then anything outside that range can be ignored by VFIO.
> >>>> 
> >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>> getting by because it's safely close enough to the CPU address width to
> >>>> not be a concern until we start exposing things at the top of the 64bit
> >>>> address space.  Maybe I can safely ignore anything above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>> 
> >>>> Alex
> >>> 
> >>> I think it's not related to target CPU at all - it's a host limitation.
> >>> So just make up your own constant, maybe depending on host architecture.
> >>> Long term add an ioctl to query it.
> >> 
> >> It's a hardware limitation which I'd imagine has some loose ties to the
> >> physical address bits of the CPU.
> >> 
> >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>> placing BARs above some address.
> >> 
> >> That doesn't help this case, it's a spurious mapping caused by sizing
> >> the BARs with them enabled.  We may still want such a thing to feed into
> >> building ACPI tables though.
> > 
> > Well the point is that if you want BIOS to avoid
> > specific addresses, you need to tell it what to avoid.
> > But neither BIOS nor ACPI actually cover the range above
> > 2^48 ATM so it's not a high priority.
> > 
> >>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>> lines of vfio_get_addr_space_bits(void).
> >>> (Is this true btw? legacy assignment doesn't have this problem?)
> >> 
> >> It's an IOMMU hardware limitation, legacy assignment has the same
> >> problem.  It looks like legacy will abort() in QEMU for the failed
> >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >> mappings.  In the short term, I think I'll ignore any mappings above
> >> TARGET_PHYS_ADDR_SPACE_BITS,
> > 
> > That seems very wrong. It will still fail on an x86 host if we are
> > emulating a CPU with full 64 bit addressing. The limitation is on the
> > host side there's no real reason to tie it to the target.

I doubt vfio would be the only thing broken in that case.

> >> long term vfio already has an IOMMU info
> >> ioctl that we could use to return this information, but we'll need to
> >> figure out how to get it out of the IOMMU driver first.
> >> Thanks,
> >> 
> >> Alex
> > 
> > Short term, just assume 48 bits on x86.

I hate to pick an arbitrary value since we have a very specific mapping
we're trying to avoid.  Perhaps a better option is to skip anything
where:

        MemoryRegionSection.offset_within_address_space >
        ~MemoryRegionSection.offset_within_address_space

> > We need to figure out what's the limitation on ppc and arm -
> > maybe there's none and it can address full 64 bit range.
> 
> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> 
> Or did I misunderstand the question?

Sounds right, if either BAR mappings outside the window will not be
realized in the memory space or the IOMMU has a full 64bit address
space, there's no problem.  Here we have an intermediate step in the BAR
sizing producing a stray mapping that the IOMMU hardware can't handle.
Even if we could handle it, it's not clear that we want to.  On AMD-Vi
the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
this then causes space and time overhead until the tables are pruned
back down.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-13 21:39                       ` Alex Williamson
@ 2014-01-13 21:48                         ` Alexander Graf
  2014-01-13 22:48                           ` Alex Williamson
  2014-01-14  8:18                           ` Michael S. Tsirkin
  2014-01-14 12:21                         ` Michael S. Tsirkin
  1 sibling, 2 replies; 74+ messages in thread
From: Alexander Graf @ 2014-01-13 21:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, Michael S. Tsirkin, Alexey Kardashevskiy,
	QEMU Developers, Luiz Capitulino, Paolo Bonzini, David Gibson



> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> 
>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>> 
>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>> 
>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>> 
>>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>> 
>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>> 
>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>> 
>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>> 
>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>> 
>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>> 
>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>> {
>>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>> 
>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>> -
>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>> 
>>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>> 
>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>> 
>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>> 
>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>> pass-through here.
>>>>>>>> 
>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>> 
>>>>>>>> Alex
>>>>>>> 
>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>> 
>>>>>> Unfortunately
>>>>>> 
>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>> traces like this:
>>>>>>>>>>> 
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>> (read size mask)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>> (restore BAR)
>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>> 
>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>> 
>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>> 
>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>> 
>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>> through, so how do I easily find out?
>>>>>> 
>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>> know it's MMIO.
>>>>>> 
>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>> ugly.
>>>>>> 
>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>> bar though, like ivshmem?
>>>>>> 
>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>> 
>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>> 
>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>> 
>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Alex
>>>>>>>>>> 
>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>> 
>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>> 
>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>> test suites normally check that it actually does work
>>>>>>> if it happens.
>>>>>> 
>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>> fixed address windows for RAM vs MMIO.
>>>>> 
>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>> Without that, devices can talk to each other without going
>>>>> through chipset, and bridge spec is very explicit that
>>>>> full 64 bit addressing must be supported.
>>>>> 
>>>>> So as long as we don't emulate an IOMMU,
>>>>> guest will normally think it's okay to use any address.
>>>>> 
>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>> if you see this address in the memory map this means
>>>>>>> your virtual device is on root bus.
>>>>>>> 
>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>> 
>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>> 
>>>>>> Alex
>>>>> 
>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>> Long term add an ioctl to query it.
>>>> 
>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>> physical address bits of the CPU.
>>>> 
>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>> placing BARs above some address.
>>>> 
>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>> building ACPI tables though.
>>> 
>>> Well the point is that if you want BIOS to avoid
>>> specific addresses, you need to tell it what to avoid.
>>> But neither BIOS nor ACPI actually cover the range above
>>> 2^48 ATM so it's not a high priority.
>>> 
>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>> lines of vfio_get_addr_space_bits(void).
>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>> 
>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>> 
>>> That seems very wrong. It will still fail on an x86 host if we are
>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>> host side there's no real reason to tie it to the target.
> 
> I doubt vfio would be the only thing broken in that case.
> 
>>>> long term vfio already has an IOMMU info
>>>> ioctl that we could use to return this information, but we'll need to
>>>> figure out how to get it out of the IOMMU driver first.
>>>> Thanks,
>>>> 
>>>> Alex
>>> 
>>> Short term, just assume 48 bits on x86.
> 
> I hate to pick an arbitrary value since we have a very specific mapping
> we're trying to avoid.  Perhaps a better option is to skip anything
> where:
> 
>        MemoryRegionSection.offset_within_address_space >
>        ~MemoryRegionSection.offset_within_address_space
> 
>>> We need to figure out what's the limitation on ppc and arm -
>>> maybe there's none and it can address full 64 bit range.
>> 
>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>> 
>> Or did I misunderstand the question?
> 
> Sounds right, if either BAR mappings outside the window will not be
> realized in the memory space or the IOMMU has a full 64bit address
> space, there's no problem.  Here we have an intermediate step in the BAR
> sizing producing a stray mapping that the IOMMU hardware can't handle.
> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> this then causes space and time overhead until the tables are pruned
> back down.  Thanks,

I thought sizing is hard defined as a set to
-1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?

Alex

> 
> Alex
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-13 21:48                         ` Alexander Graf
@ 2014-01-13 22:48                           ` Alex Williamson
  2014-01-14 10:24                             ` Avi Kivity
  2014-01-14 12:07                             ` Michael S. Tsirkin
  2014-01-14  8:18                           ` Michael S. Tsirkin
  1 sibling, 2 replies; 74+ messages in thread
From: Alex Williamson @ 2014-01-13 22:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Michael S. Tsirkin, Alexey Kardashevskiy,
	QEMU Developers, Luiz Capitulino, Paolo Bonzini, David Gibson

On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> 
> > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > 
> >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>> 
> >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>> 
> >>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>> -
> >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>> 
> >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>> 
> >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>> 
> >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>> 
> >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>> pass-through here.
> >>>>>>>> 
> >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>> 
> >>>>>> Unfortunately
> >>>>>> 
> >>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>> traces like this:
> >>>>>>>>>>> 
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>> (read size mask)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>> (restore BAR)
> >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>> 
> >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>> 
> >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>> 
> >>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>> 
> >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>> through, so how do I easily find out?
> >>>>>> 
> >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>> know it's MMIO.
> >>>>>> 
> >>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>> ugly.
> >>>>>> 
> >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>> bar though, like ivshmem?
> >>>>>> 
> >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>> 
> >>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>> 
> >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>> 
> >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>>> 
> >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>> 
> >>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>> 
> >>>>>>>>> Alex
> >>>>>>> 
> >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>> test suites normally check that it actually does work
> >>>>>>> if it happens.
> >>>>>> 
> >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>> fixed address windows for RAM vs MMIO.
> >>>>> 
> >>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>> Without that, devices can talk to each other without going
> >>>>> through chipset, and bridge spec is very explicit that
> >>>>> full 64 bit addressing must be supported.
> >>>>> 
> >>>>> So as long as we don't emulate an IOMMU,
> >>>>> guest will normally think it's okay to use any address.
> >>>>> 
> >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>> if you see this address in the memory map this means
> >>>>>>> your virtual device is on root bus.
> >>>>>>> 
> >>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>> 
> >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>> Long term add an ioctl to query it.
> >>>> 
> >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>> physical address bits of the CPU.
> >>>> 
> >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>> placing BARs above some address.
> >>>> 
> >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>> building ACPI tables though.
> >>> 
> >>> Well the point is that if you want BIOS to avoid
> >>> specific addresses, you need to tell it what to avoid.
> >>> But neither BIOS nor ACPI actually cover the range above
> >>> 2^48 ATM so it's not a high priority.
> >>> 
> >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>> lines of vfio_get_addr_space_bits(void).
> >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>> 
> >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>> 
> >>> That seems very wrong. It will still fail on an x86 host if we are
> >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>> host side there's no real reason to tie it to the target.
> > 
> > I doubt vfio would be the only thing broken in that case.
> > 
> >>>> long term vfio already has an IOMMU info
> >>>> ioctl that we could use to return this information, but we'll need to
> >>>> figure out how to get it out of the IOMMU driver first.
> >>>> Thanks,
> >>>> 
> >>>> Alex
> >>> 
> >>> Short term, just assume 48 bits on x86.
> > 
> > I hate to pick an arbitrary value since we have a very specific mapping
> > we're trying to avoid.  Perhaps a better option is to skip anything
> > where:
> > 
> >        MemoryRegionSection.offset_within_address_space >
> >        ~MemoryRegionSection.offset_within_address_space
> > 
> >>> We need to figure out what's the limitation on ppc and arm -
> >>> maybe there's none and it can address full 64 bit range.
> >> 
> >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >> 
> >> Or did I misunderstand the question?
> > 
> > Sounds right, if either BAR mappings outside the window will not be
> > realized in the memory space or the IOMMU has a full 64bit address
> > space, there's no problem.  Here we have an intermediate step in the BAR
> > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > this then causes space and time overhead until the tables are pruned
> > back down.  Thanks,
> 
> I thought sizing is hard defined as a set to
> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?

PCI doesn't want to handle this as anything special to differentiate a
sizing mask from a valid BAR address.  I agree though, I'd prefer to
never see a spurious address like this in my MemoryListener.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-13 21:48                         ` Alexander Graf
  2014-01-13 22:48                           ` Alex Williamson
@ 2014-01-14  8:18                           ` Michael S. Tsirkin
  2014-01-14  9:20                             ` Alexander Graf
  1 sibling, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14  8:18 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Alexey Kardashevskiy, QEMU Developers,
	Luiz Capitulino, Alex Williamson, Paolo Bonzini, David Gibson

On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> 
> 
> > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > 
> >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>> 
> >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>> 
> >>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>> -
> >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>> 
> >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>> 
> >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>> 
> >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>> 
> >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>> pass-through here.
> >>>>>>>> 
> >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>> 
> >>>>>> Unfortunately
> >>>>>> 
> >>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>> traces like this:
> >>>>>>>>>>> 
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>> (read size mask)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>> (restore BAR)
> >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>> 
> >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>> 
> >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>> 
> >>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>> 
> >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>> through, so how do I easily find out?
> >>>>>> 
> >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>> know it's MMIO.
> >>>>>> 
> >>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>> ugly.
> >>>>>> 
> >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>> bar though, like ivshmem?
> >>>>>> 
> >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>> 
> >>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>> 
> >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>> 
> >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>>> 
> >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>> 
> >>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>> 
> >>>>>>>>> Alex
> >>>>>>> 
> >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>> test suites normally check that it actually does work
> >>>>>>> if it happens.
> >>>>>> 
> >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>> fixed address windows for RAM vs MMIO.
> >>>>> 
> >>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>> Without that, devices can talk to each other without going
> >>>>> through chipset, and bridge spec is very explicit that
> >>>>> full 64 bit addressing must be supported.
> >>>>> 
> >>>>> So as long as we don't emulate an IOMMU,
> >>>>> guest will normally think it's okay to use any address.
> >>>>> 
> >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>> if you see this address in the memory map this means
> >>>>>>> your virtual device is on root bus.
> >>>>>>> 
> >>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>> 
> >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>> Long term add an ioctl to query it.
> >>>> 
> >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>> physical address bits of the CPU.
> >>>> 
> >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>> placing BARs above some address.
> >>>> 
> >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>> building ACPI tables though.
> >>> 
> >>> Well the point is that if you want BIOS to avoid
> >>> specific addresses, you need to tell it what to avoid.
> >>> But neither BIOS nor ACPI actually cover the range above
> >>> 2^48 ATM so it's not a high priority.
> >>> 
> >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>> lines of vfio_get_addr_space_bits(void).
> >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>> 
> >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>> 
> >>> That seems very wrong. It will still fail on an x86 host if we are
> >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>> host side there's no real reason to tie it to the target.
> > 
> > I doubt vfio would be the only thing broken in that case.
> > 
> >>>> long term vfio already has an IOMMU info
> >>>> ioctl that we could use to return this information, but we'll need to
> >>>> figure out how to get it out of the IOMMU driver first.
> >>>> Thanks,
> >>>> 
> >>>> Alex
> >>> 
> >>> Short term, just assume 48 bits on x86.
> > 
> > I hate to pick an arbitrary value since we have a very specific mapping
> > we're trying to avoid.  Perhaps a better option is to skip anything
> > where:
> > 
> >        MemoryRegionSection.offset_within_address_space >
> >        ~MemoryRegionSection.offset_within_address_space
> > 
> >>> We need to figure out what's the limitation on ppc and arm -
> >>> maybe there's none and it can address full 64 bit range.
> >> 
> >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >> 
> >> Or did I misunderstand the question?
> > 
> > Sounds right, if either BAR mappings outside the window will not be
> > realized in the memory space or the IOMMU has a full 64bit address
> > space, there's no problem.  Here we have an intermediate step in the BAR
> > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > this then causes space and time overhead until the tables are pruned
> > back down.  Thanks,
> 
> I thought sizing is hard defined as a set to
> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> 
> Alex

We already have a work-around like this and it works for 32 bit BARs
or after software writes the full 64 register:
    if (last_addr <= new_addr || new_addr == 0 ||
        last_addr == PCI_BAR_UNMAPPED) {
        return PCI_BAR_UNMAPPED;
    }

    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
        return PCI_BAR_UNMAPPED;
    }


But for 64 bit BARs this software writes all 1's
in the high 32 bit register before writing in the low register
(see trace above).
This makes it impossible to distinguish between
setting bar at fffffffffebe0000 and this intermediate sizing step.


> > 
> > Alex
> > 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14  8:18                           ` Michael S. Tsirkin
@ 2014-01-14  9:20                             ` Alexander Graf
  2014-01-14  9:31                               ` Peter Maydell
                                                 ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Alexander Graf @ 2014-01-14  9:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Alexey Kardashevskiy, QEMU Developers,
	Luiz Capitulino, Alex Williamson, Paolo Bonzini, David Gibson


On 14.01.2014, at 09:18, Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
>> 
>> 
>>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
>>> 
>>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>>>> 
>>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>>>> 
>>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>>>> pass-through here.
>>>>>>>>>> 
>>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>>>> 
>>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>>>> 
>>>>>>>> Unfortunately
>>>>>>>> 
>>>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>>>> traces like this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>>>> (read size mask)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>>>> (restore BAR)
>>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>>>> 
>>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>>>> 
>>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>>>> 
>>>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>>>> 
>>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>>>> through, so how do I easily find out?
>>>>>>>> 
>>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>>>> know it's MMIO.
>>>>>>>> 
>>>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>>>> ugly.
>>>>>>>> 
>>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>>>> bar though, like ivshmem?
>>>>>>>> 
>>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>>>> 
>>>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>>>> 
>>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>>>> 
>>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Alex
>>>>>>>>>>>> 
>>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>>>> 
>>>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>>>> test suites normally check that it actually does work
>>>>>>>>> if it happens.
>>>>>>>> 
>>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>>>> fixed address windows for RAM vs MMIO.
>>>>>>> 
>>>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>>>> Without that, devices can talk to each other without going
>>>>>>> through chipset, and bridge spec is very explicit that
>>>>>>> full 64 bit addressing must be supported.
>>>>>>> 
>>>>>>> So as long as we don't emulate an IOMMU,
>>>>>>> guest will normally think it's okay to use any address.
>>>>>>> 
>>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>>>> if you see this address in the memory map this means
>>>>>>>>> your virtual device is on root bus.
>>>>>>>>> 
>>>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>>>> 
>>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>>>> 
>>>>>>>> Alex
>>>>>>> 
>>>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>>>> Long term add an ioctl to query it.
>>>>>> 
>>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>>>> physical address bits of the CPU.
>>>>>> 
>>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>>>> placing BARs above some address.
>>>>>> 
>>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>>>> building ACPI tables though.
>>>>> 
>>>>> Well the point is that if you want BIOS to avoid
>>>>> specific addresses, you need to tell it what to avoid.
>>>>> But neither BIOS nor ACPI actually cover the range above
>>>>> 2^48 ATM so it's not a high priority.
>>>>> 
>>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>>>> lines of vfio_get_addr_space_bits(void).
>>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>>>> 
>>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>>>> 
>>>>> That seems very wrong. It will still fail on an x86 host if we are
>>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>>>> host side there's no real reason to tie it to the target.
>>> 
>>> I doubt vfio would be the only thing broken in that case.
>>> 
>>>>>> long term vfio already has an IOMMU info
>>>>>> ioctl that we could use to return this information, but we'll need to
>>>>>> figure out how to get it out of the IOMMU driver first.
>>>>>> Thanks,
>>>>>> 
>>>>>> Alex
>>>>> 
>>>>> Short term, just assume 48 bits on x86.
>>> 
>>> I hate to pick an arbitrary value since we have a very specific mapping
>>> we're trying to avoid.  Perhaps a better option is to skip anything
>>> where:
>>> 
>>>       MemoryRegionSection.offset_within_address_space >
>>>       ~MemoryRegionSection.offset_within_address_space
>>> 
>>>>> We need to figure out what's the limitation on ppc and arm -
>>>>> maybe there's none and it can address full 64 bit range.
>>>> 
>>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>>>> 
>>>> Or did I misunderstand the question?
>>> 
>>> Sounds right, if either BAR mappings outside the window will not be
>>> realized in the memory space or the IOMMU has a full 64bit address
>>> space, there's no problem.  Here we have an intermediate step in the BAR
>>> sizing producing a stray mapping that the IOMMU hardware can't handle.
>>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
>>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
>>> this then causes space and time overhead until the tables are pruned
>>> back down.  Thanks,
>> 
>> I thought sizing is hard defined as a set to
>> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
>> 
>> Alex
> 
> We already have a work-around like this and it works for 32 bit BARs
> or after software writes the full 64 register:
>    if (last_addr <= new_addr || new_addr == 0 ||
>        last_addr == PCI_BAR_UNMAPPED) {
>        return PCI_BAR_UNMAPPED;
>    }
> 
>    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
>        return PCI_BAR_UNMAPPED;
>    }
> 
> 
> But for 64 bit BARs this software writes all 1's
> in the high 32 bit register before writing in the low register
> (see trace above).
> This makes it impossible to distinguish between
> setting bar at fffffffffebe0000 and this intermediate sizing step.

Well, at least according to the AMD manual there's only support for 52 bytes of physical address space:

	• Long Mode—This mode is unique to the AMD64 architecture. This mode supports up to 4 petabytes of physical-address space using 52-bit physical addresses.

Intel seems to agree:

	• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.

Of course there's potential for future extensions to allow for more bits in the future, but at least the current generation x86_64 (and x86) specification clearly only supports 52 bits of physical address space. And non-x86(_64) don't care about bigger address spaces either because they use BAR windows which are very unlikely to grow bigger than 52 bits ;).


Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14  9:20                             ` Alexander Graf
@ 2014-01-14  9:31                               ` Peter Maydell
  2014-01-14 10:28                               ` Michael S. Tsirkin
  2014-01-14 10:43                               ` Michael S. Tsirkin
  2 siblings, 0 replies; 74+ messages in thread
From: Peter Maydell @ 2014-01-14  9:31 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Michael S. Tsirkin, Alexey Kardashevskiy, QEMU Developers,
	Luiz Capitulino, Alex Williamson, Paolo Bonzini, David Gibson

On 14 January 2014 09:20, Alexander Graf <agraf@suse.de> wrote:
> Of course there's potential for future extensions to allow for more
> bits in the future, but at least the current generation x86_64 (and x86)
> specification clearly only supports 52 bits of physical address space.
> And non-x86(_64) don't care about bigger address spaces either
> because they use BAR windows which are very unlikely to grow
> bigger than 52 bits ;).

There's no reason you couldn't do an ARM (most likely AArch64)
system which dealt with PCI BARs the same way as x86 rather
than having a fixed window in the memory map; I wouldn't be
surprised if some of the server designs took that route. However
the architecture specifies a 48 bit maximum physical address.

With some of the BAR-window design PCI controllers I think it's
theoretically possible to configure the controller so that the
window shows the very top part of PCI address space and then
configure all your device BARs with very high PCI addresses.
In that case the BAR MemoryRegions would get mapped in
at very high addresses in the PCI memory address space
MemoryRegion container, and at more usual small addresses
in the system AddressSpace.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-13 22:48                           ` Alex Williamson
@ 2014-01-14 10:24                             ` Avi Kivity
  2014-01-14 11:50                               ` Michael S. Tsirkin
  2014-01-14 15:36                               ` Alex Williamson
  2014-01-14 12:07                             ` Michael S. Tsirkin
  1 sibling, 2 replies; 74+ messages in thread
From: Avi Kivity @ 2014-01-14 10:24 UTC (permalink / raw)
  To: Alex Williamson, Alexander Graf
  Cc: Peter Maydell, Michael S. Tsirkin, Alexey Kardashevskiy,
	QEMU Developers, Luiz Capitulino, Paolo Bonzini, David Gibson

On 01/14/2014 12:48 AM, Alex Williamson wrote:
> On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
>>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
>>>
>>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>
>>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
>>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
>>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
>>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
>>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
>>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
>>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
>>>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
>>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>>>> pass-through here.
>>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>>>> Unfortunately
>>>>>>>>
>>>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>>>> traces like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>>>> (read size mask)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
>>>>>>>>>>>>> (restore BAR)
>>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
>>>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
>>>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>>>> out anything that's not memory_region_is_ram().  This still gets
>>>>>>>> through, so how do I easily find out?
>>>>>>>>
>>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>>>> know it's MMIO.
>>>>>>>> How so?  I have a MemoryListener as described above and pass everything
>>>>>>>> through to the IOMMU.  I suppose I could look through all the
>>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>>>> ugly.
>>>>>>>>
>>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>>>> bar though, like ivshmem?
>>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>>>>
>>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
>>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>>>>
>>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
>>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
>>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alex
>>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>>>> What happens on real hardware when an address like this is programmed to
>>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
>>>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
>>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
>>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
>>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>>>> test suites normally check that it actually does work
>>>>>>>>> if it happens.
>>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>>>> routing, that's more what I'm referring to.  There are generally only
>>>>>>>> fixed address windows for RAM vs MMIO.
>>>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>>>> Without that, devices can talk to each other without going
>>>>>>> through chipset, and bridge spec is very explicit that
>>>>>>> full 64 bit addressing must be supported.
>>>>>>>
>>>>>>> So as long as we don't emulate an IOMMU,
>>>>>>> guest will normally think it's okay to use any address.
>>>>>>>
>>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>>>> if you see this address in the memory map this means
>>>>>>>>> your virtual device is on root bus.
>>>>>>>>>
>>>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
>>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
>>>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>>>> address space.  Maybe I can safely ignore anything above
>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>>>> Long term add an ioctl to query it.
>>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>>>> physical address bits of the CPU.
>>>>>>
>>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>>>> placing BARs above some address.
>>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>>>> the BARs with them enabled.  We may still want such a thing to feed into
>>>>>> building ACPI tables though.
>>>>> Well the point is that if you want BIOS to avoid
>>>>> specific addresses, you need to tell it what to avoid.
>>>>> But neither BIOS nor ACPI actually cover the range above
>>>>> 2^48 ATM so it's not a high priority.
>>>>>
>>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>>>> lines of vfio_get_addr_space_bits(void).
>>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
>>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>>>> mappings.  In the short term, I think I'll ignore any mappings above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>>>> That seems very wrong. It will still fail on an x86 host if we are
>>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>>>> host side there's no real reason to tie it to the target.
>>> I doubt vfio would be the only thing broken in that case.
>>>
>>>>>> long term vfio already has an IOMMU info
>>>>>> ioctl that we could use to return this information, but we'll need to
>>>>>> figure out how to get it out of the IOMMU driver first.
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>> Short term, just assume 48 bits on x86.
>>> I hate to pick an arbitrary value since we have a very specific mapping
>>> we're trying to avoid.  Perhaps a better option is to skip anything
>>> where:
>>>
>>>         MemoryRegionSection.offset_within_address_space >
>>>         ~MemoryRegionSection.offset_within_address_space
>>>
>>>>> We need to figure out what's the limitation on ppc and arm -
>>>>> maybe there's none and it can address full 64 bit range.
>>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
>>>>
>>>> Or did I misunderstand the question?
>>> Sounds right, if either BAR mappings outside the window will not be
>>> realized in the memory space or the IOMMU has a full 64bit address
>>> space, there's no problem.  Here we have an intermediate step in the BAR
>>> sizing producing a stray mapping that the IOMMU hardware can't handle.
>>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
>>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
>>> this then causes space and time overhead until the tables are pruned
>>> back down.  Thanks,
>> I thought sizing is hard defined as a set to
>> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> PCI doesn't want to handle this as anything special to differentiate a
> sizing mask from a valid BAR address.  I agree though, I'd prefer to
> never see a spurious address like this in my MemoryListener.
>
>

Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
bios and/or linux to disable memory access while sizing.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14  9:20                             ` Alexander Graf
  2014-01-14  9:31                               ` Peter Maydell
@ 2014-01-14 10:28                               ` Michael S. Tsirkin
  2014-01-14 10:43                               ` Michael S. Tsirkin
  2 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 10:28 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Alexey Kardashevskiy, QEMU Developers,
	Luiz Capitulino, Alex Williamson, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 10:20:57AM +0100, Alexander Graf wrote:
> 
> On 14.01.2014, at 09:18, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> >> 
> >> 
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>> 
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> 
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> 
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>> 
> >>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> 
> >>>>>>>> Unfortunately
> >>>>>>>> 
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> 
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>> 
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> 
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>> 
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> 
> >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>> ugly.
> >>>>>>>> 
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> 
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>> 
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>> 
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> 
> >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> 
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> 
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>> 
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>> 
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>> 
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> 
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> 
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>> 
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> 
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>> building ACPI tables though.
> >>>>> 
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>> 
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> 
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> 
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> 
> >>> I doubt vfio would be the only thing broken in that case.
> >>> 
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> Short term, just assume 48 bits on x86.
> >>> 
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>> 
> >>>       MemoryRegionSection.offset_within_address_space >
> >>>       ~MemoryRegionSection.offset_within_address_space
> >>> 
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> 
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>> 
> >>>> Or did I misunderstand the question?
> >>> 
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> 
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> >> 
> >> Alex
> > 
> > We already have a work-around like this and it works for 32 bit BARs
> > or after software writes the full 64 register:
> >    if (last_addr <= new_addr || new_addr == 0 ||
> >        last_addr == PCI_BAR_UNMAPPED) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> >    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> > 
> > But for 64 bit BARs this software writes all 1's
> > in the high 32 bit register before writing in the low register
> > (see trace above).
> > This makes it impossible to distinguish between
> > setting bar at fffffffffebe0000 and this intermediate sizing step.
> 
> Well, at least according to the AMD manual there's only support for 52 bytes of physical address space:
> 
> 	• Long Mode—This mode is unique to the AMD64 architecture. This mode supports up to 4 petabytes of physical-address space using 52-bit physical addresses.
> 
> Intel seems to agree:
> 
> 	• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
> 
> Of course there's potential for future extensions to allow for more bits in the future, but at least the current generation x86_64 (and x86) specification clearly only supports 52 bits of physical address space. And non-x86(_64) don't care about bigger address spaces either because they use BAR windows which are very unlikely to grow bigger than 52 bits ;).
> 
> 
> Alex

Yes but that's from CPU's point of view.
I think that devices can still access each other's BARs
using full 64 bit addresses.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14  9:20                             ` Alexander Graf
  2014-01-14  9:31                               ` Peter Maydell
  2014-01-14 10:28                               ` Michael S. Tsirkin
@ 2014-01-14 10:43                               ` Michael S. Tsirkin
  2 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 10:43 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Alexey Kardashevskiy, QEMU Developers,
	Luiz Capitulino, Alex Williamson, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 10:20:57AM +0100, Alexander Graf wrote:
> 
> On 14.01.2014, at 09:18, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> >> 
> >> 
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>> 
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> 
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> 
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>> 
> >>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> 
> >>>>>>>> Unfortunately
> >>>>>>>> 
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> 
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>> 
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> 
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>> 
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> 
> >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>> ugly.
> >>>>>>>> 
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> 
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>> 
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>> 
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> 
> >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> 
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> 
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>> 
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>> 
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>> 
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> 
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> 
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>> 
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> 
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>> building ACPI tables though.
> >>>>> 
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>> 
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> 
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> 
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> 
> >>> I doubt vfio would be the only thing broken in that case.
> >>> 
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> Short term, just assume 48 bits on x86.
> >>> 
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>> 
> >>>       MemoryRegionSection.offset_within_address_space >
> >>>       ~MemoryRegionSection.offset_within_address_space
> >>> 
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> 
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>> 
> >>>> Or did I misunderstand the question?
> >>> 
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> 
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> >> 
> >> Alex
> > 
> > We already have a work-around like this and it works for 32 bit BARs
> > or after software writes the full 64 register:
> >    if (last_addr <= new_addr || new_addr == 0 ||
> >        last_addr == PCI_BAR_UNMAPPED) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> >    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> > 
> > But for 64 bit BARs this software writes all 1's
> > in the high 32 bit register before writing in the low register
> > (see trace above).
> > This makes it impossible to distinguish between
> > setting bar at fffffffffebe0000 and this intermediate sizing step.
> 
> Well, at least according to the AMD manual there's only support for 52 bytes of physical address space:
> 
> 	• Long Mode—This mode is unique to the AMD64 architecture. This mode supports up to 4 petabytes of physical-address space using 52-bit physical addresses.
> 
> Intel seems to agree:
> 
> 	• CPUID.80000008H:EAX[7:0] reports the physical-address width supported by the processor. (For processors that do not support CPUID function 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
> 
> Of course there's potential for future extensions to allow for more bits in the future, but at least the current generation x86_64 (and x86) specification clearly only supports 52 bits of physical address space. And non-x86(_64) don't care about bigger address spaces either because they use BAR windows which are very unlikely to grow bigger than 52 bits ;).
> 
> 
> Alex


I guess we could limit pci memory to 52 easily enough.
But Alex here says IOMMU is limited to 48 bits so that
still won't be good enough in all cases, even if it helps
in this specific case.

We really need to figure out the host limitations
by querying vfio.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 10:24                             ` Avi Kivity
@ 2014-01-14 11:50                               ` Michael S. Tsirkin
  2014-01-14 15:36                               ` Alex Williamson
  1 sibling, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 11:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy,
	Alexander Graf, Luiz Capitulino, Alex Williamson, Paolo Bonzini,
	David Gibson

On Tue, Jan 14, 2014 at 12:24:24PM +0200, Avi Kivity wrote:
> On 01/14/2014 12:48 AM, Alex Williamson wrote:
> >On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> >>>Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>>
> >>>>On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>>On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>
> >>>>>>On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>>On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>>On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>>On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>>On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>>On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>>size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>>This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>>consequently messing up the computations.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>>0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>>is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>>address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>>not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>>bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>>Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>>Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>>---
> >>>>>>>>>>>>>>exec.c | 8 ++------
> >>>>>>>>>>>>>>1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>>index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>>--- a/exec.c
> >>>>>>>>>>>>>>+++ b/exec.c
> >>>>>>>>>>>>>>@@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>>#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>/* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>>-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>>+#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>#define P_L2_BITS 10
> >>>>>>>>>>>>>>#define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>>@@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>>{
> >>>>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>-    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>>-
> >>>>>>>>>>>>>>-    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>>-                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>>-                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>>+    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>>This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>>BARs that I'm not sure how to handle.
> >>>>>>>>>>>>BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>>don't detect BAR being disabled?
> >>>>>>>>>>>See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>>the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>>pass-through here.
> >>>>>>>>>>Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>>while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>>
> >>>>>>>>>>Alex
> >>>>>>>>>OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>>Unfortunately
> >>>>>>>>
> >>>>>>>>>>>>>After this patch I get vfio
> >>>>>>>>>>>>>traces like this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>>(save lower 32bits of BAR)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>>(write mask to BAR)
> >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>>(memory region gets unmapped)
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>>(read size mask)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>>(restore BAR)
> >>>>>>>>>>>>>vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>>(memory region re-mapped)
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>>(save upper 32bits of BAR)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>>(write mask to BAR)
> >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>>(memory region gets unmapped)
> >>>>>>>>>>>>>vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>>(memory region gets re-mapped with new address)
> >>>>>>>>>>>>>qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>>(iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>>Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>>Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>>Why can't you? Generally memory core let you find out easily.
> >>>>>>>>My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>>out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>>through, so how do I easily find out?
> >>>>>>>>
> >>>>>>>>>But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>>know it's MMIO.
> >>>>>>>>How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>>through to the IOMMU.  I suppose I could look through all the
> >>>>>>>>VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>>ugly.
> >>>>>>>>
> >>>>>>>>>Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>>bar though, like ivshmem?
> >>>>>>>>Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>>MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>>Must be a 64 bit BAR to trigger the issue though.
> >>>>>>>
> >>>>>>>>>>>Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>>that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>>
> >>>>>>>>>>>>>Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>>address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>>window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>>allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>>Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Alex
> >>>>>>>>>>>>Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>>True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>>What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>>a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>>serious doubts that another PCI device would be able to access it
> >>>>>>>>>>>either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>>same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>>always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>>bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>>in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>>would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>>programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>>Alex
> >>>>>>>>>AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>>full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>>test suites normally check that it actually does work
> >>>>>>>>>if it happens.
> >>>>>>>>Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>>routing, that's more what I'm referring to.  There are generally only
> >>>>>>>>fixed address windows for RAM vs MMIO.
> >>>>>>>The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>>Without that, devices can talk to each other without going
> >>>>>>>through chipset, and bridge spec is very explicit that
> >>>>>>>full 64 bit addressing must be supported.
> >>>>>>>
> >>>>>>>So as long as we don't emulate an IOMMU,
> >>>>>>>guest will normally think it's okay to use any address.
> >>>>>>>
> >>>>>>>>>Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>>windows would protect you, but pci already does this filtering:
> >>>>>>>>>if you see this address in the memory map this means
> >>>>>>>>>your virtual device is on root bus.
> >>>>>>>>>
> >>>>>>>>>So I think it's the other way around: if VFIO requires specific
> >>>>>>>>>address ranges to be assigned to devices, it should give this
> >>>>>>>>>info to qemu and qemu can give this to guest.
> >>>>>>>>>Then anything outside that range can be ignored by VFIO.
> >>>>>>>>Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>>currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>>getting by because it's safely close enough to the CPU address width to
> >>>>>>>>not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>>address space.  Maybe I can safely ignore anything above
> >>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>>
> >>>>>>>>Alex
> >>>>>>>I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>>So just make up your own constant, maybe depending on host architecture.
> >>>>>>>Long term add an ioctl to query it.
> >>>>>>It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>>physical address bits of the CPU.
> >>>>>>
> >>>>>>>Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>>placing BARs above some address.
> >>>>>>That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>>the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>>building ACPI tables though.
> >>>>>Well the point is that if you want BIOS to avoid
> >>>>>specific addresses, you need to tell it what to avoid.
> >>>>>But neither BIOS nor ACPI actually cover the range above
> >>>>>2^48 ATM so it's not a high priority.
> >>>>>
> >>>>>>>Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>>lines of vfio_get_addr_space_bits(void).
> >>>>>>>(Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>>It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>>problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>>mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>>mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>>TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>>That seems very wrong. It will still fail on an x86 host if we are
> >>>>>emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>>host side there's no real reason to tie it to the target.
> >>>I doubt vfio would be the only thing broken in that case.
> >>>
> >>>>>>long term vfio already has an IOMMU info
> >>>>>>ioctl that we could use to return this information, but we'll need to
> >>>>>>figure out how to get it out of the IOMMU driver first.
> >>>>>>Thanks,
> >>>>>>
> >>>>>>Alex
> >>>>>Short term, just assume 48 bits on x86.
> >>>I hate to pick an arbitrary value since we have a very specific mapping
> >>>we're trying to avoid.  Perhaps a better option is to skip anything
> >>>where:
> >>>
> >>>        MemoryRegionSection.offset_within_address_space >
> >>>        ~MemoryRegionSection.offset_within_address_space
> >>>
> >>>>>We need to figure out what's the limitation on ppc and arm -
> >>>>>maybe there's none and it can address full 64 bit range.
> >>>>IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>>
> >>>>Or did I misunderstand the question?
> >>>Sounds right, if either BAR mappings outside the window will not be
> >>>realized in the memory space or the IOMMU has a full 64bit address
> >>>space, there's no problem.  Here we have an intermediate step in the BAR
> >>>sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>>Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>>the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>>this then causes space and time overhead until the tables are pruned
> >>>back down.  Thanks,
> >>I thought sizing is hard defined as a set to
> >>-1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> >PCI doesn't want to handle this as anything special to differentiate a
> >sizing mask from a valid BAR address.  I agree though, I'd prefer to
> >never see a spurious address like this in my MemoryListener.
> >
> >
> 
> Can't you just ignore regions that cannot be mapped?  Oh, and teach
> the bios and/or linux to disable memory access while sizing.


I know Linux won't disable memory access while sizing because
there are some broken devices where you can't re-enable it afterwards.

It should be harmless to set BAR to any silly value as long
as you are careful not to access it.


-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-13 22:48                           ` Alex Williamson
  2014-01-14 10:24                             ` Avi Kivity
@ 2014-01-14 12:07                             ` Michael S. Tsirkin
  2014-01-14 15:57                               ` Alex Williamson
  1 sibling, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 12:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > 
> > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > 
> > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>> 
> > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>>>>>>> ---
> > >>>>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>>>> --- a/exec.c
> > >>>>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>>>> -
> > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>>>> 
> > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>>>> 
> > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>>>> don't detect BAR being disabled?
> > >>>>>>>>> 
> > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>>>> pass-through here.
> > >>>>>>>> 
> > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>>>> 
> > >>>>>>>> Alex
> > >>>>>>> 
> > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>>>> 
> > >>>>>> Unfortunately
> > >>>>>> 
> > >>>>>>>>>>> After this patch I get vfio
> > >>>>>>>>>>> traces like this:
> > >>>>>>>>>>> 
> > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > >>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>>>> (read size mask)
> > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > >>>>>>>>>>> (restore BAR)
> > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>>>> (memory region re-mapped)
> > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > >>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>>> 
> > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>>>> 
> > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>>>> 
> > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > >>>>>> 
> > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>>>> through, so how do I easily find out?
> > >>>>>> 
> > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>>>> know it's MMIO.
> > >>>>>> 
> > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > >>>>>> ugly.
> > >>>>>> 
> > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>>>> bar though, like ivshmem?
> > >>>>>> 
> > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>>>> 
> > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > >>>>> 
> > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>>>> 
> > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> 
> > >>>>>>>>>>> Alex
> > >>>>>>>>>> 
> > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>>>> 
> > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>>>> 
> > >>>>>>>>> Alex
> > >>>>>>> 
> > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>>>> test suites normally check that it actually does work
> > >>>>>>> if it happens.
> > >>>>>> 
> > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > >>>>>> fixed address windows for RAM vs MMIO.
> > >>>>> 
> > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > >>>>> Without that, devices can talk to each other without going
> > >>>>> through chipset, and bridge spec is very explicit that
> > >>>>> full 64 bit addressing must be supported.
> > >>>>> 
> > >>>>> So as long as we don't emulate an IOMMU,
> > >>>>> guest will normally think it's okay to use any address.
> > >>>>> 
> > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>>>> windows would protect you, but pci already does this filtering:
> > >>>>>>> if you see this address in the memory map this means
> > >>>>>>> your virtual device is on root bus.
> > >>>>>>> 
> > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>>>> address ranges to be assigned to devices, it should give this
> > >>>>>>> info to qemu and qemu can give this to guest.
> > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > >>>>>> 
> > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > >>>>>> getting by because it's safely close enough to the CPU address width to
> > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>>>> address space.  Maybe I can safely ignore anything above
> > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>>>> 
> > >>>>>> Alex
> > >>>>> 
> > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > >>>>> So just make up your own constant, maybe depending on host architecture.
> > >>>>> Long term add an ioctl to query it.
> > >>>> 
> > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > >>>> physical address bits of the CPU.
> > >>>> 
> > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>>>> placing BARs above some address.
> > >>>> 
> > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > >>>> building ACPI tables though.
> > >>> 
> > >>> Well the point is that if you want BIOS to avoid
> > >>> specific addresses, you need to tell it what to avoid.
> > >>> But neither BIOS nor ACPI actually cover the range above
> > >>> 2^48 ATM so it's not a high priority.
> > >>> 
> > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>>>> lines of vfio_get_addr_space_bits(void).
> > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >>>> 
> > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > >>> 
> > >>> That seems very wrong. It will still fail on an x86 host if we are
> > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > >>> host side there's no real reason to tie it to the target.
> > > 
> > > I doubt vfio would be the only thing broken in that case.
> > > 
> > >>>> long term vfio already has an IOMMU info
> > >>>> ioctl that we could use to return this information, but we'll need to
> > >>>> figure out how to get it out of the IOMMU driver first.
> > >>>> Thanks,
> > >>>> 
> > >>>> Alex
> > >>> 
> > >>> Short term, just assume 48 bits on x86.
> > > 
> > > I hate to pick an arbitrary value since we have a very specific mapping
> > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > where:
> > > 
> > >        MemoryRegionSection.offset_within_address_space >
> > >        ~MemoryRegionSection.offset_within_address_space
> > > 
> > >>> We need to figure out what's the limitation on ppc and arm -
> > >>> maybe there's none and it can address full 64 bit range.
> > >> 
> > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > >> 
> > >> Or did I misunderstand the question?
> > > 
> > > Sounds right, if either BAR mappings outside the window will not be
> > > realized in the memory space or the IOMMU has a full 64bit address
> > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > this then causes space and time overhead until the tables are pruned
> > > back down.  Thanks,
> > 
> > I thought sizing is hard defined as a set to
> > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> 
> PCI doesn't want to handle this as anything special to differentiate a
> sizing mask from a valid BAR address.  I agree though, I'd prefer to
> never see a spurious address like this in my MemoryListener.

It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
set to all ones atomically.

Also, while it doesn't address this fully (same issue can happen
e.g. with ivshmem), do you think we should distinguish these BARs mapped
from vfio / device assignment in qemu somehow?

In particular, even when it has sane addresses:
device really can not DMA into its own BAR, that's a spec violation
so in theory can do anything including crashing the system.
I don't know what happens in practice but
if you are programming IOMMU to forward transactions back to
device that originated it, you are not doing it any favors.

I also note that if someone tries zero copy transmit out of such an
address, get user pages will fail.
I think this means tun zero copy transmit needs to fall-back
on copy from user on get user pages failure.

Jason, what's tour thinking on this?

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-13 21:39                       ` Alex Williamson
  2014-01-13 21:48                         ` Alexander Graf
@ 2014-01-14 12:21                         ` Michael S. Tsirkin
  2014-01-14 15:49                           ` Alex Williamson
  1 sibling, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 12:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>> 
> > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>> 
> > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > >>>>>>>>>> 
> > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>> 
> > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>> 
> > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>> 
> > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>>>>> ---
> > >>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>> 
> > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>> --- a/exec.c
> > >>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>> 
> > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>> 
> > >>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>> {
> > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>> 
> > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>> -
> > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > >>>>>>>>>> 
> > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>> 
> > >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > >>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>> 
> > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>> don't detect BAR being disabled?
> > >>>>>>> 
> > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>> pass-through here.
> > >>>>>> 
> > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>> 
> > >>>>>> Alex
> > >>>>> 
> > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>> 
> > >>>> Unfortunately
> > >>>> 
> > >>>>>>>>> After this patch I get vfio
> > >>>>>>>>> traces like this:
> > >>>>>>>>> 
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > >>>>>>>>> (write mask to BAR)
> > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>> (read size mask)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > >>>>>>>>> (restore BAR)
> > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>> (memory region re-mapped)
> > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > >>>>>>>>> (write mask to BAR)
> > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > >>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>> 
> > >>>>>>>> 
> > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>> 
> > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>> 
> > >>>>> Why can't you? Generally memory core let you find out easily.
> > >>>> 
> > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>> through, so how do I easily find out?
> > >>>> 
> > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>> know it's MMIO.
> > >>>> 
> > >>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>> through to the IOMMU.  I suppose I could look through all the
> > >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > >>>> ugly.
> > >>>> 
> > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>> bar though, like ivshmem?
> > >>>> 
> > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>> 
> > >>> Must be a 64 bit BAR to trigger the issue though.
> > >>> 
> > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>> 
> > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > >>>>>>>>> allowing it to be realized in the system address space at this location?
> > >>>>>>>>> Thanks,
> > >>>>>>>>> 
> > >>>>>>>>> Alex
> > >>>>>>>> 
> > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>> 
> > >>>>>>> What happens on real hardware when an address like this is programmed to
> > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > >>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>> 
> > >>>>>>> Alex
> > >>>>>> 
> > >>>>> 
> > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > >>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>> test suites normally check that it actually does work
> > >>>>> if it happens.
> > >>>> 
> > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>> routing, that's more what I'm referring to.  There are generally only
> > >>>> fixed address windows for RAM vs MMIO.
> > >>> 
> > >>> The physical chipset? Likely - in the presence of IOMMU.
> > >>> Without that, devices can talk to each other without going
> > >>> through chipset, and bridge spec is very explicit that
> > >>> full 64 bit addressing must be supported.
> > >>> 
> > >>> So as long as we don't emulate an IOMMU,
> > >>> guest will normally think it's okay to use any address.
> > >>> 
> > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>> windows would protect you, but pci already does this filtering:
> > >>>>> if you see this address in the memory map this means
> > >>>>> your virtual device is on root bus.
> > >>>>> 
> > >>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>> address ranges to be assigned to devices, it should give this
> > >>>>> info to qemu and qemu can give this to guest.
> > >>>>> Then anything outside that range can be ignored by VFIO.
> > >>>> 
> > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>> currently no way to find out the address width of the IOMMU.  We've been
> > >>>> getting by because it's safely close enough to the CPU address width to
> > >>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>> address space.  Maybe I can safely ignore anything above
> > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>> 
> > >>>> Alex
> > >>> 
> > >>> I think it's not related to target CPU at all - it's a host limitation.
> > >>> So just make up your own constant, maybe depending on host architecture.
> > >>> Long term add an ioctl to query it.
> > >> 
> > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > >> physical address bits of the CPU.
> > >> 
> > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>> placing BARs above some address.
> > >> 
> > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > >> the BARs with them enabled.  We may still want such a thing to feed into
> > >> building ACPI tables though.
> > > 
> > > Well the point is that if you want BIOS to avoid
> > > specific addresses, you need to tell it what to avoid.
> > > But neither BIOS nor ACPI actually cover the range above
> > > 2^48 ATM so it's not a high priority.
> > > 
> > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>> lines of vfio_get_addr_space_bits(void).
> > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >> 
> > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >> mappings.  In the short term, I think I'll ignore any mappings above
> > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > 
> > > That seems very wrong. It will still fail on an x86 host if we are
> > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > host side there's no real reason to tie it to the target.
> 
> I doubt vfio would be the only thing broken in that case.

A bit cryptic.
target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
So qemu does emulate at least one full-64 bit CPU.

It's possible that something limits PCI BAR address
there, it might or might not be architectural.

> > >> long term vfio already has an IOMMU info
> > >> ioctl that we could use to return this information, but we'll need to
> > >> figure out how to get it out of the IOMMU driver first.
> > >> Thanks,
> > >> 
> > >> Alex
> > > 
> > > Short term, just assume 48 bits on x86.
> 
> I hate to pick an arbitrary value since we have a very specific mapping
> we're trying to avoid.

Well it's not a specific mapping really.

Any mapping outside host IOMMU would not work.
guests happen to trigger it while sizing but again
they are allowed to write anything into BARs really.

>  Perhaps a better option is to skip anything
> where:
> 
>         MemoryRegionSection.offset_within_address_space >
>         ~MemoryRegionSection.offset_within_address_space


This merely checks that high bit is 1, doesn't it?

So this equivalently assumes 63 bits on x86, if you prefer
63 and not 48, that's fine with me.




> > > We need to figure out what's the limitation on ppc and arm -
> > > maybe there's none and it can address full 64 bit range.
> > 
> > IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > 
> > Or did I misunderstand the question?
> 
> Sounds right, if either BAR mappings outside the window will not be
> realized in the memory space or the IOMMU has a full 64bit address
> space, there's no problem.  Here we have an intermediate step in the BAR
> sizing producing a stray mapping that the IOMMU hardware can't handle.
> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> this then causes space and time overhead until the tables are pruned
> back down.  Thanks,
> 
> Alex

In the common case of a single VFIO device per IOMMU, you really should not
add its own BARs in the IOMMU. That's not a complete fix
but it addresses the overhead concern that you mention here.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-12  7:54                   ` Michael S. Tsirkin
  2014-01-12 15:03                     ` Alexander Graf
@ 2014-01-14 13:50                     ` Mike Day
  2014-01-14 14:05                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 74+ messages in thread
From: Mike Day @ 2014-01-14 13:50 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alex Williamson
  Cc: peter.maydell, qemu-devel, aik, agraf, Luiz Capitulino,
	Paolo Bonzini, david


"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:

> Short term, just assume 48 bits on x86.
>
> We need to figure out what's the limitation on ppc and arm -
> maybe there's none and it can address full 64 bit range.
>
> Cc some people who might know about these platforms.

The document you need is here: 

http://goo.gl/fJYxdN

"PCI Bus Binding To: IEEE Std 1275-1994"

The short answer is that Power (OpenFirmware-to-PCI) supports both MMIO
and Memory mappings for BARs.

Also, both 32-bit and 64-bit BARs are required to be supported. It is
legal to construct a 64-bit BAR by masking all the high bits to
zero. Presumably it would be OK to mask the 16 high bits to zero as
well, constructing a 48-bit address.

Mike

-- 
Mike Day | "Endurance is a Virtue"

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 13:50                     ` Mike Day
@ 2014-01-14 14:05                       ` Michael S. Tsirkin
  2014-01-14 15:01                         ` Mike Day
  2014-01-15  0:48                         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 14:05 UTC (permalink / raw)
  To: Mike Day
  Cc: peter.maydell, aik, agraf, qemu-devel, Alex Williamson,
	Paolo Bonzini, Luiz Capitulino, david

On Tue, Jan 14, 2014 at 08:50:54AM -0500, Mike Day wrote:
> 
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
> > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> 
> > Short term, just assume 48 bits on x86.
> >
> > We need to figure out what's the limitation on ppc and arm -
> > maybe there's none and it can address full 64 bit range.
> >
> > Cc some people who might know about these platforms.
> 
> The document you need is here: 
> 
> http://goo.gl/fJYxdN
> 
> "PCI Bus Binding To: IEEE Std 1275-1994"
> 
> The short answer is that Power (OpenFirmware-to-PCI) supports both MMIO
> and Memory mappings for BARs.
> 
> Also, both 32-bit and 64-bit BARs are required to be supported. It is
> legal to construct a 64-bit BAR by masking all the high bits to
> zero. Presumably it would be OK to mask the 16 high bits to zero as
> well, constructing a 48-bit address.
> 
> Mike
> 
> -- 
> Mike Day | "Endurance is a Virtue"

The question was whether addresses such as 
0xfffffffffec00000 can be a valid BAR value on these
platforms, whether it's accessible to the CPU and
to other PCI devices.



-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 14:05                       ` Michael S. Tsirkin
@ 2014-01-14 15:01                         ` Mike Day
  2014-01-15  0:48                         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 74+ messages in thread
From: Mike Day @ 2014-01-14 15:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: peter.maydell, Alexey Kardashevskiy, Alexander Graf, qemu-devel,
	Alex Williamson, Paolo Bonzini, Luiz Capitulino, David Gibson

On Tue, Jan 14, 2014 at 9:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jan 14, 2014 at 08:50:54AM -0500, Mike Day wrote:

>>
>> Also, both 32-bit and 64-bit BARs are required to be supported. It is
>> legal to construct a 64-bit BAR by masking all the high bits to
>> zero. Presumably it would be OK to mask the 16 high bits to zero as
>> well, constructing a 48-bit address.

> The question was whether addresses such as
> 0xfffffffffec00000 can be a valid BAR value on these
> platforms, whether it's accessible to the CPU and
> to other PCI devices.

The answer has to be no at least for Linux. Linux uses the high bit of
the page table address as state to indicate a huge page and uses
48-bit addresses. Each PCI device is different but right now Power7
supports 16TB of RAM so I don't think the PCI bridge would necessarily
decode the high 16 bits of the memory address. For two PCI devices to
communicate with each other using 64-bit addresses they both need to
support 64-bit memory in the same address range, which is possible.
All this info subject to Paul Mackerras or Alexy …

Mike

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 10:24                             ` Avi Kivity
  2014-01-14 11:50                               ` Michael S. Tsirkin
@ 2014-01-14 15:36                               ` Alex Williamson
  2014-01-14 16:20                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-14 15:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Maydell, Michael S. Tsirkin, Alexey Kardashevskiy,
	QEMU Developers, Luiz Capitulino, Alexander Graf, Paolo Bonzini,
	David Gibson

On Tue, 2014-01-14 at 12:24 +0200, Avi Kivity wrote:
> On 01/14/2014 12:48 AM, Alex Williamson wrote:
> > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> >>>
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> >>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Alex
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> Unfortunately
> >>>>>>>>
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>>
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> >>>>>>>> ugly.
> >>>>>>>>
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>>
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>>
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>>
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>>
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>>
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> >>>>>>>> getting by because it's safely close enough to the CPU address width to
> >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>> So just make up your own constant, maybe depending on host architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>>
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>> building ACPI tables though.
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>>
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> I doubt vfio would be the only thing broken in that case.
> >>>
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>> Short term, just assume 48 bits on x86.
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>>
> >>>         MemoryRegionSection.offset_within_address_space >
> >>>         ~MemoryRegionSection.offset_within_address_space
> >>>
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> >>>>
> >>>> Or did I misunderstand the question?
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > PCI doesn't want to handle this as anything special to differentiate a
> > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > never see a spurious address like this in my MemoryListener.
> >
> >
> 
> Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
> bios and/or linux to disable memory access while sizing.

Actually I think we need to be more stringent about DMA mapping
failures.  If a chunk of guest RAM fails to map then we can lose data if
the device attempts to DMA a packet into it.  How do we know which
regions we can ignore and which we can't?  Whether or not the CPU can
access it is a pretty good hint that we can ignore it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 12:21                         ` Michael S. Tsirkin
@ 2014-01-14 15:49                           ` Alex Williamson
  2014-01-14 16:07                             ` Michael S. Tsirkin
  2014-01-14 17:49                             ` Mike Day
  0 siblings, 2 replies; 74+ messages in thread
From: Alex Williamson @ 2014-01-14 15:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, 2014-01-14 at 14:21 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> > On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 
> > > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>> 
> > > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > >>>>>>>>>> consequently messing up the computations.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > >>>>>>>>>> 
> > > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > >>>>>>>>>> 
> > > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > >>>>>>>>>> ---
> > > >>>>>>>>>> exec.c | 8 ++------
> > > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > >>>>>>>>>> 
> > > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > >>>>>>>>>> --- a/exec.c
> > > >>>>>>>>>> +++ b/exec.c
> > > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > >>>>>>>>>> 
> > > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > >>>>>>>>>> 
> > > >>>>>>>>>> #define P_L2_BITS 10
> > > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > >>>>>>>>>> {
> > > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > > >>>>>>>>>> 
> > > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > >>>>>>>>>> -
> > > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > > >>>>>>>>>> 
> > > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > > >>>>>>>>> 
> > > >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > >>>>>>>>> BARs that I'm not sure how to handle.
> > > >>>>>>>> 
> > > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > > >>>>>>>> don't detect BAR being disabled?
> > > >>>>>>> 
> > > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > >>>>>>> pass-through here.
> > > >>>>>> 
> > > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > >>>>>> 
> > > >>>>>> Alex
> > > >>>>> 
> > > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > > >>>> 
> > > >>>> Unfortunately
> > > >>>> 
> > > >>>>>>>>> After this patch I get vfio
> > > >>>>>>>>> traces like this:
> > > >>>>>>>>> 
> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > >>>>>>>>> (save lower 32bits of BAR)
> > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > >>>>>>>>> (write mask to BAR)
> > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > >>>>>>>>> (read size mask)
> > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > >>>>>>>>> (restore BAR)
> > > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > >>>>>>>>> (memory region re-mapped)
> > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > >>>>>>>>> (save upper 32bits of BAR)
> > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > >>>>>>>>> (write mask to BAR)
> > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > >>>>>>>>> (memory region gets re-mapped with new address)
> > > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > >>>>>>>>> 
> > > >>>>>>>> 
> > > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > >>>>>>> 
> > > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > >>>>> 
> > > >>>>> Why can't you? Generally memory core let you find out easily.
> > > >>>> 
> > > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > >>>> out anything that's not memory_region_is_ram().  This still gets
> > > >>>> through, so how do I easily find out?
> > > >>>> 
> > > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > > >>>>> know it's MMIO.
> > > >>>> 
> > > >>>> How so?  I have a MemoryListener as described above and pass everything
> > > >>>> through to the IOMMU.  I suppose I could look through all the
> > > >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > >>>> ugly.
> > > >>>> 
> > > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > >>>>> bar though, like ivshmem?
> > > >>>> 
> > > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > >>> 
> > > >>> Must be a 64 bit BAR to trigger the issue though.
> > > >>> 
> > > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > >>>>>>> 
> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > >>>>>>>>> allowing it to be realized in the system address space at this location?
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>> 
> > > >>>>>>>>> Alex
> > > >>>>>>>> 
> > > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > > >>>>>>> 
> > > >>>>>>> What happens on real hardware when an address like this is programmed to
> > > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > >>>>>>> serious doubts that another PCI device would be able to access it
> > > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > >>>>>>> 
> > > >>>>>>> Alex
> > > >>>>>> 
> > > >>>>> 
> > > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > >>>>> full 64 bit addresses must be allowed and hardware validation
> > > >>>>> test suites normally check that it actually does work
> > > >>>>> if it happens.
> > > >>>> 
> > > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > >>>> routing, that's more what I'm referring to.  There are generally only
> > > >>>> fixed address windows for RAM vs MMIO.
> > > >>> 
> > > >>> The physical chipset? Likely - in the presence of IOMMU.
> > > >>> Without that, devices can talk to each other without going
> > > >>> through chipset, and bridge spec is very explicit that
> > > >>> full 64 bit addressing must be supported.
> > > >>> 
> > > >>> So as long as we don't emulate an IOMMU,
> > > >>> guest will normally think it's okay to use any address.
> > > >>> 
> > > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > >>>>> windows would protect you, but pci already does this filtering:
> > > >>>>> if you see this address in the memory map this means
> > > >>>>> your virtual device is on root bus.
> > > >>>>> 
> > > >>>>> So I think it's the other way around: if VFIO requires specific
> > > >>>>> address ranges to be assigned to devices, it should give this
> > > >>>>> info to qemu and qemu can give this to guest.
> > > >>>>> Then anything outside that range can be ignored by VFIO.
> > > >>>> 
> > > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > >>>> currently no way to find out the address width of the IOMMU.  We've been
> > > >>>> getting by because it's safely close enough to the CPU address width to
> > > >>>> not be a concern until we start exposing things at the top of the 64bit
> > > >>>> address space.  Maybe I can safely ignore anything above
> > > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > >>>> 
> > > >>>> Alex
> > > >>> 
> > > >>> I think it's not related to target CPU at all - it's a host limitation.
> > > >>> So just make up your own constant, maybe depending on host architecture.
> > > >>> Long term add an ioctl to query it.
> > > >> 
> > > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > > >> physical address bits of the CPU.
> > > >> 
> > > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > >>> placing BARs above some address.
> > > >> 
> > > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > > >> the BARs with them enabled.  We may still want such a thing to feed into
> > > >> building ACPI tables though.
> > > > 
> > > > Well the point is that if you want BIOS to avoid
> > > > specific addresses, you need to tell it what to avoid.
> > > > But neither BIOS nor ACPI actually cover the range above
> > > > 2^48 ATM so it's not a high priority.
> > > > 
> > > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > >>> lines of vfio_get_addr_space_bits(void).
> > > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > >> 
> > > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > >> mappings.  In the short term, I think I'll ignore any mappings above
> > > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > 
> > > > That seems very wrong. It will still fail on an x86 host if we are
> > > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > host side there's no real reason to tie it to the target.
> > 
> > I doubt vfio would be the only thing broken in that case.
> 
> A bit cryptic.
> target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
> So qemu does emulate at least one full-64 bit CPU.
> 
> It's possible that something limits PCI BAR address
> there, it might or might not be architectural.
> 
> > > >> long term vfio already has an IOMMU info
> > > >> ioctl that we could use to return this information, but we'll need to
> > > >> figure out how to get it out of the IOMMU driver first.
> > > >> Thanks,
> > > >> 
> > > >> Alex
> > > > 
> > > > Short term, just assume 48 bits on x86.
> > 
> > I hate to pick an arbitrary value since we have a very specific mapping
> > we're trying to avoid.
> 
> Well it's not a specific mapping really.
> 
> Any mapping outside host IOMMU would not work.
> guests happen to trigger it while sizing but again
> they are allowed to write anything into BARs really.
> 
> >  Perhaps a better option is to skip anything
> > where:
> > 
> >         MemoryRegionSection.offset_within_address_space >
> >         ~MemoryRegionSection.offset_within_address_space
> 
> 
> This merely checks that high bit is 1, doesn't it?
> 
> So this equivalently assumes 63 bits on x86, if you prefer
> 63 and not 48, that's fine with me.
> 
> 
> 
> 
> > > > We need to figure out what's the limitation on ppc and arm -
> > > > maybe there's none and it can address full 64 bit range.
> > > 
> > > IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > 
> > > Or did I misunderstand the question?
> > 
> > Sounds right, if either BAR mappings outside the window will not be
> > realized in the memory space or the IOMMU has a full 64bit address
> > space, there's no problem.  Here we have an intermediate step in the BAR
> > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > this then causes space and time overhead until the tables are pruned
> > back down.  Thanks,
> > 
> > Alex
> 
> In the common case of a single VFIO device per IOMMU, you really should not
> add its own BARs in the IOMMU. That's not a complete fix
> but it addresses the overhead concern that you mention here.

That seems like a big assumption.  We now have support for assigned GPUs
which can be paired to do SLI.  One way they might do SLI is via
peer-to-peer DMA.  We can enable that by mapping device BARs through the
IOMMU.  So it seems quite valid to want to map these.

If we choose not to map them, how do we distinguish them from guest RAM?
There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
that points to a chunk of guest memory from one that points to the mmap
of a device BAR.  I think I'd need to explicitly walk all of the vfio
device and try to match the MemoryRegion pointer to one of my devices.
That only solves the problem for vfio devices and not ivshmem devices or
pci-assign devices.

Another idea I was trying to implement is that we can enable the mmap
MemoryRegion lazily on first access.  That means we would ignore these
spurious bogus mappings because they never get accessed.  Two problems
though, first how/where to disable the mmap MemoryRegion (modifying the
memory map from within MemoryListener.region_del seems to do bad
things), second we can't handle the case of a BAR only being accessed
via peer-to-peer (which seems unlikely).  Perhaps the nail in the coffin
again is that it only solves the problem for vfio devices, spurious
mappings from other devices backed by ram_ptr will still fault.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 12:07                             ` Michael S. Tsirkin
@ 2014-01-14 15:57                               ` Alex Williamson
  2014-01-14 16:03                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-14 15:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > 
> > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > 
> > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >>> 
> > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > >>>>>>>>>>>> consequently messing up the computations.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > >>>>>>>>>>>> ---
> > > >>>>>>>>>>>> exec.c | 8 ++------
> > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > >>>>>>>>>>>> --- a/exec.c
> > > >>>>>>>>>>>> +++ b/exec.c
> > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > >>>>>>>>>>>> -
> > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > >>>>>>>>>> don't detect BAR being disabled?
> > > >>>>>>>>> 
> > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > >>>>>>>>> pass-through here.
> > > >>>>>>>> 
> > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > >>>>>>>> 
> > > >>>>>>>> Alex
> > > >>>>>>> 
> > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > >>>>>> 
> > > >>>>>> Unfortunately
> > > >>>>>> 
> > > >>>>>>>>>>> After this patch I get vfio
> > > >>>>>>>>>>> traces like this:
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > >>>>>>>>>>> (write mask to BAR)
> > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > >>>>>>>>>>> (read size mask)
> > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > >>>>>>>>>>> (restore BAR)
> > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > >>>>>>>>>>> (memory region re-mapped)
> > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > >>>>>>>>>>> (write mask to BAR)
> > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > >>>>>>>>>>> (memory region gets unmapped)
> > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > >>>>>>>>> 
> > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > >>>>>>> 
> > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > >>>>>> 
> > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > >>>>>> through, so how do I easily find out?
> > > >>>>>> 
> > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > >>>>>>> know it's MMIO.
> > > >>>>>> 
> > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > >>>>>> ugly.
> > > >>>>>> 
> > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > >>>>>>> bar though, like ivshmem?
> > > >>>>>> 
> > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > >>>>> 
> > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > >>>>> 
> > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > >>>>>>>>> 
> > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> Alex
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > >>>>>>>>> 
> > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > >>>>>>>>> 
> > > >>>>>>>>> Alex
> > > >>>>>>> 
> > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > >>>>>>> test suites normally check that it actually does work
> > > >>>>>>> if it happens.
> > > >>>>>> 
> > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > >>>>>> fixed address windows for RAM vs MMIO.
> > > >>>>> 
> > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > >>>>> Without that, devices can talk to each other without going
> > > >>>>> through chipset, and bridge spec is very explicit that
> > > >>>>> full 64 bit addressing must be supported.
> > > >>>>> 
> > > >>>>> So as long as we don't emulate an IOMMU,
> > > >>>>> guest will normally think it's okay to use any address.
> > > >>>>> 
> > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > >>>>>>> if you see this address in the memory map this means
> > > >>>>>>> your virtual device is on root bus.
> > > >>>>>>> 
> > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > >>>>>>> info to qemu and qemu can give this to guest.
> > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > >>>>>> 
> > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > >>>>>> 
> > > >>>>>> Alex
> > > >>>>> 
> > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > >>>>> Long term add an ioctl to query it.
> > > >>>> 
> > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > >>>> physical address bits of the CPU.
> > > >>>> 
> > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > >>>>> placing BARs above some address.
> > > >>>> 
> > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > >>>> building ACPI tables though.
> > > >>> 
> > > >>> Well the point is that if you want BIOS to avoid
> > > >>> specific addresses, you need to tell it what to avoid.
> > > >>> But neither BIOS nor ACPI actually cover the range above
> > > >>> 2^48 ATM so it's not a high priority.
> > > >>> 
> > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > >>>> 
> > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > >>> 
> > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > >>> host side there's no real reason to tie it to the target.
> > > > 
> > > > I doubt vfio would be the only thing broken in that case.
> > > > 
> > > >>>> long term vfio already has an IOMMU info
> > > >>>> ioctl that we could use to return this information, but we'll need to
> > > >>>> figure out how to get it out of the IOMMU driver first.
> > > >>>> Thanks,
> > > >>>> 
> > > >>>> Alex
> > > >>> 
> > > >>> Short term, just assume 48 bits on x86.
> > > > 
> > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > where:
> > > > 
> > > >        MemoryRegionSection.offset_within_address_space >
> > > >        ~MemoryRegionSection.offset_within_address_space
> > > > 
> > > >>> We need to figure out what's the limitation on ppc and arm -
> > > >>> maybe there's none and it can address full 64 bit range.
> > > >> 
> > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > >> 
> > > >> Or did I misunderstand the question?
> > > > 
> > > > Sounds right, if either BAR mappings outside the window will not be
> > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > this then causes space and time overhead until the tables are pruned
> > > > back down.  Thanks,
> > > 
> > > I thought sizing is hard defined as a set to
> > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > 
> > PCI doesn't want to handle this as anything special to differentiate a
> > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > never see a spurious address like this in my MemoryListener.
> 
> It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> set to all ones atomically.
> 
> Also, while it doesn't address this fully (same issue can happen
> e.g. with ivshmem), do you think we should distinguish these BARs mapped
> from vfio / device assignment in qemu somehow?
> 
> In particular, even when it has sane addresses:
> device really can not DMA into its own BAR, that's a spec violation
> so in theory can do anything including crashing the system.
> I don't know what happens in practice but
> if you are programming IOMMU to forward transactions back to
> device that originated it, you are not doing it any favors.

I might concede that peer-to-peer is more trouble than it's worth if I
had a convenient way to ignore MMIO mappings in my MemoryListener, but I
don't.  Self-DMA is really not the intent of doing the mapping, but
peer-to-peer does have merit.

> I also note that if someone tries zero copy transmit out of such an
> address, get user pages will fail.
> I think this means tun zero copy transmit needs to fall-back
> on copy from user on get user pages failure.
> 
> Jason, what's tour thinking on this?
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 15:57                               ` Alex Williamson
@ 2014-01-14 16:03                                 ` Michael S. Tsirkin
  2014-01-14 16:15                                   ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 16:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > 
> > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > 
> > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >>> 
> > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > >>>>>>>>>>>> ---
> > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > >>>>>>>>>>>> --- a/exec.c
> > > > >>>>>>>>>>>> +++ b/exec.c
> > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > >>>>>>>>>>>> {
> > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > >>>>>>>>>>>> -
> > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > >>>>>>>>>>>> 
> > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > >>>>>>>>>>> 
> > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > >>>>>>>>> 
> > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > >>>>>>>>> pass-through here.
> > > > >>>>>>>> 
> > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > >>>>>>>> 
> > > > >>>>>>>> Alex
> > > > >>>>>>> 
> > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > >>>>>> 
> > > > >>>>>> Unfortunately
> > > > >>>>>> 
> > > > >>>>>>>>>>> After this patch I get vfio
> > > > >>>>>>>>>>> traces like this:
> > > > >>>>>>>>>>> 
> > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > >>>>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > >>>>>>>>>>> (read size mask)
> > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > >>>>>>>>>>> (restore BAR)
> > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > >>>>>>>>>>> (memory region re-mapped)
> > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > >>>>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > >>>>>>>>> 
> > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > >>>>>>> 
> > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > >>>>>> 
> > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > >>>>>> through, so how do I easily find out?
> > > > >>>>>> 
> > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > >>>>>>> know it's MMIO.
> > > > >>>>>> 
> > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > >>>>>> ugly.
> > > > >>>>>> 
> > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > >>>>>>> bar though, like ivshmem?
> > > > >>>>>> 
> > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > >>>>> 
> > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > >>>>> 
> > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > >>>>>>>>> 
> > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> 
> > > > >>>>>>>>>>> Alex
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > >>>>>>>>> 
> > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > >>>>>>>>> 
> > > > >>>>>>>>> Alex
> > > > >>>>>>> 
> > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > >>>>>>> test suites normally check that it actually does work
> > > > >>>>>>> if it happens.
> > > > >>>>>> 
> > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > >>>>> 
> > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > >>>>> Without that, devices can talk to each other without going
> > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > >>>>> full 64 bit addressing must be supported.
> > > > >>>>> 
> > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > >>>>> guest will normally think it's okay to use any address.
> > > > >>>>> 
> > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > >>>>>>> if you see this address in the memory map this means
> > > > >>>>>>> your virtual device is on root bus.
> > > > >>>>>>> 
> > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > >>>>>> 
> > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > >>>>>> 
> > > > >>>>>> Alex
> > > > >>>>> 
> > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > >>>>> Long term add an ioctl to query it.
> > > > >>>> 
> > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > >>>> physical address bits of the CPU.
> > > > >>>> 
> > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > >>>>> placing BARs above some address.
> > > > >>>> 
> > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > >>>> building ACPI tables though.
> > > > >>> 
> > > > >>> Well the point is that if you want BIOS to avoid
> > > > >>> specific addresses, you need to tell it what to avoid.
> > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > >>> 2^48 ATM so it's not a high priority.
> > > > >>> 
> > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > >>>> 
> > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > >>> 
> > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > >>> host side there's no real reason to tie it to the target.
> > > > > 
> > > > > I doubt vfio would be the only thing broken in that case.
> > > > > 
> > > > >>>> long term vfio already has an IOMMU info
> > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > >>>> Thanks,
> > > > >>>> 
> > > > >>>> Alex
> > > > >>> 
> > > > >>> Short term, just assume 48 bits on x86.
> > > > > 
> > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > where:
> > > > > 
> > > > >        MemoryRegionSection.offset_within_address_space >
> > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > 
> > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > >>> maybe there's none and it can address full 64 bit range.
> > > > >> 
> > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > >> 
> > > > >> Or did I misunderstand the question?
> > > > > 
> > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > this then causes space and time overhead until the tables are pruned
> > > > > back down.  Thanks,
> > > > 
> > > > I thought sizing is hard defined as a set to
> > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > 
> > > PCI doesn't want to handle this as anything special to differentiate a
> > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > never see a spurious address like this in my MemoryListener.
> > 
> > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > set to all ones atomically.
> > 
> > Also, while it doesn't address this fully (same issue can happen
> > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > from vfio / device assignment in qemu somehow?
> > 
> > In particular, even when it has sane addresses:
> > device really can not DMA into its own BAR, that's a spec violation
> > so in theory can do anything including crashing the system.
> > I don't know what happens in practice but
> > if you are programming IOMMU to forward transactions back to
> > device that originated it, you are not doing it any favors.
> 
> I might concede that peer-to-peer is more trouble than it's worth if I
> had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> don't.

Well for VFIO devices you are creating these mappings so we surely
can find a way for you to check that.
Doesn't each segment point back at the memory region that created it?
Then you can just check that.

>  Self-DMA is really not the intent of doing the mapping, but
> peer-to-peer does have merit.
> 
> > I also note that if someone tries zero copy transmit out of such an
> > address, get user pages will fail.
> > I think this means tun zero copy transmit needs to fall-back
> > on copy from user on get user pages failure.
> > 
> > Jason, what's tour thinking on this?
> > 
> 
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 15:49                           ` Alex Williamson
@ 2014-01-14 16:07                             ` Michael S. Tsirkin
  2014-01-14 17:49                             ` Mike Day
  1 sibling, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 16:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 08:49:39AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 14:21 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 13, 2014 at 02:39:04PM -0700, Alex Williamson wrote:
> > > On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > 
> > > > > On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > >> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > >>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > >>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > >>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > >>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > >>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > >>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > >>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > >>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > >>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > >>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > >>>>>>>>>> consequently messing up the computations.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > >>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > >>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > >>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > >>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > >>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > >>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > >>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > >>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > >>>>>>>>>> ---
> > > > >>>>>>>>>> exec.c | 8 ++------
> > > > >>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > >>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > >>>>>>>>>> --- a/exec.c
> > > > >>>>>>>>>> +++ b/exec.c
> > > > >>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > >>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > >>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > >>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> #define P_L2_BITS 10
> > > > >>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > >>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > >>>>>>>>>> {
> > > > >>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > >>>>>>>>>> -
> > > > >>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > >>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > >>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > >>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > >>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > > > >>>>>>>>>> 
> > > > >>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > > > >>>>>>>>> 
> > > > >>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > >>>>>>>>> BARs that I'm not sure how to handle.
> > > > >>>>>>>> 
> > > > >>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > >>>>>>>> don't detect BAR being disabled?
> > > > >>>>>>> 
> > > > >>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > >>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > >>>>>>> pass-through here.
> > > > >>>>>> 
> > > > >>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > >>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > >>>>>> 
> > > > >>>>>> Alex
> > > > >>>>> 
> > > > >>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > >>>> 
> > > > >>>> Unfortunately
> > > > >>>> 
> > > > >>>>>>>>> After this patch I get vfio
> > > > >>>>>>>>> traces like this:
> > > > >>>>>>>>> 
> > > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > >>>>>>>>> (save lower 32bits of BAR)
> > > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > >>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > >>>>>>>>> (read size mask)
> > > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > >>>>>>>>> (restore BAR)
> > > > >>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > >>>>>>>>> (memory region re-mapped)
> > > > >>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > >>>>>>>>> (save upper 32bits of BAR)
> > > > >>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > >>>>>>>>> (write mask to BAR)
> > > > >>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > >>>>>>>>> (memory region gets unmapped)
> > > > >>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > >>>>>>>>> (memory region gets re-mapped with new address)
> > > > >>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > >>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > >>>>>>>>> 
> > > > >>>>>>>> 
> > > > >>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > >>>>>>> 
> > > > >>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > >>>>> 
> > > > >>>>> Why can't you? Generally memory core let you find out easily.
> > > > >>>> 
> > > > >>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > >>>> out anything that's not memory_region_is_ram().  This still gets
> > > > >>>> through, so how do I easily find out?
> > > > >>>> 
> > > > >>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > >>>>> know it's MMIO.
> > > > >>>> 
> > > > >>>> How so?  I have a MemoryListener as described above and pass everything
> > > > >>>> through to the IOMMU.  I suppose I could look through all the
> > > > >>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > >>>> ugly.
> > > > >>>> 
> > > > >>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > >>>>> bar though, like ivshmem?
> > > > >>>> 
> > > > >>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > >>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > >>> 
> > > > >>> Must be a 64 bit BAR to trigger the issue though.
> > > > >>> 
> > > > >>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > >>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > >>>>>>> 
> > > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > >>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > >>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > >>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > >>>>>>>>> Thanks,
> > > > >>>>>>>>> 
> > > > >>>>>>>>> Alex
> > > > >>>>>>>> 
> > > > >>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > >>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > >>>>>>> 
> > > > >>>>>>> What happens on real hardware when an address like this is programmed to
> > > > >>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > >>>>>>> serious doubts that another PCI device would be able to access it
> > > > >>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > >>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > >>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > >>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > >>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > >>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > >>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > >>>>>>> 
> > > > >>>>>>> Alex
> > > > >>>>>> 
> > > > >>>>> 
> > > > >>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > >>>>> full 64 bit addresses must be allowed and hardware validation
> > > > >>>>> test suites normally check that it actually does work
> > > > >>>>> if it happens.
> > > > >>>> 
> > > > >>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > >>>> routing, that's more what I'm referring to.  There are generally only
> > > > >>>> fixed address windows for RAM vs MMIO.
> > > > >>> 
> > > > >>> The physical chipset? Likely - in the presence of IOMMU.
> > > > >>> Without that, devices can talk to each other without going
> > > > >>> through chipset, and bridge spec is very explicit that
> > > > >>> full 64 bit addressing must be supported.
> > > > >>> 
> > > > >>> So as long as we don't emulate an IOMMU,
> > > > >>> guest will normally think it's okay to use any address.
> > > > >>> 
> > > > >>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > >>>>> windows would protect you, but pci already does this filtering:
> > > > >>>>> if you see this address in the memory map this means
> > > > >>>>> your virtual device is on root bus.
> > > > >>>>> 
> > > > >>>>> So I think it's the other way around: if VFIO requires specific
> > > > >>>>> address ranges to be assigned to devices, it should give this
> > > > >>>>> info to qemu and qemu can give this to guest.
> > > > >>>>> Then anything outside that range can be ignored by VFIO.
> > > > >>>> 
> > > > >>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > >>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > >>>> getting by because it's safely close enough to the CPU address width to
> > > > >>>> not be a concern until we start exposing things at the top of the 64bit
> > > > >>>> address space.  Maybe I can safely ignore anything above
> > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > >>>> 
> > > > >>>> Alex
> > > > >>> 
> > > > >>> I think it's not related to target CPU at all - it's a host limitation.
> > > > >>> So just make up your own constant, maybe depending on host architecture.
> > > > >>> Long term add an ioctl to query it.
> > > > >> 
> > > > >> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > >> physical address bits of the CPU.
> > > > >> 
> > > > >>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > >>> placing BARs above some address.
> > > > >> 
> > > > >> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > >> the BARs with them enabled.  We may still want such a thing to feed into
> > > > >> building ACPI tables though.
> > > > > 
> > > > > Well the point is that if you want BIOS to avoid
> > > > > specific addresses, you need to tell it what to avoid.
> > > > > But neither BIOS nor ACPI actually cover the range above
> > > > > 2^48 ATM so it's not a high priority.
> > > > > 
> > > > >>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > >>> lines of vfio_get_addr_space_bits(void).
> > > > >>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > >> 
> > > > >> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > >> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > >> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > >> mappings.  In the short term, I think I'll ignore any mappings above
> > > > >> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > 
> > > > > That seems very wrong. It will still fail on an x86 host if we are
> > > > > emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > host side there's no real reason to tie it to the target.
> > > 
> > > I doubt vfio would be the only thing broken in that case.
> > 
> > A bit cryptic.
> > target-s390x/cpu.h:#define TARGET_PHYS_ADDR_SPACE_BITS 64
> > So qemu does emulate at least one full-64 bit CPU.
> > 
> > It's possible that something limits PCI BAR address
> > there, it might or might not be architectural.
> > 
> > > > >> long term vfio already has an IOMMU info
> > > > >> ioctl that we could use to return this information, but we'll need to
> > > > >> figure out how to get it out of the IOMMU driver first.
> > > > >> Thanks,
> > > > >> 
> > > > >> Alex
> > > > > 
> > > > > Short term, just assume 48 bits on x86.
> > > 
> > > I hate to pick an arbitrary value since we have a very specific mapping
> > > we're trying to avoid.
> > 
> > Well it's not a specific mapping really.
> > 
> > Any mapping outside host IOMMU would not work.
> > guests happen to trigger it while sizing but again
> > they are allowed to write anything into BARs really.
> > 
> > >  Perhaps a better option is to skip anything
> > > where:
> > > 
> > >         MemoryRegionSection.offset_within_address_space >
> > >         ~MemoryRegionSection.offset_within_address_space
> > 
> > 
> > This merely checks that high bit is 1, doesn't it?
> > 
> > So this equivalently assumes 63 bits on x86, if you prefer
> > 63 and not 48, that's fine with me.
> > 
> > 
> > 
> > 
> > > > > We need to figure out what's the limitation on ppc and arm -
> > > > > maybe there's none and it can address full 64 bit range.
> > > > 
> > > > IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > 
> > > > Or did I misunderstand the question?
> > > 
> > > Sounds right, if either BAR mappings outside the window will not be
> > > realized in the memory space or the IOMMU has a full 64bit address
> > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > this then causes space and time overhead until the tables are pruned
> > > back down.  Thanks,
> > > 
> > > Alex
> > 
> > In the common case of a single VFIO device per IOMMU, you really should not
> > add its own BARs in the IOMMU. That's not a complete fix
> > but it addresses the overhead concern that you mention here.
> 
> That seems like a big assumption.  We now have support for assigned GPUs
> which can be paired to do SLI.  One way they might do SLI is via
> peer-to-peer DMA.  We can enable that by mapping device BARs through the
> IOMMU.  So it seems quite valid to want to map these.

Absolutely. But then question is how do we know guest isn't
intentionally mapping these at addresses inaccessible to the guest VCPU?

> If we choose not to map them, how do we distinguish them from guest RAM?
> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
> that points to a chunk of guest memory from one that points to the mmap
> of a device BAR.

We could invent one, it's not hard.

> I think I'd need to explicitly walk all of the vfio
> device and try to match the MemoryRegion pointer to one of my devices.
> That only solves the problem for vfio devices and not ivshmem devices or
> pci-assign devices.

Not sure what's mean by "the problem". I merely say that iommu
should not loopback transactions from a device back to itself.

> Another idea I was trying to implement is that we can enable the mmap
> MemoryRegion lazily on first access.  That means we would ignore these
> spurious bogus mappings because they never get accessed.  Two problems
> though, first how/where to disable the mmap MemoryRegion (modifying the
> memory map from within MemoryListener.region_del seems to do bad
> things), second we can't handle the case of a BAR only being accessed
> via peer-to-peer (which seems unlikely).  Perhaps the nail in the coffin
> again is that it only solves the problem for vfio devices, spurious
> mappings from other devices backed by ram_ptr will still fault.  Thanks,
> 
> Alex

In the end, I think it's an iommu limitation. If you agree we should
handle it as such: expose an API telling QEMU what the limitation is,
and we'll do our best to make sure guests don't use it for valid BARs,
using ACPI etc.
You will then be able to assume anything outside the valid range
can be skipped.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 16:03                                 ` Michael S. Tsirkin
@ 2014-01-14 16:15                                   ` Alex Williamson
  2014-01-14 16:18                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-14 16:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > 
> > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > 
> > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >>> 
> > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > >>>>>>>>>>>> ---
> > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > >>>>>>>>>>>> {
> > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > >>>>>>>>>>>> -
> > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > >>>>>>>>>>>> 
> > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > >>>>>>>>> pass-through here.
> > > > > >>>>>>>> 
> > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > >>>>>>>> 
> > > > > >>>>>>>> Alex
> > > > > >>>>>>> 
> > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > >>>>>> 
> > > > > >>>>>> Unfortunately
> > > > > >>>>>> 
> > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > >>>>>>>>>>> traces like this:
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > >>>>>>>>>>> (read size mask)
> > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > >>>>>>>>>>> (restore BAR)
> > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > >>>>>>> 
> > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > >>>>>> 
> > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > >>>>>> through, so how do I easily find out?
> > > > > >>>>>> 
> > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > >>>>>>> know it's MMIO.
> > > > > >>>>>> 
> > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > >>>>>> ugly.
> > > > > >>>>>> 
> > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > >>>>>>> bar though, like ivshmem?
> > > > > >>>>>> 
> > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > >>>>> 
> > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > >>>>> 
> > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > >>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>> 
> > > > > >>>>>>>>>>> Alex
> > > > > >>>>>>>>>> 
> > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > >>>>>>>>> 
> > > > > >>>>>>>>> Alex
> > > > > >>>>>>> 
> > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > >>>>>>> test suites normally check that it actually does work
> > > > > >>>>>>> if it happens.
> > > > > >>>>>> 
> > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > >>>>> 
> > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > >>>>> Without that, devices can talk to each other without going
> > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > >>>>> full 64 bit addressing must be supported.
> > > > > >>>>> 
> > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > >>>>> 
> > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > >>>>>>> if you see this address in the memory map this means
> > > > > >>>>>>> your virtual device is on root bus.
> > > > > >>>>>>> 
> > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > >>>>>> 
> > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > >>>>>> 
> > > > > >>>>>> Alex
> > > > > >>>>> 
> > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > >>>>> Long term add an ioctl to query it.
> > > > > >>>> 
> > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > >>>> physical address bits of the CPU.
> > > > > >>>> 
> > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > >>>>> placing BARs above some address.
> > > > > >>>> 
> > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > >>>> building ACPI tables though.
> > > > > >>> 
> > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > >>> 
> > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > >>>> 
> > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > >>> 
> > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > 
> > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > 
> > > > > >>>> long term vfio already has an IOMMU info
> > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > >>>> Thanks,
> > > > > >>>> 
> > > > > >>>> Alex
> > > > > >>> 
> > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > 
> > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > where:
> > > > > > 
> > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > 
> > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > >> 
> > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > >> 
> > > > > >> Or did I misunderstand the question?
> > > > > > 
> > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > back down.  Thanks,
> > > > > 
> > > > > I thought sizing is hard defined as a set to
> > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > 
> > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > never see a spurious address like this in my MemoryListener.
> > > 
> > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > set to all ones atomically.
> > > 
> > > Also, while it doesn't address this fully (same issue can happen
> > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > from vfio / device assignment in qemu somehow?
> > > 
> > > In particular, even when it has sane addresses:
> > > device really can not DMA into its own BAR, that's a spec violation
> > > so in theory can do anything including crashing the system.
> > > I don't know what happens in practice but
> > > if you are programming IOMMU to forward transactions back to
> > > device that originated it, you are not doing it any favors.
> > 
> > I might concede that peer-to-peer is more trouble than it's worth if I
> > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > don't.
> 
> Well for VFIO devices you are creating these mappings so we surely
> can find a way for you to check that.
> Doesn't each segment point back at the memory region that created it?
> Then you can just check that.

It's a fairly heavy-weight search and it only avoid vfio devices, so it
feels like it's just delaying a real solution.

> >  Self-DMA is really not the intent of doing the mapping, but
> > peer-to-peer does have merit.
> > 
> > > I also note that if someone tries zero copy transmit out of such an
> > > address, get user pages will fail.
> > > I think this means tun zero copy transmit needs to fall-back
> > > on copy from user on get user pages failure.
> > > 
> > > Jason, what's tour thinking on this?
> > > 
> > 
> > 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 16:15                                   ` Alex Williamson
@ 2014-01-14 16:18                                     ` Michael S. Tsirkin
  2014-01-14 16:39                                       ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 16:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > 
> > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > > 
> > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >>> 
> > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > >>>>>>>>>>>> ---
> > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > >>>>>>>>>>>> {
> > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > >>>>>>>>>>>> -
> > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > >>>>>>>>>>>> 
> > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > >>>>>>>>>>> 
> > > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > >>>>>>>>>> 
> > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > > >>>>>>>>> pass-through here.
> > > > > > >>>>>>>> 
> > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > > >>>>>>>> 
> > > > > > >>>>>>>> Alex
> > > > > > >>>>>>> 
> > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > > >>>>>> 
> > > > > > >>>>>> Unfortunately
> > > > > > >>>>>> 
> > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > >>>>>>>>>>> traces like this:
> > > > > > >>>>>>>>>>> 
> > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > >>>>>>>>>>> (read size mask)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > >>>>>>>>>> 
> > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > > >>>>>>> 
> > > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > > >>>>>> 
> > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > > >>>>>> through, so how do I easily find out?
> > > > > > >>>>>> 
> > > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > > >>>>>>> know it's MMIO.
> > > > > > >>>>>> 
> > > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > > >>>>>> ugly.
> > > > > > >>>>>> 
> > > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > >>>>>> 
> > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > > >>>>> 
> > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > >>>>> 
> > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > > >>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>> 
> > > > > > >>>>>>>>>>> Alex
> > > > > > >>>>>>>>>> 
> > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > >>>>>>>>> 
> > > > > > >>>>>>>>> Alex
> > > > > > >>>>>>> 
> > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > >>>>>>> if it happens.
> > > > > > >>>>>> 
> > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > >>>>> 
> > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > >>>>> 
> > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > >>>>> 
> > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > >>>>>>> 
> > > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > >>>>>> 
> > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > >>>>>> 
> > > > > > >>>>>> Alex
> > > > > > >>>>> 
> > > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > >>>> 
> > > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > > >>>> physical address bits of the CPU.
> > > > > > >>>> 
> > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > > >>>>> placing BARs above some address.
> > > > > > >>>> 
> > > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > > >>>> building ACPI tables though.
> > > > > > >>> 
> > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > >>> 
> > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > > >>>> 
> > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > >>> 
> > > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > 
> > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > 
> > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > >>>> Thanks,
> > > > > > >>>> 
> > > > > > >>>> Alex
> > > > > > >>> 
> > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > 
> > > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > > where:
> > > > > > > 
> > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > 
> > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > >> 
> > > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > > >> 
> > > > > > >> Or did I misunderstand the question?
> > > > > > > 
> > > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > > back down.  Thanks,
> > > > > > 
> > > > > > I thought sizing is hard defined as a set to
> > > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > > 
> > > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > never see a spurious address like this in my MemoryListener.
> > > > 
> > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > set to all ones atomically.
> > > > 
> > > > Also, while it doesn't address this fully (same issue can happen
> > > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > > from vfio / device assignment in qemu somehow?
> > > > 
> > > > In particular, even when it has sane addresses:
> > > > device really can not DMA into its own BAR, that's a spec violation
> > > > so in theory can do anything including crashing the system.
> > > > I don't know what happens in practice but
> > > > if you are programming IOMMU to forward transactions back to
> > > > device that originated it, you are not doing it any favors.
> > > 
> > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > don't.
> > 
> > Well for VFIO devices you are creating these mappings so we surely
> > can find a way for you to check that.
> > Doesn't each segment point back at the memory region that created it?
> > Then you can just check that.
> 
> It's a fairly heavy-weight search and it only avoid vfio devices, so it
> feels like it's just delaying a real solution.

Well there are several problems.

That device get its own BAR programmed
as a valid target in IOMMU is in my opinion a separate bug,
and for *that* it's a real solution.

> > >  Self-DMA is really not the intent of doing the mapping, but
> > > peer-to-peer does have merit.
> > > 
> > > > I also note that if someone tries zero copy transmit out of such an
> > > > address, get user pages will fail.
> > > > I think this means tun zero copy transmit needs to fall-back
> > > > on copy from user on get user pages failure.
> > > > 
> > > > Jason, what's tour thinking on this?
> > > > 
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 15:36                               ` Alex Williamson
@ 2014-01-14 16:20                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 16:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, Avi Kivity, QEMU Developers, Alexey Kardashevskiy,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 08:36:27AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 12:24 +0200, Avi Kivity wrote:
> > On 01/14/2014 12:48 AM, Alex Williamson wrote:
> > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > >>>
> > >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>>>>
> > >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > >>>>>>>>>>>>>> consequently messing up the computations.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > >>>>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>    diff = int128_sub(section->mr->size, int128_make64(addr));
> > >>>>>>>>>>>>>>    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>> exec.c | 8 ++------
> > >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > >>>>>>>>>>>>>> --- a/exec.c
> > >>>>>>>>>>>>>> +++ b/exec.c
> > >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> #define P_L2_BITS 10
> > >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > >>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>     system_memory = g_malloc(sizeof(*system_memory));
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > >>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > >>>>>>>>>>>>>>     address_space_init(&address_space_memory, system_memory, "memory");
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>     system_io = g_malloc(sizeof(*system_io));
> > >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> > >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > >>>>>>>>>>>> don't detect BAR being disabled?
> > >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > >>>>>>>>>>> pass-through here.
> > >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Alex
> > >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > >>>>>>>> Unfortunately
> > >>>>>>>>
> > >>>>>>>>>>>>> After this patch I get vfio
> > >>>>>>>>>>>>> traces like this:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > >>>>>>>>>>>>> (save lower 32bits of BAR)
> > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > >>>>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > >>>>>>>>>>>>> (read size mask)
> > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > >>>>>>>>>>>>> (restore BAR)
> > >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > >>>>>>>>>>>>> (memory region re-mapped)
> > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > >>>>>>>>>>>>> (save upper 32bits of BAR)
> > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > >>>>>>>>>>>>> (write mask to BAR)
> > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > >>>>>>>>>>>>> (memory region gets unmapped)
> > >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> > >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> > >>>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> > >>>>>>>> through, so how do I easily find out?
> > >>>>>>>>
> > >>>>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > >>>>>>>>> know it's MMIO.
> > >>>>>>>> How so?  I have a MemoryListener as described above and pass everything
> > >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> > >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > >>>>>>>> ugly.
> > >>>>>>>>
> > >>>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > >>>>>>>>> bar though, like ivshmem?
> > >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > >>>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> > >>>>>>>
> > >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > >>>>>>>>>>>
> > >>>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > >>>>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > >>>>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > >>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Alex
> > >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > >>>>>>>>>>> What happens on real hardware when an address like this is programmed to
> > >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> > >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Alex
> > >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> > >>>>>>>>> test suites normally check that it actually does work
> > >>>>>>>>> if it happens.
> > >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> > >>>>>>>> fixed address windows for RAM vs MMIO.
> > >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> > >>>>>>> Without that, devices can talk to each other without going
> > >>>>>>> through chipset, and bridge spec is very explicit that
> > >>>>>>> full 64 bit addressing must be supported.
> > >>>>>>>
> > >>>>>>> So as long as we don't emulate an IOMMU,
> > >>>>>>> guest will normally think it's okay to use any address.
> > >>>>>>>
> > >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > >>>>>>>>> windows would protect you, but pci already does this filtering:
> > >>>>>>>>> if you see this address in the memory map this means
> > >>>>>>>>> your virtual device is on root bus.
> > >>>>>>>>>
> > >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> > >>>>>>>>> address ranges to be assigned to devices, it should give this
> > >>>>>>>>> info to qemu and qemu can give this to guest.
> > >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> > >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > >>>>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > >>>>>>>> getting by because it's safely close enough to the CPU address width to
> > >>>>>>>> not be a concern until we start exposing things at the top of the 64bit
> > >>>>>>>> address space.  Maybe I can safely ignore anything above
> > >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > >>>>>>>>
> > >>>>>>>> Alex
> > >>>>>>> I think it's not related to target CPU at all - it's a host limitation.
> > >>>>>>> So just make up your own constant, maybe depending on host architecture.
> > >>>>>>> Long term add an ioctl to query it.
> > >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > >>>>>> physical address bits of the CPU.
> > >>>>>>
> > >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > >>>>>>> placing BARs above some address.
> > >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > >>>>>> the BARs with them enabled.  We may still want such a thing to feed into
> > >>>>>> building ACPI tables though.
> > >>>>> Well the point is that if you want BIOS to avoid
> > >>>>> specific addresses, you need to tell it what to avoid.
> > >>>>> But neither BIOS nor ACPI actually cover the range above
> > >>>>> 2^48 ATM so it's not a high priority.
> > >>>>>
> > >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > >>>>>>> lines of vfio_get_addr_space_bits(void).
> > >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > >>>>> That seems very wrong. It will still fail on an x86 host if we are
> > >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > >>>>> host side there's no real reason to tie it to the target.
> > >>> I doubt vfio would be the only thing broken in that case.
> > >>>
> > >>>>>> long term vfio already has an IOMMU info
> > >>>>>> ioctl that we could use to return this information, but we'll need to
> > >>>>>> figure out how to get it out of the IOMMU driver first.
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Alex
> > >>>>> Short term, just assume 48 bits on x86.
> > >>> I hate to pick an arbitrary value since we have a very specific mapping
> > >>> we're trying to avoid.  Perhaps a better option is to skip anything
> > >>> where:
> > >>>
> > >>>         MemoryRegionSection.offset_within_address_space >
> > >>>         ~MemoryRegionSection.offset_within_address_space
> > >>>
> > >>>>> We need to figure out what's the limitation on ppc and arm -
> > >>>>> maybe there's none and it can address full 64 bit range.
> > >>>> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > >>>>
> > >>>> Or did I misunderstand the question?
> > >>> Sounds right, if either BAR mappings outside the window will not be
> > >>> realized in the memory space or the IOMMU has a full 64bit address
> > >>> space, there's no problem.  Here we have an intermediate step in the BAR
> > >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> > >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > >>> this then causes space and time overhead until the tables are pruned
> > >>> back down.  Thanks,
> > >> I thought sizing is hard defined as a set to
> > >> -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > PCI doesn't want to handle this as anything special to differentiate a
> > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > never see a spurious address like this in my MemoryListener.
> > >
> > >
> > 
> > Can't you just ignore regions that cannot be mapped?  Oh, and teach the 
> > bios and/or linux to disable memory access while sizing.
> 
> Actually I think we need to be more stringent about DMA mapping
> failures.  If a chunk of guest RAM fails to map then we can lose data if
> the device attempts to DMA a packet into it.  How do we know which
> regions we can ignore and which we can't?  Whether or not the CPU can
> access it is a pretty good hint that we can ignore it.  Thanks,
> 
> Alex

Go ahead and use that as a hint if you prefer, but for targets which
have CPU target bits in excess of what host IOMMU supports, this might
not be enough to actually make things not break.

-- 
MST

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 16:18                                     ` Michael S. Tsirkin
@ 2014-01-14 16:39                                       ` Alex Williamson
  2014-01-14 16:45                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-14 16:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, 2014-01-14 at 18:18 +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> > On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > > 
> > > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > > > 
> > > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > >>> 
> > > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > >>>>>>>>>>>> ---
> > > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > >>>>>>>>>>>> {
> > > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > >>>>>>>>>>>> -
> > > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > > > >>>>>>>>> pass-through here.
> > > > > > > >>>>>>>> 
> > > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > > > >>>>>>>> 
> > > > > > > >>>>>>>> Alex
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Unfortunately
> > > > > > > >>>>>> 
> > > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > > >>>>>>>>>>> traces like this:
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > >>>>>>>>>>> (read size mask)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > > > >>>>>> through, so how do I easily find out?
> > > > > > > >>>>>> 
> > > > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > > > >>>>>>> know it's MMIO.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > > > >>>>>> ugly.
> > > > > > > >>>>>> 
> > > > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > > > >>>>> 
> > > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > > >>>>> 
> > > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > > > >>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> Alex
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> Alex
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > > >>>>>>> if it happens.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > > >>>>> 
> > > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > > >>>>> 
> > > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > > >>>>> 
> > > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Alex
> > > > > > > >>>>> 
> > > > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > > >>>> 
> > > > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > > > >>>> physical address bits of the CPU.
> > > > > > > >>>> 
> > > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > > > >>>>> placing BARs above some address.
> > > > > > > >>>> 
> > > > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > > > >>>> building ACPI tables though.
> > > > > > > >>> 
> > > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > > >>> 
> > > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > > > >>>> 
> > > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > > >>> 
> > > > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > > 
> > > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > > 
> > > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > > >>>> Thanks,
> > > > > > > >>>> 
> > > > > > > >>>> Alex
> > > > > > > >>> 
> > > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > > 
> > > > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > > > where:
> > > > > > > > 
> > > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > > 
> > > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > > >> 
> > > > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > > > >> 
> > > > > > > >> Or did I misunderstand the question?
> > > > > > > > 
> > > > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > > > back down.  Thanks,
> > > > > > > 
> > > > > > > I thought sizing is hard defined as a set to
> > > > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > > > 
> > > > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > > never see a spurious address like this in my MemoryListener.
> > > > > 
> > > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > > set to all ones atomically.
> > > > > 
> > > > > Also, while it doesn't address this fully (same issue can happen
> > > > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > > > from vfio / device assignment in qemu somehow?
> > > > > 
> > > > > In particular, even when it has sane addresses:
> > > > > device really can not DMA into its own BAR, that's a spec violation
> > > > > so in theory can do anything including crashing the system.
> > > > > I don't know what happens in practice but
> > > > > if you are programming IOMMU to forward transactions back to
> > > > > device that originated it, you are not doing it any favors.
> > > > 
> > > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > > don't.
> > > 
> > > Well for VFIO devices you are creating these mappings so we surely
> > > can find a way for you to check that.
> > > Doesn't each segment point back at the memory region that created it?
> > > Then you can just check that.
> > 
> > It's a fairly heavy-weight search and it only avoid vfio devices, so it
> > feels like it's just delaying a real solution.
> 
> Well there are several problems.
> 
> That device get its own BAR programmed
> as a valid target in IOMMU is in my opinion a separate bug,
> and for *that* it's a real solution.

Except the side-effect of that solution is that it also disables
peer-to-peer since we do not use separate IOMMU domains per device.  In
fact, we can't guarantee that it's possible to use separate IOMMU
domains per device.  So, the cure is worse than the disease.

> > > >  Self-DMA is really not the intent of doing the mapping, but
> > > > peer-to-peer does have merit.
> > > > 
> > > > > I also note that if someone tries zero copy transmit out of such an
> > > > > address, get user pages will fail.
> > > > > I think this means tun zero copy transmit needs to fall-back
> > > > > on copy from user on get user pages failure.
> > > > > 
> > > > > Jason, what's tour thinking on this?
> > > > > 
> > > > 
> > > > 
> > 
> > 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 16:39                                       ` Alex Williamson
@ 2014-01-14 16:45                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 16:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Alexey Kardashevskiy, Jason Wang,
	Alexander Graf, Luiz Capitulino, Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 09:39:24AM -0700, Alex Williamson wrote:
> On Tue, 2014-01-14 at 18:18 +0200, Michael S. Tsirkin wrote:
> > On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> > > On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > > > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > > > 
> > > > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson <alex.williamson@redhat.com>:
> > > > > > > > > 
> > > > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > > >>> 
> > > > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> > > > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> > > > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> > > > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
> > > > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
> > > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> > > > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from address
> > > > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region it gets
> > > > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which is as big as the PCI
> > > > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
> > > > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores the upper
> > > > > > > > >>>>>>>>>>>> bits of the physical address.  In address_space_translate_internal then
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> > > > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > >>>>>>>>>>>> ---
> > > > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> > > > > > > > >>>>>>>>>>>> {
> > > > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > > >>>>>>>>>>>> -
> > > > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> > > > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
> > > > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> > > > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, "memory");
> > > > > > > > >>>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > > > >>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit PCI
> > > > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > > > >>>>>>>>>> 
> > > > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is doing
> > > > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> > > > > > > > >>>>>>>>> pass-through here.
> > > > > > > > >>>>>>>> 
> > > > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
> > > > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> > > > > > > > >>>>>>>> 
> > > > > > > > >>>>>>>> Alex
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at all.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Unfortunately
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > > > >>>>>>>>>>> traces like this:
> > > > > > > > >>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > > > > >>>>>>>>>>> (read size mask)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
> > > > > > > > >>>>>>>>>> 
> > > > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> Why can't you? Generally memory core let you find out easily.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I then filter
> > > > > > > > >>>>>> out anything that's not memory_region_is_ram().  This still gets
> > > > > > > > >>>>>> through, so how do I easily find out?
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>>> But in this case it's vfio device itself that is sized so for sure you
> > > > > > > > >>>>>>> know it's MMIO.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> How so?  I have a MemoryListener as described above and pass everything
> > > > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all the
> > > > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
> > > > > > > > >>>>>> ugly.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>>> Maybe you will have same issue if there's another device with a 64 bit
> > > > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> > > > > > > > >>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
> > > > > > > > >>>>> 
> > > > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > > > >>>>> 
> > > > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is something
> > > > > > > > >>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> > > > > > > > >>>>>>>>>>> address, presumably because it was beyond the address space of the PCI
> > > > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so why are we
> > > > > > > > >>>>>>>>>>> allowing it to be realized in the system address space at this location?
> > > > > > > > >>>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>> 
> > > > > > > > >>>>>>>>>>> Alex
> > > > > > > > >>>>>>>>>> 
> > > > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > > > >>>>>>>>>> True, CPU can't access this address but other pci devices can.
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> What happens on real hardware when an address like this is programmed to
> > > > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  I have
> > > > > > > > >>>>>>>>> serious doubts that another PCI device would be able to access it
> > > > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the devices are on the
> > > > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> > > > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's explicit in
> > > > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and perhaps made explicit
> > > > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, how
> > > > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> > > > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  Thanks,
> > > > > > > > >>>>>>>>> 
> > > > > > > > >>>>>>>>> Alex
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
> > > > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware validation
> > > > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > > > >>>>>>> if it happens.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> > > > > > > > >>>>>> routing, that's more what I'm referring to.  There are generally only
> > > > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > > > >>>>> 
> > > > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > > > >>>>> 
> > > > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > > > >>>>> 
> > > > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> > > > > > > > >>>>>>> windows would protect you, but pci already does this filtering:
> > > > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > > > >>>>>>> 
> > > > > > > > >>>>>>> So I think it's the other way around: if VFIO requires specific
> > > > > > > > >>>>>>> address ranges to be assigned to devices, it should give this
> > > > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  There's
> > > > > > > > >>>>>> currently no way to find out the address width of the IOMMU.  We've been
> > > > > > > > >>>>>> getting by because it's safely close enough to the CPU address width to
> > > > > > > > >>>>>> not be a concern until we start exposing things at the top of the 64bit
> > > > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > > > >>>>>> 
> > > > > > > > >>>>>> Alex
> > > > > > > > >>>>> 
> > > > > > > > >>>>> I think it's not related to target CPU at all - it's a host limitation.
> > > > > > > > >>>>> So just make up your own constant, maybe depending on host architecture.
> > > > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > > > >>>> 
> > > > > > > > >>>> It's a hardware limitation which I'd imagine has some loose ties to the
> > > > > > > > >>>> physical address bits of the CPU.
> > > > > > > > >>>> 
> > > > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> > > > > > > > >>>>> placing BARs above some address.
> > > > > > > > >>>> 
> > > > > > > > >>>> That doesn't help this case, it's a spurious mapping caused by sizing
> > > > > > > > >>>> the BARs with them enabled.  We may still want such a thing to feed into
> > > > > > > > >>>> building ACPI tables though.
> > > > > > > > >>> 
> > > > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > > > >>> 
> > > > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio API, along the
> > > > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> > > > > > > > >>>> 
> > > > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the same
> > > > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the failed
> > > > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> > > > > > > > >>>> mappings.  In the short term, I think I'll ignore any mappings above
> > > > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > > > >>> 
> > > > > > > > >>> That seems very wrong. It will still fail on an x86 host if we are
> > > > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation is on the
> > > > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > > > 
> > > > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > > > 
> > > > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > > > >>>> ioctl that we could use to return this information, but we'll need to
> > > > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > > > >>>> Thanks,
> > > > > > > > >>>> 
> > > > > > > > >>>> Alex
> > > > > > > > >>> 
> > > > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > > > 
> > > > > > > > > I hate to pick an arbitrary value since we have a very specific mapping
> > > > > > > > > we're trying to avoid.  Perhaps a better option is to skip anything
> > > > > > > > > where:
> > > > > > > > > 
> > > > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > > > 
> > > > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > > > >> 
> > > > > > > > >> IIUC on PPC and ARM you always have BAR windows where things can get mapped into. Unlike x86 where the full phyiscal address range can be overlayed by BARs.
> > > > > > > > >> 
> > > > > > > > >> Or did I misunderstand the question?
> > > > > > > > > 
> > > > > > > > > Sounds right, if either BAR mappings outside the window will not be
> > > > > > > > > realized in the memory space or the IOMMU has a full 64bit address
> > > > > > > > > space, there's no problem.  Here we have an intermediate step in the BAR
> > > > > > > > > sizing producing a stray mapping that the IOMMU hardware can't handle.
> > > > > > > > > Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> > > > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> > > > > > > > > this then causes space and time overhead until the tables are pruned
> > > > > > > > > back down.  Thanks,
> > > > > > > > 
> > > > > > > > I thought sizing is hard defined as a set to
> > > > > > > > -1? Can't we check for that one special case and treat it as "not mapped, but tell the guest the size in config space"?
> > > > > > > 
> > > > > > > PCI doesn't want to handle this as anything special to differentiate a
> > > > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > > > never see a spurious address like this in my MemoryListener.
> > > > > > 
> > > > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > > > set to all ones atomically.
> > > > > > 
> > > > > > Also, while it doesn't address this fully (same issue can happen
> > > > > > e.g. with ivshmem), do you think we should distinguish these BARs mapped
> > > > > > from vfio / device assignment in qemu somehow?
> > > > > > 
> > > > > > In particular, even when it has sane addresses:
> > > > > > device really can not DMA into its own BAR, that's a spec violation
> > > > > > so in theory can do anything including crashing the system.
> > > > > > I don't know what happens in practice but
> > > > > > if you are programming IOMMU to forward transactions back to
> > > > > > device that originated it, you are not doing it any favors.
> > > > > 
> > > > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > > > don't.
> > > > 
> > > > Well for VFIO devices you are creating these mappings so we surely
> > > > can find a way for you to check that.
> > > > Doesn't each segment point back at the memory region that created it?
> > > > Then you can just check that.
> > > 
> > > It's a fairly heavy-weight search and it only avoid vfio devices, so it
> > > feels like it's just delaying a real solution.
> > 
> > Well there are several problems.
> > 
> > That device get its own BAR programmed
> > as a valid target in IOMMU is in my opinion a separate bug,
> > and for *that* it's a real solution.
> 
> Except the side-effect of that solution is that it also disables
> peer-to-peer since we do not use separate IOMMU domains per device.  In
> fact, we can't guarantee that it's possible to use separate IOMMU
> domains per device.

Interesting. I guess we can make it work if there's a single
device, this will cover many users though not all of them.

>  So, the cure is worse than the disease.

Worth checking what's worse. Want to try making device DMA
into its own BAR and see what crashes? It's a spec violation
so all bets are off but we can try to see at least some systems.

> > > > >  Self-DMA is really not the intent of doing the mapping, but
> > > > > peer-to-peer does have merit.
> > > > > 
> > > > > > I also note that if someone tries zero copy transmit out of such an
> > > > > > address, get user pages will fail.
> > > > > > I think this means tun zero copy transmit needs to fall-back
> > > > > > on copy from user on get user pages failure.
> > > > > > 
> > > > > > Jason, what's tour thinking on this?
> > > > > > 
> > > > > 
> > > > > 
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 15:49                           ` Alex Williamson
  2014-01-14 16:07                             ` Michael S. Tsirkin
@ 2014-01-14 17:49                             ` Mike Day
  2014-01-14 17:55                               ` Mike Day
  1 sibling, 1 reply; 74+ messages in thread
From: Mike Day @ 2014-01-14 17:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Michael S. Tsirkin,
	Alexey Kardashevskiy, Alexander Graf, Luiz Capitulino,
	Paolo Bonzini, David Gibson

>> > > >>>>>>>
>> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000

> If we choose not to map them, how do we distinguish them from guest RAM?
> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
> that points to a chunk of guest memory from one that points to the mmap
> of a device BAR.  I think I'd need to explicitly walk all of the vfio
> device and try to match the MemoryRegion pointer to one of my devices.
> That only solves the problem for vfio devices and not ivshmem devices or
> pci-assign devices.
>

I don't know if this will save you doing your memory region search or
not. But a BAR that ends with the low bit set is MMIO, and BAR that
ends with the low bit clear is RAM. So the address above is RAM as was
pointed out earlier in the thread. If you got an ambitious address in
the future you could test the low bit. But MMIO is deprecated
according to http://wiki.osdev.org/PCI so you probably won't see it,
at least for 64-bit addresses.

Mike

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 17:49                             ` Mike Day
@ 2014-01-14 17:55                               ` Mike Day
  2014-01-14 18:05                                 ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Mike Day @ 2014-01-14 17:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Michael S. Tsirkin,
	Alexey Kardashevskiy, Alexander Graf, Luiz Capitulino,
	Paolo Bonzini, David Gibson

On Tue, Jan 14, 2014 at 12:49 PM, Mike Day <ncmike@ncultra.org> wrote:
>>> > > >>>>>>>
>>> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>
>> If we choose not to map them, how do we distinguish them from guest RAM?
>> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
>> that points to a chunk of guest memory from one that points to the mmap
>> of a device BAR.  I think I'd need to explicitly walk all of the vfio
>> device and try to match the MemoryRegion pointer to one of my devices.
>> That only solves the problem for vfio devices and not ivshmem devices or
>> pci-assign devices.
>>
>
> I don't know if this will save you doing your memory region search or
> not. But a BAR that ends with the low bit set is MMIO, and BAR that
> ends with the low bit clear is RAM. So the address above is RAM as was
> pointed out earlier in the thread. If you got an ambitious address in
> the future you could test the low bit. But MMIO is deprecated
> according to http://wiki.osdev.org/PCI so you probably won't see it,
> at least for 64-bit addresses.

s/ambitious/ambiguous/

The address above has already been masked. What you need to do is read
the BAR. If the value from the BAR end in '1', its MMIO. If it ends in
'10', its RAM. If it ends in '0n' its disabled. The first thing that
the PCI software does after reading the BAR is mask off the two low
bits.

Mike

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 17:55                               ` Mike Day
@ 2014-01-14 18:05                                 ` Alex Williamson
  2014-01-14 18:20                                   ` Mike Day
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-14 18:05 UTC (permalink / raw)
  To: Mike Day
  Cc: Peter Maydell, QEMU Developers, Michael S. Tsirkin,
	Alexey Kardashevskiy, Alexander Graf, Luiz Capitulino,
	Paolo Bonzini, David Gibson

On Tue, 2014-01-14 at 12:55 -0500, Mike Day wrote:
> On Tue, Jan 14, 2014 at 12:49 PM, Mike Day <ncmike@ncultra.org> wrote:
> >>> > > >>>>>>>
> >>> > > >>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
> >
> >> If we choose not to map them, how do we distinguish them from guest RAM?
> >> There's no MemoryRegion flag that I'm aware of to distinguish a ram_ptr
> >> that points to a chunk of guest memory from one that points to the mmap
> >> of a device BAR.  I think I'd need to explicitly walk all of the vfio
> >> device and try to match the MemoryRegion pointer to one of my devices.
> >> That only solves the problem for vfio devices and not ivshmem devices or
> >> pci-assign devices.
> >>
> >
> > I don't know if this will save you doing your memory region search or
> > not. But a BAR that ends with the low bit set is MMIO, and BAR that
> > ends with the low bit clear is RAM. So the address above is RAM as was
> > pointed out earlier in the thread. If you got an ambitious address in
> > the future you could test the low bit. But MMIO is deprecated
> > according to http://wiki.osdev.org/PCI so you probably won't see it,
> > at least for 64-bit addresses.
> 
> s/ambitious/ambiguous/
> 
> The address above has already been masked. What you need to do is read
> the BAR. If the value from the BAR end in '1', its MMIO. If it ends in
> '10', its RAM. If it ends in '0n' its disabled. The first thing that
> the PCI software does after reading the BAR is mask off the two low
> bits.

Are you perhaps confusing MMIO and I/O port?  I/O port cannot be mmap'd
on x86, so it can't be directly mapped.  It also doesn't come through
the address_space_memory filter.  I/O port is deprecated, or at least
discouraged, MMIO is not.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 18:05                                 ` Alex Williamson
@ 2014-01-14 18:20                                   ` Mike Day
  0 siblings, 0 replies; 74+ messages in thread
From: Mike Day @ 2014-01-14 18:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Maydell, QEMU Developers, Michael S. Tsirkin,
	Alexey Kardashevskiy, Alexander Graf, Luiz Capitulino,
	Paolo Bonzini, David Gibson

>>
>> The address above has already been masked. What you need to do is read
>> the BAR. If the value from the BAR end in '1', its MMIO. If it ends in
>> '10', its RAM. If it ends in '0n' its disabled. The first thing that
>> the PCI software does after reading the BAR is mask off the two low
>> bits.
>
> Are you perhaps confusing MMIO and I/O port?  I/O port cannot be mmap'd
> on x86, so it can't be directly mapped.  It also doesn't come through
> the address_space_memory filter.  I/O port is deprecated, or at least
> discouraged, MMIO is not.  Thanks,

You're right, sorry I missed that. It doesn't solve the problem.

Mike

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-14 14:05                       ` Michael S. Tsirkin
  2014-01-14 15:01                         ` Mike Day
@ 2014-01-15  0:48                         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 74+ messages in thread
From: Alexey Kardashevskiy @ 2014-01-15  0:48 UTC (permalink / raw)
  To: Michael S. Tsirkin, Mike Day
  Cc: peter.maydell, qemu-devel, agraf, Luiz Capitulino,
	Alex Williamson, Paolo Bonzini, david

On 01/15/2014 01:05 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 08:50:54AM -0500, Mike Day wrote:
>>
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>>
>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>
>>> Short term, just assume 48 bits on x86.
>>>
>>> We need to figure out what's the limitation on ppc and arm -
>>> maybe there's none and it can address full 64 bit range.
>>>
>>> Cc some people who might know about these platforms.
>>
>> The document you need is here: 
>>
>> http://goo.gl/fJYxdN
>>
>> "PCI Bus Binding To: IEEE Std 1275-1994"
>>
>> The short answer is that Power (OpenFirmware-to-PCI) supports both MMIO
>> and Memory mappings for BARs.
>>
>> Also, both 32-bit and 64-bit BARs are required to be supported. It is
>> legal to construct a 64-bit BAR by masking all the high bits to
>> zero. Presumably it would be OK to mask the 16 high bits to zero as
>> well, constructing a 48-bit address.
>>
>> Mike
>>
>> -- 
>> Mike Day | "Endurance is a Virtue"
> 
> The question was whether addresses such as 
> 0xfffffffffec00000 can be a valid BAR value on these
> platforms, whether it's accessible to the CPU and
> to other PCI devices.


On ppc64, the guest address is limited by 60 bits (2Alex: even PA from HPT
has the same limit) but there is no actual limit for PCI bus addresses. The
actual hardware has some (less than 60 bits but close) limits but since we
do not emulate any real PHB in qemu-spapr and do para-virtualization, we do
not have to put limits there and BARs like 0xfffffffffec00000 should be
allowed (but we do not really expect them to be as big though).


-- 
Alexey

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-09 17:24   ` Alex Williamson
  2014-01-09 18:00     ` Michael S. Tsirkin
@ 2014-01-20 16:20     ` Mike Day
  2014-01-20 16:45       ` Alex Williamson
  1 sibling, 1 reply; 74+ messages in thread
From: Mike Day @ 2014-01-20 16:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Luiz Capitulino, qemu-devel, Michael S. Tsirkin

Do you know which device is writing to the BAR below? From the trace
it appears it should be restoring the memory address to the BAR after
writing all 1s to the BAR and reading back the contents. (the protocol
for finding the length of the bar memory.)

On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>> From: Paolo Bonzini <pbonzini@redhat.com>
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> (save lower 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> (write mask to BAR)

Here the device should restore the memory address (original contents)
to the BAR.

> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> (read size mask)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> (restore BAR)
> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> (memory region re-mapped)
> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> (save upper 32bits of BAR)
> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> (write mask to BAR)

and here ...

> vfio: region_del febe0000 - febe3fff
> (memory region gets unmapped)
> vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> (memory region gets re-mapped with new address)
> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> (iommu barfs because it can only handle 48bit physical addresses)

I looked around some but I couldn't find an obvious culprit. Could it
be that the BAR is getting unmapped automatically due to
x-intx-mmap-timeout-ms before the device has a chance to finish
restoring the correct value to the BAR?

Mike

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-20 16:20     ` Mike Day
@ 2014-01-20 16:45       ` Alex Williamson
  2014-01-20 17:04         ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-20 16:45 UTC (permalink / raw)
  To: Mike Day; +Cc: Paolo Bonzini, Luiz Capitulino, qemu-devel, Michael S. Tsirkin

On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> Do you know which device is writing to the BAR below? From the trace
> it appears it should be restoring the memory address to the BAR after
> writing all 1s to the BAR and reading back the contents. (the protocol
> for finding the length of the bar memory.)

The guest itself is writing the the BARs.  This is a standard sizing
operation by the guest.

> On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >> From: Paolo Bonzini <pbonzini@redhat.com>
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > (save lower 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > (write mask to BAR)
> 
> Here the device should restore the memory address (original contents)
> to the BAR.

Sorry if it's not clear, the trace here is what the vfio-pci driver
sees.  We're just observing the sizing operation of the guest, therefore
we see:

1) orig = read()
2) write(0xffffffff)
3) size_mask = read()
4) write(orig)

We're only at step 2)

> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > (read size mask)

step 3)

> > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > (restore BAR)

step 4)

> > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > (memory region re-mapped)
> > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > (save upper 32bits of BAR)
> > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > (write mask to BAR)
> 
> and here ...

This is the same as above to the next BAR, which is the upper 32bits of
the 64bit BAR.

> > vfio: region_del febe0000 - febe3fff
> > (memory region gets unmapped)
> > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > (memory region gets re-mapped with new address)
> > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > (iommu barfs because it can only handle 48bit physical addresses)
> 
> I looked around some but I couldn't find an obvious culprit. Could it
> be that the BAR is getting unmapped automatically due to
> x-intx-mmap-timeout-ms before the device has a chance to finish
> restoring the correct value to the BAR?

No, this is simply the guest sizing the BAR, this is not an internally
generated operation.  The INTx emulation isn't used here as KVM
acceleration is enabled.  That also only toggles the enable setting on
the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-20 16:45       ` Alex Williamson
@ 2014-01-20 17:04         ` Michael S. Tsirkin
  2014-01-20 17:16           ` Alex Williamson
  0 siblings, 1 reply; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-20 17:04 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Mike Day, Paolo Bonzini, qemu-devel, Luiz Capitulino

On Mon, Jan 20, 2014 at 09:45:25AM -0700, Alex Williamson wrote:
> On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> > Do you know which device is writing to the BAR below? From the trace
> > it appears it should be restoring the memory address to the BAR after
> > writing all 1s to the BAR and reading back the contents. (the protocol
> > for finding the length of the bar memory.)
> 
> The guest itself is writing the the BARs.  This is a standard sizing
> operation by the guest.

Question is maybe device memory should be disabled?
Does windows do this too (sizing when memory enabled)?


> > On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > >> From: Paolo Bonzini <pbonzini@redhat.com>
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > (save lower 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > 
> > Here the device should restore the memory address (original contents)
> > to the BAR.
> 
> Sorry if it's not clear, the trace here is what the vfio-pci driver
> sees.  We're just observing the sizing operation of the guest, therefore
> we see:
> 
> 1) orig = read()
> 2) write(0xffffffff)
> 3) size_mask = read()
> 4) write(orig)
> 
> We're only at step 2)
> 
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > (read size mask)
> 
> step 3)
> 
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > (restore BAR)
> 
> step 4)
> 
> > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > (memory region re-mapped)
> > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > (save upper 32bits of BAR)
> > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > (write mask to BAR)
> > 
> > and here ...
> 
> This is the same as above to the next BAR, which is the upper 32bits of
> the 64bit BAR.
> 
> > > vfio: region_del febe0000 - febe3fff
> > > (memory region gets unmapped)
> > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > (memory region gets re-mapped with new address)
> > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > (iommu barfs because it can only handle 48bit physical addresses)
> > 
> > I looked around some but I couldn't find an obvious culprit. Could it
> > be that the BAR is getting unmapped automatically due to
> > x-intx-mmap-timeout-ms before the device has a chance to finish
> > restoring the correct value to the BAR?
> 
> No, this is simply the guest sizing the BAR, this is not an internally
> generated operation.  The INTx emulation isn't used here as KVM
> acceleration is enabled.  That also only toggles the enable setting on
> the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-20 17:04         ` Michael S. Tsirkin
@ 2014-01-20 17:16           ` Alex Williamson
  2014-01-20 20:37             ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: Alex Williamson @ 2014-01-20 17:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Mike Day, Paolo Bonzini, qemu-devel, Luiz Capitulino

On Mon, 2014-01-20 at 19:04 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 20, 2014 at 09:45:25AM -0700, Alex Williamson wrote:
> > On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> > > Do you know which device is writing to the BAR below? From the trace
> > > it appears it should be restoring the memory address to the BAR after
> > > writing all 1s to the BAR and reading back the contents. (the protocol
> > > for finding the length of the bar memory.)
> > 
> > The guest itself is writing the the BARs.  This is a standard sizing
> > operation by the guest.
> 
> Question is maybe device memory should be disabled?
> Does windows do this too (sizing when memory enabled)?

Per the spec I would have expected memory & I/O to be disabled on the
device during a sizing operation, but that's not the case here.  I
thought you were the one that said Linux doesn't do this because some
devices don't properly re-enable.  I'm not sure how it would change our
approach to this to know whether Windows behaves the same since sizing
while disabled is not an issue and we apparently need to support sizing
while enabled regardless.  Thanks,

Alex

> > > On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> > > <alex.williamson@redhat.com> wrote:
> > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > >> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > (save lower 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > 
> > > Here the device should restore the memory address (original contents)
> > > to the BAR.
> > 
> > Sorry if it's not clear, the trace here is what the vfio-pci driver
> > sees.  We're just observing the sizing operation of the guest, therefore
> > we see:
> > 
> > 1) orig = read()
> > 2) write(0xffffffff)
> > 3) size_mask = read()
> > 4) write(orig)
> > 
> > We're only at step 2)
> > 
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > (read size mask)
> > 
> > step 3)
> > 
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > (restore BAR)
> > 
> > step 4)
> > 
> > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > (memory region re-mapped)
> > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > (save upper 32bits of BAR)
> > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > (write mask to BAR)
> > > 
> > > and here ...
> > 
> > This is the same as above to the next BAR, which is the upper 32bits of
> > the 64bit BAR.
> > 
> > > > vfio: region_del febe0000 - febe3fff
> > > > (memory region gets unmapped)
> > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > (memory region gets re-mapped with new address)
> > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > 
> > > I looked around some but I couldn't find an obvious culprit. Could it
> > > be that the BAR is getting unmapped automatically due to
> > > x-intx-mmap-timeout-ms before the device has a chance to finish
> > > restoring the correct value to the BAR?
> > 
> > No, this is simply the guest sizing the BAR, this is not an internally
> > generated operation.  The INTx emulation isn't used here as KVM
> > acceleration is enabled.  That also only toggles the enable setting on
> > the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
> > Thanks,
> > 
> > Alex

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
  2014-01-20 17:16           ` Alex Williamson
@ 2014-01-20 20:37             ` Michael S. Tsirkin
  0 siblings, 0 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2014-01-20 20:37 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Mike Day, Paolo Bonzini, qemu-devel, Luiz Capitulino

On Mon, Jan 20, 2014 at 10:16:01AM -0700, Alex Williamson wrote:
> On Mon, 2014-01-20 at 19:04 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 20, 2014 at 09:45:25AM -0700, Alex Williamson wrote:
> > > On Mon, 2014-01-20 at 11:20 -0500, Mike Day wrote:
> > > > Do you know which device is writing to the BAR below? From the trace
> > > > it appears it should be restoring the memory address to the BAR after
> > > > writing all 1s to the BAR and reading back the contents. (the protocol
> > > > for finding the length of the bar memory.)
> > > 
> > > The guest itself is writing the the BARs.  This is a standard sizing
> > > operation by the guest.
> > 
> > Question is maybe device memory should be disabled?
> > Does windows do this too (sizing when memory enabled)?
> 
> Per the spec I would have expected memory & I/O to be disabled on the
> device during a sizing operation, but that's not the case here.  I
> thought you were the one that said Linux doesn't do this because some
> devices don't properly re-enable.

Yes. But maybe we can white-list devices or something.
I'm guessing modern express devices are all sane
and let you disable/enable memory any number
of times.

> I'm not sure how it would change our
> approach to this to know whether Windows behaves the same since sizing
> while disabled is not an issue and we apparently need to support sizing
> while enabled regardless.  Thanks,
> 
> Alex

I'm talking about changing Linux here.
If windows is already doing this - this gives us more
hope that this will actually work.
Yes we need the work-around in qemu regardless.


> > > > On Thu, Jan 9, 2014 at 12:24 PM, Alex Williamson
> > > > <alex.williamson@redhat.com> wrote:
> > > > > On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> > > > >> From: Paolo Bonzini <pbonzini@redhat.com>
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> > > > > (save lower 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > 
> > > > Here the device should restore the memory address (original contents)
> > > > to the BAR.
> > > 
> > > Sorry if it's not clear, the trace here is what the vfio-pci driver
> > > sees.  We're just observing the sizing operation of the guest, therefore
> > > we see:
> > > 
> > > 1) orig = read()
> > > 2) write(0xffffffff)
> > > 3) size_mask = read()
> > > 4) write(orig)
> > > 
> > > We're only at step 2)
> > > 
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> > > > > (read size mask)
> > > 
> > > step 3)
> > > 
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, len=0x4)
> > > > > (restore BAR)
> > > 
> > > step 4)
> > > 
> > > > > vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > (memory region re-mapped)
> > > > > vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> > > > > (save upper 32bits of BAR)
> > > > > vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, len=0x4)
> > > > > (write mask to BAR)
> > > > 
> > > > and here ...
> > > 
> > > This is the same as above to the next BAR, which is the upper 32bits of
> > > the 64bit BAR.
> > > 
> > > > > vfio: region_del febe0000 - febe3fff
> > > > > (memory region gets unmapped)
> > > > > vfio: region_add fffffffffebe0000 - fffffffffebe3fff [0x7fcf3654d000]
> > > > > (memory region gets re-mapped with new address)
> > > > > qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> > > > > (iommu barfs because it can only handle 48bit physical addresses)
> > > > 
> > > > I looked around some but I couldn't find an obvious culprit. Could it
> > > > be that the BAR is getting unmapped automatically due to
> > > > x-intx-mmap-timeout-ms before the device has a chance to finish
> > > > restoring the correct value to the BAR?
> > > 
> > > No, this is simply the guest sizing the BAR, this is not an internally
> > > generated operation.  The INTx emulation isn't used here as KVM
> > > acceleration is enabled.  That also only toggles the enable setting on
> > > the mmap'd MemoryRegion, it doesn't change the address it's mapped to.
> > > Thanks,
> > > 
> > > Alex
> 
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2014-01-20 20:32 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-11 18:30 [Qemu-devel] [PULL 00/28] acpi.pci,pc,memory core fixes Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 01/28] hw: Pass QEMUMachine to its init() method Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 02/28] pc: map PCI address space as catchall region for not mapped addresses Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 03/28] qtest: split configuration of qtest accelerator and chardev Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 04/28] acpi-test: basic acpi unit-test Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 05/28] MAINTAINERS: update X86 machine entry Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 06/28] pci: fix address space size for bridge Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 07/28] pc: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 08/28] spapr_pci: s/INT64_MAX/UINT64_MAX/ Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 09/28] split definitions for exec.c and translate-all.c radix trees Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 10/28] exec: replace leaf with skip Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 11/28] exec: extend skip field to 6 bit, page entry to 32 bit Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 12/28] exec: pass hw address to phys_page_find Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 13/28] exec: memory radix tree page level compression Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide Michael S. Tsirkin
2014-01-09 17:24   ` Alex Williamson
2014-01-09 18:00     ` Michael S. Tsirkin
2014-01-09 18:47       ` Alex Williamson
2014-01-09 19:03         ` Alex Williamson
2014-01-09 21:56           ` Michael S. Tsirkin
2014-01-09 22:42             ` Alex Williamson
2014-01-10 12:55               ` Michael S. Tsirkin
2014-01-10 15:31                 ` Alex Williamson
2014-01-12  7:54                   ` Michael S. Tsirkin
2014-01-12 15:03                     ` Alexander Graf
2014-01-13 21:39                       ` Alex Williamson
2014-01-13 21:48                         ` Alexander Graf
2014-01-13 22:48                           ` Alex Williamson
2014-01-14 10:24                             ` Avi Kivity
2014-01-14 11:50                               ` Michael S. Tsirkin
2014-01-14 15:36                               ` Alex Williamson
2014-01-14 16:20                                 ` Michael S. Tsirkin
2014-01-14 12:07                             ` Michael S. Tsirkin
2014-01-14 15:57                               ` Alex Williamson
2014-01-14 16:03                                 ` Michael S. Tsirkin
2014-01-14 16:15                                   ` Alex Williamson
2014-01-14 16:18                                     ` Michael S. Tsirkin
2014-01-14 16:39                                       ` Alex Williamson
2014-01-14 16:45                                         ` Michael S. Tsirkin
2014-01-14  8:18                           ` Michael S. Tsirkin
2014-01-14  9:20                             ` Alexander Graf
2014-01-14  9:31                               ` Peter Maydell
2014-01-14 10:28                               ` Michael S. Tsirkin
2014-01-14 10:43                               ` Michael S. Tsirkin
2014-01-14 12:21                         ` Michael S. Tsirkin
2014-01-14 15:49                           ` Alex Williamson
2014-01-14 16:07                             ` Michael S. Tsirkin
2014-01-14 17:49                             ` Mike Day
2014-01-14 17:55                               ` Mike Day
2014-01-14 18:05                                 ` Alex Williamson
2014-01-14 18:20                                   ` Mike Day
2014-01-14 13:50                     ` Mike Day
2014-01-14 14:05                       ` Michael S. Tsirkin
2014-01-14 15:01                         ` Mike Day
2014-01-15  0:48                         ` Alexey Kardashevskiy
2014-01-20 16:20     ` Mike Day
2014-01-20 16:45       ` Alex Williamson
2014-01-20 17:04         ` Michael S. Tsirkin
2014-01-20 17:16           ` Alex Williamson
2014-01-20 20:37             ` Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 15/28] exec: reduce L2_PAGE_SIZE Michael S. Tsirkin
2013-12-11 18:30 ` [Qemu-devel] [PULL 16/28] smbios: Set system manufacturer, product & version by default Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 17/28] acpi unit-test: verify signature and checksum Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 18/28] acpi: strip compiler info in built-in DSDT Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 19/28] ACPI DSDT: Make control method `IQCR` serialized Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 20/28] pci: fix pci bridge fw path Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 21/28] hpet: inverse polarity when pin above ISA_NUM_IRQS Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 22/28] hpet: enable to entitle more irq pins for hpet Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 23/28] memory.c: bugfix - ref counting mismatch in memory_region_find Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 24/28] exec: separate sections and nodes per address space Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 25/28] acpi unit-test: load and check facs table Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 26/28] acpi unit-test: adjust the test data structure for better handling Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 27/28] hpet: fix build with CONFIG_HPET off Michael S. Tsirkin
2013-12-11 18:31 ` [Qemu-devel] [PULL 28/28] pc: use macro for HPET type Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.