All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2015-07-06  2:10 Alexey Kardashevskiy
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
                   ` (15 more replies)
  0 siblings, 16 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson


(cut-n-paste from kernel patchset)

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows.

Once a Linux guest discovers the presence of DDW, it does:
1. query hypervisor about number of available windows and page size masks;
2. create a window with the biggest possible page size (today 4K/64K/16M);
3. map the entire guest RAM via H_PUT_TCE* hypercalls;
4. switche dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
the guest does not waste time on DMA map/unmap operations.

Note that 32bit devices won't use DDW and will keep using the default
DMA window so KVM optimizations will be required (to be posted later).

This patchset adds DDW support for pseries. The host kernel changes are
required, available in the current upstream.

This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.

Please comment. Thanks!

Changes:
v10:
* reworked "spapr_pci: Enable vfio-pci hotplug"
* added "vfio: Unregister IOMMU notifiers when container is destroyed"
* updated kernel header update with a tag

v9:
* removed "vfio: spapr: Move SPAPR-related code to a separate file"
* rebased on top of current dwg/spapr-next
* moved hw/vfio/* related patches to the end of the patchset
* included kernel headers update
* reworked "spapr_pci: Enable vfio-pci hotplug" a lot

v8:
* reworked unreferencing in "spapr_iommu: Introduce "enabled" state for TCE table"
* added clean-up patch "spapr_iommu: Remove vfio_accel flag from sPAPRTCETable"
* rebased on latest spapr-next

v7:
* bunch of cleanups, renames after David+Thomas+Michael review
* patches are reorganized and those which do not need the host kernel headers
update are put first and can be pulled if these are good enough :)

v6:
* spapr-pci-vfio-host-bridge is now a synonim of spapr-pci-host-bridge -
same PHB can host emulated and VFIO devices
* changed patches order
* lot of small changes

v5:
* TCE tables got "enabled" state and are persistent, i.e. not recreated
every reboot
* added v2 of SPAPR_TCE_IOMMU
* fixed migration for emulated PHB with enabled DDW
* huge pile of other changes

v4:
* reimplemented the whole thing
* machine reset and ddw-reset RTAS call both remove all TCE tables and
create the default one
* IOMMU group id is not needed to use VFIO PHB anymore, multiple groups
are supported on the same VFIO container and virtual PHB

v3:
* removed "reset" from API now
* reworked machine versions
* applied multiple comments
* includes David's machine QOM rework as this patchset adds a new machine type

v2:
* tested on emulated PHB
* removed "ddw" machine property, now it is PHB property
* disabled by default
* defined "pseries-2.2" machine which enables DDW by default
* fixed reset() and reference counting




Alexey Kardashevskiy (14):
  linux-headers: Update to 4.2-rc1
  vmstate: Define VARRAY with VMS_ALLOC
  spapr_pci: Convert finish_realize() to
    dma_capabilities_update()+dma_init_window()
  spapr_iommu: Move table allocation to helpers
  spapr_iommu: Introduce "enabled" state for TCE table
  spapr_iommu: Remove vfio_accel flag from sPAPRTCETable
  spapr_iommu: Add root memory region
  spapr_pci: Do complete reset of DMA config when resetting PHB
  spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge
  spapr_pci: Enable vfio-pci hotplug
  spapr_pci_vfio: Enable multiple groups per container
  vfio: Unregister IOMMU notifiers when container is destroyed
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

 hw/ppc/Makefile.objs                            |   3 +
 hw/ppc/spapr.c                                  |   5 +
 hw/ppc/spapr_iommu.c                            | 207 +++++++++++++---
 hw/ppc/spapr_pci.c                              | 293 +++++++++++++++++------
 hw/ppc/spapr_pci_vfio.c                         | 191 ++++++++-------
 hw/ppc/spapr_rtas_ddw.c                         | 300 ++++++++++++++++++++++++
 hw/ppc/spapr_vio.c                              |   9 +-
 hw/vfio/common.c                                | 139 +++++++++--
 include/hw/pci-host/spapr.h                     |  50 +++-
 include/hw/ppc/spapr.h                          |  33 ++-
 include/hw/vfio/vfio-common.h                   |   3 +
 include/hw/vfio/vfio.h                          |   2 +-
 include/migration/vmstate.h                     |  10 +
 include/standard-headers/linux/input.h          |  10 +-
 include/standard-headers/linux/virtio_balloon.h |   1 +
 include/standard-headers/linux/virtio_gpu.h     |   2 +
 linux-headers/asm-x86/hyperv.h                  |  11 +
 linux-headers/linux/kvm.h                       |   2 +-
 linux-headers/linux/vfio.h                      | 102 +++++++-
 linux-headers/linux/virtio_pci.h                | 192 ---------------
 trace-events                                    |  11 +-
 21 files changed, 1132 insertions(+), 444 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 delete mode 100644 linux-headers/linux/virtio_pci.h

-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2015-07-06  2:10 ` Alexey Kardashevskiy
  2015-07-06 11:18   ` Paolo Bonzini
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Alexey Kardashevskiy, Michael Roth,
	Gavin Shan, Alex Williamson, qemu-ppc, Paolo Bonzini,
	David Gibson

This updates linux-headers against master 4.2-rc1 (commit
d770e558e21961ad6cfdf0ff7df0eb5d7d4f0754). This is the result of
./scripts/update-linux-headers.sh work.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is for DDW support on sPAPR.
---
 include/standard-headers/linux/input.h          |  10 +-
 include/standard-headers/linux/virtio_balloon.h |   1 +
 include/standard-headers/linux/virtio_gpu.h     |   2 +
 linux-headers/asm-x86/hyperv.h                  |  11 ++
 linux-headers/linux/kvm.h                       |   2 +-
 linux-headers/linux/vfio.h                      | 102 ++++++++++++-
 linux-headers/linux/virtio_pci.h                | 192 ------------------------
 7 files changed, 121 insertions(+), 199 deletions(-)
 delete mode 100644 linux-headers/linux/virtio_pci.h

diff --git a/include/standard-headers/linux/input.h b/include/standard-headers/linux/input.h
index b94d365..a459dd2 100644
--- a/include/standard-headers/linux/input.h
+++ b/include/standard-headers/linux/input.h
@@ -367,7 +367,8 @@ struct input_keymap_entry {
 #define KEY_MSDOS		151
 #define KEY_COFFEE		152	/* AL Terminal Lock/Screensaver */
 #define KEY_SCREENLOCK		KEY_COFFEE
-#define KEY_DIRECTION		153
+#define KEY_ROTATE_DISPLAY	153	/* Display orientation for e.g. tablets */
+#define KEY_DIRECTION		KEY_ROTATE_DISPLAY
 #define KEY_CYCLEWINDOWS	154
 #define KEY_MAIL		155
 #define KEY_BOOKMARKS		156	/* AC Bookmarks */
@@ -700,6 +701,10 @@ struct input_keymap_entry {
 #define KEY_NUMERIC_9		0x209
 #define KEY_NUMERIC_STAR	0x20a
 #define KEY_NUMERIC_POUND	0x20b
+#define KEY_NUMERIC_A		0x20c	/* Phone key A - HUT Telephony 0xb9 */
+#define KEY_NUMERIC_B		0x20d
+#define KEY_NUMERIC_C		0x20e
+#define KEY_NUMERIC_D		0x20f
 
 #define KEY_CAMERA_FOCUS	0x210
 #define KEY_WPS_BUTTON		0x211	/* WiFi Protected Setup key */
@@ -971,7 +976,8 @@ struct input_keymap_entry {
  */
 #define MT_TOOL_FINGER		0
 #define MT_TOOL_PEN		1
-#define MT_TOOL_MAX		1
+#define MT_TOOL_PALM		2
+#define MT_TOOL_MAX		2
 
 /*
  * Values describing the status of a force-feedback effect
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 88ada1d..2e2a6dc 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -26,6 +26,7 @@
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE. */
 #include "standard-headers/linux/types.h"
+#include "standard-headers/linux/virtio_types.h"
 #include "standard-headers/linux/virtio_ids.h"
 #include "standard-headers/linux/virtio_config.h"
 
diff --git a/include/standard-headers/linux/virtio_gpu.h b/include/standard-headers/linux/virtio_gpu.h
index cfcfb46..72ef815 100644
--- a/include/standard-headers/linux/virtio_gpu.h
+++ b/include/standard-headers/linux/virtio_gpu.h
@@ -38,6 +38,8 @@
 #ifndef VIRTIO_GPU_HW_H
 #define VIRTIO_GPU_HW_H
 
+#include "standard-headers/linux/types.h"
+
 enum virtio_gpu_ctrl_type {
 	VIRTIO_GPU_UNDEFINED = 0,
 
diff --git a/linux-headers/asm-x86/hyperv.h b/linux-headers/asm-x86/hyperv.h
index ce6068d..8fba544 100644
--- a/linux-headers/asm-x86/hyperv.h
+++ b/linux-headers/asm-x86/hyperv.h
@@ -199,6 +199,17 @@
 #define HV_X64_MSR_STIMER3_CONFIG		0x400000B6
 #define HV_X64_MSR_STIMER3_COUNT		0x400000B7
 
+/* Hyper-V guest crash notification MSR's */
+#define HV_X64_MSR_CRASH_P0			0x40000100
+#define HV_X64_MSR_CRASH_P1			0x40000101
+#define HV_X64_MSR_CRASH_P2			0x40000102
+#define HV_X64_MSR_CRASH_P3			0x40000103
+#define HV_X64_MSR_CRASH_P4			0x40000104
+#define HV_X64_MSR_CRASH_CTL			0x40000105
+#define HV_X64_MSR_CRASH_CTL_NOTIFY		(1ULL << 63)
+#define HV_X64_MSR_CRASH_PARAMS		\
+		(1 + (HV_X64_MSR_CRASH_P4 - HV_X64_MSR_CRASH_P0))
+
 #define HV_X64_MSR_HYPERCALL_ENABLE		0x00000001
 #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT	12
 #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK	\
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index fad9e5c..3bac873 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -897,7 +897,7 @@ struct kvm_xen_hvm_config {
  *
  * KVM_IRQFD_FLAG_RESAMPLE indicates resamplefd is valid and specifies
  * the irqfd to operate in resampling mode for level triggered interrupt
- * emlation.  See Documentation/virtual/kvm/api.txt.
+ * emulation.  See Documentation/virtual/kvm/api.txt.
  */
 #define KVM_IRQFD_FLAG_RESAMPLE (1 << 1)
 
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 0508d0b..aa276bc 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -36,6 +36,8 @@
 /* Two-stage IOMMU */
 #define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
 
+#define VFIO_SPAPR_TCE_v2_IOMMU		7
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -443,6 +445,23 @@ struct vfio_iommu_type1_dma_unmap {
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
+ * The SPAPR TCE DDW info struct provides the information about
+ * the details of Dynamic DMA window capability.
+ *
+ * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
+ * @max_dynamic_windows_supported tells the maximum number of windows
+ * which the platform can create.
+ * @levels tells the maximum number of levels in multi-level IOMMU tables;
+ * this allows splitting a table into smaller chunks which reduces
+ * the amount of physically contiguous memory required for the table.
+ */
+struct vfio_iommu_spapr_tce_ddw_info {
+	__u64 pgsizes;			/* Bitmap of supported page sizes */
+	__u32 max_dynamic_windows_supported;
+	__u32 levels;
+};
+
+/*
  * The SPAPR TCE info struct provides the information about the PCI bus
  * address ranges available for DMA, these values are programmed into
  * the hardware so the guest has to know that information.
@@ -452,14 +471,17 @@ struct vfio_iommu_type1_dma_unmap {
  * addresses too so the window works as a filter rather than an offset
  * for IOVA addresses.
  *
- * A flag will need to be added if other page sizes are supported,
- * so as defined here, it is always 4k.
+ * Flags supported:
+ * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
+ *   (DDW) support is present. @ddw is only supported when DDW is present.
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	struct vfio_iommu_spapr_tce_ddw_info ddw;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -470,12 +492,23 @@ struct vfio_iommu_spapr_tce_info {
  * - unfreeze IO/DMA for frozen PE;
  * - read PE state;
  * - reset PE;
- * - configure PE.
+ * - configure PE;
+ * - inject EEH error.
  */
+struct vfio_eeh_pe_err {
+	__u32 type;
+	__u32 func;
+	__u64 addr;
+	__u64 mask;
+};
+
 struct vfio_eeh_pe_op {
 	__u32 argsz;
 	__u32 flags;
 	__u32 op;
+	union {
+		struct vfio_eeh_pe_err err;
+	};
 };
 
 #define VFIO_EEH_PE_DISABLE		0	/* Disable EEH functionality */
@@ -492,9 +525,70 @@ struct vfio_eeh_pe_op {
 #define VFIO_EEH_PE_RESET_HOT		6	/* Assert hot reset          */
 #define VFIO_EEH_PE_RESET_FUNDAMENTAL	7	/* Assert fundamental reset  */
 #define VFIO_EEH_PE_CONFIGURE		8	/* PE configuration          */
+#define VFIO_EEH_PE_INJECT_ERR		9	/* Inject EEH error          */
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get faster.
+ */
+struct vfio_iommu_spapr_register_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
+ *
+ * Unregisters user space memory registered with
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
+ * Uses vfio_iommu_spapr_register_memory for parameters.
+ */
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
+ *
+ * Creates an additional TCE table and programs it (sets a new DMA window)
+ * to every IOMMU group in the container. It receives page shift, window
+ * size and number of levels in the TCE table being created.
+ *
+ * It allocates and returns an offset on a PCI bus of the new DMA window.
+ */
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u64 window_size;
+	__u32 levels;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
+ *
+ * Unprograms a TCE table from all groups in the container and destroys it.
+ * It receives a PCI bus offset as a window id.
+ */
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
diff --git a/linux-headers/linux/virtio_pci.h b/linux-headers/linux/virtio_pci.h
deleted file mode 100644
index 92624e5..0000000
--- a/linux-headers/linux/virtio_pci.h
+++ /dev/null
@@ -1,192 +0,0 @@
-/*
- * Virtio PCI driver
- *
- * This module allows virtio devices to be used over a virtual PCI device.
- * This can be used with QEMU based VMMs like KVM or Xen.
- *
- * Copyright IBM Corp. 2007
- *
- * Authors:
- *  Anthony Liguori  <aliguori@us.ibm.com>
- *
- * This header is BSD licensed so anyone can use the definitions to implement
- * compatible drivers/servers.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- * 3. Neither the name of IBM nor the names of its contributors
- *    may be used to endorse or promote products derived from this software
- *    without specific prior written permission.
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#ifndef _LINUX_VIRTIO_PCI_H
-#define _LINUX_VIRTIO_PCI_H
-
-#include <linux/types.h>
-
-#ifndef VIRTIO_PCI_NO_LEGACY
-
-/* A 32-bit r/o bitmask of the features supported by the host */
-#define VIRTIO_PCI_HOST_FEATURES	0
-
-/* A 32-bit r/w bitmask of features activated by the guest */
-#define VIRTIO_PCI_GUEST_FEATURES	4
-
-/* A 32-bit r/w PFN for the currently selected queue */
-#define VIRTIO_PCI_QUEUE_PFN		8
-
-/* A 16-bit r/o queue size for the currently selected queue */
-#define VIRTIO_PCI_QUEUE_NUM		12
-
-/* A 16-bit r/w queue selector */
-#define VIRTIO_PCI_QUEUE_SEL		14
-
-/* A 16-bit r/w queue notifier */
-#define VIRTIO_PCI_QUEUE_NOTIFY		16
-
-/* An 8-bit device status register.  */
-#define VIRTIO_PCI_STATUS		18
-
-/* An 8-bit r/o interrupt status register.  Reading the value will return the
- * current contents of the ISR and will also clear it.  This is effectively
- * a read-and-acknowledge. */
-#define VIRTIO_PCI_ISR			19
-
-/* MSI-X registers: only enabled if MSI-X is enabled. */
-/* A 16-bit vector for configuration changes. */
-#define VIRTIO_MSI_CONFIG_VECTOR        20
-/* A 16-bit vector for selected queue notifications. */
-#define VIRTIO_MSI_QUEUE_VECTOR         22
-
-/* The remaining space is defined by each driver as the per-driver
- * configuration space */
-#define VIRTIO_PCI_CONFIG_OFF(msix_enabled)	((msix_enabled) ? 24 : 20)
-/* Deprecated: please use VIRTIO_PCI_CONFIG_OFF instead */
-#define VIRTIO_PCI_CONFIG(dev)	VIRTIO_PCI_CONFIG_OFF((dev)->msix_enabled)
-
-/* Virtio ABI version, this must match exactly */
-#define VIRTIO_PCI_ABI_VERSION		0
-
-/* How many bits to shift physical queue address written to QUEUE_PFN.
- * 12 is historical, and due to x86 page size. */
-#define VIRTIO_PCI_QUEUE_ADDR_SHIFT	12
-
-/* The alignment to use between consumer and producer parts of vring.
- * x86 pagesize again. */
-#define VIRTIO_PCI_VRING_ALIGN		4096
-
-#endif /* VIRTIO_PCI_NO_LEGACY */
-
-/* The bit of the ISR which indicates a device configuration change. */
-#define VIRTIO_PCI_ISR_CONFIG		0x2
-/* Vector value used to disable MSI for queue */
-#define VIRTIO_MSI_NO_VECTOR            0xffff
-
-#ifndef VIRTIO_PCI_NO_MODERN
-
-/* IDs for different capabilities.  Must all exist. */
-
-/* Common configuration */
-#define VIRTIO_PCI_CAP_COMMON_CFG	1
-/* Notifications */
-#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
-/* ISR access */
-#define VIRTIO_PCI_CAP_ISR_CFG		3
-/* Device specific confiuration */
-#define VIRTIO_PCI_CAP_DEVICE_CFG	4
-
-/* This is the PCI capability header: */
-struct virtio_pci_cap {
-	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
-	__u8 cap_next;		/* Generic PCI field: next ptr. */
-	__u8 cap_len;		/* Generic PCI field: capability length */
-	__u8 cfg_type;		/* Identifies the structure. */
-	__u8 bar;		/* Where to find it. */
-	__u8 padding[3];	/* Pad to full dword. */
-	__le32 offset;		/* Offset within bar. */
-	__le32 length;		/* Length of the structure, in bytes. */
-};
-
-struct virtio_pci_notify_cap {
-	struct virtio_pci_cap cap;
-	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
-};
-
-/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
-struct virtio_pci_common_cfg {
-	/* About the whole device. */
-	__le32 device_feature_select;	/* read-write */
-	__le32 device_feature;		/* read-only */
-	__le32 guest_feature_select;	/* read-write */
-	__le32 guest_feature;		/* read-write */
-	__le16 msix_config;		/* read-write */
-	__le16 num_queues;		/* read-only */
-	__u8 device_status;		/* read-write */
-	__u8 config_generation;		/* read-only */
-
-	/* About a specific virtqueue. */
-	__le16 queue_select;		/* read-write */
-	__le16 queue_size;		/* read-write, power of 2. */
-	__le16 queue_msix_vector;	/* read-write */
-	__le16 queue_enable;		/* read-write */
-	__le16 queue_notify_off;	/* read-only */
-	__le32 queue_desc_lo;		/* read-write */
-	__le32 queue_desc_hi;		/* read-write */
-	__le32 queue_avail_lo;		/* read-write */
-	__le32 queue_avail_hi;		/* read-write */
-	__le32 queue_used_lo;		/* read-write */
-	__le32 queue_used_hi;		/* read-write */
-};
-
-/* Macro versions of offsets for the Old Timers! */
-#define VIRTIO_PCI_CAP_VNDR		0
-#define VIRTIO_PCI_CAP_NEXT		1
-#define VIRTIO_PCI_CAP_LEN		2
-#define VIRTIO_PCI_CAP_CFG_TYPE		3
-#define VIRTIO_PCI_CAP_BAR		4
-#define VIRTIO_PCI_CAP_OFFSET		8
-#define VIRTIO_PCI_CAP_LENGTH		12
-
-#define VIRTIO_PCI_NOTIFY_CAP_MULT	16
-
-
-#define VIRTIO_PCI_COMMON_DFSELECT	0
-#define VIRTIO_PCI_COMMON_DF		4
-#define VIRTIO_PCI_COMMON_GFSELECT	8
-#define VIRTIO_PCI_COMMON_GF		12
-#define VIRTIO_PCI_COMMON_MSIX		16
-#define VIRTIO_PCI_COMMON_NUMQ		18
-#define VIRTIO_PCI_COMMON_STATUS	20
-#define VIRTIO_PCI_COMMON_CFGGENERATION	21
-#define VIRTIO_PCI_COMMON_Q_SELECT	22
-#define VIRTIO_PCI_COMMON_Q_SIZE	24
-#define VIRTIO_PCI_COMMON_Q_MSIX	26
-#define VIRTIO_PCI_COMMON_Q_ENABLE	28
-#define VIRTIO_PCI_COMMON_Q_NOFF	30
-#define VIRTIO_PCI_COMMON_Q_DESCLO	32
-#define VIRTIO_PCI_COMMON_Q_DESCHI	36
-#define VIRTIO_PCI_COMMON_Q_AVAILLO	40
-#define VIRTIO_PCI_COMMON_Q_AVAILHI	44
-#define VIRTIO_PCI_COMMON_Q_USEDLO	48
-#define VIRTIO_PCI_COMMON_Q_USEDHI	52
-
-#endif /* VIRTIO_PCI_NO_MODERN */
-
-#endif
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
@ 2015-07-06  2:10 ` Alexey Kardashevskiy
  2015-07-06 14:21   ` Thomas Huth
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..5881d9f 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -295,6 +295,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window()
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2015-07-06  2:10 ` Alexey Kardashevskiy
  2015-07-06 16:41   ` Laurent Vivier
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

This reworks finish_realize() which used to finalize DMA setup with
an assumption that it will not change later.

New callbacks supports various window parameters such as page and
windows sizes. The new callback return error code rather than Error**.

This is a mechanical change so no change in behaviour is expected.
This is a part of getting rid of spapr-pci-vfio-host-bridge type.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v8:
* moved spapr_phb_dma_capabilities_update() higher to avoid forward
declaration in following patches and keep DMA code together (i.e. next
to spapr_pci_dma_iommu())
---
 hw/ppc/spapr_pci.c          | 59 ++++++++++++++++++++++++++-------------------
 hw/ppc/spapr_pci_vfio.c     | 53 ++++++++++++++++------------------------
 include/hw/pci-host/spapr.h |  8 +++++-
 3 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index a8f79d8..c1ca13d 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -808,6 +808,28 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
     return buf;
 }
 
+static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
+{
+    sphb->dma32_window_start = 0;
+    sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
+
+    return 0;
+}
+
+static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                                     uint32_t liobn, uint32_t page_shift,
+                                     uint64_t window_size)
+{
+    uint64_t bus_offset = sphb->dma32_window_start;
+    sPAPRTCETable *tcet;
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
+                               window_size >> page_shift,
+                               false);
+
+    return tcet ? 0 : -1;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1220,6 +1242,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     int i;
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
+    sPAPRTCETable *tcet;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1369,33 +1392,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    if (!info->finish_realize) {
-        error_setg(errp, "finish_realize not defined");
-        return;
-    }
-
-    info->finish_realize(sphb, errp);
-
-    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
-}
-
-static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
-{
-    sPAPRTCETable *tcet;
-    uint32_t nb_table;
-
-    nb_table = SPAPR_PCI_DMA32_SIZE >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
+    info->dma_capabilities_update(sphb);
+    info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
+                          sphb->dma32_window_size);
+    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
     if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return ;
+        error_setg(errp, "failed to create TCE table");
+        return;
     }
-
-    /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, 0,
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
+
+    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
 static int spapr_phb_children_reset(Object *child, void *opaque)
@@ -1543,9 +1551,10 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
     dc->vmsd = &vmstate_spapr_pci;
     set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
     dc->cannot_instantiate_with_device_add_yet = false;
-    spc->finish_realize = spapr_phb_finish_realize;
     hp->plug = spapr_phb_hot_plug_child;
     hp->unplug = spapr_phb_hot_unplug_child;
+    spc->dma_capabilities_update = spapr_phb_dma_capabilities_update;
+    spc->dma_init_window = spapr_phb_dma_init_window;
 }
 
 static const TypeInfo spapr_phb_info = {
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index cca45ed..6e3e17b 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -28,48 +28,36 @@ static Property spapr_phb_vfio_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
+static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
 {
     sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
     int ret;
-    sPAPRTCETable *tcet;
-    uint32_t liobn = svphb->phb.dma_liobn;
 
-    if (svphb->iommugroupid == -1) {
-        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
-        return;
-    }
-
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
-                               VFIO_CHECK_EXTENSION,
-                               (void *) VFIO_SPAPR_TCE_IOMMU);
-    if (ret != 1) {
-        error_setg_errno(errp, -ret,
-                         "spapr-vfio: SPAPR extension is not supported");
-        return;
-    }
-
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
                                VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
     if (ret) {
-        error_setg_errno(errp, -ret,
-                         "spapr-vfio: get info from container failed");
-        return;
+        return ret;
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, info.dma32_window_start,
-                               SPAPR_TCE_PAGE_SHIFT,
-                               info.dma32_window_size >> SPAPR_TCE_PAGE_SHIFT,
+    sphb->dma32_window_start = info.dma32_window_start;
+    sphb->dma32_window_size = info.dma32_window_size;
+
+    return ret;
+}
+
+static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
+                                          uint32_t liobn, uint32_t page_shift,
+                                          uint64_t window_size)
+{
+    uint64_t bus_offset = sphb->dma32_window_start;
+    sPAPRTCETable *tcet;
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
+                               window_size >> page_shift,
                                true);
-    if (!tcet) {
-        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
-        return;
-    }
 
-    /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
+    return tcet ? 0 : -1;
 }
 
 static void spapr_phb_vfio_eeh_reenable(sPAPRPHBVFIOState *svphb)
@@ -257,7 +245,8 @@ static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
 
     dc->props = spapr_phb_vfio_properties;
     dc->reset = spapr_phb_vfio_reset;
-    spc->finish_realize = spapr_phb_vfio_finish_realize;
+    spc->dma_capabilities_update = spapr_phb_vfio_dma_capabilities_update;
+    spc->dma_init_window = spapr_phb_vfio_dma_init_window;
     spc->eeh_set_option = spapr_phb_vfio_eeh_set_option;
     spc->eeh_get_state = spapr_phb_vfio_eeh_get_state;
     spc->eeh_reset = spapr_phb_vfio_eeh_reset;
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 5322b56..b6d5719 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -48,7 +48,10 @@ typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
 
-    void (*finish_realize)(sPAPRPHBState *sphb, Error **errp);
+    int (*dma_capabilities_update)(sPAPRPHBState *sphb);
+    int (*dma_init_window)(sPAPRPHBState *sphb,
+                           uint32_t liobn, uint32_t page_shift,
+                           uint64_t window_size);
     int (*eeh_set_option)(sPAPRPHBState *sphb, unsigned int addr, int option);
     int (*eeh_get_state)(sPAPRPHBState *sphb, int *state);
     int (*eeh_reset)(sPAPRPHBState *sphb, int option);
@@ -90,6 +93,9 @@ struct sPAPRPHBState {
     int32_t msi_devs_num;
     spapr_pci_msi_mig *msi_devs;
 
+    uint32_t dma32_window_start;
+    uint32_t dma32_window_size;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 15:14   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
KVM acceleration for these, the table has to be allocated by
the userspace. At the moment the table is allocated once at boot time
but next patches will reallocate it.

This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
to helpers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
 trace-events         |  2 +-
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index f61504e..0cf5010 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -74,6 +74,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
     }
 }
 
+static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
+                                       uint32_t nb_table,
+                                       uint32_t page_shift,
+                                       int *fd,
+                                       bool vfio_accel)
+{
+    uint64_t *table = NULL;
+    uint64_t window_size = (uint64_t)nb_table << page_shift;
+
+    if (kvm_enabled() && !(window_size >> 32)) {
+        table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
+    }
+
+    if (!table) {
+        *fd = -1;
+        table = g_malloc0(nb_table * sizeof(uint64_t));
+    }
+
+    trace_spapr_iommu_alloc_table(liobn, table, *fd);
+
+    return table;
+}
+
+static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
+{
+    if (!kvm_enabled() ||
+        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
+        g_free(table);
+    }
+}
+
 /* Called from RCU critical section */
 static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
                                                bool is_write)
@@ -140,21 +171,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
-    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
 
-    if (kvm_enabled() && !(window_size >> 32)) {
-        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
-                                              window_size,
-                                              &tcet->fd,
-                                              tcet->vfio_accel);
-    }
-
-    if (!tcet->table) {
-        size_t table_size = tcet->nb_table * sizeof(uint64_t);
-        tcet->table = g_malloc0(table_size);
-    }
-
-    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
+    tcet->fd = -1;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->nb_table,
+                                        tcet->page_shift,
+                                        &tcet->fd,
+                                        tcet->vfio_accel);
 
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
                              "iommu-spapr",
@@ -208,11 +231,8 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
     QLIST_REMOVE(tcet, list);
 
-    if (!kvm_enabled() ||
-        (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
-                                 tcet->nb_table) != 0)) {
-        g_free(tcet->table);
-    }
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/trace-events b/trace-events
index 52b7efa..a93af9a 100644
--- a/trace-events
+++ b/trace-events
@@ -1362,7 +1362,7 @@ spapr_iommu_pci_get(uint64_t liobn, uint64_t ioba, uint64_t ret, uint64_t tce) "
 spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN, uint64_t tceN, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcelist=0x%"PRIx64" iobaN=0x%"PRIx64" tceN=0x%"PRIx64" ret=%"PRId64
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
-spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_alloc_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 10:07   ` David Gibson
  2015-07-06 17:04   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

Currently TCE tables are created once at start and their size never
changes. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty stub object. Only
LIOBN is assigned by the time of creation. It still will be called once
at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects with two
helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
spapr_tce_table_enable() receives TCE table parameters and allocates
a guest view of the TCE table (in the user space or KVM).
spapr_tce_table_disable() disposes the table.

Follow up patches will disable+enable tables on reset (system reset
or DDW reset).

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as migration expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
as later it will be called at the sPAPRTCETable post-migration stage when
it has all the properties set after the migration.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v9 (no changes really):
* IOMMU regions are referenced by their parent which is the PHB root region,
there is no need in explicit unparenting so ignore first note from v8 changelog.

v8:
* add missing unparent_object() to spapr_tce_table_unrealize() (parenting
is made by memory_region_init_iommu)
* tcet->iommu is alive as long as sPAPRTCETable is,
memory_region_set_size() is used to enable/disable MR

v7:
* s'tmp[64]'tmp[32]' as we need less than 64bytes and more than 16 bytes
and 32 is the closest power-of-two (just looks nices to have power-of-two
values)
* updated commit log about having spapr_tce_table_do_enable() splitted
from spapr_tce_table_enable()

v6:
* got rid of set_props()
---
 hw/ppc/spapr_iommu.c    | 79 +++++++++++++++++++++++++++++++++++--------------
 hw/ppc/spapr_pci.c      | 17 +++++++----
 hw/ppc/spapr_pci_vfio.c | 10 +++----
 hw/ppc/spapr_vio.c      |  9 +++---
 include/hw/ppc/spapr.h  | 11 +++----
 5 files changed, 82 insertions(+), 44 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 0cf5010..fbca136 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -173,15 +173,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     tcet->fd = -1;
-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
-                                        tcet->nb_table,
-                                        tcet->page_shift,
-                                        &tcet->fd,
-                                        tcet->vfio_accel);
 
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr",
-                             (uint64_t)tcet->nb_table << tcet->page_shift);
+                             "iommu-spapr", 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -191,14 +185,10 @@ static int spapr_tce_table_realize(DeviceState *dev)
     return 0;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool vfio_accel)
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 {
     sPAPRTCETable *tcet;
-    char tmp[64];
+    char tmp[32];
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -206,16 +196,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
-    if (!nb_table) {
-        return NULL;
-    }
-
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
-    tcet->bus_offset = bus_offset;
-    tcet->page_shift = page_shift;
-    tcet->nb_table = nb_table;
-    tcet->vfio_accel = vfio_accel;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
@@ -225,14 +207,65 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     return tcet;
 }
 
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+{
+    if (!tcet->nb_table) {
+        return;
+    }
+
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->nb_table,
+                                        tcet->page_shift,
+                                        &tcet->fd,
+                                        tcet->vfio_accel);
+
+    memory_region_set_size(&tcet->iommu,
+                           (uint64_t)tcet->nb_table << tcet->page_shift);
+
+    tcet->enabled = true;
+}
+
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint64_t bus_offset, uint32_t page_shift,
+                            uint32_t nb_table, bool vfio_accel)
+{
+    if (tcet->enabled) {
+        return;
+    }
+
+    tcet->bus_offset = bus_offset;
+    tcet->page_shift = page_shift;
+    tcet->nb_table = nb_table;
+    tcet->vfio_accel = vfio_accel;
+
+    spapr_tce_table_do_enable(tcet);
+}
+
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
+{
+    if (!tcet->enabled) {
+        return;
+    }
+
+    memory_region_set_size(&tcet->iommu, 0);
+
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
+    tcet->table = NULL;
+    tcet->enabled = false;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+    tcet->vfio_accel = false;
+}
+
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     QLIST_REMOVE(tcet, list);
 
-    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
-    tcet->fd = -1;
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index c1ca13d..3ddd72f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -821,13 +821,12 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
                                      uint64_t window_size)
 {
     uint64_t bus_offset = sphb->dma32_window_start;
-    sPAPRTCETable *tcet;
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
-                               window_size >> page_shift,
-                               false);
-
-    return tcet ? 0 : -1;
+    spapr_tce_table_enable(tcet, bus_offset, page_shift,
+                           window_size >> page_shift,
+                           false);
+    return 0;
 }
 
 /* Macros to operate with address in OF binding to PCI */
@@ -1392,6 +1391,12 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
+    if (!tcet) {
+            error_setg(errp, "failed to create TCE table");
+            return;
+    }
+
     info->dma_capabilities_update(sphb);
     info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
                           sphb->dma32_window_size);
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index 6e3e17b..69d85ab 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -51,13 +51,13 @@ static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
                                           uint64_t window_size)
 {
     uint64_t bus_offset = sphb->dma32_window_start;
-    sPAPRTCETable *tcet;
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
-                               window_size >> page_shift,
-                               true);
+    spapr_tce_table_enable(tcet, bus_offset, page_shift,
+                           window_size >> page_shift,
+                           true);
 
-    return tcet ? 0 : -1;
+    return 0;
 }
 
 static void spapr_phb_vfio_eeh_reenable(sPAPRPHBVFIOState *svphb)
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index c51eb8e..912fa06 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -479,11 +479,10 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, 0, SPAPR_TCE_PAGE_SHIFT,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT,
+                               false);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 91a61ab..ed68c95 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -552,6 +552,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 
 struct sPAPRTCETable {
     DeviceState parent;
+    bool enabled;
     uint32_t liobn;
     uint32_t nb_table;
     uint64_t bus_offset;
@@ -578,11 +579,11 @@ void spapr_events_init(sPAPRMachineState *sm);
 void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(sPAPRMachineState *sm,
                                  target_ulong addr, target_ulong size);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool vfio_accel);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint64_t bus_offset, uint32_t page_shift,
+                            uint32_t nb_table, bool vfio_accel);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
 int spapr_dma_dt(void *fdt, int node_off, const char *propname,
                  uint32_t liobn, uint64_t window, uint32_t size);
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 16:45   ` Laurent Vivier
  2015-07-06 17:11   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

sPAPRTCETable has a vfio_accel flag which is passed to
kvmppc_create_spapr_tce() and controls whether to create a guest view
table in KVM as this depends on the host kernel ability to accelerate
H_PUT_TCE for VFIO devices. We would set this flag at the moment
when sPAPRTCETable is created in spapr_tce_new_table() and
use when the table is allocated in spapr_tce_table_realize().

Now we explicitly enable/disable DMA windows via spapr_tce_table_enable()
and spapr_tce_table_disable() and can pass this flag directly without
caching it in sPAPRTCETable.

This removes the flag. This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v8:
* new to patchset, this is cleanup
---
 hw/ppc/spapr_iommu.c   | 8 +++-----
 include/hw/ppc/spapr.h | 1 -
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index fbca136..1378a7a 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -207,7 +207,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
     return tcet;
 }
 
-static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
 {
     if (!tcet->nb_table) {
         return;
@@ -217,7 +217,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
                                         tcet->nb_table,
                                         tcet->page_shift,
                                         &tcet->fd,
-                                        tcet->vfio_accel);
+                                        vfio_accel);
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
@@ -236,9 +236,8 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
     tcet->bus_offset = bus_offset;
     tcet->page_shift = page_shift;
     tcet->nb_table = nb_table;
-    tcet->vfio_accel = vfio_accel;
 
-    spapr_tce_table_do_enable(tcet);
+    spapr_tce_table_do_enable(tcet, vfio_accel);
 }
 
 void spapr_tce_table_disable(sPAPRTCETable *tcet)
@@ -256,7 +255,6 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
     tcet->bus_offset = 0;
     tcet->page_shift = 0;
     tcet->nb_table = 0;
-    tcet->vfio_accel = false;
 }
 
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ed68c95..1da0ade 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -559,7 +559,6 @@ struct sPAPRTCETable {
     uint32_t page_shift;
     uint64_t *table;
     bool bypass;
-    bool vfio_accel;
     int fd;
     MemoryRegion iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 19:15   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c   | 13 ++++++++++---
 hw/ppc/spapr_pci.c     |  2 +-
 include/hw/ppc/spapr.h |  2 +-
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 1378a7a..45c00d8 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -171,11 +171,16 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    Object *tcetobj = OBJECT(tcet);
+    char tmp[32];
 
     tcet->fd = -1;
 
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr", 0);
+    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
+    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
+
+    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
+    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -221,6 +226,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
 
     tcet->enabled = true;
 }
@@ -246,6 +252,7 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
         return;
     }
 
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
     memory_region_set_size(&tcet->iommu, 0);
 
     spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
@@ -268,7 +275,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 3ddd72f..e27ca15 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1405,7 +1405,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         error_setg(errp, "failed to create TCE table");
         return;
     }
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
+    memory_region_add_subregion(&sphb->iommu_root, 0,
                                 spapr_tce_get_iommu(tcet));
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 1da0ade..e32e787 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -560,7 +560,7 @@ struct sPAPRTCETable {
     uint64_t *table;
     bool bypass;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

On a system reset, DMA configuration has to reset too. At the moment
it clears the table content. This is enough for the single table case
but with DDW, we will also have to disable all DMA windows except
the default one. Furthermore according to sPAPR, if the guest removed
the default window and created a huge one at the same zero offset on
a PCI bus, the reset handler has to recreate the default window with
the default properties (2GB big, 4K pages).

This reworks SPAPR PHB code to disable the existing DMA window on reset
and then configure and enable the default window.
Without DDW that means that the same window will be disabled and then
enabled with no other change in behaviour.

This changes the table creation to do it in one place in PHB (VFIO PHB
just inherits the behaviour from PHB). The actual table allocation is
done from the reset handler and this is where dma_init_window() is called.

This disables all DMA windows on a PHB reset. It does not make any
difference now as there is just one DMA window but it will later with DDW
patches.

This makes spapr_phb_dma_reset() and spapr_phb_dma_remove_window() public
as these will be used in DDW RTAS "ibm,reset-pe-dma-window" and
"ibm,remove-pe-dma-window" handlers later; the handlers will reside in
hw/ppc/spapr_rtas_ddw.c.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v9:
* as spapr_phb_vfio_reset() became not empty, this does not remove it but
adds spapr_phb_dma_reset() call
* added SPAPR_PCI_DMA_MAX_WINDOWS (was in
"spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)")
* object_child_foreach() is replaced with explicit loop over DMA windows
as later in the patchset we will be doing same loop and there the order
will matter (small windows should be enumerated first)

v7:
* s'finish_realize'dma_init_window' in the commit log
* added details (initial clause about reuse was there :) )
why exactly spapr_phb_dma_remove_window is public
---
 hw/ppc/spapr_pci.c          | 42 +++++++++++++++++++++++++++++++++---------
 hw/ppc/spapr_pci_vfio.c     |  4 ++++
 include/hw/pci-host/spapr.h |  5 +++++
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index e27ca15..00816b3 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -829,6 +829,35 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
     return 0;
 }
 
+int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
+                                sPAPRTCETable *tcet)
+{
+    spapr_tce_table_disable(tcet);
+
+    return 0;
+}
+
+int spapr_phb_dma_reset(sPAPRPHBState *sphb)
+{
+    int i;
+    sPAPRTCETable *tcet;
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
+
+    spc->dma_capabilities_update(sphb); /* Refresh @has_vfio status */
+
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
+        if (tcet) {
+            spapr_phb_dma_remove_window(sphb, tcet);
+        }
+    }
+
+    spc->dma_init_window(sphb, SPAPR_PCI_LIOBN(sphb->index, 0),
+                         SPAPR_TCE_PAGE_SHIFT, sphb->dma32_window_size);
+
+    return 0;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1236,7 +1265,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     SysBusDevice *s = SYS_BUS_DEVICE(dev);
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
     PCIHostState *phb = PCI_HOST_BRIDGE(s);
-    sPAPRPHBClass *info = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(s);
     char *namebuf;
     int i;
     PCIBus *bus;
@@ -1397,14 +1425,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
             return;
     }
 
-    info->dma_capabilities_update(sphb);
-    info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
-                          sphb->dma32_window_size);
-    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
-    if (!tcet) {
-        error_setg(errp, "failed to create TCE table");
-        return;
-    }
     memory_region_add_subregion(&sphb->iommu_root, 0,
                                 spapr_tce_get_iommu(tcet));
 
@@ -1424,6 +1444,10 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 }
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index 69d85ab..cf5483a 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -73,6 +73,10 @@ static void spapr_phb_vfio_eeh_reenable(sPAPRPHBVFIOState *svphb)
 
 static void spapr_phb_vfio_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
+
     /*
      * The PE might be in frozen state. To reenable the EEH
      * functionality on it will clean the frozen state, which
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index b6d5719..fff868e 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -123,6 +123,8 @@ struct sPAPRPHBVFIOState {
 
 #define SPAPR_PCI_DMA32_SIZE         0x40000000
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    1
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
@@ -143,5 +145,8 @@ void spapr_pci_rtas_init(void);
 sPAPRPHBState *spapr_pci_find_phb(sPAPRMachineState *spapr, uint64_t buid);
 PCIDevice *spapr_pci_find_dev(sPAPRMachineState *spapr, uint64_t buid,
                               uint32_t config_addr);
+int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
+                                sPAPRTCETable *tcet);
+int spapr_phb_dma_reset(sPAPRPHBState *sphb);
 
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 21:13   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

sPAPRTCETable is handling 2 TCE tables already:

1) guest view of the TCE table - emulated devices use only this table;

2) hardware IOMMU table - VFIO PCI devices use it for actual work but
it does not replace 1) and it is not visible to the guest.
The initialization of this table is driven by vfio-pci device,
DMA map/unmap requests are handled via MemoryListener so there is very
little to do in spapr-pci-vfio-host-bridge.

This moves VFIO bits to the generic spapr-pci-host-bridge which allows
putting emulated and VFIO devices on the same PHB. It is still possible
to create multiple PHBs and avoid sharing PHB resouces for emulated and
VFIO devices.

If there is no VFIO-PCI device attaches, no special ioctls will be called.
If there are some VFIO-PCI devices attached, PHB may refuse to attach
another VFIO-PCI device if a VFIO container on the host kernel side
does not support container sharing.

This changes spapr-pci-host-bridge to support properties of
spapr-pci-vfio-host-bridge. This makes spapr-pci-vfio-host-bridge type
equal to spapr-pci-host-bridge except it has an additional "iommu"
property for backward compatibility reasons.

This moves PCI device lookup from spapr_phb_vfio_eeh_set_option() to
rtas_ibm_set_eeh_option() as we need to know if the device is "vfio-pci"
and decide whether to call spapr_phb_vfio_eeh_set_option() or not.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v9:
* s'iommugroupid shall not be used'iommugroupid is deprecated and will be ignored'
in error log

v8:
* call spapr_phb_vfio_eeh_set_option() on vfio-pci devices only (reported by Gavin)
---
 hw/ppc/spapr_pci.c          | 82 +++++++++++++++----------------------------
 hw/ppc/spapr_pci_vfio.c     | 85 +++++++++------------------------------------
 include/hw/pci-host/spapr.h | 25 ++++++-------
 3 files changed, 55 insertions(+), 137 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 00816b3..76c988f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -428,7 +428,6 @@ static void rtas_ibm_set_eeh_option(PowerPCCPU *cpu,
                                     target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     PCIDevice *pdev;
     uint32_t addr, option;
     uint64_t buid;
@@ -443,7 +442,7 @@ static void rtas_ibm_set_eeh_option(PowerPCCPU *cpu,
     option = rtas_ld(args, 3);
 
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
@@ -453,12 +452,7 @@ static void rtas_ibm_set_eeh_option(PowerPCCPU *cpu,
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_set_option) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_set_option(sphb, addr, option);
+    ret = spapr_phb_vfio_eeh_set_option(sphb, pdev, option);
     rtas_st(rets, 0, ret);
     return;
 
@@ -473,7 +467,6 @@ static void rtas_ibm_get_config_addr_info2(PowerPCCPU *cpu,
                                            target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     PCIDevice *pdev;
     uint32_t addr, option;
     uint64_t buid;
@@ -484,12 +477,7 @@ static void rtas_ibm_get_config_addr_info2(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
-        goto param_error_exit;
-    }
-
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_set_option) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
@@ -529,7 +517,6 @@ static void rtas_ibm_read_slot_reset_state2(PowerPCCPU *cpu,
                                             target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint64_t buid;
     int state, ret;
 
@@ -539,16 +526,11 @@ static void rtas_ibm_read_slot_reset_state2(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_get_state) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_get_state(sphb, &state);
+    ret = spapr_phb_vfio_eeh_get_state(sphb, &state);
     rtas_st(rets, 0, ret);
     if (ret != RTAS_OUT_SUCCESS) {
         return;
@@ -573,7 +555,6 @@ static void rtas_ibm_set_slot_reset(PowerPCCPU *cpu,
                                     target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint32_t option;
     uint64_t buid;
     int ret;
@@ -585,16 +566,11 @@ static void rtas_ibm_set_slot_reset(PowerPCCPU *cpu,
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     option = rtas_ld(args, 3);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_reset) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_reset(sphb, option);
+    ret = spapr_phb_vfio_eeh_reset(sphb, option);
     rtas_st(rets, 0, ret);
     return;
 
@@ -609,7 +585,6 @@ static void rtas_ibm_configure_pe(PowerPCCPU *cpu,
                                   target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint64_t buid;
     int ret;
 
@@ -619,16 +594,11 @@ static void rtas_ibm_configure_pe(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_configure) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_configure(sphb);
+    ret = spapr_phb_vfio_eeh_configure(sphb);
     rtas_st(rets, 0, ret);
     return;
 
@@ -644,7 +614,6 @@ static void rtas_ibm_slot_error_detail(PowerPCCPU *cpu,
                                        target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     int option;
     uint64_t buid;
 
@@ -654,12 +623,7 @@ static void rtas_ibm_slot_error_detail(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
-        goto param_error_exit;
-    }
-
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_set_option) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
@@ -810,9 +774,14 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
 
 static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
 {
+    int ret;
+
     sphb->dma32_window_start = 0;
     sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
 
+    ret = spapr_phb_vfio_dma_capabilities_update(sphb);
+    sphb->has_vfio = (ret == 0);
+
     return 0;
 }
 
@@ -825,7 +794,8 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
 
     spapr_tce_table_enable(tcet, bus_offset, page_shift,
                            window_size >> page_shift,
-                           false);
+                           sphb->has_vfio);
+
     return 0;
 }
 
@@ -841,9 +811,8 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
     int i;
     sPAPRTCETable *tcet;
-    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
 
-    spc->dma_capabilities_update(sphb); /* Refresh @has_vfio status */
+    spapr_phb_dma_capabilities_update(sphb); /* Refresh @has_vfio status */
 
     for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
         tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
@@ -852,8 +821,8 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
         }
     }
 
-    spc->dma_init_window(sphb, SPAPR_PCI_LIOBN(sphb->index, 0),
-                         SPAPR_TCE_PAGE_SHIFT, sphb->dma32_window_size);
+    spapr_phb_dma_init_window(sphb, SPAPR_PCI_LIOBN(sphb->index, 0),
+                              SPAPR_TCE_PAGE_SHIFT, sphb->dma32_window_size);
 
     return 0;
 }
@@ -1271,6 +1240,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
 
+    if ((sphb->iommugroupid != -1) &&
+        object_dynamic_cast(OBJECT(sphb), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)) {
+        error_report("Warning: iommugroupid is deprecated and will be ignored");
+    }
+
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
@@ -1446,6 +1420,9 @@ static void spapr_phb_reset(DeviceState *qdev)
 {
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
 
+    if (sphb->has_vfio) {
+        spapr_phb_vfio_eeh_reenable(sphb);
+    }
     spapr_phb_dma_reset(sphb);
 
     /* Reset the IOMMU state */
@@ -1570,7 +1547,6 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
 {
     PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
     DeviceClass *dc = DEVICE_CLASS(klass);
-    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
     HotplugHandlerClass *hp = HOTPLUG_HANDLER_CLASS(klass);
 
     hc->root_bus_path = spapr_phb_root_bus_path;
@@ -1582,8 +1558,6 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
     dc->cannot_instantiate_with_device_add_yet = false;
     hp->plug = spapr_phb_hot_plug_child;
     hp->unplug = spapr_phb_hot_unplug_child;
-    spc->dma_capabilities_update = spapr_phb_dma_capabilities_update;
-    spc->dma_init_window = spapr_phb_dma_init_window;
 }
 
 static const TypeInfo spapr_phb_info = {
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index cf5483a..04ca4cf 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -24,17 +24,16 @@
 #include "hw/vfio/vfio.h"
 
 static Property spapr_phb_vfio_properties[] = {
-    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
+int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
     int ret;
 
-    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
                                VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
     if (ret) {
         return ret;
@@ -46,50 +45,27 @@ static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
     return ret;
 }
 
-static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
-                                          uint32_t liobn, uint32_t page_shift,
-                                          uint64_t window_size)
-{
-    uint64_t bus_offset = sphb->dma32_window_start;
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
-    spapr_tce_table_enable(tcet, bus_offset, page_shift,
-                           window_size >> page_shift,
-                           true);
-
-    return 0;
-}
-
-static void spapr_phb_vfio_eeh_reenable(sPAPRPHBVFIOState *svphb)
+void spapr_phb_vfio_eeh_reenable(sPAPRPHBState *sphb)
 {
     struct vfio_eeh_pe_op op = {
         .argsz = sizeof(op),
         .op    = VFIO_EEH_PE_ENABLE
     };
 
-    vfio_container_ioctl(&svphb->phb.iommu_as,
-                         svphb->iommugroupid, VFIO_EEH_PE_OP, &op);
-}
-
-static void spapr_phb_vfio_reset(DeviceState *qdev)
-{
-    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
-
-    spapr_phb_dma_reset(sphb);
-
     /*
      * The PE might be in frozen state. To reenable the EEH
      * functionality on it will clean the frozen state, which
      * ensures that the contained PCI devices will work properly
      * after reboot.
      */
-    spapr_phb_vfio_eeh_reenable(SPAPR_PCI_VFIO_HOST_BRIDGE(qdev));
+    vfio_container_ioctl(&sphb->iommu_as,
+                         sphb->iommugroupid, VFIO_EEH_PE_OP, &op);
 }
 
-static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
-                                         unsigned int addr, int option)
+int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
+                                  PCIDevice *pdev, int option)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
@@ -97,25 +73,9 @@ static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
     case RTAS_EEH_DISABLE:
         op.op = VFIO_EEH_PE_DISABLE;
         break;
-    case RTAS_EEH_ENABLE: {
-        PCIHostState *phb;
-        PCIDevice *pdev;
-
-        /*
-         * The EEH functionality is enabled on basis of PCI device,
-         * instead of PE. We need check the validity of the PCI
-         * device address.
-         */
-        phb = PCI_HOST_BRIDGE(sphb);
-        pdev = pci_find_device(phb->bus,
-                               (addr >> 16) & 0xFF, (addr >> 8) & 0xFF);
-        if (!pdev) {
-            return RTAS_OUT_PARAM_ERROR;
-        }
-
+    case RTAS_EEH_ENABLE:
         op.op = VFIO_EEH_PE_ENABLE;
         break;
-    }
     case RTAS_EEH_THAW_IO:
         op.op = VFIO_EEH_PE_UNFREEZE_IO;
         break;
@@ -126,7 +86,7 @@ static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
         return RTAS_OUT_PARAM_ERROR;
     }
 
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_HW_ERROR;
@@ -135,14 +95,13 @@ static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
     return RTAS_OUT_SUCCESS;
 }
 
-static int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
+int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
     op.op = VFIO_EEH_PE_GET_STATE;
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_PARAM_ERROR;
@@ -195,9 +154,8 @@ static void spapr_phb_vfio_eeh_pre_reset(sPAPRPHBState *sphb)
        pci_for_each_bus(phb->bus, spapr_phb_vfio_eeh_clear_bus_msix, NULL);
 }
 
-static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
+int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
@@ -217,7 +175,7 @@ static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
         return RTAS_OUT_PARAM_ERROR;
     }
 
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_HW_ERROR;
@@ -226,14 +184,13 @@ static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
     return RTAS_OUT_SUCCESS;
 }
 
-static int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
+int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
     op.op = VFIO_EEH_PE_CONFIGURE;
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_PARAM_ERROR;
@@ -245,22 +202,14 @@ static int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
 static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
-    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
 
     dc->props = spapr_phb_vfio_properties;
-    dc->reset = spapr_phb_vfio_reset;
-    spc->dma_capabilities_update = spapr_phb_vfio_dma_capabilities_update;
-    spc->dma_init_window = spapr_phb_vfio_dma_init_window;
-    spc->eeh_set_option = spapr_phb_vfio_eeh_set_option;
-    spc->eeh_get_state = spapr_phb_vfio_eeh_get_state;
-    spc->eeh_reset = spapr_phb_vfio_eeh_reset;
-    spc->eeh_configure = spapr_phb_vfio_eeh_configure;
 }
 
 static const TypeInfo spapr_phb_vfio_info = {
     .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
     .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
-    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .instance_size = sizeof(sPAPRPHBState),
     .class_init    = spapr_phb_vfio_class_init,
     .class_size    = sizeof(sPAPRPHBClass),
 };
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index fff868e..bf66315 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -47,15 +47,6 @@ typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
-
-    int (*dma_capabilities_update)(sPAPRPHBState *sphb);
-    int (*dma_init_window)(sPAPRPHBState *sphb,
-                           uint32_t liobn, uint32_t page_shift,
-                           uint64_t window_size);
-    int (*eeh_set_option)(sPAPRPHBState *sphb, unsigned int addr, int option);
-    int (*eeh_get_state)(sPAPRPHBState *sphb, int *state);
-    int (*eeh_reset)(sPAPRPHBState *sphb, int option);
-    int (*eeh_configure)(sPAPRPHBState *sphb);
 };
 
 typedef struct spapr_pci_msi {
@@ -95,16 +86,12 @@ struct sPAPRPHBState {
 
     uint32_t dma32_window_start;
     uint32_t dma32_window_size;
+    bool has_vfio;
+    int32_t iommugroupid; /* obsolete */
 
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
-struct sPAPRPHBVFIOState {
-    sPAPRPHBState phb;
-
-    int32_t iommugroupid;
-};
-
 #define SPAPR_PCI_MAX_INDEX          255
 
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
@@ -149,4 +136,12 @@ int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                 sPAPRTCETable *tcet);
 int spapr_phb_dma_reset(sPAPRPHBState *sphb);
 
+int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb);
+int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
+                                  PCIDevice *pdev, int option);
+int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
+int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
+int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
+void spapr_phb_vfio_eeh_reenable(sPAPRPHBState *sphb);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 10:27   ` David Gibson
                     ` (2 more replies)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  15 siblings, 3 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

sPAPR IOMMU is managing two copies of an TCE table:
1) a guest view of the table - this is what emulated devices use and
this is where H_GET_TCE reads from;
2) a hardware TCE table - only present if there is at least one vfio-pci
device on a PHB; it is updated via a memory listener on a PHB address
space which forwards map/unmap requests to vfio-pci IOMMU host driver.

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
support KVM acceleration for these, the table has to be allocated
by the userspace.

When vfio-pci device is hotplugged and there were no vfio-pci devices
already, the guest view table could have been allocated by KVM which
means that H_PUT_TCE is handled by the host kernel and since we
do not support vfio-pci in KVM, the hardware table will not be updated.

This reallocates the guest view table in QEMU if the first vfio-pci
device has just been plugged. spapr_tce_realloc_userspace() handles this.

This replays all the mappings to make sure that the tables are in sync.
This will not have a visible effect though as for a new device
the guest kernel will allocate-and-map new addresses and therefore
existing mappings from emulated devices will not be used by vfio-pci
devices.

This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
hooks.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v10:
* removed unnecessary  memory_region_del_subregion() and
memory_region_add_subregion() as
"vfio: Unregister IOMMU notifiers when container is destroyed" removes
notifiers in a more correct way

v9:
* spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
via object_child_foreach()
* spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
if we do hotplug+hotunplug+hotplug of the same device.
* moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
Actual VFIO PCI device release happens from rcu and since we add ours later,
it gets executed later and we are good.
---
 hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
 hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  1 +
 include/hw/ppc/spapr.h      |  2 ++
 trace-events                |  2 ++
 5 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 45c00d8..2d99c3b 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
                                        uint32_t nb_table,
                                        uint32_t page_shift,
                                        int *fd,
-                                       bool vfio_accel)
+                                       bool vfio_accel,
+                                       bool force_userspace)
 {
     uint64_t *table = NULL;
     uint64_t window_size = (uint64_t)nb_table << page_shift;
 
-    if (kvm_enabled() && !(window_size >> 32)) {
+    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
         table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
     }
 
@@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
                                         tcet->nb_table,
                                         tcet->page_shift,
                                         &tcet->fd,
-                                        vfio_accel);
+                                        vfio_accel,
+                                        false);
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
@@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
     return 0;
 }
 
+static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
+{
+    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
+    long i, ret = 0;
+
+    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
+        ret = put_tce_emu(tcet, ioba, table[i]);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
+int spapr_tce_replay(sPAPRTCETable *tcet)
+{
+    return spapr_tce_do_replay(tcet, tcet->table);
+}
+
+int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
+{
+    int ret = 0, oldfd;
+    uint64_t *oldtable;
+
+    oldtable = tcet->table;
+    oldfd = tcet->fd;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->nb_table,
+                                        tcet->page_shift,
+                                        &tcet->fd,
+                                        false,
+                                        true); /* force_userspace */
+
+    if (replay) {
+        ret = spapr_tce_do_replay(tcet, oldtable);
+    }
+
+    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
+
+    return ret;
+}
+
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       sPAPRTCETable *tcet)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 76c988f..d1fa157 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
     return 0;
 }
 
+static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
+{
+    int ret = 0, i;
+    bool had_vfio = sphb->has_vfio;
+    sPAPRTCETable *tcet;
+
+    spapr_phb_dma_capabilities_update(sphb);
+
+    if (!had_vfio && sphb->has_vfio) {
+        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
+            if (!tcet || !tcet->enabled) {
+                continue;
+            }
+            if (tcet->fd >= 0) {
+                /*
+                 * We got first vfio-pci device on accelerated table.
+                 * VFIO acceleration is not possible.
+                 * Reallocate table in userspace and replay mappings.
+                 */
+                ret = spapr_tce_realloc_userspace(tcet, true);
+                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
+            } else {
+                /* There was no acceleration, so just replay mappings. */
+                ret = spapr_tce_replay(tcet);
+                trace_spapr_pci_dma_update(tcet->liobn, ret);
+            }
+            if (ret) {
+                break;
+            }
+        }
+        return ret;
+    }
+
+    return 0;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
             error_setg(errp, "Failed to create pci child device tree node");
             goto out;
         }
+        spapr_phb_hotplug_dma_sync(phb);
     }
 
     drck->attach(drc, DEVICE(pdev),
@@ -1116,6 +1154,12 @@ out:
     }
 }
 
+static void spapr_phb_remove_sync_dma(struct rcu_head *head)
+{
+    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
+    spapr_phb_hotplug_dma_sync(sphb);
+}
+
 static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
 {
     /* some version guests do not wait for completion of a device
@@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
      */
     pci_device_reset(PCI_DEVICE(dev));
     object_unparent(OBJECT(dev));
+
+    /* Actual VFIO device release happens from RCU so postpone DMA update */
+    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
 }
 
 static void spapr_phb_remove_pci_device(sPAPRDRConnector *drc,
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index bf66315..8b007aa 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -61,6 +61,7 @@ typedef struct spapr_pci_msi_mig {
 
 struct sPAPRPHBState {
     PCIHostState parent_obj;
+    struct rcu_head rcu;
 
     uint32_t index;
     uint64_t buid;
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index e32e787..4645f16 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -588,6 +588,8 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
                  uint32_t liobn, uint64_t window, uint32_t size);
 int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                       sPAPRTCETable *tcet);
+int spapr_tce_replay(sPAPRTCETable *tcet);
+int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay);
 void spapr_pci_switch_vga(bool big_endian);
 void spapr_hotplug_req_add_event(sPAPRDRConnector *drc);
 void spapr_hotplug_req_remove_event(sPAPRDRConnector *drc);
diff --git a/trace-events b/trace-events
index a93af9a..a994019 100644
--- a/trace-events
+++ b/trace-events
@@ -1300,6 +1300,8 @@ spapr_pci_rtas_ibm_query_interrupt_source_number(unsigned ioa, unsigned intr) "q
 spapr_pci_msi_write(uint64_t addr, uint64_t data, uint32_t dt_irq) "@%"PRIx64"<=%"PRIx64" IRQ %u"
 spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
 spapr_pci_msi_retry(unsigned config_addr, unsigned req_num, unsigned max_irqs) "Guest device at %x asked %u, have only %u"
+spapr_pci_dma_update(uint64_t liobn, long ret) "liobn=%"PRIx64" ret=%ld"
+spapr_pci_dma_realloc_update(uint64_t liobn, long ret) "liobn=%"PRIx64" tcet=%ld"
 
 # hw/pci/pci.c
 pci_update_mappings_del(void *d, uint32_t bus, uint32_t func, uint32_t slot, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,%#"PRIx64"+%#"PRIx64
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-07  7:02   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

This enables multiple IOMMU groups in one VFIO container which means
that multiple devices from different groups can share the same IOMMU
table (or tables if DDW).

This removes a group id from vfio_container_ioctl(). The kernel support
is required for this; if the host kernel does not have the support,
it will allow only one group per container. The PHB's "iommuid" property
is ignored. The ioctl is called for every container attached to
the address space. At the moment there is just one container anyway.

If there is no container attached to the address space,
vfio_container_do_ioctl() returns -1.

This removes casts to sPAPRPHBVFIOState as none of sPAPRPHBVFIOState
members is accessed here.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci_vfio.c | 17 ++++++-----------
 hw/vfio/common.c        | 20 ++++++--------------
 include/hw/vfio/vfio.h  |  2 +-
 3 files changed, 13 insertions(+), 26 deletions(-)

diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index 04ca4cf..fe7d7d1 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -33,7 +33,7 @@ int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
     struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
     int ret;
 
-    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
     if (ret) {
         return ret;
@@ -59,8 +59,7 @@ void spapr_phb_vfio_eeh_reenable(sPAPRPHBState *sphb)
      * ensures that the contained PCI devices will work properly
      * after reboot.
      */
-    vfio_container_ioctl(&sphb->iommu_as,
-                         sphb->iommugroupid, VFIO_EEH_PE_OP, &op);
+    vfio_container_ioctl(&sphb->iommu_as, VFIO_EEH_PE_OP, &op);
 }
 
 int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
@@ -86,8 +85,7 @@ int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
         return RTAS_OUT_PARAM_ERROR;
     }
 
-    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
-                               VFIO_EEH_PE_OP, &op);
+    ret = vfio_container_ioctl(&sphb->iommu_as, VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_HW_ERROR;
     }
@@ -101,8 +99,7 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
     int ret;
 
     op.op = VFIO_EEH_PE_GET_STATE;
-    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
-                               VFIO_EEH_PE_OP, &op);
+    ret = vfio_container_ioctl(&sphb->iommu_as, VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_PARAM_ERROR;
     }
@@ -175,8 +172,7 @@ int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
         return RTAS_OUT_PARAM_ERROR;
     }
 
-    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
-                               VFIO_EEH_PE_OP, &op);
+    ret = vfio_container_ioctl(&sphb->iommu_as, VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_HW_ERROR;
     }
@@ -190,8 +186,7 @@ int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
     int ret;
 
     op.op = VFIO_EEH_PE_CONFIGURE;
-    ret = vfio_container_ioctl(&sphb->iommu_as, sphb->iommugroupid,
-                               VFIO_EEH_PE_OP, &op);
+    ret = vfio_container_ioctl(&sphb->iommu_as, VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_PARAM_ERROR;
     }
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b1045da..89ef37b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -918,34 +918,26 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     close(vbasedev->fd);
 }
 
-static int vfio_container_do_ioctl(AddressSpace *as, int32_t groupid,
+static int vfio_container_do_ioctl(AddressSpace *as,
                                    int req, void *param)
 {
-    VFIOGroup *group;
     VFIOContainer *container;
     int ret = -1;
+    VFIOAddressSpace *space = vfio_get_address_space(as);
 
-    group = vfio_get_group(groupid, as);
-    if (!group) {
-        error_report("vfio: group %d not registered", groupid);
-        return ret;
-    }
-
-    container = group->container;
-    if (group->container) {
+    QLIST_FOREACH(container, &space->containers, next) {
         ret = ioctl(container->fd, req, param);
         if (ret < 0) {
             error_report("vfio: failed to ioctl %d to container: ret=%d, %s",
                          _IOC_NR(req) - VFIO_BASE, ret, strerror(errno));
+            return -errno;
         }
     }
 
-    vfio_put_group(group);
-
     return ret;
 }
 
-int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
+int vfio_container_ioctl(AddressSpace *as,
                          int req, void *param)
 {
     /* We allow only certain ioctls to the container */
@@ -960,5 +952,5 @@ int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
         return -1;
     }
 
-    return vfio_container_do_ioctl(as, groupid, req, param);
+    return vfio_container_do_ioctl(as, req, param);
 }
diff --git a/include/hw/vfio/vfio.h b/include/hw/vfio/vfio.h
index 0b26cd8..76b5744 100644
--- a/include/hw/vfio/vfio.h
+++ b/include/hw/vfio/vfio.h
@@ -3,7 +3,7 @@
 
 #include "qemu/typedefs.h"
 
-extern int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
+extern int vfio_container_ioctl(AddressSpace *as,
                                 int req, void *param);
 
 #endif
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 10:33   ` David Gibson
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

On systems with guest visible IOMMU, adding a new memory region onto
PCI bus calls vfio_listener_region_add() for every DMA window. This
installs a notifier for IOMMU memory regions. The notifier is supposed
to be removed by vfio_listener_region_del(), however in the case of mixed
PHB (emulated + VFIO devices) when last VFIO device is unplugged and
container gets destroyed, all existing DMA windows stay alive altogether
with the notifiers which are on the linked list which head was in
the destroyed container.

This unregisters IOMMU memory region notifier when a container is
destroyed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v10:
* new to the patchset
---
 hw/vfio/common.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 89ef37b..8eacfd7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -772,11 +772,19 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
     if (QLIST_EMPTY(&container->group_list)) {
         VFIOAddressSpace *space = container->space;
+        VFIOGuestIOMMU *giommu, *tmp;
 
         if (container->iommu_data.release) {
             container->iommu_data.release(container);
         }
         QLIST_REMOVE(container, next);
+
+        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+            memory_region_unregister_iommu_notifier(&giommu->n);
+            QLIST_REMOVE(giommu, giommu_next);
+            g_free(giommu);
+        }
+
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
         g_free(container);
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 13:42   ` Alex Williamson
  2015-07-07  7:23   ` Thomas Huth
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a guest RAM memory listener which notifies a VFIO container
about memory which needs to be pinned/unpinned. VFIO MMIO regions
(i.e. "skip dump" regions) are skipped.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v9:
* since there is no more SPAPR-specific data in container::iommu_data,
the memory preregistration fields are common and potentially can be used
by other architectures

v7:
* in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
* s'ramlistener'register_listener'

v6:
* fixed commit log (s/guest/userspace/), added note about no guest visible
change
* fixed error checking if ram registration failed
* added alignment check for section->offset_within_region

v5:
* simplified the patch
* added trace points
* added round_up() for the size
* SPAPR IOMMU v2 used
---
 hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
 include/hw/vfio/vfio-common.h |   3 ++
 trace-events                  |   1 +
 3 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8eacfd7..0c7ba8c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
     memory_listener_unregister(&container->iommu_data.type1.listener);
 }
 
+static void vfio_ram_do_region(VFIOContainer *container,
+                              MemoryRegionSection *section, unsigned long req)
+{
+    int ret;
+    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
+
+    if (!memory_region_is_ram(section->mr) ||
+        memory_region_is_skip_dump(section->mr)) {
+        return;
+    }
+
+    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region;
+    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
+
+    ret = ioctl(container->fd, req, &reg);
+    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
+            ret ? -errno : 0);
+    if (!ret) {
+        return;
+    }
+
+    /*
+     * On the initfn path, store the first error in the container so we
+     * can gracefully fail.  Runtime, there's not much we can do other
+     * than throw a hardware error.
+     */
+    if (!container->iommu_data.ram_reg_initialized) {
+        if (!container->iommu_data.ram_reg_error) {
+            container->iommu_data.ram_reg_error = -errno;
+        }
+    } else {
+        hw_error("vfio: RAM registering failed, unable to continue");
+    }
+}
+
+static void vfio_ram_listener_region_add(MemoryListener *listener,
+                                         MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.register_listener);
+    memory_region_ref(section->mr);
+    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
+}
+
+static void vfio_ram_listener_region_del(MemoryListener *listener,
+                                         MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.register_listener);
+    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
+    memory_region_unref(section->mr);
+}
+
+static const MemoryListener vfio_ram_memory_listener = {
+    .region_add = vfio_ram_listener_region_add,
+    .region_del = vfio_ram_listener_region_del,
+};
+
+static void vfio_spapr_listener_release_v2(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->iommu_data.register_listener);
+    vfio_listener_release(container);
+}
+
 int vfio_mmap_region(Object *obj, VFIORegion *region,
                      MemoryRegion *mem, MemoryRegion *submem,
                      void **map, size_t size, off_t offset,
@@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 
         container->iommu_data.type1.initialized = true;
 
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
+
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
             error_report("vfio: failed to set group container: %m");
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        ret = ioctl(fd, VFIO_SET_IOMMU,
+                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
         }
 
         container->iommu_data.type1.listener = vfio_memory_listener;
-        container->iommu_data.release = vfio_listener_release;
-
         memory_listener_register(&container->iommu_data.type1.listener,
                                  container->space->as);
 
+        if (!v2) {
+            container->iommu_data.release = vfio_listener_release;
+        } else {
+            container->iommu_data.release = vfio_spapr_listener_release_v2;
+            container->iommu_data.register_listener =
+                    vfio_ram_memory_listener;
+            memory_listener_register(&container->iommu_data.register_listener,
+                                     &address_space_memory);
+
+            if (container->iommu_data.ram_reg_error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
+
+            container->iommu_data.ram_reg_initialized = true;
+        }
+
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 59a321d..b132248 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -79,6 +79,9 @@ typedef struct VFIOContainer {
             VFIOType1 type1;
         };
         void (*release)(struct VFIOContainer *);
+        MemoryListener register_listener;
+        int ram_reg_error;
+        bool ram_reg_initialized;
     } iommu_data;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOGroup) group_list;
diff --git a/trace-events b/trace-events
index a994019..b300e94 100644
--- a/trace-events
+++ b/trace-events
@@ -1584,6 +1584,7 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
+vfio_ram_register(int req, uint64_t va, uint64_t size, int ret) "req=%d va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2015-07-06  2:11 ` Alexey Kardashevskiy
  2015-07-06 11:06   ` David Gibson
                     ` (2 more replies)
  2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
  2015-07-06 15:54 ` Thomas Huth
  15 siblings, 3 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  2:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices. As all TCE root regions
are mapped at 0 and 64bit long (and actual tables are child regions),
this replaces memory_region_add_subregion() with _overlap() to make
QEMU memory API happy.

This reserves RTAS token numbers for DDW calls.

This implements helpers to interact with VFIO kernel interface.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.3 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v10:
* added dma64_win_addr property to PHB
* removed redundand check for "!migtable" in spapr_tce_table_post_load()

v9:
* fixed default 64bit window start (from mdroth)
* fixed type cast in dma window update code (from mdroth)
* spapr_phb_dma_update() now can fail and cause hotplug failure if
hardware TCE table cannot be mapped to the same bus address as the emulated one

v7:
* fixed uninitialized variables

v6:
* rework as there is no more special device for VFIO PHB

v5:
* total rework
* enabled for machines >2.3
* fixed migration
* merged rtas handlers here

v4:
* reset handler is back in generalized form

v3:
* removed reset
* windows_num is now 1 or bigger rather than 0-based value and it is only
changed in PHB code, not in RTAS
* added page mask check in create()
* added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
created

v2:
* tested on hacked emulated E1000
* implemented DDW reset on the PHB reset
* spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
---
 hw/ppc/Makefile.objs        |   3 +
 hw/ppc/spapr.c              |   5 +
 hw/ppc/spapr_iommu.c        |  32 ++++-
 hw/ppc/spapr_pci.c          | 110 ++++++++++++++--
 hw/ppc/spapr_pci_vfio.c     |  88 +++++++++++++
 hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/common.c            |   2 +
 include/hw/pci-host/spapr.h |  21 +++-
 include/hw/ppc/spapr.h      |  17 ++-
 trace-events                |   6 +
 10 files changed, 568 insertions(+), 16 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c8ab06e..0b2ff6d 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
+obj-y += spapr_rtas_ddw.o
+endif
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 5ca817c..d50d50b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = {
             .driver   = "spapr-pci-host-bridge",\
             .property = "dynamic-reconfiguration",\
             .value    = "off",\
+        },\
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
         },
 
 #define SPAPR_COMPAT_2_2 \
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 2d99c3b..b54c3d8 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -136,6 +136,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->migtable = tcet->table;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->enabled) {
+        if (!tcet->table) {
+            tcet->enabled = false;
+            /* VFIO does not migrate so pass vfio_accel == false */
+            spapr_tce_table_do_enable(tcet, false);
+        }
+        memcpy(tcet->table, tcet->migtable,
+               tcet->nb_table * sizeof(tcet->table[0]));
+        free(tcet->migtable);
+        tcet->migtable = NULL;
+    }
+
     return 0;
 }
 
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
-    .version_id = 2,
+    .version_id = 3,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index d1fa157..b7113b5 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
 
     sphb->dma32_window_start = 0;
     sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
+    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
+    sphb->dma64_window_size = pow2ceil(ram_size);
 
     ret = spapr_phb_vfio_dma_capabilities_update(sphb);
     sphb->has_vfio = (ret == 0);
@@ -785,12 +788,35 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
     return 0;
 }
 
-static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
-                                     uint32_t liobn, uint32_t page_shift,
-                                     uint64_t window_size)
+int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                              uint32_t liobn, uint32_t page_shift,
+                              uint64_t window_size)
 {
     uint64_t bus_offset = sphb->dma32_window_start;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    int ret;
+
+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+        return -1;
+    }
+
+    if (sphb->ddw_enabled) {
+        if (sphb->has_vfio) {
+            ret = spapr_phb_vfio_dma_init_window(sphb,
+                                                 page_shift, window_size,
+                                                 &bus_offset);
+            if (ret) {
+                return ret;
+            }
+        } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
+            /*
+             * There is no VFIO so we choose a huge window address.
+             * If VFIO is added later, spapr_phb_dma_update() will fail
+             * and cause hotplug failure.
+             */
+            bus_offset = sphb->dma64_window_start;
+        }
+    }
 
     spapr_tce_table_enable(tcet, bus_offset, page_shift,
                            window_size >> page_shift,
@@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
 int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                 sPAPRTCETable *tcet)
 {
+    int ret = 0;
+
+    if (sphb->has_vfio && sphb->ddw_enabled) {
+        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
+    }
     spapr_tce_table_disable(tcet);
 
-    return 0;
+    return ret;
 }
 
 int spapr_phb_dma_reset(sPAPRPHBState *sphb)
@@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
     int ret = 0, i;
     bool had_vfio = sphb->has_vfio;
     sPAPRTCETable *tcet;
+    uint64_t bus_offset = 0;
 
     spapr_phb_dma_capabilities_update(sphb);
 
+    /*
+     * PHB got first VFIO device or lost last VFIO device;
+     * If it is the last VFIO device, we do not need windows anymore so
+     * remove them.
+     * If it is the first VFIO device, we have to remove them as
+     * we cannot request a specific window from the host kernel so we
+     * remove all windows and recreate them later if necessary.
+     */
+    if (had_vfio !=  sphb->has_vfio) {
+        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
+            if (!tcet) {
+                continue;
+            }
+            spapr_phb_vfio_dma_remove_window(sphb, tcet);
+        }
+    }
+
     if (!had_vfio && sphb->has_vfio) {
         for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
             tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
             if (!tcet || !tcet->enabled) {
                 continue;
             }
+            ret = spapr_phb_vfio_dma_init_window(sphb,
+                                                 tcet->page_shift,
+                                                 (uint64_t)tcet->nb_table <<
+                                                 tcet->page_shift,
+                                                 &bus_offset);
+            if (ret) {
+                break;
+            }
+            if (bus_offset != tcet->bus_offset) {
+                ret = -EFAULT;
+                break;
+            }
             if (tcet->fd >= 0) {
                 /*
                  * We got first vfio-pci device on accelerated table.
@@ -1143,7 +1205,10 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
             error_setg(errp, "Failed to create pci child device tree node");
             goto out;
         }
-        spapr_phb_hotplug_dma_sync(phb);
+        if (spapr_phb_hotplug_dma_sync(phb)) {
+            error_setg(errp, "Failed to create DMA window(s)");
+            goto out;
+        }
     }
 
     drck->attach(drc, DEVICE(pdev),
@@ -1440,15 +1505,17 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-            error_setg(errp, "failed to create TCE table");
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb),
+                                   SPAPR_PCI_LIOBN(sphb->index, i));
+        if (!tcet) {
+            error_setg(errp, "spapr_tce_new_table failed");
             return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
     }
 
-    memory_region_add_subregion(&sphb->iommu_root, 0,
-                                spapr_tce_get_iommu(tcet));
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1486,8 +1553,12 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT64("io_win_addr", sPAPRPHBState, io_win_addr, -1),
     DEFINE_PROP_UINT64("io_win_size", sPAPRPHBState, io_win_size,
                        SPAPR_PCI_IO_WIN_SIZE),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_start,
+                       SPAPR_PCI_DMA64_START),
     DEFINE_PROP_BOOL("dynamic-reconfiguration", sPAPRPHBState, dr_enabled,
                      true),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT8("levels", sPAPRPHBState, levels, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1746,6 +1817,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1770,6 +1850,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index fe7d7d1..54089a0 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -22,6 +22,7 @@
 #include "hw/pci/msix.h"
 #include "linux/vfio.h"
 #include "hw/vfio/vfio.h"
+#include "trace.h"
 
 static Property spapr_phb_vfio_properties[] = {
     DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
@@ -42,6 +43,93 @@ int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
     sphb->dma32_window_start = info.dma32_window_start;
     sphb->dma32_window_size = info.dma32_window_size;
 
+    if (sphb->ddw_enabled && (info.flags & VFIO_IOMMU_SPAPR_INFO_DDW)) {
+        sphb->windows_supported = info.ddw.max_dynamic_windows_supported;
+        sphb->page_size_mask = info.ddw.pgsizes;
+        sphb->dma64_window_size = pow2ceil(ram_size);
+        sphb->max_levels = info.ddw.levels;
+    } else {
+        /* If VFIO_IOMMU_INFO_DDW is not set, disable DDW */
+        sphb->ddw_enabled = false;
+    }
+
+    return ret;
+}
+
+static int spapr_phb_vfio_levels(uint32_t entries)
+{
+    unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
+    int levels;
+
+    if (pages <= 64) {
+        levels = 1;
+    } else if (pages <= 64*64) {
+        levels = 2;
+    } else if (pages <= 64*64*64) {
+        levels = 3;
+    } else {
+        levels = 4;
+    }
+
+    return levels;
+}
+
+int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
+                                   uint32_t page_shift,
+                                   uint64_t window_size,
+                                   uint64_t *bus_offset)
+{
+    int ret;
+    struct vfio_iommu_spapr_tce_create create = {
+        .argsz = sizeof(create),
+        .page_shift = page_shift,
+        .window_size = window_size,
+        .levels = sphb->levels,
+        .start_addr = 0,
+    };
+
+    /*
+     * Dynamic windows are supported, that means that there is no
+     * pre-created window and we have to create one.
+     */
+    if (!create.levels) {
+        create.levels = spapr_phb_vfio_levels(create.window_size >>
+                                              page_shift);
+    }
+
+    if (create.levels > sphb->max_levels) {
+        return -EINVAL;
+    }
+
+    ret = vfio_container_ioctl(&sphb->iommu_as,
+                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    if (ret) {
+        return ret;
+    }
+    *bus_offset = create.start_addr;
+
+    trace_spapr_pci_vfio_init_window(page_shift, window_size, *bus_offset);
+
+    return 0;
+}
+
+int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
+                                            sPAPRTCETable *tcet)
+{
+    struct vfio_iommu_spapr_tce_remove remove = {
+        .argsz = sizeof(remove),
+        .start_addr = tcet->bus_offset
+    };
+    int ret;
+
+    ret = vfio_container_ioctl(&sphb->iommu_as,
+                               VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+    if (ret) {
+        return ret;
+    }
+
+    trace_spapr_pci_vfio_remove_window(tcet->bus_offset);
+
     return ret;
 }
 
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..7539c6a
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,300 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
+                                 uint64_t page_mask)
+{
+    int i, j;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
+            if ((sps[i].page_shift == masks[j].shift) &&
+                    (page_mask & (1ULL << masks[j].shift))) {
+                mask |= masks[j].mask;
+            }
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    CPUPPCState *env = &cpu->env;
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t avail, addr, pgmask = 0;
+    unsigned current;
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    current = spapr_phb_get_active_win_num(sphb);
+    avail = (sphb->windows_supported > current) ?
+            (sphb->windows_supported - current) : 0;
+
+    /* Work out supported page masks */
+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as all RAM was in 4K pages.
+     */
+    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
+                                pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+    long ret;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
+        goto hw_error_exit;
+    }
+
+    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
+                                    1ULL << window_shift);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d,
+                                 liobn, ret);
+    if (ret || !tcet) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+    long ret;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_remove_window(sphb, tcet);
+    trace_spapr_iommu_ddw_remove(liobn, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+    long ret;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0c7ba8c..b6bbd43 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1044,6 +1044,8 @@ int vfio_container_ioctl(AddressSpace *as,
     case VFIO_CHECK_EXTENSION:
     case VFIO_IOMMU_SPAPR_TCE_GET_INFO:
     case VFIO_EEH_PE_OP:
+    case VFIO_IOMMU_SPAPR_TCE_CREATE:
+    case VFIO_IOMMU_SPAPR_TCE_REMOVE:
         break;
     default:
         /* Return an error on unknown requests */
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 8b007aa..911fa27 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -89,6 +89,13 @@ struct sPAPRPHBState {
     uint32_t dma32_window_size;
     bool has_vfio;
     int32_t iommugroupid; /* obsolete */
+    bool ddw_enabled;
+    uint32_t windows_supported;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_start;
+    uint64_t dma64_window_size;
+    uint8_t max_levels;
+    uint8_t levels;
 
     QLIST_ENTRY(sPAPRPHBState) list;
 };
@@ -111,7 +118,10 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_DMA32_SIZE         0x40000000
 
-#define SPAPR_PCI_DMA_MAX_WINDOWS    1
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
+/* Default 64bit dynamic window offset */
+#define SPAPR_PCI_DMA64_START        0x800000000000000ULL
 
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
@@ -133,11 +143,20 @@ void spapr_pci_rtas_init(void);
 sPAPRPHBState *spapr_pci_find_phb(sPAPRMachineState *spapr, uint64_t buid);
 PCIDevice *spapr_pci_find_dev(sPAPRMachineState *spapr, uint64_t buid,
                               uint32_t config_addr);
+int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                              uint32_t liobn, uint32_t page_shift,
+                              uint64_t window_size);
 int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                 sPAPRTCETable *tcet);
 int spapr_phb_dma_reset(sPAPRPHBState *sphb);
 
 int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb);
+int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
+                                   uint32_t page_shift,
+                                   uint64_t window_size,
+                                   uint64_t *bus_offset);
+int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
+                                     sPAPRTCETable *tcet);
 int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
                                   PCIDevice *pdev, int option);
 int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 4645f16..5a58785 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_SUPPORTED      -3
 #define RTAS_OUT_NOT_AUTHORIZED     -9002
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
@@ -558,6 +572,7 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint64_t *migtable;
     bool bypass;
     int fd;
     MemoryRegion root, iommu;
diff --git a/trace-events b/trace-events
index b300e94..a1234dd 100644
--- a/trace-events
+++ b/trace-events
@@ -1302,6 +1302,8 @@ spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
 spapr_pci_msi_retry(unsigned config_addr, unsigned req_num, unsigned max_irqs) "Guest device at %x asked %u, have only %u"
 spapr_pci_dma_update(uint64_t liobn, long ret) "liobn=%"PRIx64" ret=%ld"
 spapr_pci_dma_realloc_update(uint64_t liobn, long ret) "liobn=%"PRIx64" tcet=%ld"
+spapr_pci_vfio_init_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+spapr_pci_vfio_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/pci/pci.c
 pci_update_mappings_del(void *d, uint32_t bus, uint32_t func, uint32_t slot, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,%#"PRIx64"+%#"PRIx64
@@ -1365,6 +1367,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_alloc_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, unsigned long long pg_size, unsigned long long req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%llx, requested=0x%llx, start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2015-07-06 10:07   ` David Gibson
  2015-07-06 17:04   ` Thomas Huth
  1 sibling, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-06 10:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 1863 bytes --]

On Mon, Jul 06, 2015 at 12:11:01PM +1000, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their size never
> changes. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty stub object. Only
> LIOBN is assigned by the time of creation. It still will be called once
> at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> spapr_tce_table_enable() receives TCE table parameters and allocates
> a guest view of the TCE table (in the user space or KVM).
> spapr_tce_table_disable() disposes the table.
> 
> Follow up patches will disable+enable tables on reset (system reset
> or DDW reset).
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as migration expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it has all the properties set after the migration.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
@ 2015-07-06 10:27   ` David Gibson
  2015-07-06 21:31   ` Thomas Huth
  2015-07-10 21:33   ` Michael Roth
  2 siblings, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-06 10:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 1981 bytes --]

On Mon, Jul 06, 2015 at 12:11:06PM +1000, Alexey Kardashevskiy wrote:
> sPAPR IOMMU is managing two copies of an TCE table:
> 1) a guest view of the table - this is what emulated devices use and
> this is where H_GET_TCE reads from;
> 2) a hardware TCE table - only present if there is at least one vfio-pci
> device on a PHB; it is updated via a memory listener on a PHB address
> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> 
> At the moment presence of vfio-pci devices on a bus affect the way
> the guest view table is allocated. If there is no vfio-pci on a PHB
> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> is allocated in KVM. However, if there is vfio-pci and we do yet not
> support KVM acceleration for these, the table has to be allocated
> by the userspace.
> 
> When vfio-pci device is hotplugged and there were no vfio-pci devices
> already, the guest view table could have been allocated by KVM which
> means that H_PUT_TCE is handled by the host kernel and since we
> do not support vfio-pci in KVM, the hardware table will not be updated.
> 
> This reallocates the guest view table in QEMU if the first vfio-pci
> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> 
> This replays all the mappings to make sure that the tables are in sync.
> This will not have a visible effect though as for a new device
> the guest kernel will allocate-and-map new addresses and therefore
> existing mappings from emulated devices will not be used by vfio-pci
> devices.
> 
> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> hooks.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
@ 2015-07-06 10:33   ` David Gibson
  2015-07-06 12:49     ` Alex Williamson
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2015-07-06 10:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 2205 bytes --]

On Mon, Jul 06, 2015 at 12:11:08PM +1000, Alexey Kardashevskiy wrote:
> On systems with guest visible IOMMU, adding a new memory region onto
> PCI bus calls vfio_listener_region_add() for every DMA window. This
> installs a notifier for IOMMU memory regions. The notifier is supposed
> to be removed by vfio_listener_region_del(), however in the case of mixed
> PHB (emulated + VFIO devices) when last VFIO device is unplugged and
> container gets destroyed, all existing DMA windows stay alive altogether
> with the notifiers which are on the linked list which head was in
> the destroyed container.
> 
> This unregisters IOMMU memory region notifier when a container is
> destroyed.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Alex,

I think this is correct, but you've probably got a better
understanding of it.  Will you take this through your tree?


> ---
> Changes:
> v10:
> * new to the patchset
> ---
>  hw/vfio/common.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 89ef37b..8eacfd7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -772,11 +772,19 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      if (QLIST_EMPTY(&container->group_list)) {
>          VFIOAddressSpace *space = container->space;
> +        VFIOGuestIOMMU *giommu, *tmp;
>  
>          if (container->iommu_data.release) {
>              container->iommu_data.release(container);
>          }
>          QLIST_REMOVE(container, next);
> +
> +        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +            memory_region_unregister_iommu_notifier(&giommu->n);
> +            QLIST_REMOVE(giommu, giommu_next);
> +            g_free(giommu);
> +        }
> +
>          trace_vfio_disconnect_container(container->fd);
>          close(container->fd);
>          g_free(container);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2015-07-06 11:06   ` David Gibson
  2015-07-06 11:27     ` Alexey Kardashevskiy
  2015-07-07  9:46     ` Alexey Kardashevskiy
  2015-07-07  4:58   ` David Gibson
  2015-07-07  9:33   ` Thomas Huth
  2 siblings, 2 replies; 71+ messages in thread
From: David Gibson @ 2015-07-06 11:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 36942 bytes --]

On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices. As all TCE root regions
> are mapped at 0 and 64bit long (and actual tables are child regions),
> this replaces memory_region_add_subregion() with _overlap() to make
> QEMU memory API happy.
> 
> This reserves RTAS token numbers for DDW calls.
> 
> This implements helpers to interact with VFIO kernel interface.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.3 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v10:
> * added dma64_win_addr property to PHB
> * removed redundand check for "!migtable" in spapr_tce_table_post_load()
> 
> v9:
> * fixed default 64bit window start (from mdroth)
> * fixed type cast in dma window update code (from mdroth)
> * spapr_phb_dma_update() now can fail and cause hotplug failure if
> hardware TCE table cannot be mapped to the same bus address as the emulated one
> 
> v7:
> * fixed uninitialized variables
> 
> v6:
> * rework as there is no more special device for VFIO PHB
> 
> v5:
> * total rework
> * enabled for machines >2.3
> * fixed migration
> * merged rtas handlers here
> 
> v4:
> * reset handler is back in generalized form
> 
> v3:
> * removed reset
> * windows_num is now 1 or bigger rather than 0-based value and it is only
> changed in PHB code, not in RTAS
> * added page mask check in create()
> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
> created
> 
> v2:
> * tested on hacked emulated E1000
> * implemented DDW reset on the PHB reset
> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
> ---
>  hw/ppc/Makefile.objs        |   3 +
>  hw/ppc/spapr.c              |   5 +
>  hw/ppc/spapr_iommu.c        |  32 ++++-
>  hw/ppc/spapr_pci.c          | 110 ++++++++++++++--
>  hw/ppc/spapr_pci_vfio.c     |  88 +++++++++++++
>  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/common.c            |   2 +
>  include/hw/pci-host/spapr.h |  21 +++-
>  include/hw/ppc/spapr.h      |  17 ++-
>  trace-events                |   6 +
>  10 files changed, 568 insertions(+), 16 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c8ab06e..0b2ff6d 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
> +obj-y += spapr_rtas_ddw.o
> +endif
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 5ca817c..d50d50b 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = {
>              .driver   = "spapr-pci-host-bridge",\
>              .property = "dynamic-reconfiguration",\
>              .value    = "off",\
> +        },\
> +        {\
> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +            .property = "ddw",\
> +            .value    = stringify(off),\
>          },
>  
>  #define SPAPR_COMPAT_2_2 \
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 2d99c3b..b54c3d8 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -136,6 +136,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->migtable = tcet->table;
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (tcet->enabled) {
> +        if (!tcet->table) {
> +            tcet->enabled = false;
> +            /* VFIO does not migrate so pass vfio_accel == false */
> +            spapr_tce_table_do_enable(tcet, false);
> +        }
> +        memcpy(tcet->table, tcet->migtable,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +        free(tcet->migtable);
> +        tcet->migtable = NULL;
> +    }
> +
>      return 0;
>  }
>  
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
> -    .version_id = 2,
> +    .version_id = 3,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index d1fa157..b7113b5 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>  
>      sphb->dma32_window_start = 0;
>      sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
> +    sphb->dma64_window_size = pow2ceil(ram_size);
>  
>      ret = spapr_phb_vfio_dma_capabilities_update(sphb);
>      sphb->has_vfio = (ret == 0);
> @@ -785,12 +788,35 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>      return 0;
>  }
>  
> -static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> -                                     uint32_t liobn, uint32_t page_shift,
> -                                     uint64_t window_size)
> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> +                              uint32_t liobn, uint32_t page_shift,
> +                              uint64_t window_size)
>  {
>      uint64_t bus_offset = sphb->dma32_window_start;
>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> +    int ret;
> +
> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> +        return -1;
> +    }
> +
> +    if (sphb->ddw_enabled) {
> +        if (sphb->has_vfio) {
> +            ret = spapr_phb_vfio_dma_init_window(sphb,
> +                                                 page_shift, window_size,
> +                                                 &bus_offset);
> +            if (ret) {
> +                return ret;
> +            }
> +        } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
> +            /*
> +             * There is no VFIO so we choose a huge window address.
> +             * If VFIO is added later, spapr_phb_dma_update() will fail
> +             * and cause hotplug failure.
> +             */
> +            bus_offset = sphb->dma64_window_start;
> +        }
> +    }
>  
>      spapr_tce_table_enable(tcet, bus_offset, page_shift,
>                             window_size >> page_shift,
> @@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>  int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>                                  sPAPRTCETable *tcet)
>  {
> +    int ret = 0;
> +
> +    if (sphb->has_vfio && sphb->ddw_enabled) {
> +        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
> +    }
>      spapr_tce_table_disable(tcet);
>  
> -    return 0;
> +    return ret;
>  }
>  
>  int spapr_phb_dma_reset(sPAPRPHBState *sphb)
> @@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>      int ret = 0, i;
>      bool had_vfio = sphb->has_vfio;
>      sPAPRTCETable *tcet;
> +    uint64_t bus_offset = 0;
>  
>      spapr_phb_dma_capabilities_update(sphb);
>  
> +    /*
> +     * PHB got first VFIO device or lost last VFIO device;
> +     * If it is the last VFIO device, we do not need windows anymore so
> +     * remove them.
> +     * If it is the first VFIO device, we have to remove them as
> +     * we cannot request a specific window from the host kernel so we
> +     * remove all windows and recreate them later if necessary.

Am I right in thinking that there never should be (VFIO enabled)
windows when the first VFIO device is added though?

If you're removing the windows when VFIO devices are removed, and any
windows created while !has_vfio shouldn't result in the kernel being
requested from the kernel..?

> +     */
> +    if (had_vfio !=  sphb->has_vfio) {
> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> +            if (!tcet) {
> +                continue;
> +            }
> +            spapr_phb_vfio_dma_remove_window(sphb, tcet);
> +        }
> +    }
> +
>      if (!had_vfio && sphb->has_vfio) {
>          for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>              tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>              if (!tcet || !tcet->enabled) {
>                  continue;
>              }
> +            ret = spapr_phb_vfio_dma_init_window(sphb,
> +                                                 tcet->page_shift,
> +                                                 (uint64_t)tcet->nb_table <<
> +                                                 tcet->page_shift,
> +                                                 &bus_offset);
> +            if (ret) {
> +                break;
> +            }
> +            if (bus_offset != tcet->bus_offset) {
> +                ret = -EFAULT;
> +                break;
> +            }
>              if (tcet->fd >= 0) {
>                  /*
>                   * We got first vfio-pci device on accelerated table.
> @@ -1143,7 +1205,10 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>              error_setg(errp, "Failed to create pci child device tree node");
>              goto out;
>          }
> -        spapr_phb_hotplug_dma_sync(phb);
> +        if (spapr_phb_hotplug_dma_sync(phb)) {
> +            error_setg(errp, "Failed to create DMA window(s)");
> +            goto out;
> +        }
>      }
>  
>      drck->attach(drc, DEVICE(pdev),
> @@ -1440,15 +1505,17 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -            error_setg(errp, "failed to create TCE table");
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb),
> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
> +        if (!tcet) {
> +            error_setg(errp, "spapr_tce_new_table failed");
>              return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
>      }
>  
> -    memory_region_add_subregion(&sphb->iommu_root, 0,
> -                                spapr_tce_get_iommu(tcet));
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
> @@ -1486,8 +1553,12 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT64("io_win_addr", sPAPRPHBState, io_win_addr, -1),
>      DEFINE_PROP_UINT64("io_win_size", sPAPRPHBState, io_win_size,
>                         SPAPR_PCI_IO_WIN_SIZE),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_start,
> +                       SPAPR_PCI_DMA64_START),
>      DEFINE_PROP_BOOL("dynamic-reconfiguration", sPAPRPHBState, dr_enabled,
>                       true),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT8("levels", sPAPRPHBState, levels, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1746,6 +1817,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1770,6 +1850,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> index fe7d7d1..54089a0 100644
> --- a/hw/ppc/spapr_pci_vfio.c
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -22,6 +22,7 @@
>  #include "hw/pci/msix.h"
>  #include "linux/vfio.h"
>  #include "hw/vfio/vfio.h"
> +#include "trace.h"
>  
>  static Property spapr_phb_vfio_properties[] = {
>      DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
> @@ -42,6 +43,93 @@ int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
>      sphb->dma32_window_start = info.dma32_window_start;
>      sphb->dma32_window_size = info.dma32_window_size;
>  
> +    if (sphb->ddw_enabled && (info.flags & VFIO_IOMMU_SPAPR_INFO_DDW)) {
> +        sphb->windows_supported = info.ddw.max_dynamic_windows_supported;
> +        sphb->page_size_mask = info.ddw.pgsizes;
> +        sphb->dma64_window_size = pow2ceil(ram_size);
> +        sphb->max_levels = info.ddw.levels;
> +    } else {
> +        /* If VFIO_IOMMU_INFO_DDW is not set, disable DDW */
> +        sphb->ddw_enabled = false;
> +    }
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_vfio_levels(uint32_t entries)
> +{
> +    unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
> +    int levels;
> +
> +    if (pages <= 64) {
> +        levels = 1;
> +    } else if (pages <= 64*64) {
> +        levels = 2;
> +    } else if (pages <= 64*64*64) {
> +        levels = 3;
> +    } else {
> +        levels = 4;
> +    }
> +
> +    return levels;
> +}
> +
> +int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
> +                                   uint32_t page_shift,
> +                                   uint64_t window_size,
> +                                   uint64_t *bus_offset)
> +{
> +    int ret;
> +    struct vfio_iommu_spapr_tce_create create = {
> +        .argsz = sizeof(create),
> +        .page_shift = page_shift,
> +        .window_size = window_size,
> +        .levels = sphb->levels,
> +        .start_addr = 0,
> +    };
> +
> +    /*
> +     * Dynamic windows are supported, that means that there is no
> +     * pre-created window and we have to create one.
> +     */
> +    if (!create.levels) {
> +        create.levels = spapr_phb_vfio_levels(create.window_size >>
> +                                              page_shift);
> +    }
> +
> +    if (create.levels > sphb->max_levels) {
> +        return -EINVAL;
> +    }
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as,
> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        return ret;
> +    }
> +    *bus_offset = create.start_addr;
> +
> +    trace_spapr_pci_vfio_init_window(page_shift, window_size, *bus_offset);
> +
> +    return 0;
> +}
> +
> +int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
> +                                            sPAPRTCETable *tcet)
> +{
> +    struct vfio_iommu_spapr_tce_remove remove = {
> +        .argsz = sizeof(remove),
> +        .start_addr = tcet->bus_offset
> +    };
> +    int ret;
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as,
> +                               VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_spapr_pci_vfio_remove_window(tcet->bus_offset);
> +
>      return ret;
>  }
>  
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..7539c6a
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,300 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> +                                 uint64_t page_mask)
> +{
> +    int i, j;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            if ((sps[i].page_shift == masks[j].shift) &&
> +                    (page_mask & (1ULL << masks[j].shift))) {
> +                mask |= masks[j].mask;
> +            }
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t avail, addr, pgmask = 0;
> +    unsigned current;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    current = spapr_phb_get_active_win_num(sphb);
> +    avail = (sphb->windows_supported > current) ?
> +            (sphb->windows_supported - current) : 0;
> +
> +    /* Work out supported page masks */
> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as all RAM was in 4K pages.
> +     */
> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
> +                                pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
> +                                    1ULL << window_shift);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
> +                                 liobn, ret);
> +    if (ret || !tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_remove_window(sphb, tcet);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +    long ret;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0c7ba8c..b6bbd43 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1044,6 +1044,8 @@ int vfio_container_ioctl(AddressSpace *as,
>      case VFIO_CHECK_EXTENSION:
>      case VFIO_IOMMU_SPAPR_TCE_GET_INFO:
>      case VFIO_EEH_PE_OP:
> +    case VFIO_IOMMU_SPAPR_TCE_CREATE:
> +    case VFIO_IOMMU_SPAPR_TCE_REMOVE:
>          break;
>      default:
>          /* Return an error on unknown requests */
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 8b007aa..911fa27 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -89,6 +89,13 @@ struct sPAPRPHBState {
>      uint32_t dma32_window_size;
>      bool has_vfio;
>      int32_t iommugroupid; /* obsolete */
> +    bool ddw_enabled;
> +    uint32_t windows_supported;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_window_start;
> +    uint64_t dma64_window_size;
> +    uint8_t max_levels;
> +    uint8_t levels;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
>  };
> @@ -111,7 +118,10 @@ struct sPAPRPHBState {
>  
>  #define SPAPR_PCI_DMA32_SIZE         0x40000000
>  
> -#define SPAPR_PCI_DMA_MAX_WINDOWS    1
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
> +/* Default 64bit dynamic window offset */
> +#define SPAPR_PCI_DMA64_START        0x800000000000000ULL
>  
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
> @@ -133,11 +143,20 @@ void spapr_pci_rtas_init(void);
>  sPAPRPHBState *spapr_pci_find_phb(sPAPRMachineState *spapr, uint64_t buid);
>  PCIDevice *spapr_pci_find_dev(sPAPRMachineState *spapr, uint64_t buid,
>                                uint32_t config_addr);
> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> +                              uint32_t liobn, uint32_t page_shift,
> +                              uint64_t window_size);
>  int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>                                  sPAPRTCETable *tcet);
>  int spapr_phb_dma_reset(sPAPRPHBState *sphb);
>  
>  int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb);
> +int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
> +                                   uint32_t page_shift,
> +                                   uint64_t window_size,
> +                                   uint64_t *bus_offset);
> +int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
> +                                     sPAPRTCETable *tcet);
>  int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
>                                    PCIDevice *pdev, int option);
>  int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 4645f16..5a58785 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_SUPPORTED      -3
>  #define RTAS_OUT_NOT_AUTHORIZED     -9002
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> @@ -558,6 +572,7 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint64_t *migtable;
>      bool bypass;
>      int fd;
>      MemoryRegion root, iommu;
> diff --git a/trace-events b/trace-events
> index b300e94..a1234dd 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1302,6 +1302,8 @@ spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
>  spapr_pci_msi_retry(unsigned config_addr, unsigned req_num, unsigned max_irqs) "Guest device at %x asked %u, have only %u"
>  spapr_pci_dma_update(uint64_t liobn, long ret) "liobn=%"PRIx64" ret=%ld"
>  spapr_pci_dma_realloc_update(uint64_t liobn, long ret) "liobn=%"PRIx64" tcet=%ld"
> +spapr_pci_vfio_init_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +spapr_pci_vfio_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/pci/pci.c
>  pci_update_mappings_del(void *d, uint32_t bus, uint32_t func, uint32_t slot, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,%#"PRIx64"+%#"PRIx64
> @@ -1365,6 +1367,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_alloc_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, unsigned long long pg_size, unsigned long long req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%llx, requested=0x%llx, start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2015-07-06 11:13 ` David Gibson
  2015-07-06 15:54 ` Thomas Huth
  15 siblings, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-06 11:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 2520 bytes --]

On Mon, Jul 06, 2015 at 12:10:56PM +1000, Alexey Kardashevskiy wrote:
> 
> (cut-n-paste from kernel patchset)
> 
> Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
> where devices are allowed to do DMA. These ranges are called DMA windows.
> By default, there is a single DMA window, 1 or 2GB big, mapped at zero
> on a PCI bus.
> 
> PAPR defines a DDW RTAS API which allows pseries guests
> querying the hypervisor about DDW support and capabilities (page size mask
> for now). A pseries guest may request an additional (to the default)
> DMA windows using this RTAS API.
> The existing pseries Linux guests request an additional window as big as
> the guest RAM and map the entire guest window which effectively creates
> direct mapping of the guest memory to a PCI bus.
> 
> This patchset reworks PPC64 IOMMU code and adds necessary structures
> to support big windows.
> 
> Once a Linux guest discovers the presence of DDW, it does:
> 1. query hypervisor about number of available windows and page size masks;
> 2. create a window with the biggest possible page size (today 4K/64K/16M);
> 3. map the entire guest RAM via H_PUT_TCE* hypercalls;
> 4. switche dma_ops to direct_dma_ops on the selected PE.
> 
> Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
> the guest does not waste time on DMA map/unmap operations.
> 
> Note that 32bit devices won't use DDW and will keep using the default
> DMA window so KVM optimizations will be required (to be posted later).
> 
> This patchset adds DDW support for pseries. The host kernel changes are
> required, available in the current upstream.
> 
> This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
> 
> Please comment. Thanks!

I've applied this to my "spapr-dev" branch.  Here's what needs to
happen before I move it into spapr-next (which is what I'll be pushing
to Alex Graf).

 * For you and Gavin to test it to see that DDW and EEH work properly
   together
 * Some word from Alex W on how he wants to go about merging 12-13/14
 * Some indication about who should be merging 2/14
 * Review from at least one more person - I've looked at so many
   versions of the ddw patches I no longer trust that I've got it all
   straight in my head

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
@ 2015-07-06 11:18   ` Paolo Bonzini
  0 siblings, 0 replies; 71+ messages in thread
From: Paolo Bonzini @ 2015-07-06 11:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel
  Cc: Michael S. Tsirkin, Michael Roth, Gavin Shan, Alex Williamson,
	qemu-ppc, David Gibson



On 06/07/2015 04:10, Alexey Kardashevskiy wrote:
> This updates linux-headers against master 4.2-rc1 (commit
> d770e558e21961ad6cfdf0ff7df0eb5d7d4f0754). This is the result of
> ./scripts/update-linux-headers.sh work.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is for DDW support on sPAPR.
> ---
>  include/standard-headers/linux/input.h          |  10 +-
>  include/standard-headers/linux/virtio_balloon.h |   1 +
>  include/standard-headers/linux/virtio_gpu.h     |   2 +
>  linux-headers/asm-x86/hyperv.h                  |  11 ++
>  linux-headers/linux/kvm.h                       |   2 +-
>  linux-headers/linux/vfio.h                      | 102 ++++++++++++-
>  linux-headers/linux/virtio_pci.h                | 192 ------------------------
>  7 files changed, 121 insertions(+), 199 deletions(-)
>  delete mode 100644 linux-headers/linux/virtio_pci.h
> 
> diff --git a/include/standard-headers/linux/input.h b/include/standard-headers/linux/input.h
> index b94d365..a459dd2 100644
> --- a/include/standard-headers/linux/input.h
> +++ b/include/standard-headers/linux/input.h
> @@ -367,7 +367,8 @@ struct input_keymap_entry {
>  #define KEY_MSDOS		151
>  #define KEY_COFFEE		152	/* AL Terminal Lock/Screensaver */
>  #define KEY_SCREENLOCK		KEY_COFFEE
> -#define KEY_DIRECTION		153
> +#define KEY_ROTATE_DISPLAY	153	/* Display orientation for e.g. tablets */
> +#define KEY_DIRECTION		KEY_ROTATE_DISPLAY
>  #define KEY_CYCLEWINDOWS	154
>  #define KEY_MAIL		155
>  #define KEY_BOOKMARKS		156	/* AC Bookmarks */
> @@ -700,6 +701,10 @@ struct input_keymap_entry {
>  #define KEY_NUMERIC_9		0x209
>  #define KEY_NUMERIC_STAR	0x20a
>  #define KEY_NUMERIC_POUND	0x20b
> +#define KEY_NUMERIC_A		0x20c	/* Phone key A - HUT Telephony 0xb9 */
> +#define KEY_NUMERIC_B		0x20d
> +#define KEY_NUMERIC_C		0x20e
> +#define KEY_NUMERIC_D		0x20f
>  
>  #define KEY_CAMERA_FOCUS	0x210
>  #define KEY_WPS_BUTTON		0x211	/* WiFi Protected Setup key */
> @@ -971,7 +976,8 @@ struct input_keymap_entry {
>   */
>  #define MT_TOOL_FINGER		0
>  #define MT_TOOL_PEN		1
> -#define MT_TOOL_MAX		1
> +#define MT_TOOL_PALM		2
> +#define MT_TOOL_MAX		2
>  
>  /*
>   * Values describing the status of a force-feedback effect
> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 88ada1d..2e2a6dc 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -26,6 +26,7 @@
>   * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>   * SUCH DAMAGE. */
>  #include "standard-headers/linux/types.h"
> +#include "standard-headers/linux/virtio_types.h"
>  #include "standard-headers/linux/virtio_ids.h"
>  #include "standard-headers/linux/virtio_config.h"
>  
> diff --git a/include/standard-headers/linux/virtio_gpu.h b/include/standard-headers/linux/virtio_gpu.h
> index cfcfb46..72ef815 100644
> --- a/include/standard-headers/linux/virtio_gpu.h
> +++ b/include/standard-headers/linux/virtio_gpu.h
> @@ -38,6 +38,8 @@
>  #ifndef VIRTIO_GPU_HW_H
>  #define VIRTIO_GPU_HW_H
>  
> +#include "standard-headers/linux/types.h"
> +
>  enum virtio_gpu_ctrl_type {
>  	VIRTIO_GPU_UNDEFINED = 0,
>  
> diff --git a/linux-headers/asm-x86/hyperv.h b/linux-headers/asm-x86/hyperv.h
> index ce6068d..8fba544 100644
> --- a/linux-headers/asm-x86/hyperv.h
> +++ b/linux-headers/asm-x86/hyperv.h
> @@ -199,6 +199,17 @@
>  #define HV_X64_MSR_STIMER3_CONFIG		0x400000B6
>  #define HV_X64_MSR_STIMER3_COUNT		0x400000B7
>  
> +/* Hyper-V guest crash notification MSR's */
> +#define HV_X64_MSR_CRASH_P0			0x40000100
> +#define HV_X64_MSR_CRASH_P1			0x40000101
> +#define HV_X64_MSR_CRASH_P2			0x40000102
> +#define HV_X64_MSR_CRASH_P3			0x40000103
> +#define HV_X64_MSR_CRASH_P4			0x40000104
> +#define HV_X64_MSR_CRASH_CTL			0x40000105
> +#define HV_X64_MSR_CRASH_CTL_NOTIFY		(1ULL << 63)
> +#define HV_X64_MSR_CRASH_PARAMS		\
> +		(1 + (HV_X64_MSR_CRASH_P4 - HV_X64_MSR_CRASH_P0))
> +
>  #define HV_X64_MSR_HYPERCALL_ENABLE		0x00000001
>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT	12
>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK	\
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index fad9e5c..3bac873 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -897,7 +897,7 @@ struct kvm_xen_hvm_config {
>   *
>   * KVM_IRQFD_FLAG_RESAMPLE indicates resamplefd is valid and specifies
>   * the irqfd to operate in resampling mode for level triggered interrupt
> - * emlation.  See Documentation/virtual/kvm/api.txt.
> + * emulation.  See Documentation/virtual/kvm/api.txt.
>   */
>  #define KVM_IRQFD_FLAG_RESAMPLE (1 << 1)
>  
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 0508d0b..aa276bc 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -36,6 +36,8 @@
>  /* Two-stage IOMMU */
>  #define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
>  
> +#define VFIO_SPAPR_TCE_v2_IOMMU		7
> +
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
>   * structure length (argsz) and flags into structures passed between
> @@ -443,6 +445,23 @@ struct vfio_iommu_type1_dma_unmap {
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*
> + * The SPAPR TCE DDW info struct provides the information about
> + * the details of Dynamic DMA window capability.
> + *
> + * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
> + * @max_dynamic_windows_supported tells the maximum number of windows
> + * which the platform can create.
> + * @levels tells the maximum number of levels in multi-level IOMMU tables;
> + * this allows splitting a table into smaller chunks which reduces
> + * the amount of physically contiguous memory required for the table.
> + */
> +struct vfio_iommu_spapr_tce_ddw_info {
> +	__u64 pgsizes;			/* Bitmap of supported page sizes */
> +	__u32 max_dynamic_windows_supported;
> +	__u32 levels;
> +};
> +
> +/*
>   * The SPAPR TCE info struct provides the information about the PCI bus
>   * address ranges available for DMA, these values are programmed into
>   * the hardware so the guest has to know that information.
> @@ -452,14 +471,17 @@ struct vfio_iommu_type1_dma_unmap {
>   * addresses too so the window works as a filter rather than an offset
>   * for IOVA addresses.
>   *
> - * A flag will need to be added if other page sizes are supported,
> - * so as defined here, it is always 4k.
> + * Flags supported:
> + * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
> + *   (DDW) support is present. @ddw is only supported when DDW is present.
>   */
>  struct vfio_iommu_spapr_tce_info {
>  	__u32 argsz;
> -	__u32 flags;			/* reserved for future use */
> +	__u32 flags;
> +#define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
>  	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
>  	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
> +	struct vfio_iommu_spapr_tce_ddw_info ddw;
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> @@ -470,12 +492,23 @@ struct vfio_iommu_spapr_tce_info {
>   * - unfreeze IO/DMA for frozen PE;
>   * - read PE state;
>   * - reset PE;
> - * - configure PE.
> + * - configure PE;
> + * - inject EEH error.
>   */
> +struct vfio_eeh_pe_err {
> +	__u32 type;
> +	__u32 func;
> +	__u64 addr;
> +	__u64 mask;
> +};
> +
>  struct vfio_eeh_pe_op {
>  	__u32 argsz;
>  	__u32 flags;
>  	__u32 op;
> +	union {
> +		struct vfio_eeh_pe_err err;
> +	};
>  };
>  
>  #define VFIO_EEH_PE_DISABLE		0	/* Disable EEH functionality */
> @@ -492,9 +525,70 @@ struct vfio_eeh_pe_op {
>  #define VFIO_EEH_PE_RESET_HOT		6	/* Assert hot reset          */
>  #define VFIO_EEH_PE_RESET_FUNDAMENTAL	7	/* Assert fundamental reset  */
>  #define VFIO_EEH_PE_CONFIGURE		8	/* PE configuration          */
> +#define VFIO_EEH_PE_INJECT_ERR		9	/* Inject EEH error          */
>  
>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +/**
> + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
> + *
> + * Registers user space memory where DMA is allowed. It pins
> + * user pages and does the locked memory accounting so
> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
> + * get faster.
> + */
> +struct vfio_iommu_spapr_register_memory {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u64	vaddr;				/* Process virtual address */
> +	__u64	size;				/* Size of mapping (bytes) */
> +};
> +#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
> +
> +/**
> + * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
> + *
> + * Unregisters user space memory registered with
> + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
> + * Uses vfio_iommu_spapr_register_memory for parameters.
> + */
> +#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> +/**
> + * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
> + *
> + * Creates an additional TCE table and programs it (sets a new DMA window)
> + * to every IOMMU group in the container. It receives page shift, window
> + * size and number of levels in the TCE table being created.
> + *
> + * It allocates and returns an offset on a PCI bus of the new DMA window.
> + */
> +struct vfio_iommu_spapr_tce_create {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u32 page_shift;
> +	__u64 window_size;
> +	__u32 levels;
> +	/* out */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
> +/**
> + * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
> + *
> + * Unprograms a TCE table from all groups in the container and destroys it.
> + * It receives a PCI bus offset as a window id.
> + */
> +struct vfio_iommu_spapr_tce_remove {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */
> diff --git a/linux-headers/linux/virtio_pci.h b/linux-headers/linux/virtio_pci.h
> deleted file mode 100644
> index 92624e5..0000000
> --- a/linux-headers/linux/virtio_pci.h
> +++ /dev/null
> @@ -1,192 +0,0 @@
> -/*
> - * Virtio PCI driver
> - *
> - * This module allows virtio devices to be used over a virtual PCI device.
> - * This can be used with QEMU based VMMs like KVM or Xen.
> - *
> - * Copyright IBM Corp. 2007
> - *
> - * Authors:
> - *  Anthony Liguori  <aliguori@us.ibm.com>
> - *
> - * This header is BSD licensed so anyone can use the definitions to implement
> - * compatible drivers/servers.
> - *
> - * Redistribution and use in source and binary forms, with or without
> - * modification, are permitted provided that the following conditions
> - * are met:
> - * 1. Redistributions of source code must retain the above copyright
> - *    notice, this list of conditions and the following disclaimer.
> - * 2. Redistributions in binary form must reproduce the above copyright
> - *    notice, this list of conditions and the following disclaimer in the
> - *    documentation and/or other materials provided with the distribution.
> - * 3. Neither the name of IBM nor the names of its contributors
> - *    may be used to endorse or promote products derived from this software
> - *    without specific prior written permission.
> - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> - * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> - * SUCH DAMAGE.
> - */
> -
> -#ifndef _LINUX_VIRTIO_PCI_H
> -#define _LINUX_VIRTIO_PCI_H
> -
> -#include <linux/types.h>
> -
> -#ifndef VIRTIO_PCI_NO_LEGACY
> -
> -/* A 32-bit r/o bitmask of the features supported by the host */
> -#define VIRTIO_PCI_HOST_FEATURES	0
> -
> -/* A 32-bit r/w bitmask of features activated by the guest */
> -#define VIRTIO_PCI_GUEST_FEATURES	4
> -
> -/* A 32-bit r/w PFN for the currently selected queue */
> -#define VIRTIO_PCI_QUEUE_PFN		8
> -
> -/* A 16-bit r/o queue size for the currently selected queue */
> -#define VIRTIO_PCI_QUEUE_NUM		12
> -
> -/* A 16-bit r/w queue selector */
> -#define VIRTIO_PCI_QUEUE_SEL		14
> -
> -/* A 16-bit r/w queue notifier */
> -#define VIRTIO_PCI_QUEUE_NOTIFY		16
> -
> -/* An 8-bit device status register.  */
> -#define VIRTIO_PCI_STATUS		18
> -
> -/* An 8-bit r/o interrupt status register.  Reading the value will return the
> - * current contents of the ISR and will also clear it.  This is effectively
> - * a read-and-acknowledge. */
> -#define VIRTIO_PCI_ISR			19
> -
> -/* MSI-X registers: only enabled if MSI-X is enabled. */
> -/* A 16-bit vector for configuration changes. */
> -#define VIRTIO_MSI_CONFIG_VECTOR        20
> -/* A 16-bit vector for selected queue notifications. */
> -#define VIRTIO_MSI_QUEUE_VECTOR         22
> -
> -/* The remaining space is defined by each driver as the per-driver
> - * configuration space */
> -#define VIRTIO_PCI_CONFIG_OFF(msix_enabled)	((msix_enabled) ? 24 : 20)
> -/* Deprecated: please use VIRTIO_PCI_CONFIG_OFF instead */
> -#define VIRTIO_PCI_CONFIG(dev)	VIRTIO_PCI_CONFIG_OFF((dev)->msix_enabled)
> -
> -/* Virtio ABI version, this must match exactly */
> -#define VIRTIO_PCI_ABI_VERSION		0
> -
> -/* How many bits to shift physical queue address written to QUEUE_PFN.
> - * 12 is historical, and due to x86 page size. */
> -#define VIRTIO_PCI_QUEUE_ADDR_SHIFT	12
> -
> -/* The alignment to use between consumer and producer parts of vring.
> - * x86 pagesize again. */
> -#define VIRTIO_PCI_VRING_ALIGN		4096
> -
> -#endif /* VIRTIO_PCI_NO_LEGACY */
> -
> -/* The bit of the ISR which indicates a device configuration change. */
> -#define VIRTIO_PCI_ISR_CONFIG		0x2
> -/* Vector value used to disable MSI for queue */
> -#define VIRTIO_MSI_NO_VECTOR            0xffff
> -
> -#ifndef VIRTIO_PCI_NO_MODERN
> -
> -/* IDs for different capabilities.  Must all exist. */
> -
> -/* Common configuration */
> -#define VIRTIO_PCI_CAP_COMMON_CFG	1
> -/* Notifications */
> -#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
> -/* ISR access */
> -#define VIRTIO_PCI_CAP_ISR_CFG		3
> -/* Device specific confiuration */
> -#define VIRTIO_PCI_CAP_DEVICE_CFG	4
> -
> -/* This is the PCI capability header: */
> -struct virtio_pci_cap {
> -	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
> -	__u8 cap_next;		/* Generic PCI field: next ptr. */
> -	__u8 cap_len;		/* Generic PCI field: capability length */
> -	__u8 cfg_type;		/* Identifies the structure. */
> -	__u8 bar;		/* Where to find it. */
> -	__u8 padding[3];	/* Pad to full dword. */
> -	__le32 offset;		/* Offset within bar. */
> -	__le32 length;		/* Length of the structure, in bytes. */
> -};
> -
> -struct virtio_pci_notify_cap {
> -	struct virtio_pci_cap cap;
> -	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
> -};
> -
> -/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> -struct virtio_pci_common_cfg {
> -	/* About the whole device. */
> -	__le32 device_feature_select;	/* read-write */
> -	__le32 device_feature;		/* read-only */
> -	__le32 guest_feature_select;	/* read-write */
> -	__le32 guest_feature;		/* read-write */
> -	__le16 msix_config;		/* read-write */
> -	__le16 num_queues;		/* read-only */
> -	__u8 device_status;		/* read-write */
> -	__u8 config_generation;		/* read-only */
> -
> -	/* About a specific virtqueue. */
> -	__le16 queue_select;		/* read-write */
> -	__le16 queue_size;		/* read-write, power of 2. */
> -	__le16 queue_msix_vector;	/* read-write */
> -	__le16 queue_enable;		/* read-write */
> -	__le16 queue_notify_off;	/* read-only */
> -	__le32 queue_desc_lo;		/* read-write */
> -	__le32 queue_desc_hi;		/* read-write */
> -	__le32 queue_avail_lo;		/* read-write */
> -	__le32 queue_avail_hi;		/* read-write */
> -	__le32 queue_used_lo;		/* read-write */
> -	__le32 queue_used_hi;		/* read-write */
> -};
> -
> -/* Macro versions of offsets for the Old Timers! */
> -#define VIRTIO_PCI_CAP_VNDR		0
> -#define VIRTIO_PCI_CAP_NEXT		1
> -#define VIRTIO_PCI_CAP_LEN		2
> -#define VIRTIO_PCI_CAP_CFG_TYPE		3
> -#define VIRTIO_PCI_CAP_BAR		4
> -#define VIRTIO_PCI_CAP_OFFSET		8
> -#define VIRTIO_PCI_CAP_LENGTH		12
> -
> -#define VIRTIO_PCI_NOTIFY_CAP_MULT	16
> -
> -
> -#define VIRTIO_PCI_COMMON_DFSELECT	0
> -#define VIRTIO_PCI_COMMON_DF		4
> -#define VIRTIO_PCI_COMMON_GFSELECT	8
> -#define VIRTIO_PCI_COMMON_GF		12
> -#define VIRTIO_PCI_COMMON_MSIX		16
> -#define VIRTIO_PCI_COMMON_NUMQ		18
> -#define VIRTIO_PCI_COMMON_STATUS	20
> -#define VIRTIO_PCI_COMMON_CFGGENERATION	21
> -#define VIRTIO_PCI_COMMON_Q_SELECT	22
> -#define VIRTIO_PCI_COMMON_Q_SIZE	24
> -#define VIRTIO_PCI_COMMON_Q_MSIX	26
> -#define VIRTIO_PCI_COMMON_Q_ENABLE	28
> -#define VIRTIO_PCI_COMMON_Q_NOFF	30
> -#define VIRTIO_PCI_COMMON_Q_DESCLO	32
> -#define VIRTIO_PCI_COMMON_Q_DESCHI	36
> -#define VIRTIO_PCI_COMMON_Q_AVAILLO	40
> -#define VIRTIO_PCI_COMMON_Q_AVAILHI	44
> -#define VIRTIO_PCI_COMMON_Q_USEDLO	48
> -#define VIRTIO_PCI_COMMON_Q_USEDHI	52
> -
> -#endif /* VIRTIO_PCI_NO_MODERN */
> -
> -#endif
> 

I'm adding this patch to my final pull request for 2.4.

Paolo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-06 11:06   ` David Gibson
@ 2015-07-06 11:27     ` Alexey Kardashevskiy
  2015-07-07  9:46     ` Alexey Kardashevskiy
  1 sibling, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06 11:27 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

On 07/06/2015 09:06 PM, David Gibson wrote:
> On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This implements helpers to interact with VFIO kernel interface.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.3 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v10:
>> * added dma64_win_addr property to PHB
>> * removed redundand check for "!migtable" in spapr_tce_table_post_load()
>>
>> v9:
>> * fixed default 64bit window start (from mdroth)
>> * fixed type cast in dma window update code (from mdroth)
>> * spapr_phb_dma_update() now can fail and cause hotplug failure if
>> hardware TCE table cannot be mapped to the same bus address as the emulated one
>>
>> v7:
>> * fixed uninitialized variables
>>
>> v6:
>> * rework as there is no more special device for VFIO PHB
>>
>> v5:
>> * total rework
>> * enabled for machines >2.3
>> * fixed migration
>> * merged rtas handlers here
>>
>> v4:
>> * reset handler is back in generalized form
>>
>> v3:
>> * removed reset
>> * windows_num is now 1 or bigger rather than 0-based value and it is only
>> changed in PHB code, not in RTAS
>> * added page mask check in create()
>> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
>> created
>>
>> v2:
>> * tested on hacked emulated E1000
>> * implemented DDW reset on the PHB reset
>> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
>> ---
>>   hw/ppc/Makefile.objs        |   3 +
>>   hw/ppc/spapr.c              |   5 +
>>   hw/ppc/spapr_iommu.c        |  32 ++++-
>>   hw/ppc/spapr_pci.c          | 110 ++++++++++++++--
>>   hw/ppc/spapr_pci_vfio.c     |  88 +++++++++++++
>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/common.c            |   2 +
>>   include/hw/pci-host/spapr.h |  21 +++-
>>   include/hw/ppc/spapr.h      |  17 ++-
>>   trace-events                |   6 +
>>   10 files changed, 568 insertions(+), 16 deletions(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c8ab06e..0b2ff6d 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
>> +obj-y += spapr_rtas_ddw.o
>> +endif
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 5ca817c..d50d50b 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = {
>>               .driver   = "spapr-pci-host-bridge",\
>>               .property = "dynamic-reconfiguration",\
>>               .value    = "off",\
>> +        },\
>> +        {\
>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +            .property = "ddw",\
>> +            .value    = stringify(off),\
>>           },
>>
>>   #define SPAPR_COMPAT_2_2 \
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 2d99c3b..b54c3d8 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -136,6 +136,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>       return ret;
>>   }
>>
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> +    tcet->migtable = tcet->table;
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>>           spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>>       }
>>
>> +    if (tcet->enabled) {
>> +        if (!tcet->table) {
>> +            tcet->enabled = false;
>> +            /* VFIO does not migrate so pass vfio_accel == false */
>> +            spapr_tce_table_do_enable(tcet, false);
>> +        }
>> +        memcpy(tcet->table, tcet->migtable,
>> +               tcet->nb_table * sizeof(tcet->table[0]));
>> +        free(tcet->migtable);
>> +        tcet->migtable = NULL;
>> +    }
>> +
>>       return 0;
>>   }
>>
>>   static const VMStateDescription vmstate_spapr_tce_table = {
>>       .name = "spapr_iommu",
>> -    .version_id = 2,
>> +    .version_id = 3,
>>       .minimum_version_id = 2,
>> +    .pre_save = spapr_tce_table_pre_save,
>>       .post_load = spapr_tce_table_post_load,
>>       .fields      = (VMStateField []) {
>>           /* Sanity check */
>>           VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>>
>>           /* IOMMU state */
>> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
>> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>>           VMSTATE_BOOL(bypass, sPAPRTCETable),
>> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
>> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
>> +                                    vmstate_info_uint64, uint64_t),
>>
>>           VMSTATE_END_OF_LIST()
>>       },
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index d1fa157..b7113b5 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>>
>>       sphb->dma32_window_start = 0;
>>       sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
>> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
>> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
>> +    sphb->dma64_window_size = pow2ceil(ram_size);
>>
>>       ret = spapr_phb_vfio_dma_capabilities_update(sphb);
>>       sphb->has_vfio = (ret == 0);
>> @@ -785,12 +788,35 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>>       return 0;
>>   }
>>
>> -static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> -                                     uint32_t liobn, uint32_t page_shift,
>> -                                     uint64_t window_size)
>> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> +                              uint32_t liobn, uint32_t page_shift,
>> +                              uint64_t window_size)
>>   {
>>       uint64_t bus_offset = sphb->dma32_window_start;
>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>> +    int ret;
>> +
>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>> +        return -1;
>> +    }
>> +
>> +    if (sphb->ddw_enabled) {
>> +        if (sphb->has_vfio) {
>> +            ret = spapr_phb_vfio_dma_init_window(sphb,
>> +                                                 page_shift, window_size,
>> +                                                 &bus_offset);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +        } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
>> +            /*
>> +             * There is no VFIO so we choose a huge window address.
>> +             * If VFIO is added later, spapr_phb_dma_update() will fail
>> +             * and cause hotplug failure.
>> +             */
>> +            bus_offset = sphb->dma64_window_start;
>> +        }
>> +    }
>>
>>       spapr_tce_table_enable(tcet, bus_offset, page_shift,
>>                              window_size >> page_shift,
>> @@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>>   int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>>                                   sPAPRTCETable *tcet)
>>   {
>> +    int ret = 0;
>> +
>> +    if (sphb->has_vfio && sphb->ddw_enabled) {
>> +        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
>> +    }
>>       spapr_tce_table_disable(tcet);
>>
>> -    return 0;
>> +    return ret;
>>   }
>>
>>   int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>> @@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>>       int ret = 0, i;
>>       bool had_vfio = sphb->has_vfio;
>>       sPAPRTCETable *tcet;
>> +    uint64_t bus_offset = 0;
>>
>>       spapr_phb_dma_capabilities_update(sphb);
>>
>> +    /*
>> +     * PHB got first VFIO device or lost last VFIO device;
>> +     * If it is the last VFIO device, we do not need windows anymore so
>> +     * remove them.
>> +     * If it is the first VFIO device, we have to remove them as
>> +     * we cannot request a specific window from the host kernel so we
>> +     * remove all windows and recreate them later if necessary.
>
> Am I right in thinking that there never should be (VFIO enabled)
> windows when the first VFIO device is added though?
 >
> If you're removing the windows when VFIO devices are removed, and any
> windows created while !has_vfio shouldn't result in the kernel being
> requested from the kernel..?

Yes, I do not need this chunk. I realized that when I posted the patchset :(




-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed
  2015-07-06 10:33   ` David Gibson
@ 2015-07-06 12:49     ` Alex Williamson
  2015-07-06 12:59       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 71+ messages in thread
From: Alex Williamson @ 2015-07-06 12:49 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Michael Roth, qemu-ppc, qemu-devel, Gavin Shan

On Mon, 2015-07-06 at 20:33 +1000, David Gibson wrote:
> On Mon, Jul 06, 2015 at 12:11:08PM +1000, Alexey Kardashevskiy wrote:
> > On systems with guest visible IOMMU, adding a new memory region onto
> > PCI bus calls vfio_listener_region_add() for every DMA window. This
> > installs a notifier for IOMMU memory regions. The notifier is supposed
> > to be removed by vfio_listener_region_del(), however in the case of mixed
> > PHB (emulated + VFIO devices) when last VFIO device is unplugged and
> > container gets destroyed, all existing DMA windows stay alive altogether
> > with the notifiers which are on the linked list which head was in
> > the destroyed container.
> > 
> > This unregisters IOMMU memory region notifier when a container is
> > destroyed.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> Alex,
> 
> I think this is correct, but you've probably got a better
> understanding of it.  Will you take this through your tree?

Yes, confusingly this patch was sent twice yesterday, once in this
series and once separately.  AFAICT they're identical, so I'll add your
R-b and add the patch to my pull request for 2.4-rc0.  Thanks,

Alex

> > ---
> > Changes:
> > v10:
> > * new to the patchset
> > ---
> >  hw/vfio/common.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 89ef37b..8eacfd7 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -772,11 +772,19 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >  
> >      if (QLIST_EMPTY(&container->group_list)) {
> >          VFIOAddressSpace *space = container->space;
> > +        VFIOGuestIOMMU *giommu, *tmp;
> >  
> >          if (container->iommu_data.release) {
> >              container->iommu_data.release(container);
> >          }
> >          QLIST_REMOVE(container, next);
> > +
> > +        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> > +            memory_region_unregister_iommu_notifier(&giommu->n);
> > +            QLIST_REMOVE(giommu, giommu_next);
> > +            g_free(giommu);
> > +        }
> > +
> >          trace_vfio_disconnect_container(container->fd);
> >          close(container->fd);
> >          g_free(container);
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed
  2015-07-06 12:49     ` Alex Williamson
@ 2015-07-06 12:59       ` Alexey Kardashevskiy
  2015-07-06 13:45         ` Alex Williamson
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06 12:59 UTC (permalink / raw)
  To: Alex Williamson, David Gibson
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan

On 07/06/2015 10:49 PM, Alex Williamson wrote:
> On Mon, 2015-07-06 at 20:33 +1000, David Gibson wrote:
>> On Mon, Jul 06, 2015 at 12:11:08PM +1000, Alexey Kardashevskiy wrote:
>>> On systems with guest visible IOMMU, adding a new memory region onto
>>> PCI bus calls vfio_listener_region_add() for every DMA window. This
>>> installs a notifier for IOMMU memory regions. The notifier is supposed
>>> to be removed by vfio_listener_region_del(), however in the case of mixed
>>> PHB (emulated + VFIO devices) when last VFIO device is unplugged and
>>> container gets destroyed, all existing DMA windows stay alive altogether
>>> with the notifiers which are on the linked list which head was in
>>> the destroyed container.
>>>
>>> This unregisters IOMMU memory region notifier when a container is
>>> destroyed.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>
>> Alex,
>>
>> I think this is correct, but you've probably got a better
>> understanding of it.  Will you take this through your tree?
>
> Yes, confusingly this patch was sent twice yesterday, once in this
> series and once separately.  AFAICT they're identical, so I'll add your
> R-b and add the patch to my pull request for 2.4-rc0.  Thanks,

Yes, these are identical, sorry for the confusion. btw what was the right 
to do with this patch?


>
> Alex
>
>>> ---
>>> Changes:
>>> v10:
>>> * new to the patchset
>>> ---
>>>   hw/vfio/common.c | 8 ++++++++
>>>   1 file changed, 8 insertions(+)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 89ef37b..8eacfd7 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -772,11 +772,19 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>>
>>>       if (QLIST_EMPTY(&container->group_list)) {
>>>           VFIOAddressSpace *space = container->space;
>>> +        VFIOGuestIOMMU *giommu, *tmp;
>>>
>>>           if (container->iommu_data.release) {
>>>               container->iommu_data.release(container);
>>>           }
>>>           QLIST_REMOVE(container, next);
>>> +
>>> +        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
>>> +            memory_region_unregister_iommu_notifier(&giommu->n);
>>> +            QLIST_REMOVE(giommu, giommu_next);
>>> +            g_free(giommu);
>>> +        }
>>> +
>>>           trace_vfio_disconnect_container(container->fd);
>>>           close(container->fd);
>>>           g_free(container);
>>
>
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2015-07-06 13:42   ` Alex Williamson
  2015-07-06 15:34     ` Alexey Kardashevskiy
  2015-07-07  7:23   ` Thomas Huth
  1 sibling, 1 reply; 71+ messages in thread
From: Alex Williamson @ 2015-07-06 13:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a guest RAM memory listener which notifies a VFIO container
> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> (i.e. "skip dump" regions) are skipped.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v9:
> * since there is no more SPAPR-specific data in container::iommu_data,
> the memory preregistration fields are common and potentially can be used
> by other architectures
> 
> v7:
> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> * s'ramlistener'register_listener'
> 
> v6:
> * fixed commit log (s/guest/userspace/), added note about no guest visible
> change
> * fixed error checking if ram registration failed
> * added alignment check for section->offset_within_region
> 
> v5:
> * simplified the patch
> * added trace points
> * added round_up() for the size
> * SPAPR IOMMU v2 used
> ---
>  hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>  include/hw/vfio/vfio-common.h |   3 ++
>  trace-events                  |   1 +
>  3 files changed, 104 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 8eacfd7..0c7ba8c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>      memory_listener_unregister(&container->iommu_data.type1.listener);
>  }
>  
> +static void vfio_ram_do_region(VFIOContainer *container,
> +                              MemoryRegionSection *section, unsigned long req)
> +{
> +    int ret;
> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };

This function is not as general as the name would imply, it's spapr
specific due to this.  How about vfio_spapr_register_memory() with a
bool parameter toggling register vs unregister so we're not passing an
arbitrary ioctl number?

> +
> +    if (!memory_region_is_ram(section->mr) ||
> +        memory_region_is_skip_dump(section->mr)) {
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {

s/getpagesize()/qemu_real_host_page_size/?

> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region;
> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
> +
> +    ret = ioctl(container->fd, req, &reg);
> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
> +            ret ? -errno : 0);
> +    if (!ret) {
> +        return;
> +    }
> +
> +    /*
> +     * On the initfn path, store the first error in the container so we
> +     * can gracefully fail.  Runtime, there's not much we can do other
> +     * than throw a hardware error.
> +     */
> +    if (!container->iommu_data.ram_reg_initialized) {
> +        if (!container->iommu_data.ram_reg_error) {
> +            container->iommu_data.ram_reg_error = -errno;
> +        }
> +    } else {
> +        hw_error("vfio: RAM registering failed, unable to continue");
> +    }

I'd rather see:

if (ret) {
  if (!container...) {
    ...
  } else {
    ...
  }
}

Exiting early on success and otherwise falling into error handling is a
strange code flow.

> +}
> +
> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> +                                         MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.register_listener);
> +    memory_region_ref(section->mr);
> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);

vfio_spapr_register_memory(container, section, true);

> +}
> +
> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> +                                         MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.register_listener);
> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);

vfio_spapr_register_memory(container, section, false);

> +    memory_region_unref(section->mr);
> +}
> +
> +static const MemoryListener vfio_ram_memory_listener = {
> +    .region_add = vfio_ram_listener_region_add,
> +    .region_del = vfio_ram_listener_region_del,
> +};

These are all spapr specific, please reflect that in the name;
vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.

Actually, can't we determine what type of IOMMU we have and make the
existing MemoryListener handle either type1 or spapr or spapr-v2?

> +
> +static void vfio_spapr_listener_release_v2(VFIOContainer *container)
> +{
> +    memory_listener_unregister(&container->iommu_data.register_listener);
> +    vfio_listener_release(container);
> +}
> +
>  int vfio_mmap_region(Object *obj, VFIORegion *region,
>                       MemoryRegion *mem, MemoryRegion *submem,
>                       void **map, size_t size, off_t offset,
> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>  
>          container->iommu_data.type1.initialized = true;
>  
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> +
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
>              error_report("vfio: failed to set group container: %m");
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        ret = ioctl(fd, VFIO_SET_IOMMU,
> +                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
>          }
>  
>          container->iommu_data.type1.listener = vfio_memory_listener;
> -        container->iommu_data.release = vfio_listener_release;
> -
>          memory_listener_register(&container->iommu_data.type1.listener,
>                                   container->space->as);
>  
> +        if (!v2) {
> +            container->iommu_data.release = vfio_listener_release;
> +        } else {
> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
> +            container->iommu_data.register_listener =
> +                    vfio_ram_memory_listener;
> +            memory_listener_register(&container->iommu_data.register_listener,
> +                                     &address_space_memory);
> +
> +            if (container->iommu_data.ram_reg_error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto listener_release_exit;
> +            }
> +
> +            container->iommu_data.ram_reg_initialized = true;
> +        }
> +
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 59a321d..b132248 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -79,6 +79,9 @@ typedef struct VFIOContainer {
>              VFIOType1 type1;
>          };
>          void (*release)(struct VFIOContainer *);
> +        MemoryListener register_listener;
> +        int ram_reg_error;
> +        bool ram_reg_initialized;

Isn't this exactly what the union above is for?

>      } iommu_data;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> diff --git a/trace-events b/trace-events
> index a994019..b300e94 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1584,6 +1584,7 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
>  vfio_put_group(int fd) "close group->fd=%d"
>  vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
>  vfio_put_base_device(int fd) "close vdev->fd=%d"
> +vfio_ram_register(int req, uint64_t va, uint64_t size, int ret) "req=%d va=%"PRIx64" size=%"PRIx64" ret=%d"
>  
>  # hw/vfio/platform.c
>  vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed
  2015-07-06 12:59       ` Alexey Kardashevskiy
@ 2015-07-06 13:45         ` Alex Williamson
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Williamson @ 2015-07-06 13:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Michael Roth, Gavin Shan, David Gibson

On Mon, 2015-07-06 at 22:59 +1000, Alexey Kardashevskiy wrote:
> On 07/06/2015 10:49 PM, Alex Williamson wrote:
> > On Mon, 2015-07-06 at 20:33 +1000, David Gibson wrote:
> >> On Mon, Jul 06, 2015 at 12:11:08PM +1000, Alexey Kardashevskiy wrote:
> >>> On systems with guest visible IOMMU, adding a new memory region onto
> >>> PCI bus calls vfio_listener_region_add() for every DMA window. This
> >>> installs a notifier for IOMMU memory regions. The notifier is supposed
> >>> to be removed by vfio_listener_region_del(), however in the case of mixed
> >>> PHB (emulated + VFIO devices) when last VFIO device is unplugged and
> >>> container gets destroyed, all existing DMA windows stay alive altogether
> >>> with the notifiers which are on the linked list which head was in
> >>> the destroyed container.
> >>>
> >>> This unregisters IOMMU memory region notifier when a container is
> >>> destroyed.
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>
> >> Alex,
> >>
> >> I think this is correct, but you've probably got a better
> >> understanding of it.  Will you take this through your tree?
> >
> > Yes, confusingly this patch was sent twice yesterday, once in this
> > series and once separately.  AFAICT they're identical, so I'll add your
> > R-b and add the patch to my pull request for 2.4-rc0.  Thanks,
> 
> Yes, these are identical, sorry for the confusion. btw what was the right 
> to do with this patch?

The patch stands on its own and doesn't conflict or contribute
specifically to this series.  It should be left on its own.  The usual
practice should be to separate patches for different subsystems into
series that stand on their own, not to bundle everything together across
subsystems for convenience.  Thanks,

Alex

> >>> ---
> >>> Changes:
> >>> v10:
> >>> * new to the patchset
> >>> ---
> >>>   hw/vfio/common.c | 8 ++++++++
> >>>   1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>> index 89ef37b..8eacfd7 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -772,11 +772,19 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>>
> >>>       if (QLIST_EMPTY(&container->group_list)) {
> >>>           VFIOAddressSpace *space = container->space;
> >>> +        VFIOGuestIOMMU *giommu, *tmp;
> >>>
> >>>           if (container->iommu_data.release) {
> >>>               container->iommu_data.release(container);
> >>>           }
> >>>           QLIST_REMOVE(container, next);
> >>> +
> >>> +        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> >>> +            memory_region_unregister_iommu_notifier(&giommu->n);
> >>> +            QLIST_REMOVE(giommu, giommu_next);
> >>> +            g_free(giommu);
> >>> +        }
> >>> +
> >>>           trace_vfio_disconnect_container(container->fd);
> >>>           close(container->fd);
> >>>           g_free(container);
> >>
> >
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2015-07-06 14:21   ` Thomas Huth
  0 siblings, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 14:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:10:58 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This allows dynamic allocation for migrating arrays.
> 
> Already existing VMSTATE_VARRAY_UINT32 requires an array to be
> pre-allocated, however there are cases when the size is not known in
> advance and there is no real need to enforce it.
> 
> This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
> flag which tells the receiving side to allocate memory for the array
> before receiving the data.
> 
> The first user of it is a dynamic DMA window which existence and size
> are totally dynamic.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  include/migration/vmstate.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 0695d7c..5881d9f 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -295,6 +295,16 @@ extern const VMStateInfo vmstate_info_bitmap;
>      .offset     = vmstate_offset_pointer(_state, _field, _type),     \
>  }
>  
> +#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
> +    .name       = (stringify(_field)),                               \
> +    .version_id = (_version),                                        \
> +    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
> +    .info       = &(_info),                                          \
> +    .size       = sizeof(_type),                                     \
> +    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
> +    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
> +}
> +
>  #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
>      .name       = (stringify(_field)),                               \
>      .version_id = (_version),                                        \

Reviewed-by: Thomas Huth <thuth@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2015-07-06 15:14   ` Thomas Huth
  2015-07-06 15:43     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 15:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:00 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> At the moment presence of vfio-pci devices on a bus affect the way
> the guest view table is allocated. If there is no vfio-pci on a PHB
> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> is allocated in KVM. However, if there is vfio-pci and we do yet not
> KVM acceleration for these, the table has to be allocated by
> the userspace. At the moment the table is allocated once at boot time
> but next patches will reallocate it.
> 
> This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
> to helpers.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
>  trace-events         |  2 +-
>  2 files changed, 40 insertions(+), 20 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index f61504e..0cf5010 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -74,6 +74,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
>      }
>  }
>  
> +static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
> +                                       uint32_t nb_table,
> +                                       uint32_t page_shift,
> +                                       int *fd,
> +                                       bool vfio_accel)
> +{
> +    uint64_t *table = NULL;
> +    uint64_t window_size = (uint64_t)nb_table << page_shift;
> +
> +    if (kvm_enabled() && !(window_size >> 32)) {
> +        table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
> +    }
> +
> +    if (!table) {
> +        *fd = -1;
> +        table = g_malloc0(nb_table * sizeof(uint64_t));
> +    }
> +
> +    trace_spapr_iommu_alloc_table(liobn, table, *fd);
> +
> +    return table;
> +}
> +
> +static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
> +{
> +    if (!kvm_enabled() ||
> +        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
> +        g_free(table);
> +    }
> +}
> +
>  /* Called from RCU critical section */
>  static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>                                                 bool is_write)
> @@ -140,21 +171,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  static int spapr_tce_table_realize(DeviceState *dev)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> -    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
>  
> -    if (kvm_enabled() && !(window_size >> 32)) {
> -        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
> -                                              window_size,
> -                                              &tcet->fd,
> -                                              tcet->vfio_accel);
> -    }
> -
> -    if (!tcet->table) {
> -        size_t table_size = tcet->nb_table * sizeof(uint64_t);
> -        tcet->table = g_malloc0(table_size);
> -    }
> -
> -    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
> +    tcet->fd = -1;

As far as I can see, spapr_tce_alloc_table() always initialized the fd
to -1 in case the allocation failed, so you can drop the above line, I
think.

> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->nb_table,
> +                                        tcet->page_shift,
> +                                        &tcet->fd,
> +                                        tcet->vfio_accel);
>  
>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
>                               "iommu-spapr",

Apart from the nit above, the patch looks fine to me.

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06 13:42   ` Alex Williamson
@ 2015-07-06 15:34     ` Alexey Kardashevskiy
  2015-07-06 16:13       ` Alex Williamson
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06 15:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On 07/06/2015 11:42 PM, Alex Williamson wrote:
> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a guest RAM memory listener which notifies a VFIO container
>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>> (i.e. "skip dump" regions) are skipped.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This does not change the guest visible interface.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> ---
>> Changes:
>> v9:
>> * since there is no more SPAPR-specific data in container::iommu_data,
>> the memory preregistration fields are common and potentially can be used
>> by other architectures
>>
>> v7:
>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
>> * s'ramlistener'register_listener'
>>
>> v6:
>> * fixed commit log (s/guest/userspace/), added note about no guest visible
>> change
>> * fixed error checking if ram registration failed
>> * added alignment check for section->offset_within_region
>>
>> v5:
>> * simplified the patch
>> * added trace points
>> * added round_up() for the size
>> * SPAPR IOMMU v2 used
>> ---
>>   hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>>   include/hw/vfio/vfio-common.h |   3 ++
>>   trace-events                  |   1 +
>>   3 files changed, 104 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 8eacfd7..0c7ba8c 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>       memory_listener_unregister(&container->iommu_data.type1.listener);
>>   }
>>
>> +static void vfio_ram_do_region(VFIOContainer *container,
>> +                              MemoryRegionSection *section, unsigned long req)
>> +{
>> +    int ret;
>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>
> This function is not as general as the name would imply, it's spapr
> specific due to this.  How about vfio_spapr_register_memory() with a
> bool parameter toggling register vs unregister so we're not passing an
> arbitrary ioctl number?

Ok. Although I am quite often asked not to do such a thing and rather add 2 
helpers (reg/unreg, do/undo, etc) instead and reuse common bits.


>> +
>> +    if (!memory_region_is_ram(section->mr) ||
>> +        memory_region_is_skip_dump(section->mr)) {
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>
> s/getpagesize()/qemu_real_host_page_size/?


Oh, right, I guess it reached upstream now.


>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>> +        section->offset_within_region;
>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
>> +
>> +    ret = ioctl(container->fd, req, &reg);
>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
>> +            ret ? -errno : 0);
>> +    if (!ret) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * On the initfn path, store the first error in the container so we
>> +     * can gracefully fail.  Runtime, there's not much we can do other
>> +     * than throw a hardware error.
>> +     */
>> +    if (!container->iommu_data.ram_reg_initialized) {
>> +        if (!container->iommu_data.ram_reg_error) {
>> +            container->iommu_data.ram_reg_error = -errno;
>> +        }
>> +    } else {
>> +        hw_error("vfio: RAM registering failed, unable to continue");
>> +    }
>
> I'd rather see:
>
> if (ret) {
>    if (!container...) {
>      ...
>    } else {
>      ...
>    }
> }
>
> Exiting early on success and otherwise falling into error handling is a
> strange code flow.

Ok... vfio_dma_map() does not follow this rule so I thought it is not that 
strict :)


>> +}
>> +
>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
>> +                                         MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.register_listener);
>> +    memory_region_ref(section->mr);
>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
>
> vfio_spapr_register_memory(container, section, true);
>
>> +}
>> +
>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
>> +                                         MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.register_listener);
>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
>
> vfio_spapr_register_memory(container, section, false);
>
>> +    memory_region_unref(section->mr);
>> +}
>> +
>> +static const MemoryListener vfio_ram_memory_listener = {
>> +    .region_add = vfio_ram_listener_region_add,
>> +    .region_del = vfio_ram_listener_region_del,
>> +};
>
> These are all spapr specific, please reflect that in the name;
> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.

ok.


> Actually, can't we determine what type of IOMMU we have and make the
> existing MemoryListener handle either type1 or spapr or spapr-v2?


Sorry, I do not follow you here. How? The existing listener listens on PCI 
address space (at least, on pseries), new one listens on RAM address space 
(address_space_memory). What do I miss?


>
>> +
>> +static void vfio_spapr_listener_release_v2(VFIOContainer *container)
>> +{
>> +    memory_listener_unregister(&container->iommu_data.register_listener);
>> +    vfio_listener_release(container);
>> +}
>> +
>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>>                        MemoryRegion *mem, MemoryRegion *submem,
>>                        void **map, size_t size, off_t offset,
>> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>
>>           container->iommu_data.type1.initialized = true;
>>
>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>> +
>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>           if (ret) {
>>               error_report("vfio: failed to set group container: %m");
>>               ret = -errno;
>>               goto free_container_exit;
>>           }
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        ret = ioctl(fd, VFIO_SET_IOMMU,
>> +                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
>>           if (ret) {
>>               error_report("vfio: failed to set iommu for container: %m");
>>               ret = -errno;
>> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>            * when container fd is closed so we do not call it explicitly
>>            * in this file.
>>            */
>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> -        if (ret) {
>> -            error_report("vfio: failed to enable container: %m");
>> -            ret = -errno;
>> -            goto free_container_exit;
>> +        if (!v2) {
>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +            if (ret) {
>> +                error_report("vfio: failed to enable container: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>>           }
>>
>>           container->iommu_data.type1.listener = vfio_memory_listener;
>> -        container->iommu_data.release = vfio_listener_release;
>> -
>>           memory_listener_register(&container->iommu_data.type1.listener,
>>                                    container->space->as);
>>
>> +        if (!v2) {
>> +            container->iommu_data.release = vfio_listener_release;
>> +        } else {
>> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
>> +            container->iommu_data.register_listener =
>> +                    vfio_ram_memory_listener;
>> +            memory_listener_register(&container->iommu_data.register_listener,
>> +                                     &address_space_memory);
>> +
>> +            if (container->iommu_data.ram_reg_error) {
>> +                error_report("vfio: RAM memory listener initialization failed for container");
>> +                goto listener_release_exit;
>> +            }
>> +
>> +            container->iommu_data.ram_reg_initialized = true;
>> +        }
>> +
>>       } else {
>>           error_report("vfio: No available IOMMU models");
>>           ret = -EINVAL;
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 59a321d..b132248 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -79,6 +79,9 @@ typedef struct VFIOContainer {
>>               VFIOType1 type1;
>>           };
>>           void (*release)(struct VFIOContainer *);
>> +        MemoryListener register_listener;
>> +        int ram_reg_error;
>> +        bool ram_reg_initialized;
>
> Isn't this exactly what the union above is for?

This is a different listener on a different address space and I do not 
really feel sharing these _error/_initialized between unrelated listeners, 
should I?



>>       } iommu_data;
>>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>> diff --git a/trace-events b/trace-events
>> index a994019..b300e94 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1584,6 +1584,7 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
>>   vfio_put_group(int fd) "close group->fd=%d"
>>   vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
>>   vfio_put_base_device(int fd) "close vdev->fd=%d"
>> +vfio_ram_register(int req, uint64_t va, uint64_t size, int ret) "req=%d va=%"PRIx64" size=%"PRIx64" ret=%d"
>>
>>   # hw/vfio/platform.c
>>   vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
>
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers
  2015-07-06 15:14   ` Thomas Huth
@ 2015-07-06 15:43     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06 15:43 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 01:14 AM, Thomas Huth wrote:
> On Mon,  6 Jul 2015 12:11:00 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> At the moment presence of vfio-pci devices on a bus affect the way
>> the guest view table is allocated. If there is no vfio-pci on a PHB
>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
>> is allocated in KVM. However, if there is vfio-pci and we do yet not
>> KVM acceleration for these, the table has to be allocated by
>> the userspace. At the moment the table is allocated once at boot time
>> but next patches will reallocate it.
>>
>> This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
>> to helpers.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> ---
>>   hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
>>   trace-events         |  2 +-
>>   2 files changed, 40 insertions(+), 20 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index f61504e..0cf5010 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -74,6 +74,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
>>       }
>>   }
>>
>> +static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>> +                                       uint32_t nb_table,
>> +                                       uint32_t page_shift,
>> +                                       int *fd,
>> +                                       bool vfio_accel)
>> +{
>> +    uint64_t *table = NULL;
>> +    uint64_t window_size = (uint64_t)nb_table << page_shift;
>> +
>> +    if (kvm_enabled() && !(window_size >> 32)) {
>> +        table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>> +    }
>> +
>> +    if (!table) {
>> +        *fd = -1;
>> +        table = g_malloc0(nb_table * sizeof(uint64_t));
>> +    }
>> +
>> +    trace_spapr_iommu_alloc_table(liobn, table, *fd);
>> +
>> +    return table;
>> +}
>> +
>> +static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
>> +{
>> +    if (!kvm_enabled() ||
>> +        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
>> +        g_free(table);
>> +    }
>> +}
>> +
>>   /* Called from RCU critical section */
>>   static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>                                                  bool is_write)
>> @@ -140,21 +171,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>   static int spapr_tce_table_realize(DeviceState *dev)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>> -    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
>>
>> -    if (kvm_enabled() && !(window_size >> 32)) {
>> -        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
>> -                                              window_size,
>> -                                              &tcet->fd,
>> -                                              tcet->vfio_accel);
>> -    }
>> -
>> -    if (!tcet->table) {
>> -        size_t table_size = tcet->nb_table * sizeof(uint64_t);
>> -        tcet->table = g_malloc0(table_size);
>> -    }
>> -
>> -    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
>> +    tcet->fd = -1;
>
> As far as I can see, spapr_tce_alloc_table() always initialized the fd
> to -1 in case the allocation failed, so you can drop the above line, I
> think.

Later in the patchset I remove spapr_tce_alloc_table() so it is safer to 
initialize it here, I guess.


>> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> +                                        tcet->nb_table,
>> +                                        tcet->page_shift,
>> +                                        &tcet->fd,
>> +                                        tcet->vfio_accel);
>>
>>       memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
>>                                "iommu-spapr",
>
> Apart from the nit above, the patch looks fine to me.
>
>   Thomas
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
@ 2015-07-06 15:54 ` Thomas Huth
  2015-07-06 16:07   ` Alexey Kardashevskiy
  2015-07-08  4:34   ` David Gibson
  15 siblings, 2 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 15:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:10:56 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
...
> 
> This patchset adds DDW support for pseries. The host kernel changes are
> required, available in the current upstream.
> 
> This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
> 
> Please comment. Thanks!

 Alexey,

I'm sorry, but it looks like this patch set badly fails to link when
compiling for a non-Linux target:

  LINK  ppc64-softmmu/qemu-system-ppc64.exe
hw/ppc/spapr_pci.o: In function `spapr_phb_dma_capabilities_update':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:785: undefined reference to `spapr_phb_vfio_dma_capabilities_update'
hw/ppc/spapr_pci.o: In function `rtas_ibm_configure_pe':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:601: undefined reference to `spapr_phb_vfio_eeh_configure'
hw/ppc/spapr_pci.o: In function `rtas_ibm_set_slot_reset':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:573: undefined reference to `spapr_phb_vfio_eeh_reset'
hw/ppc/spapr_pci.o: In function `rtas_ibm_read_slot_reset_state2':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:533: undefined reference to `spapr_phb_vfio_eeh_get_state'
hw/ppc/spapr_pci.o: In function `rtas_ibm_set_eeh_option':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:455: undefined reference to `spapr_phb_vfio_eeh_set_option'
hw/ppc/spapr_pci.o: In function `spapr_phb_hotplug_dma_sync':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:884: undefined reference to `spapr_phb_vfio_dma_remove_window'
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:894: undefined reference to `spapr_phb_vfio_dma_init_window'
hw/ppc/spapr_pci.o: In function `spapr_phb_dma_init_window':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:805: undefined reference to `spapr_phb_vfio_dma_init_window'
hw/ppc/spapr_pci.o: In function `spapr_phb_dma_remove_window':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:834: undefined reference to `spapr_phb_vfio_dma_remove_window'
hw/ppc/spapr_pci.o: In function `spapr_phb_reset':
/home/thuth/devel/qemu/hw/ppc/spapr_pci.c:1538: undefined reference to `spapr_phb_vfio_eeh_reenable'
collect2: error: ld returned 1 exit status

Please make sure that this series also works if either CONFIG_LINUX
or CONFIG_PCI are not enabled!

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2015-07-06 15:54 ` Thomas Huth
@ 2015-07-06 16:07   ` Alexey Kardashevskiy
  2015-07-06 16:13     ` Thomas Huth
  2015-07-08  4:34   ` David Gibson
  1 sibling, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06 16:07 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 01:54 AM, Thomas Huth wrote:
> On Mon,  6 Jul 2015 12:10:56 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> ...
>>
>> This patchset adds DDW support for pseries. The host kernel changes are
>> required, available in the current upstream.
>>
>> This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
>>
>> Please comment. Thanks!
>
>   Alexey,
>
> I'm sorry, but it looks like this patch set badly fails to link when
> compiling for a non-Linux target:
>
>    LINK  ppc64-softmmu/qemu-system-ppc64.exe
> hw/ppc/spapr_pci.o: In function `spapr_phb_dma_capabilities_update':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:785: undefined reference to `spapr_phb_vfio_dma_capabilities_update'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_configure_pe':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:601: undefined reference to `spapr_phb_vfio_eeh_configure'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_set_slot_reset':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:573: undefined reference to `spapr_phb_vfio_eeh_reset'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_read_slot_reset_state2':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:533: undefined reference to `spapr_phb_vfio_eeh_get_state'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_set_eeh_option':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:455: undefined reference to `spapr_phb_vfio_eeh_set_option'
> hw/ppc/spapr_pci.o: In function `spapr_phb_hotplug_dma_sync':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:884: undefined reference to `spapr_phb_vfio_dma_remove_window'
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:894: undefined reference to `spapr_phb_vfio_dma_init_window'
> hw/ppc/spapr_pci.o: In function `spapr_phb_dma_init_window':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:805: undefined reference to `spapr_phb_vfio_dma_init_window'
> hw/ppc/spapr_pci.o: In function `spapr_phb_dma_remove_window':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:834: undefined reference to `spapr_phb_vfio_dma_remove_window'
> hw/ppc/spapr_pci.o: In function `spapr_phb_reset':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:1538: undefined reference to `spapr_phb_vfio_eeh_reenable'
> collect2: error: ld returned 1 exit status
>
> Please make sure that this series also works if either CONFIG_LINUX
> or CONFIG_PCI are not enabled!


Oh. How exactly did you configure qemu to get this?



-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06 15:34     ` Alexey Kardashevskiy
@ 2015-07-06 16:13       ` Alex Williamson
  2015-07-07  0:29         ` David Gibson
  2015-07-07 12:11         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 71+ messages in thread
From: Alex Williamson @ 2015-07-06 16:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
> On 07/06/2015 11:42 PM, Alex Williamson wrote:
> > On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> >> This makes use of the new "memory registering" feature. The idea is
> >> to provide the userspace ability to notify the host kernel about pages
> >> which are going to be used for DMA. Having this information, the host
> >> kernel can pin them all once per user process, do locked pages
> >> accounting (once) and not spent time on doing that in real time with
> >> possible failures which cannot be handled nicely in some cases.
> >>
> >> This adds a guest RAM memory listener which notifies a VFIO container
> >> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >> (i.e. "skip dump" regions) are skipped.
> >>
> >> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >> not call it when v2 is detected and enabled.
> >>
> >> This does not change the guest visible interface.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >> ---
> >> Changes:
> >> v9:
> >> * since there is no more SPAPR-specific data in container::iommu_data,
> >> the memory preregistration fields are common and potentially can be used
> >> by other architectures
> >>
> >> v7:
> >> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> >> * s'ramlistener'register_listener'
> >>
> >> v6:
> >> * fixed commit log (s/guest/userspace/), added note about no guest visible
> >> change
> >> * fixed error checking if ram registration failed
> >> * added alignment check for section->offset_within_region
> >>
> >> v5:
> >> * simplified the patch
> >> * added trace points
> >> * added round_up() for the size
> >> * SPAPR IOMMU v2 used
> >> ---
> >>   hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
> >>   include/hw/vfio/vfio-common.h |   3 ++
> >>   trace-events                  |   1 +
> >>   3 files changed, 104 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 8eacfd7..0c7ba8c 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> >>       memory_listener_unregister(&container->iommu_data.type1.listener);
> >>   }
> >>
> >> +static void vfio_ram_do_region(VFIOContainer *container,
> >> +                              MemoryRegionSection *section, unsigned long req)
> >> +{
> >> +    int ret;
> >> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> >
> > This function is not as general as the name would imply, it's spapr
> > specific due to this.  How about vfio_spapr_register_memory() with a
> > bool parameter toggling register vs unregister so we're not passing an
> > arbitrary ioctl number?
> 
> Ok. Although I am quite often asked not to do such a thing and rather add 2 
> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.

I'm not a fan of functions that do the reverse process based on a bool
arg either, but I dislike them less than passing an arbitrary ioctl
number for a parameter.  The former is ugly, but the latter is difficult
to use and difficult to maintain because it would be subtle later to
spot an unsupported ioctl being passed to the function.

> >> +
> >> +    if (!memory_region_is_ram(section->mr) ||
> >> +        memory_region_is_skip_dump(section->mr)) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> >
> > s/getpagesize()/qemu_real_host_page_size/?
> 
> 
> Oh, right, I guess it reached upstream now.
> 
> 
> >> +        error_report("%s received unaligned region", __func__);
> >> +        return;
> >> +    }
> >> +
> >> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >> +        section->offset_within_region;
> >> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
> >> +
> >> +    ret = ioctl(container->fd, req, &reg);
> >> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
> >> +            ret ? -errno : 0);
> >> +    if (!ret) {
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * On the initfn path, store the first error in the container so we
> >> +     * can gracefully fail.  Runtime, there's not much we can do other
> >> +     * than throw a hardware error.
> >> +     */
> >> +    if (!container->iommu_data.ram_reg_initialized) {
> >> +        if (!container->iommu_data.ram_reg_error) {
> >> +            container->iommu_data.ram_reg_error = -errno;
> >> +        }
> >> +    } else {
> >> +        hw_error("vfio: RAM registering failed, unable to continue");
> >> +    }
> >
> > I'd rather see:
> >
> > if (ret) {
> >    if (!container...) {
> >      ...
> >    } else {
> >      ...
> >    }
> > }
> >
> > Exiting early on success and otherwise falling into error handling is a
> > strange code flow.
> 
> Ok... vfio_dma_map() does not follow this rule so I thought it is not that 
> strict :)

It would be nice to clean it up there too.

> >> +}
> >> +
> >> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> >> +                                         MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            iommu_data.register_listener);
> >> +    memory_region_ref(section->mr);
> >> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> >
> > vfio_spapr_register_memory(container, section, true);
> >
> >> +}
> >> +
> >> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> >> +                                         MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            iommu_data.register_listener);
> >> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> >
> > vfio_spapr_register_memory(container, section, false);
> >
> >> +    memory_region_unref(section->mr);
> >> +}
> >> +
> >> +static const MemoryListener vfio_ram_memory_listener = {
> >> +    .region_add = vfio_ram_listener_region_add,
> >> +    .region_del = vfio_ram_listener_region_del,
> >> +};
> >
> > These are all spapr specific, please reflect that in the name;
> > vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
> 
> ok.
> 
> 
> > Actually, can't we determine what type of IOMMU we have and make the
> > existing MemoryListener handle either type1 or spapr or spapr-v2?
> 
> 
> Sorry, I do not follow you here. How? The existing listener listens on PCI 
> address space (at least, on pseries), new one listens on RAM address space 
> (address_space_memory). What do I miss?

Isn't that simply a difference of the address space the listener is
attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
are already both handled by the same listener.

> >> +
> >> +static void vfio_spapr_listener_release_v2(VFIOContainer *container)
> >> +{
> >> +    memory_listener_unregister(&container->iommu_data.register_listener);
> >> +    vfio_listener_release(container);
> >> +}
> >> +
> >>   int vfio_mmap_region(Object *obj, VFIORegion *region,
> >>                        MemoryRegion *mem, MemoryRegion *submem,
> >>                        void **map, size_t size, off_t offset,
> >> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>
> >>           container->iommu_data.type1.initialized = true;
> >>
> >> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >> +
> >>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>           if (ret) {
> >>               error_report("vfio: failed to set group container: %m");
> >>               ret = -errno;
> >>               goto free_container_exit;
> >>           }
> >> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU,
> >> +                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
> >>           if (ret) {
> >>               error_report("vfio: failed to set iommu for container: %m");
> >>               ret = -errno;
> >> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>            * when container fd is closed so we do not call it explicitly
> >>            * in this file.
> >>            */
> >> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> -        if (ret) {
> >> -            error_report("vfio: failed to enable container: %m");
> >> -            ret = -errno;
> >> -            goto free_container_exit;
> >> +        if (!v2) {
> >> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> +            if (ret) {
> >> +                error_report("vfio: failed to enable container: %m");
> >> +                ret = -errno;
> >> +                goto free_container_exit;
> >> +            }
> >>           }
> >>
> >>           container->iommu_data.type1.listener = vfio_memory_listener;
> >> -        container->iommu_data.release = vfio_listener_release;
> >> -
> >>           memory_listener_register(&container->iommu_data.type1.listener,
> >>                                    container->space->as);
> >>
> >> +        if (!v2) {
> >> +            container->iommu_data.release = vfio_listener_release;
> >> +        } else {
> >> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
> >> +            container->iommu_data.register_listener =
> >> +                    vfio_ram_memory_listener;
> >> +            memory_listener_register(&container->iommu_data.register_listener,
> >> +                                     &address_space_memory);
> >> +
> >> +            if (container->iommu_data.ram_reg_error) {
> >> +                error_report("vfio: RAM memory listener initialization failed for container");
> >> +                goto listener_release_exit;
> >> +            }
> >> +
> >> +            container->iommu_data.ram_reg_initialized = true;
> >> +        }
> >> +
> >>       } else {
> >>           error_report("vfio: No available IOMMU models");
> >>           ret = -EINVAL;
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 59a321d..b132248 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -79,6 +79,9 @@ typedef struct VFIOContainer {
> >>               VFIOType1 type1;
> >>           };
> >>           void (*release)(struct VFIOContainer *);
> >> +        MemoryListener register_listener;
> >> +        int ram_reg_error;
> >> +        bool ram_reg_initialized;
> >
> > Isn't this exactly what the union above is for?
> 
> This is a different listener on a different address space and I do not 
> really feel sharing these _error/_initialized between unrelated listeners, 
> should I?

How are they not related?  They're specific to the spapr-v2 IOMMU
backend, which is exactly what that union was intended to abstract.
Thanks,

Alex

> >>       } iommu_data;
> >>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>       QLIST_HEAD(, VFIOGroup) group_list;
> >> diff --git a/trace-events b/trace-events
> >> index a994019..b300e94 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1584,6 +1584,7 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
> >>   vfio_put_group(int fd) "close group->fd=%d"
> >>   vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
> >>   vfio_put_base_device(int fd) "close vdev->fd=%d"
> >> +vfio_ram_register(int req, uint64_t va, uint64_t size, int ret) "req=%d va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>
> >>   # hw/vfio/platform.c
> >>   vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
> >
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2015-07-06 16:07   ` Alexey Kardashevskiy
@ 2015-07-06 16:13     ` Thomas Huth
  0 siblings, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 16:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Michael Roth, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Tue, 7 Jul 2015 02:07:36 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 07/07/2015 01:54 AM, Thomas Huth wrote:
> > On Mon,  6 Jul 2015 12:10:56 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > ...
> >>
> >> This patchset adds DDW support for pseries. The host kernel changes are
> >> required, available in the current upstream.
> >>
> >> This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
> >>
> >> Please comment. Thanks!
> >
> >   Alexey,
> >
> > I'm sorry, but it looks like this patch set badly fails to link when
> > compiling for a non-Linux target:
> >
> >    LINK  ppc64-softmmu/qemu-system-ppc64.exe
> > hw/ppc/spapr_pci.o: In function `spapr_phb_dma_capabilities_update':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:785: undefined reference to `spapr_phb_vfio_dma_capabilities_update'
> > hw/ppc/spapr_pci.o: In function `rtas_ibm_configure_pe':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:601: undefined reference to `spapr_phb_vfio_eeh_configure'
> > hw/ppc/spapr_pci.o: In function `rtas_ibm_set_slot_reset':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:573: undefined reference to `spapr_phb_vfio_eeh_reset'
> > hw/ppc/spapr_pci.o: In function `rtas_ibm_read_slot_reset_state2':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:533: undefined reference to `spapr_phb_vfio_eeh_get_state'
> > hw/ppc/spapr_pci.o: In function `rtas_ibm_set_eeh_option':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:455: undefined reference to `spapr_phb_vfio_eeh_set_option'
> > hw/ppc/spapr_pci.o: In function `spapr_phb_hotplug_dma_sync':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:884: undefined reference to `spapr_phb_vfio_dma_remove_window'
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:894: undefined reference to `spapr_phb_vfio_dma_init_window'
> > hw/ppc/spapr_pci.o: In function `spapr_phb_dma_init_window':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:805: undefined reference to `spapr_phb_vfio_dma_init_window'
> > hw/ppc/spapr_pci.o: In function `spapr_phb_dma_remove_window':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:834: undefined reference to `spapr_phb_vfio_dma_remove_window'
> > hw/ppc/spapr_pci.o: In function `spapr_phb_reset':
> > /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:1538: undefined reference to `spapr_phb_vfio_eeh_reenable'
> > collect2: error: ld returned 1 exit status
> >
> > Please make sure that this series also works if either CONFIG_LINUX
> > or CONFIG_PCI are not enabled!
> 
> 
> Oh. How exactly did you configure qemu to get this?

I installed the mingw64 cross compiler packages (and likely some
additional library packages like mingw64-zlib, -pixman, etc.), and then
ran configure like this:

configure' --cc=x86_64-w64-mingw32-gcc --cross-prefix=x86_64-w64-mingw32- --target-list=ppc64-softmmu

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window()
  2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
@ 2015-07-06 16:41   ` Laurent Vivier
  2015-07-07  0:28     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 71+ messages in thread
From: Laurent Vivier @ 2015-07-06 16:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel
  Cc: Alex Williamson, David Gibson, qemu-ppc, Michael Roth, Gavin Shan



On 06/07/2015 04:10, Alexey Kardashevskiy wrote:
> This reworks finish_realize() which used to finalize DMA setup with
> an assumption that it will not change later.
> 
> New callbacks supports various window parameters such as page and
> windows sizes. The new callback return error code rather than Error**.
> 
> This is a mechanical change so no change in behaviour is expected.
> This is a part of getting rid of spapr-pci-vfio-host-bridge type.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v8:
> * moved spapr_phb_dma_capabilities_update() higher to avoid forward
> declaration in following patches and keep DMA code together (i.e. next
> to spapr_pci_dma_iommu())
> ---
>  hw/ppc/spapr_pci.c          | 59 ++++++++++++++++++++++++++-------------------
>  hw/ppc/spapr_pci_vfio.c     | 53 ++++++++++++++++------------------------
>  include/hw/pci-host/spapr.h |  8 +++++-
>  3 files changed, 62 insertions(+), 58 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index a8f79d8..c1ca13d 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -808,6 +808,28 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>      return buf;
>  }
>  
> +static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
> +{
> +    sphb->dma32_window_start = 0;
> +    sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
> +
> +    return 0;
> +}
> +
> +static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> +                                     uint32_t liobn, uint32_t page_shift,
> +                                     uint64_t window_size)
> +{
> +    uint64_t bus_offset = sphb->dma32_window_start;
> +    sPAPRTCETable *tcet;
> +
> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
> +                               window_size >> page_shift,
> +                               false);
> +
> +    return tcet ? 0 : -1;
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1220,6 +1242,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      int i;
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
> +    sPAPRTCETable *tcet;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
> @@ -1369,33 +1392,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    if (!info->finish_realize) {
> -        error_setg(errp, "finish_realize not defined");
> -        return;
> -    }
> -
> -    info->finish_realize(sphb, errp);
> -
> -    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> -}
> -
> -static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
> -{
> -    sPAPRTCETable *tcet;
> -    uint32_t nb_table;
> -
> -    nb_table = SPAPR_PCI_DMA32_SIZE >> SPAPR_TCE_PAGE_SHIFT;
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> +    info->dma_capabilities_update(sphb);
> +    info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> +                          sphb->dma32_window_size);
> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>      if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return ;
> +        error_setg(errp, "failed to create TCE table");
> +        return;
>      }
> -
> -    /* Register default 32bit DMA window */
> -    memory_region_add_subregion(&sphb->iommu_root, 0,
> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>                                  spapr_tce_get_iommu(tcet));
> +
> +    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
>  static int spapr_phb_children_reset(Object *child, void *opaque)
> @@ -1543,9 +1551,10 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
>      dc->vmsd = &vmstate_spapr_pci;
>      set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>      dc->cannot_instantiate_with_device_add_yet = false;
> -    spc->finish_realize = spapr_phb_finish_realize;
>      hp->plug = spapr_phb_hot_plug_child;
>      hp->unplug = spapr_phb_hot_unplug_child;
> +    spc->dma_capabilities_update = spapr_phb_dma_capabilities_update;
> +    spc->dma_init_window = spapr_phb_dma_init_window;
>  }
>  
>  static const TypeInfo spapr_phb_info = {
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> index cca45ed..6e3e17b 100644
> --- a/hw/ppc/spapr_pci_vfio.c
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -28,48 +28,36 @@ static Property spapr_phb_vfio_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> -static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
> +static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
>  {
>      sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>      struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
>      int ret;
> -    sPAPRTCETable *tcet;
> -    uint32_t liobn = svphb->phb.dma_liobn;
>  
> -    if (svphb->iommugroupid == -1) {
> -        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
> -        return;
> -    }
> -
> -    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
> -                               VFIO_CHECK_EXTENSION,
> -                               (void *) VFIO_SPAPR_TCE_IOMMU);


This ioctl() disappears completely, was it useless ?
[ but I guess the following one will fail exactly in the same way ]


> -    if (ret != 1) {
> -        error_setg_errno(errp, -ret,
> -                         "spapr-vfio: SPAPR extension is not supported");
> -        return;
> -    }
> -
> -    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>                                 VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>      if (ret) {
> -        error_setg_errno(errp, -ret,
> -                         "spapr-vfio: get info from container failed");
> -        return;
> +        return ret;
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, info.dma32_window_start,
> -                               SPAPR_TCE_PAGE_SHIFT,
> -                               info.dma32_window_size >> SPAPR_TCE_PAGE_SHIFT,
> +    sphb->dma32_window_start = info.dma32_window_start;
> +    sphb->dma32_window_size = info.dma32_window_size;
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
> +                                          uint32_t liobn, uint32_t page_shift,
> +                                          uint64_t window_size)
> +{
> +    uint64_t bus_offset = sphb->dma32_window_start;
> +    sPAPRTCETable *tcet;
> +
> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
> +                               window_size >> page_shift,
>                                 true);
> -    if (!tcet) {
> -        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
> -        return;
> -    }
>  
> -    /* Register default 32bit DMA window */
> -    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> -                                spapr_tce_get_iommu(tcet));
> +    return tcet ? 0 : -1;
>  }
>  
>  static void spapr_phb_vfio_eeh_reenable(sPAPRPHBVFIOState *svphb)
> @@ -257,7 +245,8 @@ static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
>  
>      dc->props = spapr_phb_vfio_properties;
>      dc->reset = spapr_phb_vfio_reset;
> -    spc->finish_realize = spapr_phb_vfio_finish_realize;
> +    spc->dma_capabilities_update = spapr_phb_vfio_dma_capabilities_update;
> +    spc->dma_init_window = spapr_phb_vfio_dma_init_window;
>      spc->eeh_set_option = spapr_phb_vfio_eeh_set_option;
>      spc->eeh_get_state = spapr_phb_vfio_eeh_get_state;
>      spc->eeh_reset = spapr_phb_vfio_eeh_reset;
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 5322b56..b6d5719 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -48,7 +48,10 @@ typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
>  struct sPAPRPHBClass {
>      PCIHostBridgeClass parent_class;
>  
> -    void (*finish_realize)(sPAPRPHBState *sphb, Error **errp);
> +    int (*dma_capabilities_update)(sPAPRPHBState *sphb);
> +    int (*dma_init_window)(sPAPRPHBState *sphb,
> +                           uint32_t liobn, uint32_t page_shift,
> +                           uint64_t window_size);
>      int (*eeh_set_option)(sPAPRPHBState *sphb, unsigned int addr, int option);
>      int (*eeh_get_state)(sPAPRPHBState *sphb, int *state);
>      int (*eeh_reset)(sPAPRPHBState *sphb, int option);
> @@ -90,6 +93,9 @@ struct sPAPRPHBState {
>      int32_t msi_devs_num;
>      spapr_pci_msi_mig *msi_devs;
>  
> +    uint32_t dma32_window_start;
> +    uint32_t dma32_window_size;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  };
>  
> 

Really nice work.
Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
@ 2015-07-06 16:45   ` Laurent Vivier
  2015-07-06 17:11   ` Thomas Huth
  1 sibling, 0 replies; 71+ messages in thread
From: Laurent Vivier @ 2015-07-06 16:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel
  Cc: Alex Williamson, David Gibson, qemu-ppc, Michael Roth, Gavin Shan



On 06/07/2015 04:11, Alexey Kardashevskiy wrote:
> sPAPRTCETable has a vfio_accel flag which is passed to
> kvmppc_create_spapr_tce() and controls whether to create a guest view
> table in KVM as this depends on the host kernel ability to accelerate
> H_PUT_TCE for VFIO devices. We would set this flag at the moment
> when sPAPRTCETable is created in spapr_tce_new_table() and
> use when the table is allocated in spapr_tce_table_realize().
> 
> Now we explicitly enable/disable DMA windows via spapr_tce_table_enable()
> and spapr_tce_table_disable() and can pass this flag directly without
> caching it in sPAPRTCETable.
> 
> This removes the flag. This should cause no behavioural change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v8:
> * new to patchset, this is cleanup
> ---
>  hw/ppc/spapr_iommu.c   | 8 +++-----
>  include/hw/ppc/spapr.h | 1 -
>  2 files changed, 3 insertions(+), 6 deletions(-)


Reviewed-by: Laurent Vivier <lvivier@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
  2015-07-06 10:07   ` David Gibson
@ 2015-07-06 17:04   ` Thomas Huth
  1 sibling, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:01 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Currently TCE tables are created once at start and their size never
> changes. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty stub object. Only
> LIOBN is assigned by the time of creation. It still will be called once
> at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> spapr_tce_table_enable() receives TCE table parameters and allocates
> a guest view of the TCE table (in the user space or KVM).
> spapr_tce_table_disable() disposes the table.
> 
> Follow up patches will disable+enable tables on reset (system reset
> or DDW reset).
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as migration expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it has all the properties set after the migration.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
...
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index c1ca13d..3ddd72f 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -821,13 +821,12 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>                                       uint64_t window_size)
>  {
>      uint64_t bus_offset = sphb->dma32_window_start;
> -    sPAPRTCETable *tcet;
> +    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
> -                               window_size >> page_shift,
> -                               false);
> -
> -    return tcet ? 0 : -1;
> +    spapr_tce_table_enable(tcet, bus_offset, page_shift,
> +                           window_size >> page_shift,
> +                           false);
> +    return 0;
>  }

Would it be possible that this function is called with a window_size
where window_size >> page_shift results in 0?
For example triggered by a guest with a bad value for the RTAS call in
rtas_ibm_create_pe_dma_window() ?
In that case, the enablement would fail, but you'd still return 0 for
success here.
==> Maybe add a check for window_size >> page_shift == 0 and return an
error code in that case?

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
  2015-07-06 16:45   ` Laurent Vivier
@ 2015-07-06 17:11   ` Thomas Huth
  1 sibling, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 17:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:02 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> sPAPRTCETable has a vfio_accel flag which is passed to
> kvmppc_create_spapr_tce() and controls whether to create a guest view
> table in KVM as this depends on the host kernel ability to accelerate
> H_PUT_TCE for VFIO devices. We would set this flag at the moment
> when sPAPRTCETable is created in spapr_tce_new_table() and
> use when the table is allocated in spapr_tce_table_realize().
> 
> Now we explicitly enable/disable DMA windows via spapr_tce_table_enable()
> and spapr_tce_table_disable() and can pass this flag directly without
> caching it in sPAPRTCETable.
> 
> This removes the flag. This should cause no behavioural change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v8:
> * new to patchset, this is cleanup
> ---
>  hw/ppc/spapr_iommu.c   | 8 +++-----
>  include/hw/ppc/spapr.h | 1 -
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index fbca136..1378a7a 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -207,7 +207,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>      return tcet;
>  }
>  
> -static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>  {
>      if (!tcet->nb_table) {
>          return;
> @@ -217,7 +217,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>                                          tcet->nb_table,
>                                          tcet->page_shift,
>                                          &tcet->fd,
> -                                        tcet->vfio_accel);
> +                                        vfio_accel);
>  
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> @@ -236,9 +236,8 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
>      tcet->bus_offset = bus_offset;
>      tcet->page_shift = page_shift;
>      tcet->nb_table = nb_table;
> -    tcet->vfio_accel = vfio_accel;
>  
> -    spapr_tce_table_do_enable(tcet);
> +    spapr_tce_table_do_enable(tcet, vfio_accel);
>  }
>  
>  void spapr_tce_table_disable(sPAPRTCETable *tcet)
> @@ -256,7 +255,6 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
>      tcet->bus_offset = 0;
>      tcet->page_shift = 0;
>      tcet->nb_table = 0;
> -    tcet->vfio_accel = false;
>  }
>  
>  static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index ed68c95..1da0ade 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -559,7 +559,6 @@ struct sPAPRTCETable {
>      uint32_t page_shift;
>      uint64_t *table;
>      bool bypass;
> -    bool vfio_accel;
>      int fd;
>      MemoryRegion iommu;
>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */

Reviewed-by: Thomas Huth <thuth@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2015-07-06 19:15   ` Thomas Huth
  0 siblings, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 19:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:03 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> We are going to have multiple DMA windows at different offsets on
> a PCI bus. For the sake of migration, we will have as many TCE table
> objects pre-created as many windows supported.
> So we need a way to map windows dynamically onto a PCI bus
> when migration of a table is completed but at this stage a TCE table
> object does not have access to a PHB to ask it to map a DMA window
> backed by just migrated TCE table.
> 
> This adds a "root" memory region (UINT64_MAX long) to the TCE object.
> This new region is mapped on a PCI bus with enabled overlapping as
> there will be one root MR per TCE table, each of them mapped at 0.
> The actual IOMMU memory region is a subregion of the root region and
> a TCE table enables/disables this subregion and maps it at
> the specific offset inside the root MR which is 1:1 mapping of
> a PCI address space.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr_iommu.c   | 13 ++++++++++---
>  hw/ppc/spapr_pci.c     |  2 +-
>  include/hw/ppc/spapr.h |  2 +-
>  3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 1378a7a..45c00d8 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -171,11 +171,16 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  static int spapr_tce_table_realize(DeviceState *dev)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> +    Object *tcetobj = OBJECT(tcet);
> +    char tmp[32];
>  
>      tcet->fd = -1;
>  
> -    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> -                             "iommu-spapr", 0);
> +    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
> +    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
> +
> +    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
> +    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
>  
>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>  
> @@ -221,6 +226,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>  
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> +    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
>  
>      tcet->enabled = true;
>  }
> @@ -246,6 +252,7 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
>          return;
>      }
>  
> +    memory_region_del_subregion(&tcet->root, &tcet->iommu);
>      memory_region_set_size(&tcet->iommu, 0);
>  
>      spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> @@ -268,7 +275,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
>  {
> -    return &tcet->iommu;
> +    return &tcet->root;
>  }
>  
>  static void spapr_tce_reset(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 3ddd72f..e27ca15 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1405,7 +1405,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          error_setg(errp, "failed to create TCE table");
>          return;
>      }
> -    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> +    memory_region_add_subregion(&sphb->iommu_root, 0,
>                                  spapr_tce_get_iommu(tcet));
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 1da0ade..e32e787 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -560,7 +560,7 @@ struct sPAPRTCETable {
>      uint64_t *table;
>      bool bypass;
>      int fd;
> -    MemoryRegion iommu;
> +    MemoryRegion root, iommu;
>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
>      QLIST_ENTRY(sPAPRTCETable) list;
>  };

Reviewed-by: Thomas Huth <thuth@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
@ 2015-07-06 21:13   ` Thomas Huth
  0 siblings, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 21:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:05 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> sPAPRTCETable is handling 2 TCE tables already:
> 
> 1) guest view of the TCE table - emulated devices use only this table;
> 
> 2) hardware IOMMU table - VFIO PCI devices use it for actual work but
> it does not replace 1) and it is not visible to the guest.
> The initialization of this table is driven by vfio-pci device,
> DMA map/unmap requests are handled via MemoryListener so there is very
> little to do in spapr-pci-vfio-host-bridge.
> 
> This moves VFIO bits to the generic spapr-pci-host-bridge which allows
> putting emulated and VFIO devices on the same PHB. It is still possible
> to create multiple PHBs and avoid sharing PHB resouces for emulated and
> VFIO devices.
> 
> If there is no VFIO-PCI device attaches, no special ioctls will be called.
> If there are some VFIO-PCI devices attached, PHB may refuse to attach
> another VFIO-PCI device if a VFIO container on the host kernel side
> does not support container sharing.
> 
> This changes spapr-pci-host-bridge to support properties of
> spapr-pci-vfio-host-bridge. This makes spapr-pci-vfio-host-bridge type
> equal to spapr-pci-host-bridge except it has an additional "iommu"
> property for backward compatibility reasons.
> 
> This moves PCI device lookup from spapr_phb_vfio_eeh_set_option() to
> rtas_ibm_set_eeh_option() as we need to know if the device is "vfio-pci"
> and decide whether to call spapr_phb_vfio_eeh_set_option() or not.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v9:
> * s'iommugroupid shall not be used'iommugroupid is deprecated and will be ignored'
> in error log
> 
> v8:
> * call spapr_phb_vfio_eeh_set_option() on vfio-pci devices only (reported by Gavin)
> ---
>  hw/ppc/spapr_pci.c          | 82 +++++++++++++++----------------------------
>  hw/ppc/spapr_pci_vfio.c     | 85 +++++++++------------------------------------
>  include/hw/pci-host/spapr.h | 25 ++++++-------
>  3 files changed, 55 insertions(+), 137 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 00816b3..76c988f 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
...
> @@ -841,9 +811,8 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
>      int i;
>      sPAPRTCETable *tcet;
> -    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>  
> -    spc->dma_capabilities_update(sphb); /* Refresh @has_vfio status */
> +    spapr_phb_dma_capabilities_update(sphb); /* Refresh @has_vfio status */
>  
>      for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>          tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> @@ -852,8 +821,8 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>          }
>      }
>  
> -    spc->dma_init_window(sphb, SPAPR_PCI_LIOBN(sphb->index, 0),
> -                         SPAPR_TCE_PAGE_SHIFT, sphb->dma32_window_size);
> +    spapr_phb_dma_init_window(sphb, SPAPR_PCI_LIOBN(sphb->index, 0),
> +                              SPAPR_TCE_PAGE_SHIFT, sphb->dma32_window_size);
>  
>      return 0;
>  }
> @@ -1271,6 +1240,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      uint64_t msi_window_size = 4096;
>      sPAPRTCETable *tcet;
>  
> +    if ((sphb->iommugroupid != -1) &&

Too many brackets for my taste...
... but apart from that, the patch looks good to me.

Reviewed-by: Thomas Huth <thuth@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
  2015-07-06 10:27   ` David Gibson
@ 2015-07-06 21:31   ` Thomas Huth
  2015-07-07  9:28     ` Alexey Kardashevskiy
  2015-07-10 21:33   ` Michael Roth
  2 siblings, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-06 21:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:06 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> sPAPR IOMMU is managing two copies of an TCE table:
> 1) a guest view of the table - this is what emulated devices use and
> this is where H_GET_TCE reads from;
> 2) a hardware TCE table - only present if there is at least one vfio-pci
> device on a PHB; it is updated via a memory listener on a PHB address
> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> 
> At the moment presence of vfio-pci devices on a bus affect the way
> the guest view table is allocated. If there is no vfio-pci on a PHB
> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> is allocated in KVM. However, if there is vfio-pci and we do yet not
> support KVM acceleration for these, the table has to be allocated
> by the userspace.
> 
> When vfio-pci device is hotplugged and there were no vfio-pci devices
> already, the guest view table could have been allocated by KVM which
> means that H_PUT_TCE is handled by the host kernel and since we
> do not support vfio-pci in KVM, the hardware table will not be updated.
> 
> This reallocates the guest view table in QEMU if the first vfio-pci
> device has just been plugged. spapr_tce_realloc_userspace() handles this.

I wonder whether it would help to improve the readability of the code
later if you put the description of the function into the code instead
of the commit message?

> This replays all the mappings to make sure that the tables are in sync.
> This will not have a visible effect though as for a new device
> the guest kernel will allocate-and-map new addresses and therefore
> existing mappings from emulated devices will not be used by vfio-pci
> devices.
> 
> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> hooks.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
...
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 76c988f..d1fa157 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>      return 0;
>  }
>  
> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> +{
> +    int ret = 0, i;
> +    bool had_vfio = sphb->has_vfio;
> +    sPAPRTCETable *tcet;
> +
> +    spapr_phb_dma_capabilities_update(sphb);
> +
> +    if (!had_vfio && sphb->has_vfio) {

    if (had_vfio || !sphb->has_vfio) {
        return 0;
    }

... and then you can save one level of indentation for the following
for-loop.

> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> +            if (!tcet || !tcet->enabled) {
> +                continue;
> +            }
> +            if (tcet->fd >= 0) {
> +                /*
> +                 * We got first vfio-pci device on accelerated table.
> +                 * VFIO acceleration is not possible.
> +                 * Reallocate table in userspace and replay mappings.
> +                 */
> +                ret = spapr_tce_realloc_userspace(tcet, true);
> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> +            } else {
> +                /* There was no acceleration, so just replay mappings. */
> +                ret = spapr_tce_replay(tcet);
> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
> +            }
> +            if (ret) {
> +                break;
> +            }
> +        }
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
...
> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>       */
>      pci_device_reset(PCI_DEVICE(dev));
>      object_unparent(OBJECT(dev));
> +
> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);

Too much brackets again for my taste ;-)

>  }
>  

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window()
  2015-07-06 16:41   ` Laurent Vivier
@ 2015-07-07  0:28     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07  0:28 UTC (permalink / raw)
  To: Laurent Vivier, qemu-devel
  Cc: Alex Williamson, David Gibson, qemu-ppc, Michael Roth, Gavin Shan

On 07/07/2015 02:41 AM, Laurent Vivier wrote:
>
>
> On 06/07/2015 04:10, Alexey Kardashevskiy wrote:
>> This reworks finish_realize() which used to finalize DMA setup with
>> an assumption that it will not change later.
>>
>> New callbacks supports various window parameters such as page and
>> windows sizes. The new callback return error code rather than Error**.
>>
>> This is a mechanical change so no change in behaviour is expected.
>> This is a part of getting rid of spapr-pci-vfio-host-bridge type.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> ---
>> Changes:
>> v8:
>> * moved spapr_phb_dma_capabilities_update() higher to avoid forward
>> declaration in following patches and keep DMA code together (i.e. next
>> to spapr_pci_dma_iommu())
>> ---
>>   hw/ppc/spapr_pci.c          | 59 ++++++++++++++++++++++++++-------------------
>>   hw/ppc/spapr_pci_vfio.c     | 53 ++++++++++++++++------------------------
>>   include/hw/pci-host/spapr.h |  8 +++++-
>>   3 files changed, 62 insertions(+), 58 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index a8f79d8..c1ca13d 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -808,6 +808,28 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>       return buf;
>>   }
>>
>> +static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>> +{
>> +    sphb->dma32_window_start = 0;
>> +    sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
>> +
>> +    return 0;
>> +}
>> +
>> +static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> +                                     uint32_t liobn, uint32_t page_shift,
>> +                                     uint64_t window_size)
>> +{
>> +    uint64_t bus_offset = sphb->dma32_window_start;
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
>> +                               window_size >> page_shift,
>> +                               false);
>> +
>> +    return tcet ? 0 : -1;
>> +}
>> +
>>   /* Macros to operate with address in OF binding to PCI */
>>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
>> @@ -1220,6 +1242,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       int i;
>>       PCIBus *bus;
>>       uint64_t msi_window_size = 4096;
>> +    sPAPRTCETable *tcet;
>>
>>       if (sphb->index != (uint32_t)-1) {
>>           hwaddr windows_base;
>> @@ -1369,33 +1392,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           }
>>       }
>>
>> -    if (!info->finish_realize) {
>> -        error_setg(errp, "finish_realize not defined");
>> -        return;
>> -    }
>> -
>> -    info->finish_realize(sphb, errp);
>> -
>> -    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>> -}
>> -
>> -static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
>> -{
>> -    sPAPRTCETable *tcet;
>> -    uint32_t nb_table;
>> -
>> -    nb_table = SPAPR_PCI_DMA32_SIZE >> SPAPR_TCE_PAGE_SHIFT;
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
>> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
>> +    info->dma_capabilities_update(sphb);
>> +    info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
>> +                          sphb->dma32_window_size);
>> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>>       if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return ;
>> +        error_setg(errp, "failed to create TCE table");
>> +        return;
>>       }
>> -
>> -    /* Register default 32bit DMA window */
>> -    memory_region_add_subregion(&sphb->iommu_root, 0,
>> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>                                   spapr_tce_get_iommu(tcet));
>> +
>> +    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>>
>>   static int spapr_phb_children_reset(Object *child, void *opaque)
>> @@ -1543,9 +1551,10 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
>>       dc->vmsd = &vmstate_spapr_pci;
>>       set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>>       dc->cannot_instantiate_with_device_add_yet = false;
>> -    spc->finish_realize = spapr_phb_finish_realize;
>>       hp->plug = spapr_phb_hot_plug_child;
>>       hp->unplug = spapr_phb_hot_unplug_child;
>> +    spc->dma_capabilities_update = spapr_phb_dma_capabilities_update;
>> +    spc->dma_init_window = spapr_phb_dma_init_window;
>>   }
>>
>>   static const TypeInfo spapr_phb_info = {
>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>> index cca45ed..6e3e17b 100644
>> --- a/hw/ppc/spapr_pci_vfio.c
>> +++ b/hw/ppc/spapr_pci_vfio.c
>> @@ -28,48 +28,36 @@ static Property spapr_phb_vfio_properties[] = {
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>
>> -static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>> +static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
>>   {
>>       sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>       struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
>>       int ret;
>> -    sPAPRTCETable *tcet;
>> -    uint32_t liobn = svphb->phb.dma_liobn;
>>
>> -    if (svphb->iommugroupid == -1) {
>> -        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
>> -        return;
>> -    }
>> -
>> -    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
>> -                               VFIO_CHECK_EXTENSION,
>> -                               (void *) VFIO_SPAPR_TCE_IOMMU);
>
>
> This ioctl() disappears completely, was it useless ?


It is called in hw/vfio/common.c and if it fails there, we never get here 
so there is no point in calling it here too.





-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06 16:13       ` Alex Williamson
@ 2015-07-07  0:29         ` David Gibson
  2015-07-07  0:36           ` Alexey Kardashevskiy
  2015-07-07 12:11         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 71+ messages in thread
From: David Gibson @ 2015-07-07  0:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, Michael Roth, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 8472 bytes --]

On Mon, Jul 06, 2015 at 10:13:07AM -0600, Alex Williamson wrote:
> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
> > On 07/06/2015 11:42 PM, Alex Williamson wrote:
> > > On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> > >> This makes use of the new "memory registering" feature. The idea is
> > >> to provide the userspace ability to notify the host kernel about pages
> > >> which are going to be used for DMA. Having this information, the host
> > >> kernel can pin them all once per user process, do locked pages
> > >> accounting (once) and not spent time on doing that in real time with
> > >> possible failures which cannot be handled nicely in some cases.
> > >>
> > >> This adds a guest RAM memory listener which notifies a VFIO container
> > >> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > >> (i.e. "skip dump" regions) are skipped.
> > >>
> > >> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > >> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > >> not call it when v2 is detected and enabled.
> > >>
> > >> This does not change the guest visible interface.
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > >> ---
> > >> Changes:
> > >> v9:
> > >> * since there is no more SPAPR-specific data in container::iommu_data,
> > >> the memory preregistration fields are common and potentially can be used
> > >> by other architectures
> > >>
> > >> v7:
> > >> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> > >> * s'ramlistener'register_listener'
> > >>
> > >> v6:
> > >> * fixed commit log (s/guest/userspace/), added note about no guest visible
> > >> change
> > >> * fixed error checking if ram registration failed
> > >> * added alignment check for section->offset_within_region
> > >>
> > >> v5:
> > >> * simplified the patch
> > >> * added trace points
> > >> * added round_up() for the size
> > >> * SPAPR IOMMU v2 used
> > >> ---
> > >>   hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
> > >>   include/hw/vfio/vfio-common.h |   3 ++
> > >>   trace-events                  |   1 +
> > >>   3 files changed, 104 insertions(+), 9 deletions(-)
> > >>
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index 8eacfd7..0c7ba8c 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> > >>       memory_listener_unregister(&container->iommu_data.type1.listener);
> > >>   }
> > >>
> > >> +static void vfio_ram_do_region(VFIOContainer *container,
> > >> +                              MemoryRegionSection *section, unsigned long req)
> > >> +{
> > >> +    int ret;
> > >> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> > >
> > > This function is not as general as the name would imply, it's spapr
> > > specific due to this.  How about vfio_spapr_register_memory() with a
> > > bool parameter toggling register vs unregister so we're not passing an
> > > arbitrary ioctl number?
> > 
> > Ok. Although I am quite often asked not to do such a thing and rather add 2 
> > helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
> 
> I'm not a fan of functions that do the reverse process based on a bool
> arg either, but I dislike them less than passing an arbitrary ioctl
> number for a parameter.  The former is ugly, but the latter is difficult
> to use and difficult to maintain because it would be subtle later to
> spot an unsupported ioctl being passed to the function.
> 
> > >> +
> > >> +    if (!memory_region_is_ram(section->mr) ||
> > >> +        memory_region_is_skip_dump(section->mr)) {
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> > >
> > > s/getpagesize()/qemu_real_host_page_size/?
> > 
> > 
> > Oh, right, I guess it reached upstream now.
> > 
> > 
> > >> +        error_report("%s received unaligned region", __func__);
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> > >> +        section->offset_within_region;
> > >> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
> > >> +
> > >> +    ret = ioctl(container->fd, req, &reg);
> > >> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
> > >> +            ret ? -errno : 0);
> > >> +    if (!ret) {
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    /*
> > >> +     * On the initfn path, store the first error in the container so we
> > >> +     * can gracefully fail.  Runtime, there's not much we can do other
> > >> +     * than throw a hardware error.
> > >> +     */
> > >> +    if (!container->iommu_data.ram_reg_initialized) {
> > >> +        if (!container->iommu_data.ram_reg_error) {
> > >> +            container->iommu_data.ram_reg_error = -errno;
> > >> +        }
> > >> +    } else {
> > >> +        hw_error("vfio: RAM registering failed, unable to continue");
> > >> +    }
> > >
> > > I'd rather see:
> > >
> > > if (ret) {
> > >    if (!container...) {
> > >      ...
> > >    } else {
> > >      ...
> > >    }
> > > }
> > >
> > > Exiting early on success and otherwise falling into error handling is a
> > > strange code flow.
> > 
> > Ok... vfio_dma_map() does not follow this rule so I thought it is not that 
> > strict :)
> 
> It would be nice to clean it up there too.
> 
> > >> +}
> > >> +
> > >> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> > >> +                                         MemoryRegionSection *section)
> > >> +{
> > >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> > >> +                                            iommu_data.register_listener);
> > >> +    memory_region_ref(section->mr);
> > >> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> > >
> > > vfio_spapr_register_memory(container, section, true);
> > >
> > >> +}
> > >> +
> > >> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> > >> +                                         MemoryRegionSection *section)
> > >> +{
> > >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> > >> +                                            iommu_data.register_listener);
> > >> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> > >
> > > vfio_spapr_register_memory(container, section, false);
> > >
> > >> +    memory_region_unref(section->mr);
> > >> +}
> > >> +
> > >> +static const MemoryListener vfio_ram_memory_listener = {
> > >> +    .region_add = vfio_ram_listener_region_add,
> > >> +    .region_del = vfio_ram_listener_region_del,
> > >> +};
> > >
> > > These are all spapr specific, please reflect that in the name;
> > > vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
> > 
> > ok.
> > 
> > 
> > > Actually, can't we determine what type of IOMMU we have and make the
> > > existing MemoryListener handle either type1 or spapr or spapr-v2?
> > 
> > 
> > Sorry, I do not follow you here. How? The existing listener listens on PCI 
> > address space (at least, on pseries), new one listens on RAM address space 
> > (address_space_memory). What do I miss?
> 
> Isn't that simply a difference of the address space the listener is
> attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
> are already both handled by the same listener.

I think what you're missing is that the spapr code now needs to listen
on *both* the RAM and PCI address spaces.  On RAM so it can do the
preregistration, and on PCI so it can do the actual IOMMU mappings.

What might make sense, although it might be better as a later cleanup
is to bake into the common code the idea of two listeners - one for
new RAM regions, one for new PCI mappings, with the actual actions for
each case dependent on the IOMMU type.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07  0:29         ` David Gibson
@ 2015-07-07  0:36           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07  0:36 UTC (permalink / raw)
  To: David Gibson, Alex Williamson
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan

On 07/07/2015 10:29 AM, David Gibson wrote:
> On Mon, Jul 06, 2015 at 10:13:07AM -0600, Alex Williamson wrote:
>> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
>>> On 07/06/2015 11:42 PM, Alex Williamson wrote:
>>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
>>>>> This makes use of the new "memory registering" feature. The idea is
>>>>> to provide the userspace ability to notify the host kernel about pages
>>>>> which are going to be used for DMA. Having this information, the host
>>>>> kernel can pin them all once per user process, do locked pages
>>>>> accounting (once) and not spent time on doing that in real time with
>>>>> possible failures which cannot be handled nicely in some cases.
>>>>>
>>>>> This adds a guest RAM memory listener which notifies a VFIO container
>>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>>>>> (i.e. "skip dump" regions) are skipped.
>>>>>
>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>>> not call it when v2 is detected and enabled.
>>>>>
>>>>> This does not change the guest visible interface.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>> ---
>>>>> Changes:
>>>>> v9:
>>>>> * since there is no more SPAPR-specific data in container::iommu_data,
>>>>> the memory preregistration fields are common and potentially can be used
>>>>> by other architectures
>>>>>
>>>>> v7:
>>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
>>>>> * s'ramlistener'register_listener'
>>>>>
>>>>> v6:
>>>>> * fixed commit log (s/guest/userspace/), added note about no guest visible
>>>>> change
>>>>> * fixed error checking if ram registration failed
>>>>> * added alignment check for section->offset_within_region
>>>>>
>>>>> v5:
>>>>> * simplified the patch
>>>>> * added trace points
>>>>> * added round_up() for the size
>>>>> * SPAPR IOMMU v2 used
>>>>> ---
>>>>>    hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>>>>>    include/hw/vfio/vfio-common.h |   3 ++
>>>>>    trace-events                  |   1 +
>>>>>    3 files changed, 104 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>> index 8eacfd7..0c7ba8c 100644
>>>>> --- a/hw/vfio/common.c
>>>>> +++ b/hw/vfio/common.c
>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>>>>        memory_listener_unregister(&container->iommu_data.type1.listener);
>>>>>    }
>>>>>
>>>>> +static void vfio_ram_do_region(VFIOContainer *container,
>>>>> +                              MemoryRegionSection *section, unsigned long req)
>>>>> +{
>>>>> +    int ret;
>>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>>>>
>>>> This function is not as general as the name would imply, it's spapr
>>>> specific due to this.  How about vfio_spapr_register_memory() with a
>>>> bool parameter toggling register vs unregister so we're not passing an
>>>> arbitrary ioctl number?
>>>
>>> Ok. Although I am quite often asked not to do such a thing and rather add 2
>>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
>>
>> I'm not a fan of functions that do the reverse process based on a bool
>> arg either, but I dislike them less than passing an arbitrary ioctl
>> number for a parameter.  The former is ugly, but the latter is difficult
>> to use and difficult to maintain because it would be subtle later to
>> spot an unsupported ioctl being passed to the function.
>>
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr) ||
>>>>> +        memory_region_is_skip_dump(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>>>>
>>>> s/getpagesize()/qemu_real_host_page_size/?
>>>
>>>
>>> Oh, right, I guess it reached upstream now.
>>>
>>>
>>>>> +        error_report("%s received unaligned region", __func__);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>>>>> +        section->offset_within_region;
>>>>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
>>>>> +
>>>>> +    ret = ioctl(container->fd, req, &reg);
>>>>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
>>>>> +            ret ? -errno : 0);
>>>>> +    if (!ret) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * On the initfn path, store the first error in the container so we
>>>>> +     * can gracefully fail.  Runtime, there's not much we can do other
>>>>> +     * than throw a hardware error.
>>>>> +     */
>>>>> +    if (!container->iommu_data.ram_reg_initialized) {
>>>>> +        if (!container->iommu_data.ram_reg_error) {
>>>>> +            container->iommu_data.ram_reg_error = -errno;
>>>>> +        }
>>>>> +    } else {
>>>>> +        hw_error("vfio: RAM registering failed, unable to continue");
>>>>> +    }
>>>>
>>>> I'd rather see:
>>>>
>>>> if (ret) {
>>>>     if (!container...) {
>>>>       ...
>>>>     } else {
>>>>       ...
>>>>     }
>>>> }
>>>>
>>>> Exiting early on success and otherwise falling into error handling is a
>>>> strange code flow.
>>>
>>> Ok... vfio_dma_map() does not follow this rule so I thought it is not that
>>> strict :)
>>
>> It would be nice to clean it up there too.
>>
>>>>> +}
>>>>> +
>>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
>>>>> +                                         MemoryRegionSection *section)
>>>>> +{
>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>>> +                                            iommu_data.register_listener);
>>>>> +    memory_region_ref(section->mr);
>>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
>>>>
>>>> vfio_spapr_register_memory(container, section, true);
>>>>
>>>>> +}
>>>>> +
>>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
>>>>> +                                         MemoryRegionSection *section)
>>>>> +{
>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>>> +                                            iommu_data.register_listener);
>>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
>>>>
>>>> vfio_spapr_register_memory(container, section, false);
>>>>
>>>>> +    memory_region_unref(section->mr);
>>>>> +}
>>>>> +
>>>>> +static const MemoryListener vfio_ram_memory_listener = {
>>>>> +    .region_add = vfio_ram_listener_region_add,
>>>>> +    .region_del = vfio_ram_listener_region_del,
>>>>> +};
>>>>
>>>> These are all spapr specific, please reflect that in the name;
>>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
>>>
>>> ok.
>>>
>>>
>>>> Actually, can't we determine what type of IOMMU we have and make the
>>>> existing MemoryListener handle either type1 or spapr or spapr-v2?
>>>
>>>
>>> Sorry, I do not follow you here. How? The existing listener listens on PCI
>>> address space (at least, on pseries), new one listens on RAM address space
>>> (address_space_memory). What do I miss?
>>
>> Isn't that simply a difference of the address space the listener is
>> attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
>> are already both handled by the same listener.
>
> I think what you're missing is that the spapr code now needs to listen
> on *both* the RAM and PCI address spaces.  On RAM so it can do the
> preregistration, and on PCI so it can do the actual IOMMU mappings.

We had a chat with Alex. On x86 this listener listens on RAM already (as a 
fallback result in pci_device_iommu_address_space), it is SPAPR who does 
not. So now the plan is to keep using the same listener for both RAM and 
PCI but without the filter and do filtering in the callbacks.


> What might make sense, although it might be better as a later cleanup
> is to bake into the common code the idea of two listeners - one for
> new RAM regions, one for new PCI mappings, with the actual actions for
> each case dependent on the IOMMU type.

May be one day. Do we have any other arch with guest visible IOMMU coming? 
Preferably x86?


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  2015-07-06 11:06   ` David Gibson
@ 2015-07-07  4:58   ` David Gibson
  2015-07-07  9:33   ` Thomas Huth
  2 siblings, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-07  4:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 4933 bytes --]

On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices. As all TCE root regions
> are mapped at 0 and 64bit long (and actual tables are child regions),
> this replaces memory_region_add_subregion() with _overlap() to make
> QEMU memory API happy.
> 
> This reserves RTAS token numbers for DDW calls.
> 
> This implements helpers to interact with VFIO kernel interface.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.3 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.


> diff --git a/trace-events b/trace-events
> index b300e94..a1234dd 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1302,6 +1302,8 @@ spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
>  spapr_pci_msi_retry(unsigned config_addr, unsigned req_num, unsigned max_irqs) "Guest device at %x asked %u, have only %u"
>  spapr_pci_dma_update(uint64_t liobn, long ret) "liobn=%"PRIx64" ret=%ld"
>  spapr_pci_dma_realloc_update(uint64_t liobn, long ret) "liobn=%"PRIx64" tcet=%ld"
> +spapr_pci_vfio_init_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +spapr_pci_vfio_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/pci/pci.c
>  pci_update_mappings_del(void *d, uint32_t bus, uint32_t func, uint32_t slot, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,%#"PRIx64"+%#"PRIx64
> @@ -1365,6 +1367,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_alloc_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32

Turns out the dtrace trace backend barfs on the "long long" here :(

$ ./configure --target-list=ppc64-softmmu --enable-trace-backends=dtrace
[...]
$ make
  GEN   config-host.h
  GEN   trace/generated-tracers.h
  GEN   trace/generated-tracers-dtrace.dtrace
  GEN   trace/generated-tracers-dtrace.h
Warning: /bin/dtrace:trace/generated-tracers-dtrace.dtrace:2212: syntax	error near:
probe spapr_iommu_ddw_query

Warning: Proceeding as if --no-pyparsing was given.

  GEN   trace/generated-tcg-tracers.h
  GEN   trace/generated-helpers-wrappers.h
  GEN   trace/generated-helpers.h
  CC    trace/generated-events.o
  GEN   trace/generated-tracers-dtrace.o
Warning: /bin/dtrace:trace/generated-tracers-dtrace.dtrace:2212: syntax error near:
probe spapr_iommu_ddw_query

Warning: Proceeding as if --no-pyparsing was given.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
@ 2015-07-07  7:02   ` Thomas Huth
  0 siblings, 0 replies; 71+ messages in thread
From: Thomas Huth @ 2015-07-07  7:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:07 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This enables multiple IOMMU groups in one VFIO container which means
> that multiple devices from different groups can share the same IOMMU
> table (or tables if DDW).
> 
> This removes a group id from vfio_container_ioctl(). The kernel support
> is required for this; if the host kernel does not have the support,
> it will allow only one group per container. The PHB's "iommuid" property
> is ignored. The ioctl is called for every container attached to
> the address space. At the moment there is just one container anyway.
> 
> If there is no container attached to the address space,
> vfio_container_do_ioctl() returns -1.
> 
> This removes casts to sPAPRPHBVFIOState as none of sPAPRPHBVFIOState
> members is accessed here.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr_pci_vfio.c | 17 ++++++-----------
>  hw/vfio/common.c        | 20 ++++++--------------
>  include/hw/vfio/vfio.h  |  2 +-
>  3 files changed, 13 insertions(+), 26 deletions(-)
...
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index b1045da..89ef37b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -918,34 +918,26 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      close(vbasedev->fd);
>  }
>  
> -static int vfio_container_do_ioctl(AddressSpace *as, int32_t groupid,
> +static int vfio_container_do_ioctl(AddressSpace *as,
>                                     int req, void *param)
>  {
> -    VFIOGroup *group;
>      VFIOContainer *container;
>      int ret = -1;
> +    VFIOAddressSpace *space = vfio_get_address_space(as);
>  
> -    group = vfio_get_group(groupid, as);
> -    if (!group) {
> -        error_report("vfio: group %d not registered", groupid);
> -        return ret;
> -    }
> -
> -    container = group->container;
> -    if (group->container) {
> +    QLIST_FOREACH(container, &space->containers, next) {
>          ret = ioctl(container->fd, req, param);
>          if (ret < 0) {
>              error_report("vfio: failed to ioctl %d to container: ret=%d, %s",
>                           _IOC_NR(req) - VFIO_BASE, ret, strerror(errno));
> +            return -errno;
>          }
>      }
>  
> -    vfio_put_group(group);
> -
>      return ret;
>  }
>  
> -int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
> +int vfio_container_ioctl(AddressSpace *as,
>                           int req, void *param)

You could easily fit that into one line now.

>  {
>      /* We allow only certain ioctls to the container */
> @@ -960,5 +952,5 @@ int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
>          return -1;
>      }
>  
> -    return vfio_container_do_ioctl(as, groupid, req, param);
> +    return vfio_container_do_ioctl(as, req, param);
>  }
> diff --git a/include/hw/vfio/vfio.h b/include/hw/vfio/vfio.h
> index 0b26cd8..76b5744 100644
> --- a/include/hw/vfio/vfio.h
> +++ b/include/hw/vfio/vfio.h
> @@ -3,7 +3,7 @@
>  
>  #include "qemu/typedefs.h"
>  
> -extern int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
> +extern int vfio_container_ioctl(AddressSpace *as,
>                                  int req, void *param);

Dito.

Apart from the two cosmetic nits, patch looks fine to me:

Reviewed-by: Thomas Huth <thuth@redhat.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
  2015-07-06 13:42   ` Alex Williamson
@ 2015-07-07  7:23   ` Thomas Huth
  2015-07-07 10:05     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-07  7:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:09 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a guest RAM memory listener which notifies a VFIO container
> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> (i.e. "skip dump" regions) are skipped.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v9:
> * since there is no more SPAPR-specific data in container::iommu_data,
> the memory preregistration fields are common and potentially can be used
> by other architectures
> 
> v7:
> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> * s'ramlistener'register_listener'
> 
> v6:
> * fixed commit log (s/guest/userspace/), added note about no guest visible
> change
> * fixed error checking if ram registration failed
> * added alignment check for section->offset_within_region
> 
> v5:
> * simplified the patch
> * added trace points
> * added round_up() for the size
> * SPAPR IOMMU v2 used
> ---
>  hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>  include/hw/vfio/vfio-common.h |   3 ++
>  trace-events                  |   1 +
>  3 files changed, 104 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 8eacfd7..0c7ba8c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>      memory_listener_unregister(&container->iommu_data.type1.listener);
>  }
>  
> +static void vfio_ram_do_region(VFIOContainer *container,
> +                              MemoryRegionSection *section, unsigned long req)
> +{
> +    int ret;
> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> +
> +    if (!memory_region_is_ram(section->mr) ||
> +        memory_region_is_skip_dump(section->mr)) {
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +

We're in usespace here ... I think it would be better to use uint64_t
instead of the kernel-type __u64.

> +        section->offset_within_region;
> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
> +
> +    ret = ioctl(container->fd, req, &reg);
> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
> +            ret ? -errno : 0);
> +    if (!ret) {
> +        return;
> +    }
> +
> +    /*
> +     * On the initfn path, store the first error in the container so we
> +     * can gracefully fail.  Runtime, there's not much we can do other
> +     * than throw a hardware error.
> +     */
> +    if (!container->iommu_data.ram_reg_initialized) {
> +        if (!container->iommu_data.ram_reg_error) {
> +            container->iommu_data.ram_reg_error = -errno;
> +        }
> +    } else {
> +        hw_error("vfio: RAM registering failed, unable to continue");
> +    }
> +}
> +
> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> +                                         MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.register_listener);
> +    memory_region_ref(section->mr);
> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> +}
> +
> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> +                                         MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.register_listener);
> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> +    memory_region_unref(section->mr);
> +}
> +
> +static const MemoryListener vfio_ram_memory_listener = {
> +    .region_add = vfio_ram_listener_region_add,
> +    .region_del = vfio_ram_listener_region_del,
> +};
> +
> +static void vfio_spapr_listener_release_v2(VFIOContainer *container)
> +{
> +    memory_listener_unregister(&container->iommu_data.register_listener);
> +    vfio_listener_release(container);
> +}
> +
>  int vfio_mmap_region(Object *obj, VFIORegion *region,
>                       MemoryRegion *mem, MemoryRegion *submem,
>                       void **map, size_t size, off_t offset,
> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>  
>          container->iommu_data.type1.initialized = true;
>  
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);

That "!!" sounds somewhat wrong here. I think you either want to check
for "ioctl() == 1" (because only in this case you can be sure that v2
is supported), or you can simply omit the "!!" because you're 100% sure
that the ioctl only returns 0 or 1 (and never a negative error code).

>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
>              error_report("vfio: failed to set group container: %m");
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        ret = ioctl(fd, VFIO_SET_IOMMU,
> +                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
>          }
>  
>          container->iommu_data.type1.listener = vfio_memory_listener;
> -        container->iommu_data.release = vfio_listener_release;
> -
>          memory_listener_register(&container->iommu_data.type1.listener,
>                                   container->space->as);
>  
> +        if (!v2) {
> +            container->iommu_data.release = vfio_listener_release;
> +        } else {
> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
> +            container->iommu_data.register_listener =
> +                    vfio_ram_memory_listener;
> +            memory_listener_register(&container->iommu_data.register_listener,
> +                                     &address_space_memory);
> +
> +            if (container->iommu_data.ram_reg_error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");

Line > 80 columns?

> +                goto listener_release_exit;
> +            }
> +
> +            container->iommu_data.ram_reg_initialized = true;
> +        }
> +

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-06 21:31   ` Thomas Huth
@ 2015-07-07  9:28     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07  9:28 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 07:31 AM, Thomas Huth wrote:
> On Mon,  6 Jul 2015 12:11:06 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> sPAPR IOMMU is managing two copies of an TCE table:
>> 1) a guest view of the table - this is what emulated devices use and
>> this is where H_GET_TCE reads from;
>> 2) a hardware TCE table - only present if there is at least one vfio-pci
>> device on a PHB; it is updated via a memory listener on a PHB address
>> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
>>
>> At the moment presence of vfio-pci devices on a bus affect the way
>> the guest view table is allocated. If there is no vfio-pci on a PHB
>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
>> is allocated in KVM. However, if there is vfio-pci and we do yet not
>> support KVM acceleration for these, the table has to be allocated
>> by the userspace.
>>
>> When vfio-pci device is hotplugged and there were no vfio-pci devices
>> already, the guest view table could have been allocated by KVM which
>> means that H_PUT_TCE is handled by the host kernel and since we
>> do not support vfio-pci in KVM, the hardware table will not be updated.
>>
>> This reallocates the guest view table in QEMU if the first vfio-pci
>> device has just been plugged. spapr_tce_realloc_userspace() handles this.
>
> I wonder whether it would help to improve the readability of the code
> later if you put the description of the function into the code instead
> of the commit message?


Not sure I understood how much of this commit log you'd like to see in the 
code. The function has some comments already...

>
>> This replays all the mappings to make sure that the tables are in sync.
>> This will not have a visible effect though as for a new device
>> the guest kernel will allocate-and-map new addresses and therefore
>> existing mappings from emulated devices will not be used by vfio-pci
>> devices.
>>
>> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
>> hooks.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
> ...
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 76c988f..d1fa157 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>       return 0;
>>   }
>>
>> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>> +{
>> +    int ret = 0, i;
>> +    bool had_vfio = sphb->has_vfio;
>> +    sPAPRTCETable *tcet;
>> +
>> +    spapr_phb_dma_capabilities_update(sphb);
>> +
>> +    if (!had_vfio && sphb->has_vfio) {
>
>      if (had_vfio || !sphb->has_vfio) {
>          return 0;
>      }
>
> ... and then you can save one level of indentation for the following
> for-loop.

Right. I was going to add another chunk later with "if", "had_vfio" and 
"sphb->has_vfio", this is why this indentation. I'll remove this.


>> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>> +            if (!tcet || !tcet->enabled) {
>> +                continue;
>> +            }
>> +            if (tcet->fd >= 0) {
>> +                /*
>> +                 * We got first vfio-pci device on accelerated table.
>> +                 * VFIO acceleration is not possible.
>> +                 * Reallocate table in userspace and replay mappings.
>> +                 */
>> +                ret = spapr_tce_realloc_userspace(tcet, true);
>> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
>> +            } else {
>> +                /* There was no acceleration, so just replay mappings. */
>> +                ret = spapr_tce_replay(tcet);
>> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
>> +            }
>> +            if (ret) {
>> +                break;
>> +            }
>> +        }
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   /* Macros to operate with address in OF binding to PCI */
>>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> ...
>> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>        */
>>       pci_device_reset(PCI_DEVICE(dev));
>>       object_unparent(OBJECT(dev));
>> +
>> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
>> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
>
> Too much brackets again for my taste ;-)


Never too much! ;)


>
>>   }
>>
>
>   Thomas
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  2015-07-06 11:06   ` David Gibson
  2015-07-07  4:58   ` David Gibson
@ 2015-07-07  9:33   ` Thomas Huth
  2015-07-07 10:43     ` Alexey Kardashevskiy
  2 siblings, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-07  9:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Mon,  6 Jul 2015 12:11:10 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices. As all TCE root regions
> are mapped at 0 and 64bit long (and actual tables are child regions),
> this replaces memory_region_add_subregion() with _overlap() to make
> QEMU memory API happy.
> 
> This reserves RTAS token numbers for DDW calls.
> 
> This implements helpers to interact with VFIO kernel interface.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.3 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
...
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..7539c6a
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,300 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.

Happy new year?

> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> +                                 uint64_t page_mask)
> +{
> +    int i, j;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            if ((sps[i].page_shift == masks[j].shift) &&
> +                    (page_mask & (1ULL << masks[j].shift))) {
> +                mask |= masks[j].mask;
> +            }
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t avail, addr, pgmask = 0;
> +    unsigned current;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    current = spapr_phb_get_active_win_num(sphb);
> +    avail = (sphb->windows_supported > current) ?
> +            (sphb->windows_supported - current) : 0;
> +
> +    /* Work out supported page masks */
> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as all RAM was in 4K pages.
> +     */
> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
> +                                pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {

Pascal bracket style again :-(

> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
> +                                    1ULL << window_shift);

As already mentioned in a comment to another patch in this series, I
think it maybe might be better to do some sanity checks on the
window_shift value, too?

> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
> +                                 liobn, ret);
> +    if (ret || !tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));

Why don't you simply use 0xffffffff instead of ((uint32_t) -1) ?
That's shorter and much easier to understand at a first glance than
calulating the type-cast in your brain ;-)

> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {

 (╯°□°)╯︵ ┻━┻

> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_remove_window(sphb, tcet);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +    long ret;
> +
> +    if ((nargs != 3) || (nret != 1)) {

 ┬─┬ ︵ /(.□. \)

> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-06 11:06   ` David Gibson
  2015-07-06 11:27     ` Alexey Kardashevskiy
@ 2015-07-07  9:46     ` Alexey Kardashevskiy
  1 sibling, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07  9:46 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

On 07/06/2015 09:06 PM, David Gibson wrote:
> On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This implements helpers to interact with VFIO kernel interface.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.3 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v10:
>> * added dma64_win_addr property to PHB
>> * removed redundand check for "!migtable" in spapr_tce_table_post_load()
>>
>> v9:
>> * fixed default 64bit window start (from mdroth)
>> * fixed type cast in dma window update code (from mdroth)
>> * spapr_phb_dma_update() now can fail and cause hotplug failure if
>> hardware TCE table cannot be mapped to the same bus address as the emulated one
>>
>> v7:
>> * fixed uninitialized variables
>>
>> v6:
>> * rework as there is no more special device for VFIO PHB
>>
>> v5:
>> * total rework
>> * enabled for machines >2.3
>> * fixed migration
>> * merged rtas handlers here
>>
>> v4:
>> * reset handler is back in generalized form
>>
>> v3:
>> * removed reset
>> * windows_num is now 1 or bigger rather than 0-based value and it is only
>> changed in PHB code, not in RTAS
>> * added page mask check in create()
>> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
>> created
>>
>> v2:
>> * tested on hacked emulated E1000
>> * implemented DDW reset on the PHB reset
>> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
>> ---
>>   hw/ppc/Makefile.objs        |   3 +
>>   hw/ppc/spapr.c              |   5 +
>>   hw/ppc/spapr_iommu.c        |  32 ++++-
>>   hw/ppc/spapr_pci.c          | 110 ++++++++++++++--
>>   hw/ppc/spapr_pci_vfio.c     |  88 +++++++++++++
>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/common.c            |   2 +
>>   include/hw/pci-host/spapr.h |  21 +++-
>>   include/hw/ppc/spapr.h      |  17 ++-
>>   trace-events                |   6 +
>>   10 files changed, 568 insertions(+), 16 deletions(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c8ab06e..0b2ff6d 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
>> +obj-y += spapr_rtas_ddw.o
>> +endif
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 5ca817c..d50d50b 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = {
>>               .driver   = "spapr-pci-host-bridge",\
>>               .property = "dynamic-reconfiguration",\
>>               .value    = "off",\
>> +        },\
>> +        {\
>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +            .property = "ddw",\
>> +            .value    = stringify(off),\
>>           },
>>
>>   #define SPAPR_COMPAT_2_2 \
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 2d99c3b..b54c3d8 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -136,6 +136,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>       return ret;
>>   }
>>
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> +    tcet->migtable = tcet->table;
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>>           spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>>       }
>>
>> +    if (tcet->enabled) {
>> +        if (!tcet->table) {
>> +            tcet->enabled = false;
>> +            /* VFIO does not migrate so pass vfio_accel == false */
>> +            spapr_tce_table_do_enable(tcet, false);
>> +        }
>> +        memcpy(tcet->table, tcet->migtable,
>> +               tcet->nb_table * sizeof(tcet->table[0]));
>> +        free(tcet->migtable);
>> +        tcet->migtable = NULL;
>> +    }
>> +
>>       return 0;
>>   }
>>
>>   static const VMStateDescription vmstate_spapr_tce_table = {
>>       .name = "spapr_iommu",
>> -    .version_id = 2,
>> +    .version_id = 3,
>>       .minimum_version_id = 2,
>> +    .pre_save = spapr_tce_table_pre_save,
>>       .post_load = spapr_tce_table_post_load,
>>       .fields      = (VMStateField []) {
>>           /* Sanity check */
>>           VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>>
>>           /* IOMMU state */
>> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
>> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>>           VMSTATE_BOOL(bypass, sPAPRTCETable),
>> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
>> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
>> +                                    vmstate_info_uint64, uint64_t),
>>
>>           VMSTATE_END_OF_LIST()
>>       },
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index d1fa157..b7113b5 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>>
>>       sphb->dma32_window_start = 0;
>>       sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
>> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
>> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
>> +    sphb->dma64_window_size = pow2ceil(ram_size);
>>
>>       ret = spapr_phb_vfio_dma_capabilities_update(sphb);
>>       sphb->has_vfio = (ret == 0);
>> @@ -785,12 +788,35 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>>       return 0;
>>   }
>>
>> -static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> -                                     uint32_t liobn, uint32_t page_shift,
>> -                                     uint64_t window_size)
>> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> +                              uint32_t liobn, uint32_t page_shift,
>> +                              uint64_t window_size)
>>   {
>>       uint64_t bus_offset = sphb->dma32_window_start;
>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>> +    int ret;
>> +
>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>> +        return -1;
>> +    }
>> +
>> +    if (sphb->ddw_enabled) {
>> +        if (sphb->has_vfio) {
>> +            ret = spapr_phb_vfio_dma_init_window(sphb,
>> +                                                 page_shift, window_size,
>> +                                                 &bus_offset);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +        } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
>> +            /*
>> +             * There is no VFIO so we choose a huge window address.
>> +             * If VFIO is added later, spapr_phb_dma_update() will fail
>> +             * and cause hotplug failure.
>> +             */
>> +            bus_offset = sphb->dma64_window_start;
>> +        }
>> +    }
>>
>>       spapr_tce_table_enable(tcet, bus_offset, page_shift,
>>                              window_size >> page_shift,
>> @@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>>   int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>>                                   sPAPRTCETable *tcet)
>>   {
>> +    int ret = 0;
>> +
>> +    if (sphb->has_vfio && sphb->ddw_enabled) {
>> +        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
>> +    }
>>       spapr_tce_table_disable(tcet);
>>
>> -    return 0;
>> +    return ret;
>>   }
>>
>>   int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>> @@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>>       int ret = 0, i;
>>       bool had_vfio = sphb->has_vfio;
>>       sPAPRTCETable *tcet;
>> +    uint64_t bus_offset = 0;
>>
>>       spapr_phb_dma_capabilities_update(sphb);
>>
>> +    /*
>> +     * PHB got first VFIO device or lost last VFIO device;
>> +     * If it is the last VFIO device, we do not need windows anymore so
>> +     * remove them.
>> +     * If it is the first VFIO device, we have to remove them as
>> +     * we cannot request a specific window from the host kernel so we
>> +     * remove all windows and recreate them later if necessary.
>
> Am I right in thinking that there never should be (VFIO enabled)
> windows when the first VFIO device is added though?


Actually there should be a 32bit window already created in the container.
And PHB may have no 32bit at the moment of hotplug, it may have removed it 
and created 64bit window instead (which does not happen now with the modern 
guests and not supported by old guests anyway but still may be the case for 
the other OS).


> If you're removing the windows when VFIO devices are removed, and any
> windows created while !has_vfio shouldn't result in the kernel being


"shouldn't result in the window" may be?

> requested from the kernel..?

Either way, putting (i.e. releasing) a container should do the right job.



>
>> +     */
>> +    if (had_vfio !=  sphb->has_vfio) {
>> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>> +            if (!tcet) {
>> +                continue;
>> +            }
>> +            spapr_phb_vfio_dma_remove_window(sphb, tcet);
>> +        }
>> +    }
>> +
>>       if (!had_vfio && sphb->has_vfio) {
>>           for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>               tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>>               if (!tcet || !tcet->enabled) {
>>                   continue;
>>               }
>> +            ret = spapr_phb_vfio_dma_init_window(sphb,
>> +                                                 tcet->page_shift,
>> +                                                 (uint64_t)tcet->nb_table <<
>> +                                                 tcet->page_shift,
>> +                                                 &bus_offset);
>> +            if (ret) {
>> +                break;
>> +            }
>> +            if (bus_offset != tcet->bus_offset) {
>> +                ret = -EFAULT;
>> +                break;
>> +            }
>>               if (tcet->fd >= 0) {
>>                   /*
>>                    * We got first vfio-pci device on accelerated table.
>> @@ -1143,7 +1205,10 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>               error_setg(errp, "Failed to create pci child device tree node");
>>               goto out;
>>           }
>> -        spapr_phb_hotplug_dma_sync(phb);
>> +        if (spapr_phb_hotplug_dma_sync(phb)) {
>> +            error_setg(errp, "Failed to create DMA window(s)");
>> +            goto out;
>> +        }
>>       }
>>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07  7:23   ` Thomas Huth
@ 2015-07-07 10:05     ` Alexey Kardashevskiy
  2015-07-07 10:21       ` Thomas Huth
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07 10:05 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 05:23 PM, Thomas Huth wrote:
> On Mon,  6 Jul 2015 12:11:09 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a guest RAM memory listener which notifies a VFIO container
>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>> (i.e. "skip dump" regions) are skipped.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This does not change the guest visible interface.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> ---
>> Changes:
>> v9:
>> * since there is no more SPAPR-specific data in container::iommu_data,
>> the memory preregistration fields are common and potentially can be used
>> by other architectures
>>
>> v7:
>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
>> * s'ramlistener'register_listener'
>>
>> v6:
>> * fixed commit log (s/guest/userspace/), added note about no guest visible
>> change
>> * fixed error checking if ram registration failed
>> * added alignment check for section->offset_within_region
>>
>> v5:
>> * simplified the patch
>> * added trace points
>> * added round_up() for the size
>> * SPAPR IOMMU v2 used
>> ---
>>   hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>>   include/hw/vfio/vfio-common.h |   3 ++
>>   trace-events                  |   1 +
>>   3 files changed, 104 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 8eacfd7..0c7ba8c 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>       memory_listener_unregister(&container->iommu_data.type1.listener);
>>   }
>>
>> +static void vfio_ram_do_region(VFIOContainer *container,
>> +                              MemoryRegionSection *section, unsigned long req)
>> +{
>> +    int ret;
>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>> +
>> +    if (!memory_region_is_ram(section->mr) ||
>> +        memory_region_is_skip_dump(section->mr)) {
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>
> We're in usespace here ... I think it would be better to use uint64_t
> instead of the kernel-type __u64.


We are calling a kernel here - @reg is a kernel-defined struct.


>
>> +        section->offset_within_region;
>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
>> +
>> +    ret = ioctl(container->fd, req, &reg);
>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
>> +            ret ? -errno : 0);
>> +    if (!ret) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * On the initfn path, store the first error in the container so we
>> +     * can gracefully fail.  Runtime, there's not much we can do other
>> +     * than throw a hardware error.
>> +     */
>> +    if (!container->iommu_data.ram_reg_initialized) {
>> +        if (!container->iommu_data.ram_reg_error) {
>> +            container->iommu_data.ram_reg_error = -errno;
>> +        }
>> +    } else {
>> +        hw_error("vfio: RAM registering failed, unable to continue");
>> +    }
>> +}
>> +
>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
>> +                                         MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.register_listener);
>> +    memory_region_ref(section->mr);
>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
>> +}
>> +
>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
>> +                                         MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.register_listener);
>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
>> +    memory_region_unref(section->mr);
>> +}
>> +
>> +static const MemoryListener vfio_ram_memory_listener = {
>> +    .region_add = vfio_ram_listener_region_add,
>> +    .region_del = vfio_ram_listener_region_del,
>> +};
>> +
>> +static void vfio_spapr_listener_release_v2(VFIOContainer *container)
>> +{
>> +    memory_listener_unregister(&container->iommu_data.register_listener);
>> +    vfio_listener_release(container);
>> +}
>> +
>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>>                        MemoryRegion *mem, MemoryRegion *submem,
>>                        void **map, size_t size, off_t offset,
>> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>
>>           container->iommu_data.type1.initialized = true;
>>
>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>
> That "!!" sounds somewhat wrong here. I think you either want to check
> for "ioctl() == 1" (because only in this case you can be sure that v2
> is supported), or you can simply omit the "!!" because you're 100% sure
> that the ioctl only returns 0 or 1 (and never a negative error code).


The host kernel does not return an error on these ioctls, it returns 0 or 
1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does 
exactly the same already.



>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>           if (ret) {
>>               error_report("vfio: failed to set group container: %m");
>>               ret = -errno;
>>               goto free_container_exit;
>>           }
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        ret = ioctl(fd, VFIO_SET_IOMMU,
>> +                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
>>           if (ret) {
>>               error_report("vfio: failed to set iommu for container: %m");
>>               ret = -errno;
>> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>            * when container fd is closed so we do not call it explicitly
>>            * in this file.
>>            */
>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> -        if (ret) {
>> -            error_report("vfio: failed to enable container: %m");
>> -            ret = -errno;
>> -            goto free_container_exit;
>> +        if (!v2) {
>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +            if (ret) {
>> +                error_report("vfio: failed to enable container: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>>           }
>>
>>           container->iommu_data.type1.listener = vfio_memory_listener;
>> -        container->iommu_data.release = vfio_listener_release;
>> -
>>           memory_listener_register(&container->iommu_data.type1.listener,
>>                                    container->space->as);
>>
>> +        if (!v2) {
>> +            container->iommu_data.release = vfio_listener_release;
>> +        } else {
>> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
>> +            container->iommu_data.register_listener =
>> +                    vfio_ram_memory_listener;
>> +            memory_listener_register(&container->iommu_data.register_listener,
>> +                                     &address_space_memory);
>> +
>> +            if (container->iommu_data.ram_reg_error) {
>> +                error_report("vfio: RAM memory listener initialization failed for container");
>
> Line > 80 columns?

afaik user visible strings are an exception in QEMU and kernel.


>
>> +                goto listener_release_exit;
>> +            }
>> +
>> +            container->iommu_data.ram_reg_initialized = true;
>> +        }
>> +
>
>   Thomas
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07 10:05     ` Alexey Kardashevskiy
@ 2015-07-07 10:21       ` Thomas Huth
  2015-07-07 11:05         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-07 10:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Tue, 7 Jul 2015 20:05:25 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 07/07/2015 05:23 PM, Thomas Huth wrote:
> > On Mon,  6 Jul 2015 12:11:09 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
...
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 8eacfd7..0c7ba8c 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> >>       memory_listener_unregister(&container->iommu_data.type1.listener);
> >>   }
> >>
> >> +static void vfio_ram_do_region(VFIOContainer *container,
> >> +                              MemoryRegionSection *section, unsigned long req)
> >> +{
> >> +    int ret;
> >> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> >> +
> >> +    if (!memory_region_is_ram(section->mr) ||
> >> +        memory_region_is_skip_dump(section->mr)) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> >> +        error_report("%s received unaligned region", __func__);
> >> +        return;
> >> +    }
> >> +
> >> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >
> > We're in usespace here ... I think it would be better to use uint64_t
> > instead of the kernel-type __u64.
> 
> We are calling a kernel here - @reg is a kernel-defined struct.

If you grep for __u64 in the QEMU sources, you'll see that hardly
anybody is using this type - even if calling ioctls. So for
consistency, I'd really suggest to use uint64_t here.

> >> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>
> >>           container->iommu_data.type1.initialized = true;
> >>
> >> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >
> > That "!!" sounds somewhat wrong here. I think you either want to check
> > for "ioctl() == 1" (because only in this case you can be sure that v2
> > is supported), or you can simply omit the "!!" because you're 100% sure
> > that the ioctl only returns 0 or 1 (and never a negative error code).
> 
> 
> The host kernel does not return an error on these ioctls, it returns 0 or 
> 1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does 
> exactly the same already.

Simply using nothing instead is even shorter than using "!!". The
compiler is smart enough to convert from 0 and 1 to bool.
"!!" is IMHO quite ugly and should only be used when it is really
necessary.

> >> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>            * when container fd is closed so we do not call it explicitly
> >>            * in this file.
> >>            */
> >> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> -        if (ret) {
> >> -            error_report("vfio: failed to enable container: %m");
> >> -            ret = -errno;
> >> -            goto free_container_exit;
> >> +        if (!v2) {
> >> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> +            if (ret) {
> >> +                error_report("vfio: failed to enable container: %m");
> >> +                ret = -errno;
> >> +                goto free_container_exit;
> >> +            }
> >>           }
> >>
> >>           container->iommu_data.type1.listener = vfio_memory_listener;
> >> -        container->iommu_data.release = vfio_listener_release;
> >> -
> >>           memory_listener_register(&container->iommu_data.type1.listener,
> >>                                    container->space->as);
> >>
> >> +        if (!v2) {
> >> +            container->iommu_data.release = vfio_listener_release;
> >> +        } else {
> >> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
> >> +            container->iommu_data.register_listener =
> >> +                    vfio_ram_memory_listener;
> >> +            memory_listener_register(&container->iommu_data.register_listener,
> >> +                                     &address_space_memory);
> >> +
> >> +            if (container->iommu_data.ram_reg_error) {
> >> +                error_report("vfio: RAM memory listener initialization failed for container");
> >
> > Line > 80 columns?
> 
> afaik user visible strings are an exception in QEMU and kernel.

You're right for the kernel, but AFAIK QEMU (currently still) has a
hard limit at 80 columns.

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-07  9:33   ` Thomas Huth
@ 2015-07-07 10:43     ` Alexey Kardashevskiy
  2015-07-07 11:35       ` Thomas Huth
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07 10:43 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 07:33 PM, Thomas Huth wrote:
> On Mon,  6 Jul 2015 12:11:10 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This implements helpers to interact with VFIO kernel interface.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.3 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
> ...
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..7539c6a
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,300 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
>
> Happy new year?
>
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->enabled) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->enabled) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>> +                                 uint64_t page_mask)
>> +{
>> +    int i, j;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>> +            if ((sps[i].page_shift == masks[j].shift) &&
>> +                    (page_mask & (1ULL << masks[j].shift))) {
>> +                mask |= masks[j].mask;
>> +            }
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    CPUPPCState *env = &cpu->env;
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    unsigned current;
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    current = spapr_phb_get_active_win_num(sphb);
>> +    avail = (sphb->windows_supported > current) ?
>> +            (sphb->windows_supported - current) : 0;
>> +
>> +    /* Work out supported page masks */
>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as all RAM was in 4K pages.
>> +     */
>> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
>> +                                pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +    long ret;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>
> Pascal bracket style again :-(


Am I breaking any code design guideline here?


>
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);


But here braces are ok? :-/


>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +
>> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
>> +                                    1ULL << window_shift);
>
> As already mentioned in a comment to another patch in this series, I
> think it maybe might be better to do some sanity checks on the
> window_shift value, too?


Well, as you suggested, I added a check to spapr_phb_dma_init_window() 
which makes this code return RTAS_OUT_HW_ERROR. Or I can add this here:

if (window_shift < page_shift) {
     goto param_error_exit;
}

and RTAS handler will return RTAS_OUT_PARAM_ERROR.
SPAPR does not say what is the correct reponse in this case...


>
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift,
>> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
>> +                                 liobn, ret);
>> +    if (ret || !tcet) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>
> Why don't you simply use 0xffffffff instead of ((uint32_t) -1) ?
> That's shorter and much easier to understand at a first glance than
> calulating the type-cast in your brain ;-)


At a first glance I cannot tell if there are 7 or 8 or 9 "f"s in 
0xffffffff. I may accidentally add/remove one "f" and nobody will notice. 
Such typecast of (-1) is quite typical.


>
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +    long ret;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>
>   (╯°□°)╯︵ ┻━┻
>
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_remove_window(sphb, tcet);
>> +    trace_spapr_iommu_ddw_remove(liobn, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +    long ret;
>> +
>> +    if ((nargs != 3) || (nret != 1)) {
>
>   ┬─┬ ︵ /(.□. \)
>
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>
>   Thomas
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07 10:21       ` Thomas Huth
@ 2015-07-07 11:05         ` Alexey Kardashevskiy
  2015-07-08  4:30           ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07 11:05 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 08:21 PM, Thomas Huth wrote:
> On Tue, 7 Jul 2015 20:05:25 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> On 07/07/2015 05:23 PM, Thomas Huth wrote:
>>> On Mon,  6 Jul 2015 12:11:09 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> ...
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 8eacfd7..0c7ba8c 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>>>        memory_listener_unregister(&container->iommu_data.type1.listener);
>>>>    }
>>>>
>>>> +static void vfio_ram_do_region(VFIOContainer *container,
>>>> +                              MemoryRegionSection *section, unsigned long req)
>>>> +{
>>>> +    int ret;
>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr) ||
>>>> +        memory_region_is_skip_dump(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>>>> +        error_report("%s received unaligned region", __func__);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>>>
>>> We're in usespace here ... I think it would be better to use uint64_t
>>> instead of the kernel-type __u64.
>>
>> We are calling a kernel here - @reg is a kernel-defined struct.
>
> If you grep for __u64 in the QEMU sources, you'll see that hardly
> anybody is using this type - even if calling ioctls. So for
> consistency, I'd really suggest to use uint64_t here.



I am not using it, I am packing data to a struct. So does vfio_dma_map() 
already.



>>>> @@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>
>>>>            container->iommu_data.type1.initialized = true;
>>>>
>>>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>>>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>>
>>> That "!!" sounds somewhat wrong here. I think you either want to check
>>> for "ioctl() == 1" (because only in this case you can be sure that v2
>>> is supported), or you can simply omit the "!!" because you're 100% sure
>>> that the ioctl only returns 0 or 1 (and never a negative error code).
>>
>>
>> The host kernel does not return an error on these ioctls, it returns 0 or
>> 1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does
>> exactly the same already.
>
> Simply using nothing instead is even shorter than using "!!". The
> compiler is smart enough to convert from 0 and 1 to bool.
> "!!" is IMHO quite ugly and should only be used when it is really
> necessary.


imho it is not but either way I'd rather follow the existing style, 
especially if I do literally the same thing (checking IOMMU version). 
Unless the original author tells me to convert all the existing occurences 
of "!!" to "!=0" (or something like this) before I post new ones.

Alex, should I get rid of "!!"s in the patch?


>
>>>> @@ -717,19 +791,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>             * when container fd is closed so we do not call it explicitly
>>>>             * in this file.
>>>>             */
>>>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> -        if (ret) {
>>>> -            error_report("vfio: failed to enable container: %m");
>>>> -            ret = -errno;
>>>> -            goto free_container_exit;
>>>> +        if (!v2) {
>>>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> +            if (ret) {
>>>> +                error_report("vfio: failed to enable container: %m");
>>>> +                ret = -errno;
>>>> +                goto free_container_exit;
>>>> +            }
>>>>            }
>>>>
>>>>            container->iommu_data.type1.listener = vfio_memory_listener;
>>>> -        container->iommu_data.release = vfio_listener_release;
>>>> -
>>>>            memory_listener_register(&container->iommu_data.type1.listener,
>>>>                                     container->space->as);
>>>>
>>>> +        if (!v2) {
>>>> +            container->iommu_data.release = vfio_listener_release;
>>>> +        } else {
>>>> +            container->iommu_data.release = vfio_spapr_listener_release_v2;
>>>> +            container->iommu_data.register_listener =
>>>> +                    vfio_ram_memory_listener;
>>>> +            memory_listener_register(&container->iommu_data.register_listener,
>>>> +                                     &address_space_memory);
>>>> +
>>>> +            if (container->iommu_data.ram_reg_error) {
>>>> +                error_report("vfio: RAM memory listener initialization failed for container");
>>>
>>> Line > 80 columns?
>>
>> afaik user visible strings are an exception in QEMU and kernel.
>
> You're right for the kernel, but AFAIK QEMU (currently still) has a
> hard limit at 80 columns.

This is not an error, this is warning and in fact nobody is enforcing this 
(and this is a good thing) and for example VFIO already has longer lines.



-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-07 10:43     ` Alexey Kardashevskiy
@ 2015-07-07 11:35       ` Thomas Huth
  2015-07-07 11:53         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-07 11:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On Tue, 7 Jul 2015 20:43:44 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 07/07/2015 07:33 PM, Thomas Huth wrote:
> > On Mon,  6 Jul 2015 12:11:10 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
...
> >> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet = NULL;
> >> +    uint32_t addr, page_shift, window_shift, liobn;
> >> +    uint64_t buid;
> >> +    long ret;
> >> +
> >> +    if ((nargs != 5) || (nret != 4)) {
> >
> > Pascal bracket style again :-(
> 
> 
> Am I breaking any code design guideline here?

No, but my Pascal allergy causes me to sneeze here ;-)

> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> 
> But here braces are ok? :-/

You could remove them, too. But I did not need to sneeze here.

> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    page_shift = rtas_ld(args, 3);
> >> +    window_shift = rtas_ld(args, 4);
> >> +    liobn = spapr_phb_get_free_liobn(sphb);
> >> +
> >> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
> >> +                                    1ULL << window_shift);
> >
> > As already mentioned in a comment to another patch in this series, I
> > think it maybe might be better to do some sanity checks on the
> > window_shift value, too?
> 
> 
> Well, as you suggested, I added a check to spapr_phb_dma_init_window() 
> which makes this code return RTAS_OUT_HW_ERROR. Or I can add this here:
> 
> if (window_shift < page_shift) {
>      goto param_error_exit;
> }
> 
> and RTAS handler will return RTAS_OUT_PARAM_ERROR.
> SPAPR does not say what is the correct reponse in this case...

Both error codes sound ok for me here, so do whatever you think is best.

> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, liobn);
> >> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> >> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> >
> > Why don't you simply use 0xffffffff instead of ((uint32_t) -1) ?
> > That's shorter and much easier to understand at a first glance than
> > calulating the type-cast in your brain ;-)
> 
> 
> At a first glance I cannot tell if there are 7 or 8 or 9 "f"s in 
> 0xffffffff. I may accidentally add/remove one "f" and nobody will notice. 
> Such typecast of (-1) is quite typical.

But IMHO it's ugly to use it to mask a value to the lower 32 bits this
way. At least I had to read this twice to understand what you're
trying to achieve here. So if you don't like the 0xffffffff, what about
simply using:

    rtas_st(rets, 3, (uint32_t)tcet->bus_offset);

?

 Thomas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-07-07 11:35       ` Thomas Huth
@ 2015-07-07 11:53         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07 11:53 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, qemu-devel, Gavin Shan, Alex Williamson, qemu-ppc,
	David Gibson

On 07/07/2015 09:35 PM, Thomas Huth wrote:
> On Tue, 7 Jul 2015 20:43:44 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> On 07/07/2015 07:33 PM, Thomas Huth wrote:
>>> On Mon,  6 Jul 2015 12:11:10 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> ...
>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                          sPAPRMachineState *spapr,
>>>> +                                          uint32_t token, uint32_t nargs,
>>>> +                                          target_ulong args,
>>>> +                                          uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    sPAPRTCETable *tcet = NULL;
>>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>>> +    uint64_t buid;
>>>> +    long ret;
>>>> +
>>>> +    if ((nargs != 5) || (nret != 4)) {
>>>
>>> Pascal bracket style again :-(
>>
>>
>> Am I breaking any code design guideline here?
>
> No, but my Pascal allergy causes me to sneeze here ;-)

I feel cold when I do not see braces in cases like this ;)


>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>
>> But here braces are ok? :-/
>
> You could remove them, too. But I did not need to sneeze here.

:)


>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    page_shift = rtas_ld(args, 3);
>>>> +    window_shift = rtas_ld(args, 4);
>>>> +    liobn = spapr_phb_get_free_liobn(sphb);
>>>> +
>>>> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
>>>> +                                    1ULL << window_shift);
>>>
>>> As already mentioned in a comment to another patch in this series, I
>>> think it maybe might be better to do some sanity checks on the
>>> window_shift value, too?
>>
>>
>> Well, as you suggested, I added a check to spapr_phb_dma_init_window()
>> which makes this code return RTAS_OUT_HW_ERROR. Or I can add this here:
>>
>> if (window_shift < page_shift) {
>>       goto param_error_exit;
>> }
>>
>> and RTAS handler will return RTAS_OUT_PARAM_ERROR.
>> SPAPR does not say what is the correct reponse in this case...
>
> Both error codes sound ok for me here, so do whatever you think is best.


RTAS_OUT_PARAM_ERROR it is then.


>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    rtas_st(rets, 1, liobn);
>>>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>>>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>>>
>>> Why don't you simply use 0xffffffff instead of ((uint32_t) -1) ?
>>> That's shorter and much easier to understand at a first glance than
>>> calulating the type-cast in your brain ;-)
>>
>>
>> At a first glance I cannot tell if there are 7 or 8 or 9 "f"s in
>> 0xffffffff. I may accidentally add/remove one "f" and nobody will notice.
>> Such typecast of (-1) is quite typical.
>
> But IMHO it's ugly to use it to mask a value to the lower 32 bits this
> way. At least I had to read this twice to understand what you're
> trying to achieve here. So if you don't like the 0xffffffff, what about
> simply using:
>
>      rtas_st(rets, 3, (uint32_t)tcet->bus_offset);
>
> ?

I believe there are compilers which will warn me than I am loosing upper 
32bits.



-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-06 16:13       ` Alex Williamson
  2015-07-07  0:29         ` David Gibson
@ 2015-07-07 12:11         ` Alexey Kardashevskiy
  2015-07-07 16:24           ` Alex Williamson
  1 sibling, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-07 12:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On 07/07/2015 02:13 AM, Alex Williamson wrote:
> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
>> On 07/06/2015 11:42 PM, Alex Williamson wrote:
>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
>>>> This makes use of the new "memory registering" feature. The idea is
>>>> to provide the userspace ability to notify the host kernel about pages
>>>> which are going to be used for DMA. Having this information, the host
>>>> kernel can pin them all once per user process, do locked pages
>>>> accounting (once) and not spent time on doing that in real time with
>>>> possible failures which cannot be handled nicely in some cases.
>>>>
>>>> This adds a guest RAM memory listener which notifies a VFIO container
>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>>>> (i.e. "skip dump" regions) are skipped.
>>>>
>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>> not call it when v2 is detected and enabled.
>>>>
>>>> This does not change the guest visible interface.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>> ---
>>>> Changes:
>>>> v9:
>>>> * since there is no more SPAPR-specific data in container::iommu_data,
>>>> the memory preregistration fields are common and potentially can be used
>>>> by other architectures
>>>>
>>>> v7:
>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
>>>> * s'ramlistener'register_listener'
>>>>
>>>> v6:
>>>> * fixed commit log (s/guest/userspace/), added note about no guest visible
>>>> change
>>>> * fixed error checking if ram registration failed
>>>> * added alignment check for section->offset_within_region
>>>>
>>>> v5:
>>>> * simplified the patch
>>>> * added trace points
>>>> * added round_up() for the size
>>>> * SPAPR IOMMU v2 used
>>>> ---
>>>>    hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>>>>    include/hw/vfio/vfio-common.h |   3 ++
>>>>    trace-events                  |   1 +
>>>>    3 files changed, 104 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 8eacfd7..0c7ba8c 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>>>        memory_listener_unregister(&container->iommu_data.type1.listener);
>>>>    }
>>>>
>>>> +static void vfio_ram_do_region(VFIOContainer *container,
>>>> +                              MemoryRegionSection *section, unsigned long req)
>>>> +{
>>>> +    int ret;
>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>>>
>>> This function is not as general as the name would imply, it's spapr
>>> specific due to this.  How about vfio_spapr_register_memory() with a
>>> bool parameter toggling register vs unregister so we're not passing an
>>> arbitrary ioctl number?
>>
>> Ok. Although I am quite often asked not to do such a thing and rather add 2
>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
>
> I'm not a fan of functions that do the reverse process based on a bool
> arg either, but I dislike them less than passing an arbitrary ioctl
> number for a parameter.  The former is ugly, but the latter is difficult
> to use and difficult to maintain because it would be subtle later to
> spot an unsupported ioctl being passed to the function.
>
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr) ||
>>>> +        memory_region_is_skip_dump(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>>>
>>> s/getpagesize()/qemu_real_host_page_size/?
>>
>>
>> Oh, right, I guess it reached upstream now.
>>
>>
>>>> +        error_report("%s received unaligned region", __func__);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>>>> +        section->offset_within_region;
>>>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
>>>> +
>>>> +    ret = ioctl(container->fd, req, &reg);
>>>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
>>>> +            ret ? -errno : 0);
>>>> +    if (!ret) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * On the initfn path, store the first error in the container so we
>>>> +     * can gracefully fail.  Runtime, there's not much we can do other
>>>> +     * than throw a hardware error.
>>>> +     */
>>>> +    if (!container->iommu_data.ram_reg_initialized) {
>>>> +        if (!container->iommu_data.ram_reg_error) {
>>>> +            container->iommu_data.ram_reg_error = -errno;
>>>> +        }
>>>> +    } else {
>>>> +        hw_error("vfio: RAM registering failed, unable to continue");
>>>> +    }
>>>
>>> I'd rather see:
>>>
>>> if (ret) {
>>>     if (!container...) {
>>>       ...
>>>     } else {
>>>       ...
>>>     }
>>> }
>>>
>>> Exiting early on success and otherwise falling into error handling is a
>>> strange code flow.
>>
>> Ok... vfio_dma_map() does not follow this rule so I thought it is not that
>> strict :)
>
> It would be nice to clean it up there too.
>
>>>> +}
>>>> +
>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
>>>> +                                         MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> +                                            iommu_data.register_listener);
>>>> +    memory_region_ref(section->mr);
>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
>>>
>>> vfio_spapr_register_memory(container, section, true);
>>>
>>>> +}
>>>> +
>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
>>>> +                                         MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> +                                            iommu_data.register_listener);
>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
>>>
>>> vfio_spapr_register_memory(container, section, false);
>>>
>>>> +    memory_region_unref(section->mr);
>>>> +}
>>>> +
>>>> +static const MemoryListener vfio_ram_memory_listener = {
>>>> +    .region_add = vfio_ram_listener_region_add,
>>>> +    .region_del = vfio_ram_listener_region_del,
>>>> +};
>>>
>>> These are all spapr specific, please reflect that in the name;
>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
>>
>> ok.
>>
>>
>>> Actually, can't we determine what type of IOMMU we have and make the
>>> existing MemoryListener handle either type1 or spapr or spapr-v2?
>>
>>
>> Sorry, I do not follow you here. How? The existing listener listens on PCI
>> address space (at least, on pseries), new one listens on RAM address space
>> (address_space_memory). What do I miss?
>
> Isn't that simply a difference of the address space the listener is
> attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
> are already both handled by the same listener.


Ok, I tried merging 2 listeners and realized that the PCI listener works 
with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be 
using an IOMMU page size which is not easily available there but this is a 
different story) and RAM listener with the qemu_real_host_page_size 
granularity (64K for my case) so depending on the address space type, 
vfio_listener_region_add() will have to use different page sizes. I like 
the idea of merging less now...



-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07 12:11         ` Alexey Kardashevskiy
@ 2015-07-07 16:24           ` Alex Williamson
  2015-07-08  6:26             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 71+ messages in thread
From: Alex Williamson @ 2015-07-07 16:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On Tue, 2015-07-07 at 22:11 +1000, Alexey Kardashevskiy wrote:
> On 07/07/2015 02:13 AM, Alex Williamson wrote:
> > On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
> >> On 07/06/2015 11:42 PM, Alex Williamson wrote:
> >>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> >>>> This makes use of the new "memory registering" feature. The idea is
> >>>> to provide the userspace ability to notify the host kernel about pages
> >>>> which are going to be used for DMA. Having this information, the host
> >>>> kernel can pin them all once per user process, do locked pages
> >>>> accounting (once) and not spent time on doing that in real time with
> >>>> possible failures which cannot be handled nicely in some cases.
> >>>>
> >>>> This adds a guest RAM memory listener which notifies a VFIO container
> >>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >>>> (i.e. "skip dump" regions) are skipped.
> >>>>
> >>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>>> not call it when v2 is detected and enabled.
> >>>>
> >>>> This does not change the guest visible interface.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>> ---
> >>>> Changes:
> >>>> v9:
> >>>> * since there is no more SPAPR-specific data in container::iommu_data,
> >>>> the memory preregistration fields are common and potentially can be used
> >>>> by other architectures
> >>>>
> >>>> v7:
> >>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> >>>> * s'ramlistener'register_listener'
> >>>>
> >>>> v6:
> >>>> * fixed commit log (s/guest/userspace/), added note about no guest visible
> >>>> change
> >>>> * fixed error checking if ram registration failed
> >>>> * added alignment check for section->offset_within_region
> >>>>
> >>>> v5:
> >>>> * simplified the patch
> >>>> * added trace points
> >>>> * added round_up() for the size
> >>>> * SPAPR IOMMU v2 used
> >>>> ---
> >>>>    hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
> >>>>    include/hw/vfio/vfio-common.h |   3 ++
> >>>>    trace-events                  |   1 +
> >>>>    3 files changed, 104 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>> index 8eacfd7..0c7ba8c 100644
> >>>> --- a/hw/vfio/common.c
> >>>> +++ b/hw/vfio/common.c
> >>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> >>>>        memory_listener_unregister(&container->iommu_data.type1.listener);
> >>>>    }
> >>>>
> >>>> +static void vfio_ram_do_region(VFIOContainer *container,
> >>>> +                              MemoryRegionSection *section, unsigned long req)
> >>>> +{
> >>>> +    int ret;
> >>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> >>>
> >>> This function is not as general as the name would imply, it's spapr
> >>> specific due to this.  How about vfio_spapr_register_memory() with a
> >>> bool parameter toggling register vs unregister so we're not passing an
> >>> arbitrary ioctl number?
> >>
> >> Ok. Although I am quite often asked not to do such a thing and rather add 2
> >> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
> >
> > I'm not a fan of functions that do the reverse process based on a bool
> > arg either, but I dislike them less than passing an arbitrary ioctl
> > number for a parameter.  The former is ugly, but the latter is difficult
> > to use and difficult to maintain because it would be subtle later to
> > spot an unsupported ioctl being passed to the function.
> >
> >>>> +
> >>>> +    if (!memory_region_is_ram(section->mr) ||
> >>>> +        memory_region_is_skip_dump(section->mr)) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> >>>
> >>> s/getpagesize()/qemu_real_host_page_size/?
> >>
> >>
> >> Oh, right, I guess it reached upstream now.
> >>
> >>
> >>>> +        error_report("%s received unaligned region", __func__);
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>>> +        section->offset_within_region;
> >>>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
> >>>> +
> >>>> +    ret = ioctl(container->fd, req, &reg);
> >>>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
> >>>> +            ret ? -errno : 0);
> >>>> +    if (!ret) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    /*
> >>>> +     * On the initfn path, store the first error in the container so we
> >>>> +     * can gracefully fail.  Runtime, there's not much we can do other
> >>>> +     * than throw a hardware error.
> >>>> +     */
> >>>> +    if (!container->iommu_data.ram_reg_initialized) {
> >>>> +        if (!container->iommu_data.ram_reg_error) {
> >>>> +            container->iommu_data.ram_reg_error = -errno;
> >>>> +        }
> >>>> +    } else {
> >>>> +        hw_error("vfio: RAM registering failed, unable to continue");
> >>>> +    }
> >>>
> >>> I'd rather see:
> >>>
> >>> if (ret) {
> >>>     if (!container...) {
> >>>       ...
> >>>     } else {
> >>>       ...
> >>>     }
> >>> }
> >>>
> >>> Exiting early on success and otherwise falling into error handling is a
> >>> strange code flow.
> >>
> >> Ok... vfio_dma_map() does not follow this rule so I thought it is not that
> >> strict :)
> >
> > It would be nice to clean it up there too.
> >
> >>>> +}
> >>>> +
> >>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> >>>> +                                         MemoryRegionSection *section)
> >>>> +{
> >>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>> +                                            iommu_data.register_listener);
> >>>> +    memory_region_ref(section->mr);
> >>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> >>>
> >>> vfio_spapr_register_memory(container, section, true);
> >>>
> >>>> +}
> >>>> +
> >>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> >>>> +                                         MemoryRegionSection *section)
> >>>> +{
> >>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>> +                                            iommu_data.register_listener);
> >>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> >>>
> >>> vfio_spapr_register_memory(container, section, false);
> >>>
> >>>> +    memory_region_unref(section->mr);
> >>>> +}
> >>>> +
> >>>> +static const MemoryListener vfio_ram_memory_listener = {
> >>>> +    .region_add = vfio_ram_listener_region_add,
> >>>> +    .region_del = vfio_ram_listener_region_del,
> >>>> +};
> >>>
> >>> These are all spapr specific, please reflect that in the name;
> >>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
> >>
> >> ok.
> >>
> >>
> >>> Actually, can't we determine what type of IOMMU we have and make the
> >>> existing MemoryListener handle either type1 or spapr or spapr-v2?
> >>
> >>
> >> Sorry, I do not follow you here. How? The existing listener listens on PCI
> >> address space (at least, on pseries), new one listens on RAM address space
> >> (address_space_memory). What do I miss?
> >
> > Isn't that simply a difference of the address space the listener is
> > attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
> > are already both handled by the same listener.
> 
> 
> Ok, I tried merging 2 listeners and realized that the PCI listener works 
> with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be 
> using an IOMMU page size which is not easily available there but this is a 
> different story) and RAM listener with the qemu_real_host_page_size 
> granularity (64K for my case) so depending on the address space type, 
> vfio_listener_region_add() will have to use different page sizes. I like 
> the idea of merging less now...

Sounds like you're already solving something that needs to be fixed for
both.  The type1 VFIO_IOMMU_GET_INFO ioctl does actually give us a
bitmap of supported iommu page sizes.  It's really all but useless for
anything except determining the minimum page size.  For the most part we
just assume that it's the same as the host page size, so those existing
checks could actually change to host page alignment pretty safely.  I
think we both actually want pages that are both host and target aligned,
don't we?  What would you do on a 64k host if the guest tried to map a
region that only had 4k alignment?  Anyway, if that's the only problem,
it looks more like an opportunity than a barrier.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07 11:05         ` Alexey Kardashevskiy
@ 2015-07-08  4:30           ` David Gibson
  2015-07-08  6:24             ` Thomas Huth
                               ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: David Gibson @ 2015-07-08  4:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Thomas Huth, Michael Roth, qemu-devel, Gavin Shan,
	Alex Williamson, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 4137 bytes --]

On Tue, Jul 07, 2015 at 09:05:02PM +1000, Alexey Kardashevskiy wrote:
> On 07/07/2015 08:21 PM, Thomas Huth wrote:
> >On Tue, 7 Jul 2015 20:05:25 +1000
> >Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> >>On 07/07/2015 05:23 PM, Thomas Huth wrote:
> >>>On Mon,  6 Jul 2015 12:11:09 +1000
> >>>Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >...
> >>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>index 8eacfd7..0c7ba8c 100644
> >>>>--- a/hw/vfio/common.c
> >>>>+++ b/hw/vfio/common.c
> >>>>@@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> >>>>       memory_listener_unregister(&container->iommu_data.type1.listener);
> >>>>   }
> >>>>
> >>>>+static void vfio_ram_do_region(VFIOContainer *container,
> >>>>+                              MemoryRegionSection *section, unsigned long req)
> >>>>+{
> >>>>+    int ret;
> >>>>+    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> >>>>+
> >>>>+    if (!memory_region_is_ram(section->mr) ||
> >>>>+        memory_region_is_skip_dump(section->mr)) {
> >>>>+        return;
> >>>>+    }
> >>>>+
> >>>>+    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> >>>>+        error_report("%s received unaligned region", __func__);
> >>>>+        return;
> >>>>+    }
> >>>>+
> >>>>+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>>
> >>>We're in usespace here ... I think it would be better to use uint64_t
> >>>instead of the kernel-type __u64.
> >>
> >>We are calling a kernel here - @reg is a kernel-defined struct.
> >
> >If you grep for __u64 in the QEMU sources, you'll see that hardly
> >anybody is using this type - even if calling ioctls. So for
> >consistency, I'd really suggest to use uint64_t here.
> 
> I am not using it, I am packing data to a struct. So does vfio_dma_map()
> already.

__u64 is just an alias typedef used by the kernel in uapi headers for
64-bit integers.  You should use uint64_t here.

> >>>>@@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>
> >>>>           container->iommu_data.type1.initialized = true;
> >>>>
> >>>>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >>>>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >>>>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>>>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >>>
> >>>That "!!" sounds somewhat wrong here. I think you either want to check
> >>>for "ioctl() == 1" (because only in this case you can be sure that v2
> >>>is supported), or you can simply omit the "!!" because you're 100% sure
> >>>that the ioctl only returns 0 or 1 (and never a negative error code).
> >>
> >>
> >>The host kernel does not return an error on these ioctls, it returns 0 or
> >>1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does
> >>exactly the same already.
> >
> >Simply using nothing instead is even shorter than using "!!". The
> >compiler is smart enough to convert from 0 and 1 to bool.
> >"!!" is IMHO quite ugly and should only be used when it is really
> >necessary.
> 
> 
> imho it is not but either way I'd rather follow the existing style,
> especially if I do literally the same thing (checking IOMMU version). Unless
> the original author tells me to convert all the existing occurences of "!!"
> to "!=0" (or something like this) before I post new ones.
> 
> Alex, should I get rid of "!!"s in the patch?

I think !! is the lesser evil here.  The trouble is that in C "bool"
is not a first-class datatype, but just a typedef for some integer
type.  Which means that, confusingly, (bool)2 != (bool)1.  So using
the !! trick to force a value to be either 0 or 1 when assigning it to
a bool variable is probably a good idea.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2015-07-06 15:54 ` Thomas Huth
  2015-07-06 16:07   ` Alexey Kardashevskiy
@ 2015-07-08  4:34   ` David Gibson
  1 sibling, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-08  4:34 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, Alexey Kardashevskiy, qemu-devel, Gavin Shan,
	Alex Williamson, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 2804 bytes --]

On Mon, Jul 06, 2015 at 05:54:56PM +0200, Thomas Huth wrote:
> On Mon,  6 Jul 2015 12:10:56 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> ...
> > 
> > This patchset adds DDW support for pseries. The host kernel changes are
> > required, available in the current upstream.
> > 
> > This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
> > 
> > Please comment. Thanks!
> 
>  Alexey,
> 
> I'm sorry, but it looks like this patch set badly fails to link when
> compiling for a non-Linux target:
> 
>   LINK  ppc64-softmmu/qemu-system-ppc64.exe
> hw/ppc/spapr_pci.o: In function `spapr_phb_dma_capabilities_update':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:785: undefined reference to `spapr_phb_vfio_dma_capabilities_update'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_configure_pe':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:601: undefined reference to `spapr_phb_vfio_eeh_configure'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_set_slot_reset':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:573: undefined reference to `spapr_phb_vfio_eeh_reset'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_read_slot_reset_state2':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:533: undefined reference to `spapr_phb_vfio_eeh_get_state'
> hw/ppc/spapr_pci.o: In function `rtas_ibm_set_eeh_option':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:455: undefined reference to `spapr_phb_vfio_eeh_set_option'
> hw/ppc/spapr_pci.o: In function `spapr_phb_hotplug_dma_sync':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:884: undefined reference to `spapr_phb_vfio_dma_remove_window'
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:894: undefined reference to `spapr_phb_vfio_dma_init_window'
> hw/ppc/spapr_pci.o: In function `spapr_phb_dma_init_window':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:805: undefined reference to `spapr_phb_vfio_dma_init_window'
> hw/ppc/spapr_pci.o: In function `spapr_phb_dma_remove_window':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:834: undefined reference to `spapr_phb_vfio_dma_remove_window'
> hw/ppc/spapr_pci.o: In function `spapr_phb_reset':
> /home/thuth/devel/qemu/hw/ppc/spapr_pci.c:1538: undefined reference to `spapr_phb_vfio_eeh_reenable'
> collect2: error: ld returned 1 exit status
> 
> Please make sure that this series also works if either CONFIG_LINUX
> or CONFIG_PCI are not enabled!

I don't think !CONFIG_PCI really needs handling - I think having
CONFIG_PSERIES enabled without CONFIG_PCI is simply a configuration
error - another thing to handle when and if we ever encode config
dependencies in qemu.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-08  4:30           ` David Gibson
@ 2015-07-08  6:24             ` Thomas Huth
  2015-07-08  6:50               ` David Gibson
  2015-07-08  7:07             ` Alexey Kardashevskiy
  2015-07-08 14:47             ` Alex Williamson
  2 siblings, 1 reply; 71+ messages in thread
From: Thomas Huth @ 2015-07-08  6:24 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael Roth, Alexey Kardashevskiy, qemu-devel, Gavin Shan,
	Alex Williamson, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 3024 bytes --]

On Wed, 8 Jul 2015 14:30:29 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jul 07, 2015 at 09:05:02PM +1000, Alexey Kardashevskiy wrote:
> > On 07/07/2015 08:21 PM, Thomas Huth wrote:
> > >On Tue, 7 Jul 2015 20:05:25 +1000
> > >Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >
> > >>On 07/07/2015 05:23 PM, Thomas Huth wrote:
> > >>>On Mon,  6 Jul 2015 12:11:09 +1000
> > >>>Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
...
> > >>>>@@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>>>
> > >>>>           container->iommu_data.type1.initialized = true;
> > >>>>
> > >>>>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> > >>>>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> > >>>>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> > >>>>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> > >>>
> > >>>That "!!" sounds somewhat wrong here. I think you either want to check
> > >>>for "ioctl() == 1" (because only in this case you can be sure that v2
> > >>>is supported), or you can simply omit the "!!" because you're 100% sure
> > >>>that the ioctl only returns 0 or 1 (and never a negative error code).
> > >>
> > >>
> > >>The host kernel does not return an error on these ioctls, it returns 0 or
> > >>1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does
> > >>exactly the same already.
> > >
> > >Simply using nothing instead is even shorter than using "!!". The
> > >compiler is smart enough to convert from 0 and 1 to bool.
> > >"!!" is IMHO quite ugly and should only be used when it is really
> > >necessary.
> > 
> > 
> > imho it is not but either way I'd rather follow the existing style,
> > especially if I do literally the same thing (checking IOMMU version). Unless
> > the original author tells me to convert all the existing occurences of "!!"
> > to "!=0" (or something like this) before I post new ones.
> > 
> > Alex, should I get rid of "!!"s in the patch?
> 
> I think !! is the lesser evil here.  The trouble is that in C "bool"
> is not a first-class datatype, but just a typedef for some integer
> type.  Which means that, confusingly, (bool)2 != (bool)1.  So using
> the !! trick to force a value to be either 0 or 1 when assigning it to
> a bool variable is probably a good idea.

That was maybe the case > 15 years ago, but since C99, there is a
proper bool type in C, as far as I know. But I am also not an expert
here... However, I tried the following small test program:

#include <stdio.h>
#include <stdbool.h>

int main()
{
	bool a = 1;
	bool b = 2;
	printf("a=%i b=%i\n", a, b);
	return 0;
}

... and indeed, it prints out "a=1 b=1" here, so the "2" got properly
changed to "true" :-)

Anyway, that was already too much bike-shed painting now, if you want to
keep the "!!", then keep it, that's fine for me, too.

 Thomas

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-07 16:24           ` Alex Williamson
@ 2015-07-08  6:26             ` Alexey Kardashevskiy
  2015-07-08 14:51               ` Alex Williamson
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-08  6:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On 07/08/2015 02:24 AM, Alex Williamson wrote:
> On Tue, 2015-07-07 at 22:11 +1000, Alexey Kardashevskiy wrote:
>> On 07/07/2015 02:13 AM, Alex Williamson wrote:
>>> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
>>>> On 07/06/2015 11:42 PM, Alex Williamson wrote:
>>>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
>>>>>> This makes use of the new "memory registering" feature. The idea is
>>>>>> to provide the userspace ability to notify the host kernel about pages
>>>>>> which are going to be used for DMA. Having this information, the host
>>>>>> kernel can pin them all once per user process, do locked pages
>>>>>> accounting (once) and not spent time on doing that in real time with
>>>>>> possible failures which cannot be handled nicely in some cases.
>>>>>>
>>>>>> This adds a guest RAM memory listener which notifies a VFIO container
>>>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>>>>>> (i.e. "skip dump" regions) are skipped.
>>>>>>
>>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>>>> not call it when v2 is detected and enabled.
>>>>>>
>>>>>> This does not change the guest visible interface.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>>> ---
>>>>>> Changes:
>>>>>> v9:
>>>>>> * since there is no more SPAPR-specific data in container::iommu_data,
>>>>>> the memory preregistration fields are common and potentially can be used
>>>>>> by other architectures
>>>>>>
>>>>>> v7:
>>>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
>>>>>> * s'ramlistener'register_listener'
>>>>>>
>>>>>> v6:
>>>>>> * fixed commit log (s/guest/userspace/), added note about no guest visible
>>>>>> change
>>>>>> * fixed error checking if ram registration failed
>>>>>> * added alignment check for section->offset_within_region
>>>>>>
>>>>>> v5:
>>>>>> * simplified the patch
>>>>>> * added trace points
>>>>>> * added round_up() for the size
>>>>>> * SPAPR IOMMU v2 used
>>>>>> ---
>>>>>>     hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
>>>>>>     include/hw/vfio/vfio-common.h |   3 ++
>>>>>>     trace-events                  |   1 +
>>>>>>     3 files changed, 104 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 8eacfd7..0c7ba8c 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>>>>>         memory_listener_unregister(&container->iommu_data.type1.listener);
>>>>>>     }
>>>>>>
>>>>>> +static void vfio_ram_do_region(VFIOContainer *container,
>>>>>> +                              MemoryRegionSection *section, unsigned long req)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>>>>>
>>>>> This function is not as general as the name would imply, it's spapr
>>>>> specific due to this.  How about vfio_spapr_register_memory() with a
>>>>> bool parameter toggling register vs unregister so we're not passing an
>>>>> arbitrary ioctl number?
>>>>
>>>> Ok. Although I am quite often asked not to do such a thing and rather add 2
>>>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
>>>
>>> I'm not a fan of functions that do the reverse process based on a bool
>>> arg either, but I dislike them less than passing an arbitrary ioctl
>>> number for a parameter.  The former is ugly, but the latter is difficult
>>> to use and difficult to maintain because it would be subtle later to
>>> spot an unsupported ioctl being passed to the function.
>>>
>>>>>> +
>>>>>> +    if (!memory_region_is_ram(section->mr) ||
>>>>>> +        memory_region_is_skip_dump(section->mr)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>>>>>
>>>>> s/getpagesize()/qemu_real_host_page_size/?
>>>>
>>>>
>>>> Oh, right, I guess it reached upstream now.
>>>>
>>>>
>>>>>> +        error_report("%s received unaligned region", __func__);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>>>>>> +        section->offset_within_region;
>>>>>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
>>>>>> +
>>>>>> +    ret = ioctl(container->fd, req, &reg);
>>>>>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
>>>>>> +            ret ? -errno : 0);
>>>>>> +    if (!ret) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    /*
>>>>>> +     * On the initfn path, store the first error in the container so we
>>>>>> +     * can gracefully fail.  Runtime, there's not much we can do other
>>>>>> +     * than throw a hardware error.
>>>>>> +     */
>>>>>> +    if (!container->iommu_data.ram_reg_initialized) {
>>>>>> +        if (!container->iommu_data.ram_reg_error) {
>>>>>> +            container->iommu_data.ram_reg_error = -errno;
>>>>>> +        }
>>>>>> +    } else {
>>>>>> +        hw_error("vfio: RAM registering failed, unable to continue");
>>>>>> +    }
>>>>>
>>>>> I'd rather see:
>>>>>
>>>>> if (ret) {
>>>>>      if (!container...) {
>>>>>        ...
>>>>>      } else {
>>>>>        ...
>>>>>      }
>>>>> }
>>>>>
>>>>> Exiting early on success and otherwise falling into error handling is a
>>>>> strange code flow.
>>>>
>>>> Ok... vfio_dma_map() does not follow this rule so I thought it is not that
>>>> strict :)
>>>
>>> It would be nice to clean it up there too.
>>>
>>>>>> +}
>>>>>> +
>>>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
>>>>>> +                                         MemoryRegionSection *section)
>>>>>> +{
>>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>>>> +                                            iommu_data.register_listener);
>>>>>> +    memory_region_ref(section->mr);
>>>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
>>>>>
>>>>> vfio_spapr_register_memory(container, section, true);
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
>>>>>> +                                         MemoryRegionSection *section)
>>>>>> +{
>>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>>>> +                                            iommu_data.register_listener);
>>>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
>>>>>
>>>>> vfio_spapr_register_memory(container, section, false);
>>>>>
>>>>>> +    memory_region_unref(section->mr);
>>>>>> +}
>>>>>> +
>>>>>> +static const MemoryListener vfio_ram_memory_listener = {
>>>>>> +    .region_add = vfio_ram_listener_region_add,
>>>>>> +    .region_del = vfio_ram_listener_region_del,
>>>>>> +};
>>>>>
>>>>> These are all spapr specific, please reflect that in the name;
>>>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
>>>>
>>>> ok.
>>>>
>>>>
>>>>> Actually, can't we determine what type of IOMMU we have and make the
>>>>> existing MemoryListener handle either type1 or spapr or spapr-v2?
>>>>
>>>>
>>>> Sorry, I do not follow you here. How? The existing listener listens on PCI
>>>> address space (at least, on pseries), new one listens on RAM address space
>>>> (address_space_memory). What do I miss?
>>>
>>> Isn't that simply a difference of the address space the listener is
>>> attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
>>> are already both handled by the same listener.
>>
>>
>> Ok, I tried merging 2 listeners and realized that the PCI listener works
>> with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be
>> using an IOMMU page size which is not easily available there but this is a
>> different story) and RAM listener with the qemu_real_host_page_size
>> granularity (64K for my case) so depending on the address space type,
>> vfio_listener_region_add() will have to use different page sizes. I like
>> the idea of merging less now...
>
> Sounds like you're already solving something that needs to be fixed for
> both.  The type1 VFIO_IOMMU_GET_INFO ioctl does actually give us a
> bitmap of supported iommu page sizes.  It's really all but useless for
> anything except determining the minimum page size.

btw what sizes can really come from there?

>  For the most part we
> just assume that it's the same as the host page size, so those existing
> checks could actually change to host page alignment pretty safely.  I
> think we both actually want pages that are both host and target aligned,
> don't we?  What would you do on a 64k host if the guest tried to map a
> region that only had 4k alignment?

I will get_user_pages_fast(va & PAGE_MASK) and then write 
(gpa_to_hpa(va&PAGE_MASK)|(va & ~PAGE_MASK)) to the table, this is what we 
do now as our typical host uses 64k pages and default 32bit window always 
uses 4K (irrelevant to the guest page size).

>  Anyway, if that's the only problem,
> it looks more like an opportunity than a barrier.

Oh. Ok :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-08  6:24             ` Thomas Huth
@ 2015-07-08  6:50               ` David Gibson
  0 siblings, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-08  6:50 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Michael Roth, Alexey Kardashevskiy, qemu-devel, Gavin Shan,
	Alex Williamson, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 3647 bytes --]

On Wed, Jul 08, 2015 at 08:24:56AM +0200, Thomas Huth wrote:
> On Wed, 8 Jul 2015 14:30:29 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Jul 07, 2015 at 09:05:02PM +1000, Alexey Kardashevskiy wrote:
> > > On 07/07/2015 08:21 PM, Thomas Huth wrote:
> > > >On Tue, 7 Jul 2015 20:05:25 +1000
> > > >Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >
> > > >>On 07/07/2015 05:23 PM, Thomas Huth wrote:
> > > >>>On Mon,  6 Jul 2015 12:11:09 +1000
> > > >>>Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> ...
> > > >>>>@@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > > >>>>
> > > >>>>           container->iommu_data.type1.initialized = true;
> > > >>>>
> > > >>>>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> > > >>>>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> > > >>>>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> > > >>>>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> > > >>>
> > > >>>That "!!" sounds somewhat wrong here. I think you either want to check
> > > >>>for "ioctl() == 1" (because only in this case you can be sure that v2
> > > >>>is supported), or you can simply omit the "!!" because you're 100% sure
> > > >>>that the ioctl only returns 0 or 1 (and never a negative error code).
> > > >>
> > > >>
> > > >>The host kernel does not return an error on these ioctls, it returns 0 or
> > > >>1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does
> > > >>exactly the same already.
> > > >
> > > >Simply using nothing instead is even shorter than using "!!". The
> > > >compiler is smart enough to convert from 0 and 1 to bool.
> > > >"!!" is IMHO quite ugly and should only be used when it is really
> > > >necessary.
> > > 
> > > 
> > > imho it is not but either way I'd rather follow the existing style,
> > > especially if I do literally the same thing (checking IOMMU version). Unless
> > > the original author tells me to convert all the existing occurences of "!!"
> > > to "!=0" (or something like this) before I post new ones.
> > > 
> > > Alex, should I get rid of "!!"s in the patch?
> > 
> > I think !! is the lesser evil here.  The trouble is that in C "bool"
> > is not a first-class datatype, but just a typedef for some integer
> > type.  Which means that, confusingly, (bool)2 != (bool)1.  So using
> > the !! trick to force a value to be either 0 or 1 when assigning it to
> > a bool variable is probably a good idea.
> 
> That was maybe the case > 15 years ago, but since C99, there is a
> proper bool type in C, as far as I know. But I am also not an expert
> here... However, I tried the following small test program:
> 
> #include <stdio.h>
> #include <stdbool.h>
> 
> int main()
> {
> 	bool a = 1;
> 	bool b = 2;
> 	printf("a=%i b=%i\n", a, b);
> 	return 0;
> }
> 
> ... and indeed, it prints out "a=1 b=1" here, so the "2" got properly
> changed to "true" :-)

Huh.  I had thought that C99 merely required that there be the
stdbool.h header declaring the bool type, rather than defining it as a
true first class type.

I'm very glad to be wrong.

> Anyway, that was already too much bike-shed painting now, if you want to
> keep the "!!", then keep it, that's fine for me, too.

But bike-shedding is a qemu tradition! ;-/

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-08  4:30           ` David Gibson
  2015-07-08  6:24             ` Thomas Huth
@ 2015-07-08  7:07             ` Alexey Kardashevskiy
  2015-07-08 14:47             ` Alex Williamson
  2 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-08  7:07 UTC (permalink / raw)
  To: David Gibson
  Cc: Thomas Huth, Michael Roth, qemu-devel, Gavin Shan,
	Alex Williamson, qemu-ppc

On 07/08/2015 02:30 PM, David Gibson wrote:
> On Tue, Jul 07, 2015 at 09:05:02PM +1000, Alexey Kardashevskiy wrote:
>> On 07/07/2015 08:21 PM, Thomas Huth wrote:
>>> On Tue, 7 Jul 2015 20:05:25 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>
>>>> On 07/07/2015 05:23 PM, Thomas Huth wrote:
>>>>> On Mon,  6 Jul 2015 12:11:09 +1000
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>> ...
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 8eacfd7..0c7ba8c 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
>>>>>>        memory_listener_unregister(&container->iommu_data.type1.listener);
>>>>>>    }
>>>>>>
>>>>>> +static void vfio_ram_do_region(VFIOContainer *container,
>>>>>> +                              MemoryRegionSection *section, unsigned long req)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
>>>>>> +
>>>>>> +    if (!memory_region_is_ram(section->mr) ||
>>>>>> +        memory_region_is_skip_dump(section->mr)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
>>>>>> +        error_report("%s received unaligned region", __func__);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>>>>>
>>>>> We're in usespace here ... I think it would be better to use uint64_t
>>>>> instead of the kernel-type __u64.
>>>>
>>>> We are calling a kernel here - @reg is a kernel-defined struct.
>>>
>>> If you grep for __u64 in the QEMU sources, you'll see that hardly
>>> anybody is using this type - even if calling ioctls. So for
>>> consistency, I'd really suggest to use uint64_t here.
>>
>> I am not using it, I am packing data to a struct. So does vfio_dma_map()
>> already.
>
> __u64 is just an alias typedef used by the kernel in uapi headers for
> 64-bit integers.  You should use uint64_t here.


Out of curiosity - I do not mind but still fail to see the point. 
reg::vaddr has a very specific type - __u64 - why should I cast to 
something which I am not going to pass to the kernel and which will be 
casted to __u64 anyway?

Is there some guideline about these uapi types? I'll read and shut up :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-08  4:30           ` David Gibson
  2015-07-08  6:24             ` Thomas Huth
  2015-07-08  7:07             ` Alexey Kardashevskiy
@ 2015-07-08 14:47             ` Alex Williamson
  2 siblings, 0 replies; 71+ messages in thread
From: Alex Williamson @ 2015-07-08 14:47 UTC (permalink / raw)
  To: David Gibson
  Cc: Thomas Huth, Michael Roth, Alexey Kardashevskiy, qemu-devel,
	Gavin Shan, qemu-ppc

On Wed, 2015-07-08 at 14:30 +1000, David Gibson wrote:
> On Tue, Jul 07, 2015 at 09:05:02PM +1000, Alexey Kardashevskiy wrote:
> > On 07/07/2015 08:21 PM, Thomas Huth wrote:
> > >On Tue, 7 Jul 2015 20:05:25 +1000
> > >Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >
> > >>On 07/07/2015 05:23 PM, Thomas Huth wrote:
> > >>>On Mon,  6 Jul 2015 12:11:09 +1000
> > >>>Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >...
> > >>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >>>>index 8eacfd7..0c7ba8c 100644
> > >>>>--- a/hw/vfio/common.c
> > >>>>+++ b/hw/vfio/common.c
> > >>>>@@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> > >>>>       memory_listener_unregister(&container->iommu_data.type1.listener);
> > >>>>   }
> > >>>>
> > >>>>+static void vfio_ram_do_region(VFIOContainer *container,
> > >>>>+                              MemoryRegionSection *section, unsigned long req)
> > >>>>+{
> > >>>>+    int ret;
> > >>>>+    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> > >>>>+
> > >>>>+    if (!memory_region_is_ram(section->mr) ||
> > >>>>+        memory_region_is_skip_dump(section->mr)) {
> > >>>>+        return;
> > >>>>+    }
> > >>>>+
> > >>>>+    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> > >>>>+        error_report("%s received unaligned region", __func__);
> > >>>>+        return;
> > >>>>+    }
> > >>>>+
> > >>>>+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> > >>>
> > >>>We're in usespace here ... I think it would be better to use uint64_t
> > >>>instead of the kernel-type __u64.
> > >>
> > >>We are calling a kernel here - @reg is a kernel-defined struct.
> > >
> > >If you grep for __u64 in the QEMU sources, you'll see that hardly
> > >anybody is using this type - even if calling ioctls. So for
> > >consistency, I'd really suggest to use uint64_t here.
> > 
> > I am not using it, I am packing data to a struct. So does vfio_dma_map()
> > already.
> 
> __u64 is just an alias typedef used by the kernel in uapi headers for
> 64-bit integers.  You should use uint64_t here.
> 
> > >>>>@@ -698,14 +768,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>>>
> > >>>>           container->iommu_data.type1.initialized = true;
> > >>>>
> > >>>>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> > >>>>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> > >>>>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> > >>>>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> > >>>
> > >>>That "!!" sounds somewhat wrong here. I think you either want to check
> > >>>for "ioctl() == 1" (because only in this case you can be sure that v2
> > >>>is supported), or you can simply omit the "!!" because you're 100% sure
> > >>>that the ioctl only returns 0 or 1 (and never a negative error code).
> > >>
> > >>
> > >>The host kernel does not return an error on these ioctls, it returns 0 or
> > >>1. And "!!" is shorter than "(bool)". VFIO_CHECK_EXTENSION for Type1 does
> > >>exactly the same already.
> > >
> > >Simply using nothing instead is even shorter than using "!!". The
> > >compiler is smart enough to convert from 0 and 1 to bool.
> > >"!!" is IMHO quite ugly and should only be used when it is really
> > >necessary.
> > 
> > 
> > imho it is not but either way I'd rather follow the existing style,
> > especially if I do literally the same thing (checking IOMMU version). Unless
> > the original author tells me to convert all the existing occurences of "!!"
> > to "!=0" (or something like this) before I post new ones.
> > 
> > Alex, should I get rid of "!!"s in the patch?
> 
> I think !! is the lesser evil here.  The trouble is that in C "bool"
> is not a first-class datatype, but just a typedef for some integer
> type.  Which means that, confusingly, (bool)2 != (bool)1.  So using
> the !! trick to force a value to be either 0 or 1 when assigning it to
> a bool variable is probably a good idea.

I agree that it shouldn't be necessary, but we do it elsewhere and it
doesn't bother me to do it here too.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-08  6:26             ` Alexey Kardashevskiy
@ 2015-07-08 14:51               ` Alex Williamson
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Williamson @ 2015-07-08 14:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-ppc, qemu-devel, Gavin Shan, David Gibson

On Wed, 2015-07-08 at 16:26 +1000, Alexey Kardashevskiy wrote:
> On 07/08/2015 02:24 AM, Alex Williamson wrote:
> > On Tue, 2015-07-07 at 22:11 +1000, Alexey Kardashevskiy wrote:
> >> On 07/07/2015 02:13 AM, Alex Williamson wrote:
> >>> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
> >>>> On 07/06/2015 11:42 PM, Alex Williamson wrote:
> >>>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> >>>>>> This makes use of the new "memory registering" feature. The idea is
> >>>>>> to provide the userspace ability to notify the host kernel about pages
> >>>>>> which are going to be used for DMA. Having this information, the host
> >>>>>> kernel can pin them all once per user process, do locked pages
> >>>>>> accounting (once) and not spent time on doing that in real time with
> >>>>>> possible failures which cannot be handled nicely in some cases.
> >>>>>>
> >>>>>> This adds a guest RAM memory listener which notifies a VFIO container
> >>>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >>>>>> (i.e. "skip dump" regions) are skipped.
> >>>>>>
> >>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>>>>> not call it when v2 is detected and enabled.
> >>>>>>
> >>>>>> This does not change the guest visible interface.
> >>>>>>
> >>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>>>> ---
> >>>>>> Changes:
> >>>>>> v9:
> >>>>>> * since there is no more SPAPR-specific data in container::iommu_data,
> >>>>>> the memory preregistration fields are common and potentially can be used
> >>>>>> by other architectures
> >>>>>>
> >>>>>> v7:
> >>>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> >>>>>> * s'ramlistener'register_listener'
> >>>>>>
> >>>>>> v6:
> >>>>>> * fixed commit log (s/guest/userspace/), added note about no guest visible
> >>>>>> change
> >>>>>> * fixed error checking if ram registration failed
> >>>>>> * added alignment check for section->offset_within_region
> >>>>>>
> >>>>>> v5:
> >>>>>> * simplified the patch
> >>>>>> * added trace points
> >>>>>> * added round_up() for the size
> >>>>>> * SPAPR IOMMU v2 used
> >>>>>> ---
> >>>>>>     hw/vfio/common.c              | 109 ++++++++++++++++++++++++++++++++++++++----
> >>>>>>     include/hw/vfio/vfio-common.h |   3 ++
> >>>>>>     trace-events                  |   1 +
> >>>>>>     3 files changed, 104 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>> index 8eacfd7..0c7ba8c 100644
> >>>>>> --- a/hw/vfio/common.c
> >>>>>> +++ b/hw/vfio/common.c
> >>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container)
> >>>>>>         memory_listener_unregister(&container->iommu_data.type1.listener);
> >>>>>>     }
> >>>>>>
> >>>>>> +static void vfio_ram_do_region(VFIOContainer *container,
> >>>>>> +                              MemoryRegionSection *section, unsigned long req)
> >>>>>> +{
> >>>>>> +    int ret;
> >>>>>> +    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
> >>>>>
> >>>>> This function is not as general as the name would imply, it's spapr
> >>>>> specific due to this.  How about vfio_spapr_register_memory() with a
> >>>>> bool parameter toggling register vs unregister so we're not passing an
> >>>>> arbitrary ioctl number?
> >>>>
> >>>> Ok. Although I am quite often asked not to do such a thing and rather add 2
> >>>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
> >>>
> >>> I'm not a fan of functions that do the reverse process based on a bool
> >>> arg either, but I dislike them less than passing an arbitrary ioctl
> >>> number for a parameter.  The former is ugly, but the latter is difficult
> >>> to use and difficult to maintain because it would be subtle later to
> >>> spot an unsupported ioctl being passed to the function.
> >>>
> >>>>>> +
> >>>>>> +    if (!memory_region_is_ram(section->mr) ||
> >>>>>> +        memory_region_is_skip_dump(section->mr)) {
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
> >>>>>
> >>>>> s/getpagesize()/qemu_real_host_page_size/?
> >>>>
> >>>>
> >>>> Oh, right, I guess it reached upstream now.
> >>>>
> >>>>
> >>>>>> +        error_report("%s received unaligned region", __func__);
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>>>>> +        section->offset_within_region;
> >>>>>> +    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
> >>>>>> +
> >>>>>> +    ret = ioctl(container->fd, req, &reg);
> >>>>>> +    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
> >>>>>> +            ret ? -errno : 0);
> >>>>>> +    if (!ret) {
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    /*
> >>>>>> +     * On the initfn path, store the first error in the container so we
> >>>>>> +     * can gracefully fail.  Runtime, there's not much we can do other
> >>>>>> +     * than throw a hardware error.
> >>>>>> +     */
> >>>>>> +    if (!container->iommu_data.ram_reg_initialized) {
> >>>>>> +        if (!container->iommu_data.ram_reg_error) {
> >>>>>> +            container->iommu_data.ram_reg_error = -errno;
> >>>>>> +        }
> >>>>>> +    } else {
> >>>>>> +        hw_error("vfio: RAM registering failed, unable to continue");
> >>>>>> +    }
> >>>>>
> >>>>> I'd rather see:
> >>>>>
> >>>>> if (ret) {
> >>>>>      if (!container...) {
> >>>>>        ...
> >>>>>      } else {
> >>>>>        ...
> >>>>>      }
> >>>>> }
> >>>>>
> >>>>> Exiting early on success and otherwise falling into error handling is a
> >>>>> strange code flow.
> >>>>
> >>>> Ok... vfio_dma_map() does not follow this rule so I thought it is not that
> >>>> strict :)
> >>>
> >>> It would be nice to clean it up there too.
> >>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> >>>>>> +                                         MemoryRegionSection *section)
> >>>>>> +{
> >>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>>>> +                                            iommu_data.register_listener);
> >>>>>> +    memory_region_ref(section->mr);
> >>>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> >>>>>
> >>>>> vfio_spapr_register_memory(container, section, true);
> >>>>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> >>>>>> +                                         MemoryRegionSection *section)
> >>>>>> +{
> >>>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>>>> +                                            iommu_data.register_listener);
> >>>>>> +    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> >>>>>
> >>>>> vfio_spapr_register_memory(container, section, false);
> >>>>>
> >>>>>> +    memory_region_unref(section->mr);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const MemoryListener vfio_ram_memory_listener = {
> >>>>>> +    .region_add = vfio_ram_listener_region_add,
> >>>>>> +    .region_del = vfio_ram_listener_region_del,
> >>>>>> +};
> >>>>>
> >>>>> These are all spapr specific, please reflect that in the name;
> >>>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
> >>>>
> >>>> ok.
> >>>>
> >>>>
> >>>>> Actually, can't we determine what type of IOMMU we have and make the
> >>>>> existing MemoryListener handle either type1 or spapr or spapr-v2?
> >>>>
> >>>>
> >>>> Sorry, I do not follow you here. How? The existing listener listens on PCI
> >>>> address space (at least, on pseries), new one listens on RAM address space
> >>>> (address_space_memory). What do I miss?
> >>>
> >>> Isn't that simply a difference of the address space the listener is
> >>> attached to?  Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
> >>> are already both handled by the same listener.
> >>
> >>
> >> Ok, I tried merging 2 listeners and realized that the PCI listener works
> >> with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be
> >> using an IOMMU page size which is not easily available there but this is a
> >> different story) and RAM listener with the qemu_real_host_page_size
> >> granularity (64K for my case) so depending on the address space type,
> >> vfio_listener_region_add() will have to use different page sizes. I like
> >> the idea of merging less now...
> >
> > Sounds like you're already solving something that needs to be fixed for
> > both.  The type1 VFIO_IOMMU_GET_INFO ioctl does actually give us a
> > bitmap of supported iommu page sizes.  It's really all but useless for
> > anything except determining the minimum page size.
> 
> btw what sizes can really come from there?

I think it was originally intended to be a bitmap of native IOMMU page
sizes.  I think AMD-Vi still does this, and reports essentially
PAGE_MASK minus a few bits that the hardware doesn't support for
whatever reason.  Not to be outdone, VT-d reports PAGE_MASK even though
their hardware supports a small set of discrete page sizes.  I think the
theory there was that software can breakdown any mapping to supported
sizes.  The result is that we have no idea whether a bit in the bitmap
means native support or not, so we ignore it and assume host page size
is the minimum alignment.

> >  For the most part we
> > just assume that it's the same as the host page size, so those existing
> > checks could actually change to host page alignment pretty safely.  I
> > think we both actually want pages that are both host and target aligned,
> > don't we?  What would you do on a 64k host if the guest tried to map a
> > region that only had 4k alignment?
> 
> I will get_user_pages_fast(va & PAGE_MASK) and then write 
> (gpa_to_hpa(va&PAGE_MASK)|(va & ~PAGE_MASK)) to the table, this is what we 
> do now as our typical host uses 64k pages and default 32bit window always 
> uses 4K (irrelevant to the guest page size).

Ok, so the windowed IOMMU with native 4k pages prevents you from
allowing access to more than the guest mapped.

> >  Anyway, if that's the only problem,
> > it looks more like an opportunity than a barrier.
> 
> Oh. Ok :)

Sorry ;)  Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
  2015-07-06 10:27   ` David Gibson
  2015-07-06 21:31   ` Thomas Huth
@ 2015-07-10 21:33   ` Michael Roth
  2015-07-12  4:59     ` Alexey Kardashevskiy
  2 siblings, 1 reply; 71+ messages in thread
From: Michael Roth @ 2015-07-10 21:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel
  Cc: Alex Williamson, qemu-ppc, Gavin Shan, David Gibson

Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
> sPAPR IOMMU is managing two copies of an TCE table:
> 1) a guest view of the table - this is what emulated devices use and
> this is where H_GET_TCE reads from;
> 2) a hardware TCE table - only present if there is at least one vfio-pci
> device on a PHB; it is updated via a memory listener on a PHB address
> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> 
> At the moment presence of vfio-pci devices on a bus affect the way
> the guest view table is allocated. If there is no vfio-pci on a PHB
> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> is allocated in KVM. However, if there is vfio-pci and we do yet not
> support KVM acceleration for these, the table has to be allocated
> by the userspace.
> 
> When vfio-pci device is hotplugged and there were no vfio-pci devices
> already, the guest view table could have been allocated by KVM which
> means that H_PUT_TCE is handled by the host kernel and since we
> do not support vfio-pci in KVM, the hardware table will not be updated.
> 
> This reallocates the guest view table in QEMU if the first vfio-pci
> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> 
> This replays all the mappings to make sure that the tables are in sync.
> This will not have a visible effect though as for a new device
> the guest kernel will allocate-and-map new addresses and therefore
> existing mappings from emulated devices will not be used by vfio-pci
> devices.
> 
> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> hooks.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v10:
> * removed unnecessary  memory_region_del_subregion() and
> memory_region_add_subregion() as
> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
> notifiers in a more correct way
> 
> v9:
> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
> via object_child_foreach()
> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
> if we do hotplug+hotunplug+hotplug of the same device.
> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
> Actual VFIO PCI device release happens from rcu and since we add ours later,
> it gets executed later and we are good.
> ---
>  hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
>  hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  1 +
>  include/hw/ppc/spapr.h      |  2 ++
>  trace-events                |  2 ++
>  5 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 45c00d8..2d99c3b 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>                                         uint32_t nb_table,
>                                         uint32_t page_shift,
>                                         int *fd,
> -                                       bool vfio_accel)
> +                                       bool vfio_accel,
> +                                       bool force_userspace)
>  {
>      uint64_t *table = NULL;
>      uint64_t window_size = (uint64_t)nb_table << page_shift;
> 
> -    if (kvm_enabled() && !(window_size >> 32)) {
> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
>          table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>      }
> 
> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>                                          tcet->nb_table,
>                                          tcet->page_shift,
>                                          &tcet->fd,
> -                                        vfio_accel);
> +                                        vfio_accel,
> +                                        false);
> 
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>      return 0;
>  }
> 
> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
> +{
> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
> +    long i, ret = 0;
> +
> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
> +        ret = put_tce_emu(tcet, ioba, table[i]);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +int spapr_tce_replay(sPAPRTCETable *tcet)
> +{
> +    return spapr_tce_do_replay(tcet, tcet->table);
> +}
> +
> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
> +{
> +    int ret = 0, oldfd;
> +    uint64_t *oldtable;
> +
> +    oldtable = tcet->table;
> +    oldfd = tcet->fd;
> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->nb_table,
> +                                        tcet->page_shift,
> +                                        &tcet->fd,
> +                                        false,
> +                                        true); /* force_userspace */
> +
> +    if (replay) {
> +        ret = spapr_tce_do_replay(tcet, oldtable);
> +    }
> +
> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
> +
> +    return ret;
> +}
> +
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        sPAPRTCETable *tcet)
>  {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 76c988f..d1fa157 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>      return 0;
>  }
> 
> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> +{
> +    int ret = 0, i;
> +    bool had_vfio = sphb->has_vfio;
> +    sPAPRTCETable *tcet;
> +
> +    spapr_phb_dma_capabilities_update(sphb);

So, in the unplug case, we update caps, but has_vfio = false so we don't do
anything else below.

Does that mean our KVM-accelerated TCE table won't get restored until reboot?
Would it make sense to re-enable it here?

> +
> +    if (!had_vfio && sphb->has_vfio) {
> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> +            if (!tcet || !tcet->enabled) {
> +                continue;
> +            }
> +            if (tcet->fd >= 0) {
> +                /*
> +                 * We got first vfio-pci device on accelerated table.
> +                 * VFIO acceleration is not possible.
> +                 * Reallocate table in userspace and replay mappings.
> +                 */
> +                ret = spapr_tce_realloc_userspace(tcet, true);
> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> +            } else {
> +                /* There was no acceleration, so just replay mappings. */
> +                ret = spapr_tce_replay(tcet);
> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
> +            }
> +            if (ret) {
> +                break;
> +            }
> +        }
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>              error_setg(errp, "Failed to create pci child device tree node");
>              goto out;
>          }
> +        spapr_phb_hotplug_dma_sync(phb);
>      }
> 
>      drck->attach(drc, DEVICE(pdev),
> @@ -1116,6 +1154,12 @@ out:
>      }
>  }
> 
> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
> +{
> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
> +    spapr_phb_hotplug_dma_sync(sphb);
> +}
> +
>  static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>  {
>      /* some version guests do not wait for completion of a device
> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>       */
>      pci_device_reset(PCI_DEVICE(dev));
>      object_unparent(OBJECT(dev));
> +
> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);

Hmm... can't think of any reason this wouldn't work, but would be nice
if there was something a bit more straightforward...

When the device is actually finalized, it does:

  static void vfio_instance_finalize(Object *obj)
  {
      PCIDevice *pci_dev = PCI_DEVICE(obj);
      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
      VFIOGroup *group = vdev->vbasedev.group;

      ...
  
      vfio_put_device(vdev);
      vfio_put_group(group);
  }

When all the groups are removed from a VFIO container, there's a
call to container->iommu_data.release(container). This is the
event we really care about, not so much the fact that a device
got released.

Right now all it does it remove the memory listener, but maybe it
makes sense to allow an additional callback/opaque to register for
the event. Not sure what the best way to do that is though...

And, kind of a separate topic, but if we could do something
similar for the initial group attach, we could drop *all* the
plug/unplug hooks, and the hooks themselves could drop all
the !had_vfio / has_vfio logic/probing, since that would then
be clear from the context.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-10 21:33   ` Michael Roth
@ 2015-07-12  4:59     ` Alexey Kardashevskiy
  2015-07-12 14:41       ` Michael Roth
  0 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-12  4:59 UTC (permalink / raw)
  To: Michael Roth, qemu-devel
  Cc: Alex Williamson, qemu-ppc, Gavin Shan, David Gibson

On 07/11/2015 07:33 AM, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
>> sPAPR IOMMU is managing two copies of an TCE table:
>> 1) a guest view of the table - this is what emulated devices use and
>> this is where H_GET_TCE reads from;
>> 2) a hardware TCE table - only present if there is at least one vfio-pci
>> device on a PHB; it is updated via a memory listener on a PHB address
>> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
>>
>> At the moment presence of vfio-pci devices on a bus affect the way
>> the guest view table is allocated. If there is no vfio-pci on a PHB
>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
>> is allocated in KVM. However, if there is vfio-pci and we do yet not
>> support KVM acceleration for these, the table has to be allocated
>> by the userspace.
>>
>> When vfio-pci device is hotplugged and there were no vfio-pci devices
>> already, the guest view table could have been allocated by KVM which
>> means that H_PUT_TCE is handled by the host kernel and since we
>> do not support vfio-pci in KVM, the hardware table will not be updated.
>>
>> This reallocates the guest view table in QEMU if the first vfio-pci
>> device has just been plugged. spapr_tce_realloc_userspace() handles this.
>>
>> This replays all the mappings to make sure that the tables are in sync.
>> This will not have a visible effect though as for a new device
>> the guest kernel will allocate-and-map new addresses and therefore
>> existing mappings from emulated devices will not be used by vfio-pci
>> devices.
>>
>> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
>> hooks.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v10:
>> * removed unnecessary  memory_region_del_subregion() and
>> memory_region_add_subregion() as
>> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
>> notifiers in a more correct way
>>
>> v9:
>> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
>> via object_child_foreach()
>> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
>> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
>> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
>> if we do hotplug+hotunplug+hotplug of the same device.
>> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
>> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
>> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
>> Actual VFIO PCI device release happens from rcu and since we add ours later,
>> it gets executed later and we are good.
>> ---
>>   hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
>>   hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
>>   include/hw/pci-host/spapr.h |  1 +
>>   include/hw/ppc/spapr.h      |  2 ++
>>   trace-events                |  2 ++
>>   5 files changed, 100 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 45c00d8..2d99c3b 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>>                                          uint32_t nb_table,
>>                                          uint32_t page_shift,
>>                                          int *fd,
>> -                                       bool vfio_accel)
>> +                                       bool vfio_accel,
>> +                                       bool force_userspace)
>>   {
>>       uint64_t *table = NULL;
>>       uint64_t window_size = (uint64_t)nb_table << page_shift;
>>
>> -    if (kvm_enabled() && !(window_size >> 32)) {
>> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
>>           table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>>       }
>>
>> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>>                                           tcet->nb_table,
>>                                           tcet->page_shift,
>>                                           &tcet->fd,
>> -                                        vfio_accel);
>> +                                        vfio_accel,
>> +                                        false);
>>
>>       memory_region_set_size(&tcet->iommu,
>>                              (uint64_t)tcet->nb_table << tcet->page_shift);
>> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>       return 0;
>>   }
>>
>> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
>> +{
>> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
>> +    long i, ret = 0;
>> +
>> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
>> +        ret = put_tce_emu(tcet, ioba, table[i]);
>> +        if (ret) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +int spapr_tce_replay(sPAPRTCETable *tcet)
>> +{
>> +    return spapr_tce_do_replay(tcet, tcet->table);
>> +}
>> +
>> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
>> +{
>> +    int ret = 0, oldfd;
>> +    uint64_t *oldtable;
>> +
>> +    oldtable = tcet->table;
>> +    oldfd = tcet->fd;
>> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> +                                        tcet->nb_table,
>> +                                        tcet->page_shift,
>> +                                        &tcet->fd,
>> +                                        false,
>> +                                        true); /* force_userspace */
>> +
>> +    if (replay) {
>> +        ret = spapr_tce_do_replay(tcet, oldtable);
>> +    }
>> +
>> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
>> +
>> +    return ret;
>> +}
>> +
>>   int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                         sPAPRTCETable *tcet)
>>   {
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 76c988f..d1fa157 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>       return 0;
>>   }
>>
>> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>> +{
>> +    int ret = 0, i;
>> +    bool had_vfio = sphb->has_vfio;
>> +    sPAPRTCETable *tcet;
>> +
>> +    spapr_phb_dma_capabilities_update(sphb);
>
> So, in the unplug case, we update caps, but has_vfio = false so we don't do
> anything else below.

Yes.


> Does that mean our KVM-accelerated TCE table won't get restored until reboot?
> Would it make sense to re-enable it here?

No, it shold be reenabled as DMA config is completely reset during the 
machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of 
DMA config when resetting PHB"



>> +
>> +    if (!had_vfio && sphb->has_vfio) {
>> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>> +            if (!tcet || !tcet->enabled) {
>> +                continue;
>> +            }
>> +            if (tcet->fd >= 0) {
>> +                /*
>> +                 * We got first vfio-pci device on accelerated table.
>> +                 * VFIO acceleration is not possible.
>> +                 * Reallocate table in userspace and replay mappings.
>> +                 */
>> +                ret = spapr_tce_realloc_userspace(tcet, true);
>> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
>> +            } else {
>> +                /* There was no acceleration, so just replay mappings. */
>> +                ret = spapr_tce_replay(tcet);
>> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
>> +            }
>> +            if (ret) {
>> +                break;
>> +            }
>> +        }
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   /* Macros to operate with address in OF binding to PCI */
>>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
>> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>               error_setg(errp, "Failed to create pci child device tree node");
>>               goto out;
>>           }
>> +        spapr_phb_hotplug_dma_sync(phb);
>>       }
>>
>>       drck->attach(drc, DEVICE(pdev),
>> @@ -1116,6 +1154,12 @@ out:
>>       }
>>   }
>>
>> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
>> +{
>> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
>> +    spapr_phb_hotplug_dma_sync(sphb);
>> +}
>> +
>>   static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>   {
>>       /* some version guests do not wait for completion of a device
>> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>        */
>>       pci_device_reset(PCI_DEVICE(dev));
>>       object_unparent(OBJECT(dev));
>> +
>> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
>> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
>
> Hmm... can't think of any reason this wouldn't work, but would be nice
> if there was something a bit more straightforward...
>
> When the device is actually finalized, it does:

The problem is with "when". I looked at gdb, this vfio_instance_finalize() 
is called from an RCU handler because the last reference is dropped because 
of some memory region was removed and this was postponed to RCU.

If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same 
stack, I would not need this call_rcu1.

>    static void vfio_instance_finalize(Object *obj)
>    {
>        PCIDevice *pci_dev = PCI_DEVICE(obj);
>        VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
>        VFIOGroup *group = vdev->vbasedev.group;
>
>        ...
>
>        vfio_put_device(vdev);
>        vfio_put_group(group);
>    }
>
> When all the groups are removed from a VFIO container, there's a
> call to container->iommu_data.release(container). This is the
> event we really care about, not so much the fact that a device
> got released.
>
> Right now all it does it remove the memory listener, but maybe it
> makes sense to allow an additional callback/opaque to register for
> the event. Not sure what the best way to do that is though...

In this context I rather care about container's fd being closed so 
VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know 
that there is no more VFIO devices.


> And, kind of a separate topic, but if we could do something
> similar for the initial group attach,  we could drop *all* the
> plug/unplug hooks, and the hooks themselves could drop all
> the !had_vfio / has_vfio logic/probing, since that would then
> be clear from the context.

Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they 
what HMP calls on "device_add"?


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-12  4:59     ` Alexey Kardashevskiy
@ 2015-07-12 14:41       ` Michael Roth
  2015-07-13  1:10         ` David Gibson
  2015-07-13  7:06         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 71+ messages in thread
From: Michael Roth @ 2015-07-12 14:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel
  Cc: Alex Williamson, qemu-ppc, Gavin Shan, David Gibson

Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
> On 07/11/2015 07:33 AM, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
> >> sPAPR IOMMU is managing two copies of an TCE table:
> >> 1) a guest view of the table - this is what emulated devices use and
> >> this is where H_GET_TCE reads from;
> >> 2) a hardware TCE table - only present if there is at least one vfio-pci
> >> device on a PHB; it is updated via a memory listener on a PHB address
> >> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> >>
> >> At the moment presence of vfio-pci devices on a bus affect the way
> >> the guest view table is allocated. If there is no vfio-pci on a PHB
> >> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> >> is allocated in KVM. However, if there is vfio-pci and we do yet not
> >> support KVM acceleration for these, the table has to be allocated
> >> by the userspace.
> >>
> >> When vfio-pci device is hotplugged and there were no vfio-pci devices
> >> already, the guest view table could have been allocated by KVM which
> >> means that H_PUT_TCE is handled by the host kernel and since we
> >> do not support vfio-pci in KVM, the hardware table will not be updated.
> >>
> >> This reallocates the guest view table in QEMU if the first vfio-pci
> >> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> >>
> >> This replays all the mappings to make sure that the tables are in sync.
> >> This will not have a visible effect though as for a new device
> >> the guest kernel will allocate-and-map new addresses and therefore
> >> existing mappings from emulated devices will not be used by vfio-pci
> >> devices.
> >>
> >> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> >> hooks.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v10:
> >> * removed unnecessary  memory_region_del_subregion() and
> >> memory_region_add_subregion() as
> >> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
> >> notifiers in a more correct way
> >>
> >> v9:
> >> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
> >> via object_child_foreach()
> >> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
> >> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
> >> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
> >> if we do hotplug+hotunplug+hotplug of the same device.
> >> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
> >> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
> >> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
> >> Actual VFIO PCI device release happens from rcu and since we add ours later,
> >> it gets executed later and we are good.
> >> ---
> >>   hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
> >>   hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
> >>   include/hw/pci-host/spapr.h |  1 +
> >>   include/hw/ppc/spapr.h      |  2 ++
> >>   trace-events                |  2 ++
> >>   5 files changed, 100 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >> index 45c00d8..2d99c3b 100644
> >> --- a/hw/ppc/spapr_iommu.c
> >> +++ b/hw/ppc/spapr_iommu.c
> >> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
> >>                                          uint32_t nb_table,
> >>                                          uint32_t page_shift,
> >>                                          int *fd,
> >> -                                       bool vfio_accel)
> >> +                                       bool vfio_accel,
> >> +                                       bool force_userspace)
> >>   {
> >>       uint64_t *table = NULL;
> >>       uint64_t window_size = (uint64_t)nb_table << page_shift;
> >>
> >> -    if (kvm_enabled() && !(window_size >> 32)) {
> >> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
> >>           table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
> >>       }
> >>
> >> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
> >>                                           tcet->nb_table,
> >>                                           tcet->page_shift,
> >>                                           &tcet->fd,
> >> -                                        vfio_accel);
> >> +                                        vfio_accel,
> >> +                                        false);
> >>
> >>       memory_region_set_size(&tcet->iommu,
> >>                              (uint64_t)tcet->nb_table << tcet->page_shift);
> >> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> >>       return 0;
> >>   }
> >>
> >> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
> >> +{
> >> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
> >> +    long i, ret = 0;
> >> +
> >> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
> >> +        ret = put_tce_emu(tcet, ioba, table[i]);
> >> +        if (ret) {
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +int spapr_tce_replay(sPAPRTCETable *tcet)
> >> +{
> >> +    return spapr_tce_do_replay(tcet, tcet->table);
> >> +}
> >> +
> >> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
> >> +{
> >> +    int ret = 0, oldfd;
> >> +    uint64_t *oldtable;
> >> +
> >> +    oldtable = tcet->table;
> >> +    oldfd = tcet->fd;
> >> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> >> +                                        tcet->nb_table,
> >> +                                        tcet->page_shift,
> >> +                                        &tcet->fd,
> >> +                                        false,
> >> +                                        true); /* force_userspace */
> >> +
> >> +    if (replay) {
> >> +        ret = spapr_tce_do_replay(tcet, oldtable);
> >> +    }
> >> +
> >> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >>   int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> >>                         sPAPRTCETable *tcet)
> >>   {
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 76c988f..d1fa157 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>       return 0;
> >>   }
> >>
> >> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> >> +{
> >> +    int ret = 0, i;
> >> +    bool had_vfio = sphb->has_vfio;
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    spapr_phb_dma_capabilities_update(sphb);
> >
> > So, in the unplug case, we update caps, but has_vfio = false so we don't do
> > anything else below.
> 
> Yes.
> 
> 
> > Does that mean our KVM-accelerated TCE table won't get restored until reboot?
> > Would it make sense to re-enable it here?
> 
> No, it shold be reenabled as DMA config is completely reset during the 
> machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of 
> DMA config when resetting PHB"

We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
get re-enabled till guest system reset. I'm not sure how big a deal that
is performance-wise, but it seems a little unexpected.

> 
> 
> 
> >> +
> >> +    if (!had_vfio && sphb->has_vfio) {
> >> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> >> +            if (!tcet || !tcet->enabled) {
> >> +                continue;
> >> +            }
> >> +            if (tcet->fd >= 0) {
> >> +                /*
> >> +                 * We got first vfio-pci device on accelerated table.
> >> +                 * VFIO acceleration is not possible.
> >> +                 * Reallocate table in userspace and replay mappings.
> >> +                 */
> >> +                ret = spapr_tce_realloc_userspace(tcet, true);
> >> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> >> +            } else {
> >> +                /* There was no acceleration, so just replay mappings. */
> >> +                ret = spapr_tce_replay(tcet);
> >> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
> >> +            }
> >> +            if (ret) {
> >> +                break;
> >> +            }
> >> +        }
> >> +        return ret;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>   /* Macros to operate with address in OF binding to PCI */
> >>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
> >>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> >> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>               error_setg(errp, "Failed to create pci child device tree node");
> >>               goto out;
> >>           }
> >> +        spapr_phb_hotplug_dma_sync(phb);
> >>       }
> >>
> >>       drck->attach(drc, DEVICE(pdev),
> >> @@ -1116,6 +1154,12 @@ out:
> >>       }
> >>   }
> >>
> >> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
> >> +{
> >> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
> >> +    spapr_phb_hotplug_dma_sync(sphb);
> >> +}
> >> +
> >>   static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> >>   {
> >>       /* some version guests do not wait for completion of a device
> >> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> >>        */
> >>       pci_device_reset(PCI_DEVICE(dev));
> >>       object_unparent(OBJECT(dev));
> >> +
> >> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
> >> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
> >
> > Hmm... can't think of any reason this wouldn't work, but would be nice
> > if there was something a bit more straightforward...
> >
> > When the device is actually finalized, it does:
> 
> The problem is with "when". I looked at gdb, this vfio_instance_finalize() 
> is called from an RCU handler because the last reference is dropped because 
> of some memory region was removed and this was postponed to RCU.
> 
> If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same 
> stack, I would not need this call_rcu1.

Right, object_unparent() has no guaruntee of immediately finalizing it,
but you *do* have the guaruntee that
vfio_instance_finalize()->vfio_put_group() will only be called once the
device is actually finalized, regardless of whether or not it's kicked
off by the RCU thread. So it seems more straightforward to hook into
that rather than needing to employ internal knowledge of
object_unparent().

> 
> >    static void vfio_instance_finalize(Object *obj)
> >    {
> >        PCIDevice *pci_dev = PCI_DEVICE(obj);
> >        VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
> >        VFIOGroup *group = vdev->vbasedev.group;
> >
> >        ...
> >
> >        vfio_put_device(vdev);
> >        vfio_put_group(group);
> >    }
> >
> > When all the groups are removed from a VFIO container, there's a
> > call to container->iommu_data.release(container). This is the
> > event we really care about, not so much the fact that a device
> > got released.
> >
> > Right now all it does it remove the memory listener, but maybe it
> > makes sense to allow an additional callback/opaque to register for
> > the event. Not sure what the best way to do that is though...
> 
> In this context I rather care about container's fd being closed so 
> VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know 
> that there is no more VFIO devices.

VFIO container getting closed also corresponds to the last group being
removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
would fail unless at least one iommu group was attached to the
container? So knowing when the first/last group is removed seems to
be the real main event.

> 
> 
> > And, kind of a separate topic, but if we could do something
> > similar for the initial group attach,  we could drop *all* the
> > plug/unplug hooks, and the hooks themselves could drop all
> > the !had_vfio / has_vfio logic/probing, since that would then
> > be clear from the context.
> 
> Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they 
> what HMP calls on "device_add"?

I mean all the places we call into code that ends up doing:
  spapr_phb_dma_capabilities_update(sphb);
  /* do something special if has_vfio changed */

We currently have one in PCI plug, PCI unplug, and PHB reset. If we
hooked into vfio_put_group(), we could drop each of those hooks. PHB
reset would still have the special case of restoring default 32-bit
window config, but it wouldn't need to care about has_vfio status
anymore, all that code could be handled by
vfio_put_group/vfio_get_group callbacks.

I wouldn't hold up the series for it, but I think it would greatly
simplify tracking has_vfio changes.

But I do think spapr_phb_hotplug_dma_sync() should have some logic
squashed in for re-enabling TCE acceleration on
(had_vfio && !has_vfio).

> 
> 
> -- 
> Alexey
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-12 14:41       ` Michael Roth
@ 2015-07-13  1:10         ` David Gibson
  2015-07-13  7:06         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 71+ messages in thread
From: David Gibson @ 2015-07-13  1:10 UTC (permalink / raw)
  To: Michael Roth
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan

[-- Attachment #1: Type: text/plain, Size: 15437 bytes --]

On Sun, Jul 12, 2015 at 09:41:27AM -0500, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
> > On 07/11/2015 07:33 AM, Michael Roth wrote:
> > > Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
> > >> sPAPR IOMMU is managing two copies of an TCE table:
> > >> 1) a guest view of the table - this is what emulated devices use and
> > >> this is where H_GET_TCE reads from;
> > >> 2) a hardware TCE table - only present if there is at least one vfio-pci
> > >> device on a PHB; it is updated via a memory listener on a PHB address
> > >> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> > >>
> > >> At the moment presence of vfio-pci devices on a bus affect the way
> > >> the guest view table is allocated. If there is no vfio-pci on a PHB
> > >> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> > >> is allocated in KVM. However, if there is vfio-pci and we do yet not
> > >> support KVM acceleration for these, the table has to be allocated
> > >> by the userspace.
> > >>
> > >> When vfio-pci device is hotplugged and there were no vfio-pci devices
> > >> already, the guest view table could have been allocated by KVM which
> > >> means that H_PUT_TCE is handled by the host kernel and since we
> > >> do not support vfio-pci in KVM, the hardware table will not be updated.
> > >>
> > >> This reallocates the guest view table in QEMU if the first vfio-pci
> > >> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> > >>
> > >> This replays all the mappings to make sure that the tables are in sync.
> > >> This will not have a visible effect though as for a new device
> > >> the guest kernel will allocate-and-map new addresses and therefore
> > >> existing mappings from emulated devices will not be used by vfio-pci
> > >> devices.
> > >>
> > >> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> > >> hooks.
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >> Changes:
> > >> v10:
> > >> * removed unnecessary  memory_region_del_subregion() and
> > >> memory_region_add_subregion() as
> > >> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
> > >> notifiers in a more correct way
> > >>
> > >> v9:
> > >> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
> > >> via object_child_foreach()
> > >> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
> > >> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
> > >> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
> > >> if we do hotplug+hotunplug+hotplug of the same device.
> > >> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
> > >> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
> > >> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
> > >> Actual VFIO PCI device release happens from rcu and since we add ours later,
> > >> it gets executed later and we are good.
> > >> ---
> > >>   hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
> > >>   hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
> > >>   include/hw/pci-host/spapr.h |  1 +
> > >>   include/hw/ppc/spapr.h      |  2 ++
> > >>   trace-events                |  2 ++
> > >>   5 files changed, 100 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> > >> index 45c00d8..2d99c3b 100644
> > >> --- a/hw/ppc/spapr_iommu.c
> > >> +++ b/hw/ppc/spapr_iommu.c
> > >> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
> > >>                                          uint32_t nb_table,
> > >>                                          uint32_t page_shift,
> > >>                                          int *fd,
> > >> -                                       bool vfio_accel)
> > >> +                                       bool vfio_accel,
> > >> +                                       bool force_userspace)
> > >>   {
> > >>       uint64_t *table = NULL;
> > >>       uint64_t window_size = (uint64_t)nb_table << page_shift;
> > >>
> > >> -    if (kvm_enabled() && !(window_size >> 32)) {
> > >> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
> > >>           table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
> > >>       }
> > >>
> > >> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
> > >>                                           tcet->nb_table,
> > >>                                           tcet->page_shift,
> > >>                                           &tcet->fd,
> > >> -                                        vfio_accel);
> > >> +                                        vfio_accel,
> > >> +                                        false);
> > >>
> > >>       memory_region_set_size(&tcet->iommu,
> > >>                              (uint64_t)tcet->nb_table << tcet->page_shift);
> > >> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> > >>       return 0;
> > >>   }
> > >>
> > >> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
> > >> +{
> > >> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
> > >> +    long i, ret = 0;
> > >> +
> > >> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
> > >> +        ret = put_tce_emu(tcet, ioba, table[i]);
> > >> +        if (ret) {
> > >> +            break;
> > >> +        }
> > >> +    }
> > >> +
> > >> +    return ret;
> > >> +}
> > >> +
> > >> +int spapr_tce_replay(sPAPRTCETable *tcet)
> > >> +{
> > >> +    return spapr_tce_do_replay(tcet, tcet->table);
> > >> +}
> > >> +
> > >> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
> > >> +{
> > >> +    int ret = 0, oldfd;
> > >> +    uint64_t *oldtable;
> > >> +
> > >> +    oldtable = tcet->table;
> > >> +    oldfd = tcet->fd;
> > >> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> > >> +                                        tcet->nb_table,
> > >> +                                        tcet->page_shift,
> > >> +                                        &tcet->fd,
> > >> +                                        false,
> > >> +                                        true); /* force_userspace */
> > >> +
> > >> +    if (replay) {
> > >> +        ret = spapr_tce_do_replay(tcet, oldtable);
> > >> +    }
> > >> +
> > >> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
> > >> +
> > >> +    return ret;
> > >> +}
> > >> +
> > >>   int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> > >>                         sPAPRTCETable *tcet)
> > >>   {
> > >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > >> index 76c988f..d1fa157 100644
> > >> --- a/hw/ppc/spapr_pci.c
> > >> +++ b/hw/ppc/spapr_pci.c
> > >> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
> > >>       return 0;
> > >>   }
> > >>
> > >> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> > >> +{
> > >> +    int ret = 0, i;
> > >> +    bool had_vfio = sphb->has_vfio;
> > >> +    sPAPRTCETable *tcet;
> > >> +
> > >> +    spapr_phb_dma_capabilities_update(sphb);
> > >
> > > So, in the unplug case, we update caps, but has_vfio = false so we don't do
> > > anything else below.
> > 
> > Yes.
> > 
> > 
> > > Does that mean our KVM-accelerated TCE table won't get restored until reboot?
> > > Would it make sense to re-enable it here?
> > 
> > No, it shold be reenabled as DMA config is completely reset during the 
> > machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of 
> > DMA config when resetting PHB"
> 
> We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
> get re-enabled till guest system reset. I'm not sure how big a deal that
> is performance-wise, but it seems a little unexpected.

Right.  I'm not particularly fussed by this though.  If you really
want to optimize performance, you're still better off segregating vfio
devices and emulated devices onto different PHBs.

It seems to me if you plug a vfio device onto a bus once, you might
well do so again in the future, which would mean ejecting all the
kernel accelerated emulated devices again.
> > >> +
> > >> +    if (!had_vfio && sphb->has_vfio) {
> > >> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> > >> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> > >> +            if (!tcet || !tcet->enabled) {
> > >> +                continue;
> > >> +            }
> > >> +            if (tcet->fd >= 0) {
> > >> +                /*
> > >> +                 * We got first vfio-pci device on accelerated table.
> > >> +                 * VFIO acceleration is not possible.
> > >> +                 * Reallocate table in userspace and replay mappings.
> > >> +                 */
> > >> +                ret = spapr_tce_realloc_userspace(tcet, true);
> > >> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> > >> +            } else {
> > >> +                /* There was no acceleration, so just replay mappings. */
> > >> +                ret = spapr_tce_replay(tcet);
> > >> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
> > >> +            }
> > >> +            if (ret) {
> > >> +                break;
> > >> +            }
> > >> +        }
> > >> +        return ret;
> > >> +    }
> > >> +
> > >> +    return 0;
> > >> +}
> > >> +
> > >>   /* Macros to operate with address in OF binding to PCI */
> > >>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
> > >>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> > >> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> > >>               error_setg(errp, "Failed to create pci child device tree node");
> > >>               goto out;
> > >>           }
> > >> +        spapr_phb_hotplug_dma_sync(phb);
> > >>       }
> > >>
> > >>       drck->attach(drc, DEVICE(pdev),
> > >> @@ -1116,6 +1154,12 @@ out:
> > >>       }
> > >>   }
> > >>
> > >> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
> > >> +{
> > >> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
> > >> +    spapr_phb_hotplug_dma_sync(sphb);
> > >> +}
> > >> +
> > >>   static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> > >>   {
> > >>       /* some version guests do not wait for completion of a device
> > >> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> > >>        */
> > >>       pci_device_reset(PCI_DEVICE(dev));
> > >>       object_unparent(OBJECT(dev));
> > >> +
> > >> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
> > >> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
> > >
> > > Hmm... can't think of any reason this wouldn't work, but would be nice
> > > if there was something a bit more straightforward...
> > >
> > > When the device is actually finalized, it does:
> > 
> > The problem is with "when". I looked at gdb, this vfio_instance_finalize() 
> > is called from an RCU handler because the last reference is dropped because 
> > of some memory region was removed and this was postponed to RCU.
> > 
> > If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same 
> > stack, I would not need this call_rcu1.
> 
> Right, object_unparent() has no guaruntee of immediately finalizing it,
> but you *do* have the guaruntee that
> vfio_instance_finalize()->vfio_put_group() will only be called once the
> device is actually finalized, regardless of whether or not it's kicked
> off by the RCU thread. So it seems more straightforward to hook into
> that rather than needing to employ internal knowledge of
> object_unparent().
> 
> > 
> > >    static void vfio_instance_finalize(Object *obj)
> > >    {
> > >        PCIDevice *pci_dev = PCI_DEVICE(obj);
> > >        VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
> > >        VFIOGroup *group = vdev->vbasedev.group;
> > >
> > >        ...
> > >
> > >        vfio_put_device(vdev);
> > >        vfio_put_group(group);
> > >    }
> > >
> > > When all the groups are removed from a VFIO container, there's a
> > > call to container->iommu_data.release(container). This is the
> > > event we really care about, not so much the fact that a device
> > > got released.
> > >
> > > Right now all it does it remove the memory listener, but maybe it
> > > makes sense to allow an additional callback/opaque to register for
> > > the event. Not sure what the best way to do that is though...
> > 
> > In this context I rather care about container's fd being closed so 
> > VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know 
> > that there is no more VFIO devices.
> 
> VFIO container getting closed also corresponds to the last group being
> removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
> would fail unless at least one iommu group was attached to the
> container? So knowing when the first/last group is removed seems to
> be the real main event.

So, I think what your basically suggesting here is that the
"vfio-active" state of the PHB should be tied to the presence of an
active container associated with the PHB, rather than whether there
are actually vfio devices present on the bus.

The container approach does seem more conceptually correct to me, as
well as possibly simplifying the implementation a bit.

> > > And, kind of a separate topic, but if we could do something
> > > similar for the initial group attach,  we could drop *all* the
> > > plug/unplug hooks, and the hooks themselves could drop all
> > > the !had_vfio / has_vfio logic/probing, since that would then
> > > be clear from the context.
> > 
> > Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they 
> > what HMP calls on "device_add"?
> 
> I mean all the places we call into code that ends up doing:
>   spapr_phb_dma_capabilities_update(sphb);
>   /* do something special if has_vfio changed */
> 
> We currently have one in PCI plug, PCI unplug, and PHB reset. If we
> hooked into vfio_put_group(), we could drop each of those hooks. PHB
> reset would still have the special case of restoring default 32-bit
> window config, but it wouldn't need to care about has_vfio status
> anymore, all that code could be handled by
> vfio_put_group/vfio_get_group callbacks.
> 
> I wouldn't hold up the series for it, but I think it would greatly
> simplify tracking has_vfio changes.
> 
> But I do think spapr_phb_hotplug_dma_sync() should have some logic
> squashed in for re-enabling TCE acceleration on
> (had_vfio && !has_vfio).
> 
> > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
  2015-07-12 14:41       ` Michael Roth
  2015-07-13  1:10         ` David Gibson
@ 2015-07-13  7:06         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13  7:06 UTC (permalink / raw)
  To: Michael Roth, qemu-devel
  Cc: Alex Williamson, qemu-ppc, Gavin Shan, David Gibson

On 07/13/2015 12:41 AM, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
>> On 07/11/2015 07:33 AM, Michael Roth wrote:
>>> Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
>>>> sPAPR IOMMU is managing two copies of an TCE table:
>>>> 1) a guest view of the table - this is what emulated devices use and
>>>> this is where H_GET_TCE reads from;
>>>> 2) a hardware TCE table - only present if there is at least one vfio-pci
>>>> device on a PHB; it is updated via a memory listener on a PHB address
>>>> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
>>>>
>>>> At the moment presence of vfio-pci devices on a bus affect the way
>>>> the guest view table is allocated. If there is no vfio-pci on a PHB
>>>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
>>>> is allocated in KVM. However, if there is vfio-pci and we do yet not
>>>> support KVM acceleration for these, the table has to be allocated
>>>> by the userspace.
>>>>
>>>> When vfio-pci device is hotplugged and there were no vfio-pci devices
>>>> already, the guest view table could have been allocated by KVM which
>>>> means that H_PUT_TCE is handled by the host kernel and since we
>>>> do not support vfio-pci in KVM, the hardware table will not be updated.
>>>>
>>>> This reallocates the guest view table in QEMU if the first vfio-pci
>>>> device has just been plugged. spapr_tce_realloc_userspace() handles this.
>>>>
>>>> This replays all the mappings to make sure that the tables are in sync.
>>>> This will not have a visible effect though as for a new device
>>>> the guest kernel will allocate-and-map new addresses and therefore
>>>> existing mappings from emulated devices will not be used by vfio-pci
>>>> devices.
>>>>
>>>> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
>>>> hooks.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v10:
>>>> * removed unnecessary  memory_region_del_subregion() and
>>>> memory_region_add_subregion() as
>>>> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
>>>> notifiers in a more correct way
>>>>
>>>> v9:
>>>> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
>>>> via object_child_foreach()
>>>> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
>>>> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
>>>> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
>>>> if we do hotplug+hotunplug+hotplug of the same device.
>>>> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
>>>> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
>>>> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
>>>> Actual VFIO PCI device release happens from rcu and since we add ours later,
>>>> it gets executed later and we are good.
>>>> ---
>>>>    hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
>>>>    hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
>>>>    include/hw/pci-host/spapr.h |  1 +
>>>>    include/hw/ppc/spapr.h      |  2 ++
>>>>    trace-events                |  2 ++
>>>>    5 files changed, 100 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>> index 45c00d8..2d99c3b 100644
>>>> --- a/hw/ppc/spapr_iommu.c
>>>> +++ b/hw/ppc/spapr_iommu.c
>>>> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>>>>                                           uint32_t nb_table,
>>>>                                           uint32_t page_shift,
>>>>                                           int *fd,
>>>> -                                       bool vfio_accel)
>>>> +                                       bool vfio_accel,
>>>> +                                       bool force_userspace)
>>>>    {
>>>>        uint64_t *table = NULL;
>>>>        uint64_t window_size = (uint64_t)nb_table << page_shift;
>>>>
>>>> -    if (kvm_enabled() && !(window_size >> 32)) {
>>>> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
>>>>            table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>>>>        }
>>>>
>>>> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>>>>                                            tcet->nb_table,
>>>>                                            tcet->page_shift,
>>>>                                            &tcet->fd,
>>>> -                                        vfio_accel);
>>>> +                                        vfio_accel,
>>>> +                                        false);
>>>>
>>>>        memory_region_set_size(&tcet->iommu,
>>>>                               (uint64_t)tcet->nb_table << tcet->page_shift);
>>>> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>>        return 0;
>>>>    }
>>>>
>>>> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
>>>> +{
>>>> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
>>>> +    long i, ret = 0;
>>>> +
>>>> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
>>>> +        ret = put_tce_emu(tcet, ioba, table[i]);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +int spapr_tce_replay(sPAPRTCETable *tcet)
>>>> +{
>>>> +    return spapr_tce_do_replay(tcet, tcet->table);
>>>> +}
>>>> +
>>>> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
>>>> +{
>>>> +    int ret = 0, oldfd;
>>>> +    uint64_t *oldtable;
>>>> +
>>>> +    oldtable = tcet->table;
>>>> +    oldfd = tcet->fd;
>>>> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>>>> +                                        tcet->nb_table,
>>>> +                                        tcet->page_shift,
>>>> +                                        &tcet->fd,
>>>> +                                        false,
>>>> +                                        true); /* force_userspace */
>>>> +
>>>> +    if (replay) {
>>>> +        ret = spapr_tce_do_replay(tcet, oldtable);
>>>> +    }
>>>> +
>>>> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>    int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>>                          sPAPRTCETable *tcet)
>>>>    {
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 76c988f..d1fa157 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>>>        return 0;
>>>>    }
>>>>
>>>> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>>>> +{
>>>> +    int ret = 0, i;
>>>> +    bool had_vfio = sphb->has_vfio;
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    spapr_phb_dma_capabilities_update(sphb);
>>>
>>> So, in the unplug case, we update caps, but has_vfio = false so we don't do
>>> anything else below.
>>
>> Yes.
>>
>>
>>> Does that mean our KVM-accelerated TCE table won't get restored until reboot?
>>> Would it make sense to re-enable it here?
>>
>> No, it shold be reenabled as DMA config is completely reset during the
>> machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of
>> DMA config when resetting PHB"
>
> We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
> get re-enabled till guest system reset.

Yes, this is what I said :)


> I'm not sure how big a deal that
> is performance-wise, but it seems a little unexpected.

True...



>>>> +
>>>> +    if (!had_vfio && sphb->has_vfio) {
>>>> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>>>> +            if (!tcet || !tcet->enabled) {
>>>> +                continue;
>>>> +            }
>>>> +            if (tcet->fd >= 0) {
>>>> +                /*
>>>> +                 * We got first vfio-pci device on accelerated table.
>>>> +                 * VFIO acceleration is not possible.
>>>> +                 * Reallocate table in userspace and replay mappings.
>>>> +                 */
>>>> +                ret = spapr_tce_realloc_userspace(tcet, true);
>>>> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
>>>> +            } else {
>>>> +                /* There was no acceleration, so just replay mappings. */
>>>> +                ret = spapr_tce_replay(tcet);
>>>> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
>>>> +            }
>>>> +            if (ret) {
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    /* Macros to operate with address in OF binding to PCI */
>>>>    #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>>>    #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
>>>> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>>>                error_setg(errp, "Failed to create pci child device tree node");
>>>>                goto out;
>>>>            }
>>>> +        spapr_phb_hotplug_dma_sync(phb);
>>>>        }
>>>>
>>>>        drck->attach(drc, DEVICE(pdev),
>>>> @@ -1116,6 +1154,12 @@ out:
>>>>        }
>>>>    }
>>>>
>>>> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
>>>> +{
>>>> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
>>>> +    spapr_phb_hotplug_dma_sync(sphb);
>>>> +}
>>>> +
>>>>    static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>>>    {
>>>>        /* some version guests do not wait for completion of a device
>>>> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>>>         */
>>>>        pci_device_reset(PCI_DEVICE(dev));
>>>>        object_unparent(OBJECT(dev));
>>>> +
>>>> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
>>>> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
>>>
>>> Hmm... can't think of any reason this wouldn't work, but would be nice
>>> if there was something a bit more straightforward...
>>>
>>> When the device is actually finalized, it does:
>>
>> The problem is with "when". I looked at gdb, this vfio_instance_finalize()
>> is called from an RCU handler because the last reference is dropped because
>> of some memory region was removed and this was postponed to RCU.
>>
>> If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same
>> stack, I would not need this call_rcu1.
>
> Right, object_unparent() has no guaruntee of immediately finalizing it,
> but you *do* have the guaruntee that
> vfio_instance_finalize()->vfio_put_group() will only be called once the
> device is actually finalized, regardless of whether or not it's kicked
> off by the RCU thread. So it seems more straightforward to hook into
> that rather than needing to employ internal knowledge of
> object_unparent().


Right... I'll try adding a hook and see what review I'll receive.



>>
>>>     static void vfio_instance_finalize(Object *obj)
>>>     {
>>>         PCIDevice *pci_dev = PCI_DEVICE(obj);
>>>         VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
>>>         VFIOGroup *group = vdev->vbasedev.group;
>>>
>>>         ...
>>>
>>>         vfio_put_device(vdev);
>>>         vfio_put_group(group);
>>>     }
>>>
>>> When all the groups are removed from a VFIO container, there's a
>>> call to container->iommu_data.release(container). This is the
>>> event we really care about, not so much the fact that a device
>>> got released.
>>>
>>> Right now all it does it remove the memory listener, but maybe it
>>> makes sense to allow an additional callback/opaque to register for
>>> the event. Not sure what the best way to do that is though...
>>
>> In this context I rather care about container's fd being closed so
>> VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know
>> that there is no more VFIO devices.
>
> VFIO container getting closed also corresponds to the last group being
> removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
> would fail unless at least one iommu group was attached to the
> container? So knowing when the first/last group is removed seems to
> be the real main event.

Right. This is still VFIO knowledge which PHB does not have access to. We 
will need these new hooks.



>>> And, kind of a separate topic, but if we could do something
>>> similar for the initial group attach,  we could drop *all* the
>>> plug/unplug hooks, and the hooks themselves could drop all
>>> the !had_vfio / has_vfio logic/probing, since that would then
>>> be clear from the context.
>>
>> Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they
>> what HMP calls on "device_add"?
>
> I mean all the places we call into code that ends up doing:
>    spapr_phb_dma_capabilities_update(sphb);
>    /* do something special if has_vfio changed */
>
> We currently have one in PCI plug, PCI unplug, and PHB reset. If we
> hooked into vfio_put_group(), we could drop each of those hooks. PHB
> reset would still have the special case of restoring default 32-bit
> window config, but it wouldn't need to care about has_vfio status
> anymore, all that code could be handled by
> vfio_put_group/vfio_get_group callbacks.
>
> I wouldn't hold up the series for it, but I think it would greatly
> simplify tracking has_vfio changes.

It is now 2 series and I can try what you suggest.


> But I do think spapr_phb_hotplug_dma_sync() should have some logic
> squashed in for re-enabling TCE acceleration on
> (had_vfio && !has_vfio).




-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2015-07-13  7:07 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
2015-07-06 11:18   ` Paolo Bonzini
2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2015-07-06 14:21   ` Thomas Huth
2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
2015-07-06 16:41   ` Laurent Vivier
2015-07-07  0:28     ` Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2015-07-06 15:14   ` Thomas Huth
2015-07-06 15:43     ` Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2015-07-06 10:07   ` David Gibson
2015-07-06 17:04   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
2015-07-06 16:45   ` Laurent Vivier
2015-07-06 17:11   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
2015-07-06 19:15   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
2015-07-06 21:13   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
2015-07-06 10:27   ` David Gibson
2015-07-06 21:31   ` Thomas Huth
2015-07-07  9:28     ` Alexey Kardashevskiy
2015-07-10 21:33   ` Michael Roth
2015-07-12  4:59     ` Alexey Kardashevskiy
2015-07-12 14:41       ` Michael Roth
2015-07-13  1:10         ` David Gibson
2015-07-13  7:06         ` Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
2015-07-07  7:02   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
2015-07-06 10:33   ` David Gibson
2015-07-06 12:49     ` Alex Williamson
2015-07-06 12:59       ` Alexey Kardashevskiy
2015-07-06 13:45         ` Alex Williamson
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-06 13:42   ` Alex Williamson
2015-07-06 15:34     ` Alexey Kardashevskiy
2015-07-06 16:13       ` Alex Williamson
2015-07-07  0:29         ` David Gibson
2015-07-07  0:36           ` Alexey Kardashevskiy
2015-07-07 12:11         ` Alexey Kardashevskiy
2015-07-07 16:24           ` Alex Williamson
2015-07-08  6:26             ` Alexey Kardashevskiy
2015-07-08 14:51               ` Alex Williamson
2015-07-07  7:23   ` Thomas Huth
2015-07-07 10:05     ` Alexey Kardashevskiy
2015-07-07 10:21       ` Thomas Huth
2015-07-07 11:05         ` Alexey Kardashevskiy
2015-07-08  4:30           ` David Gibson
2015-07-08  6:24             ` Thomas Huth
2015-07-08  6:50               ` David Gibson
2015-07-08  7:07             ` Alexey Kardashevskiy
2015-07-08 14:47             ` Alex Williamson
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2015-07-06 11:06   ` David Gibson
2015-07-06 11:27     ` Alexey Kardashevskiy
2015-07-07  9:46     ` Alexey Kardashevskiy
2015-07-07  4:58   ` David Gibson
2015-07-07  9:33   ` Thomas Huth
2015-07-07 10:43     ` Alexey Kardashevskiy
2015-07-07 11:35       ` Thomas Huth
2015-07-07 11:53         ` Alexey Kardashevskiy
2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
2015-07-06 15:54 ` Thomas Huth
2015-07-06 16:07   ` Alexey Kardashevskiy
2015-07-06 16:13     ` Thomas Huth
2015-07-08  4:34   ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.