linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9][RFC] stackable dma_ops for x86
@ 2008-09-22 18:21 Joerg Roedel
  2008-09-22 18:21 ` [PATCH 1/9] x86/iommu: add necessary types for stackable dma_ops Joerg Roedel
                   ` (10 more replies)
  0 siblings, 11 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori

Hi,

this patch series implements stackable dma_ops on x86. This is useful to
be able to fall back to a different dma_ops implementation if one can
not handle a particular device (as necessary for example with
paravirtualized device passthrough or if a hardware IOMMU only handles a
subset of available devices).
The patch series is split to change the different iommu implementations
in different patches. This should make it easier for the specific
maintainers to review the part that changes their code.


Thanks for review and comments,

Joerg

diffstat:

 arch/x86/kernel/amd_iommu.c      |    5 ++-
 arch/x86/kernel/pci-calgary_64.c |   21 ++++++------
 arch/x86/kernel/pci-dma.c        |   63 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/pci-gart_64.c    |    9 +++++
 arch/x86/kernel/pci-nommu.c      |   12 +++++++
 arch/x86/kernel/pci-swiotlb_64.c |   16 +++++++++-
 drivers/pci/intel-iommu.c        |    8 +++++
 include/asm-x86/device.h         |    6 ++--
 include/asm-x86/dma-mapping.h    |   43 +++++++++++++++++++++----
 include/asm-x86/swiotlb.h        |    1 +
 10 files changed, 160 insertions(+), 24 deletions(-)




^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/9] x86/iommu: add necessary types for stackable dma_ops
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 2/9] x86/iommu: add stackable dma_ops registration interface Joerg Roedel
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

This patch extends the x86 dma_ops structure so that we can queue it
into a list. It also adds a device_supported callback. It can be used to
find out if a registered dma_ops implementation can handle a given
device. Further it adds an enum with dma_ops implementation types
possible in the future.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 include/asm-x86/dma-mapping.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index f408e6d..59d1101 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -17,6 +17,8 @@ extern struct device x86_dma_fallback_dev;
 extern int panic_on_overflow;
 
 struct dma_mapping_ops {
+	struct		list_head list; /* for stacking dma_ops */
+
 	int             (*mapping_error)(struct device *dev,
 					 dma_addr_t dma_addr);
 	void*           (*alloc_coherent)(struct device *dev, size_t size,
@@ -51,9 +53,33 @@ struct dma_mapping_ops {
 				struct scatterlist *sg, int nents,
 				int direction);
 	int             (*dma_supported)(struct device *hwdev, u64 mask);
+	int		(*device_supported)(struct device *hwdev);
 	int		is_phys;
 };
 
+/*
+ * This are the supported types of dma_ops implementations on x86. The
+ * different types mean different priority:
+ *
+ *   DMA_OPS_TYPE_PV   - paravirtualized dma_ops implementation
+ *   DMA_OPS_TYPE_HW   - dma_ops implementation using some kind of hardware
+ *                       support
+ *   DMA_OPS_TYPE_SOFT - a software only dma_ops implementation
+ *
+ * When a device issues its first request to the DMA layer the available
+ * implementations are asked if they support the device by calling the specific
+ * dma_supported callback. The implementations are checked in order, first the
+ * DMA_OPS_TYPE_PV, then the DMA_OPS_TYPE_HW and at the end the
+ * DMA_OPS_TYPE_SOFT implementations. Within the implementation types they are
+ * called in registration order.
+ */
+enum dma_ops_driver_type {
+	DMA_OPS_TYPE_PV,
+	DMA_OPS_TYPE_HW,
+	DMA_OPS_TYPE_SOFT,
+	DMA_OPS_TYPE_MAX,
+};
+
 extern struct dma_mapping_ops *dma_ops;
 
 static inline struct dma_mapping_ops *get_dma_ops(struct device *dev)
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/9] x86/iommu: add stackable dma_ops registration interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
  2008-09-22 18:21 ` [PATCH 1/9] x86/iommu: add necessary types for stackable dma_ops Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 3/9] x86/iommu: change PCI-NOMMU to use dma_ops register interface Joerg Roedel
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

This patch adds the function x86_register_dma_ops which drivers can use
to register their specific dma_ops implementations. The patch also adds
the data structures necessary for registration and the initializtion of
them.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/pci-dma.c     |   31 +++++++++++++++++++++++++++++++
 include/asm-x86/dma-mapping.h |    3 +++
 2 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index d2f2c01..c783b73 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -47,6 +47,11 @@ EXPORT_SYMBOL(iommu_bio_merge);
 dma_addr_t bad_dma_address __read_mostly = 0;
 EXPORT_SYMBOL(bad_dma_address);
 
+/* The lists for stacking dma_ops and its lock */
+static DEFINE_RWLOCK(dma_ops_list_lock);
+
+static struct list_head dma_ops_list[DMA_OPS_TYPE_MAX];
+
 /* Dummy device used for NULL arguments (normally ISA). Better would
    be probably a smaller DMA mask, but this is bug-to-bug compatible
    to older i386. */
@@ -57,6 +62,26 @@ struct device x86_dma_fallback_dev = {
 };
 EXPORT_SYMBOL(x86_dma_fallback_dev);
 
+static void __init init_dma_ops_list(void)
+{
+	int i;
+
+	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
+		INIT_LIST_HEAD(&dma_ops_list[i]);
+}
+
+void x86_register_dma_ops(struct dma_mapping_ops *ops,
+			  enum dma_ops_driver_type type)
+{
+	unsigned long flags;
+
+	BUG_ON(type < DMA_OPS_TYPE_PV || type >= DMA_OPS_TYPE_MAX);
+
+	write_lock_irqsave(&dma_ops_list_lock, flags);
+	list_add_tail(&ops->list, &dma_ops_list[type]);
+	write_unlock_irqrestore(&dma_ops_list_lock, flags);
+}
+
 int dma_set_mask(struct device *dev, u64 mask)
 {
 	if (!dev->dma_mask || !dma_supported(dev, mask))
@@ -257,6 +282,12 @@ EXPORT_SYMBOL(dma_supported);
 
 static int __init pci_iommu_init(void)
 {
+	/*
+	 * initialize dma_ops_list so that drivers can register their
+	 * implementations
+	 */
+	init_dma_ops_list();
+
 	calgary_iommu_init();
 
 	intel_iommu_init();
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index 59d1101..128786b 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -80,6 +80,9 @@ enum dma_ops_driver_type {
 	DMA_OPS_TYPE_MAX,
 };
 
+extern void x86_register_dma_ops(struct dma_mapping_ops *ops,
+				 enum dma_ops_driver_type type);
+
 extern struct dma_mapping_ops *dma_ops;
 
 static inline struct dma_mapping_ops *get_dma_ops(struct device *dev)
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/9] x86/iommu: change PCI-NOMMU to use dma_ops register interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
  2008-09-22 18:21 ` [PATCH 1/9] x86/iommu: add necessary types for stackable dma_ops Joerg Roedel
  2008-09-22 18:21 ` [PATCH 2/9] x86/iommu: add stackable dma_ops registration interface Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 4/9] x86/iommu: change SWIOTLB " Joerg Roedel
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/pci-nommu.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index 1c1c98a..0cbecb9 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -115,16 +115,28 @@ static void nommu_free_coherent(struct device *dev, size_t size, void *vaddr,
 	free_pages((unsigned long)vaddr, get_order(size));
 }
 
+static int nommu_device_supported(struct device *dev)
+{
+	return 1;
+}
+
 struct dma_mapping_ops nommu_dma_ops = {
 	.alloc_coherent = nommu_alloc_coherent,
 	.free_coherent = nommu_free_coherent,
 	.map_single = nommu_map_single,
 	.map_sg = nommu_map_sg,
+	.device_supported = nommu_device_supported,
 	.is_phys = 1,
 };
 
 void __init no_iommu_init(void)
 {
+	/*
+	 * we always want nommu to be the last fallback if no
+	 * other dma_ops implementation feels responsible
+	 */
+	x86_register_dma_ops(&nommu_dma_ops, DMA_OPS_TYPE_SOFT);
+
 	if (dma_ops)
 		return;
 
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/9] x86/iommu: change SWIOTLB to use dma_ops register interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (2 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 3/9] x86/iommu: change PCI-NOMMU to use dma_ops register interface Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 5/9] x86/iommu: change GART " Joerg Roedel
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/pci-dma.c        |    6 +++++-
 arch/x86/kernel/pci-swiotlb_64.c |   16 +++++++++++++++-
 include/asm-x86/swiotlb.h        |    1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index c783b73..b990fb6 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -156,7 +156,7 @@ void __init pci_iommu_alloc(void)
 
 	amd_iommu_detect();
 
-	pci_swiotlb_init();
+	pci_swiotlb_detect();
 }
 
 unsigned long iommu_num_pages(unsigned long addr, unsigned long len)
@@ -296,6 +296,10 @@ static int __init pci_iommu_init(void)
 
 	gart_iommu_init();
 
+#ifdef CONFIG_X86_64
+	pci_swiotlb_init();
+#endif
+
 	no_iommu_init();
 	return 0;
 }
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
index c4ce033..26747e0 100644
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -18,6 +18,12 @@ swiotlb_map_single_phys(struct device *hwdev, phys_addr_t paddr, size_t size,
 	return swiotlb_map_single(hwdev, phys_to_virt(paddr), size, direction);
 }
 
+static int
+swiotlb_device_supported(struct device *hwdev)
+{
+	return 1;
+}
+
 struct dma_mapping_ops swiotlb_dma_ops = {
 	.mapping_error = swiotlb_dma_mapping_error,
 	.alloc_coherent = swiotlb_alloc_coherent,
@@ -33,9 +39,10 @@ struct dma_mapping_ops swiotlb_dma_ops = {
 	.map_sg = swiotlb_map_sg,
 	.unmap_sg = swiotlb_unmap_sg,
 	.dma_supported = NULL,
+	.device_supported = swiotlb_device_supported,
 };
 
-void __init pci_swiotlb_init(void)
+void __init pci_swiotlb_detect(void)
 {
 	/* don't initialize swiotlb if iommu=off (no_iommu=1) */
 	if (!iommu_detected && !no_iommu && max_pfn > MAX_DMA32_PFN)
@@ -45,6 +52,13 @@ void __init pci_swiotlb_init(void)
 	if (swiotlb) {
 		printk(KERN_INFO "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n");
 		swiotlb_init();
+	}
+}
+
+void __init pci_swiotlb_init(void)
+{
+	if (swiotlb) {
+		x86_register_dma_ops(&swiotlb_dma_ops, DMA_OPS_TYPE_SOFT);
 		dma_ops = &swiotlb_dma_ops;
 	}
 }
diff --git a/include/asm-x86/swiotlb.h b/include/asm-x86/swiotlb.h
index 2730b35..20dfcb0 100644
--- a/include/asm-x86/swiotlb.h
+++ b/include/asm-x86/swiotlb.h
@@ -45,6 +45,7 @@ extern int swiotlb_force;
 
 #ifdef CONFIG_SWIOTLB
 extern int swiotlb;
+extern void pci_swiotlb_detect(void);
 extern void pci_swiotlb_init(void);
 #else
 #define swiotlb 0
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/9] x86/iommu: change GART to use dma_ops register interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (3 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 4/9] x86/iommu: change SWIOTLB " Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 6/9] x86/iommu: change Calgary " Joerg Roedel
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/pci-gart_64.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 508ef47..c5891c9 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -533,6 +533,12 @@ gart_free_coherent(struct device *dev, size_t size, void *vaddr,
 	free_pages((unsigned long)vaddr, get_order(size));
 }
 
+static int gart_device_supported(struct device *dev)
+{
+	/* GART remaps _everything_ (including CPU memory accesses) */
+	return 1;
+}
+
 static int no_agp;
 
 static __init unsigned long check_iommu_size(unsigned long aper, u64 aper_size)
@@ -736,6 +742,7 @@ static struct dma_mapping_ops gart_dma_ops = {
 	.unmap_sg			= gart_unmap_sg,
 	.alloc_coherent			= gart_alloc_coherent,
 	.free_coherent			= gart_free_coherent,
+	.device_supported		= gart_device_supported,
 };
 
 void gart_iommu_shutdown(void)
@@ -874,6 +881,8 @@ void __init gart_iommu_init(void)
 		iommu_gatt_base[i] = gart_unmapped_entry;
 
 	flush_gart();
+
+	x86_register_dma_ops(&gart_dma_ops, DMA_OPS_TYPE_HW);
 	dma_ops = &gart_dma_ops;
 }
 
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (4 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 5/9] x86/iommu: change GART " Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-23 14:37   ` Muli Ben-Yehuda
  2008-09-30 11:10   ` Ingo Molnar
  2008-09-22 18:21 ` [PATCH 7/9] x86/iommu: change AMD IOMMU " Joerg Roedel
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/pci-calgary_64.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index fe7695e..e2f2f60 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -525,6 +525,15 @@ static void calgary_free_coherent(struct device *dev, size_t size,
 	free_pages((unsigned long)vaddr, get_order(size));
 }
 
+static int calgary_device_supported(struct device *dev)
+{
+	struct iommu_table *tbl;
+
+	tbl = find_iommu_table(dev);
+
+	return translation_enabled(tbl);
+}
+
 static struct dma_mapping_ops calgary_dma_ops = {
 	.alloc_coherent = calgary_alloc_coherent,
 	.free_coherent = calgary_free_coherent,
@@ -532,6 +541,7 @@ static struct dma_mapping_ops calgary_dma_ops = {
 	.unmap_single = calgary_unmap_single,
 	.map_sg = calgary_map_sg,
 	.unmap_sg = calgary_unmap_sg,
+	.device_supported = calgary_device_supported,
 };
 
 static inline void __iomem * busno_to_bbar(unsigned char num)
@@ -1223,16 +1233,6 @@ static int __init calgary_init(void)
 			goto error;
 	} while (1);
 
-	dev = NULL;
-	for_each_pci_dev(dev) {
-		struct iommu_table *tbl;
-
-		tbl = find_iommu_table(&dev->dev);
-
-		if (translation_enabled(tbl))
-			dev->dev.archdata.dma_ops = &calgary_dma_ops;
-	}
-
 	return ret;
 
 error:
@@ -1535,6 +1535,7 @@ int __init calgary_iommu_init(void)
 
 	force_iommu = 1;
 	bad_dma_address = 0x0;
+	x86_register_dma_ops(&calgary_dma_ops, DMA_OPS_TYPE_HW);
 	/* dma_ops is set to swiotlb or nommu */
 	if (!dma_ops)
 		dma_ops = &nommu_dma_ops;
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/9] x86/iommu: change AMD IOMMU to use dma_ops register interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (5 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 6/9] x86/iommu: change Calgary " Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 8/9] x86/iommu: change Intel " Joerg Roedel
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/amd_iommu.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c
index 6f7b974..7aa9824 100644
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -1258,7 +1258,7 @@ free_mem:
  * This function is called by the DMA layer to find out if we can handle a
  * particular device. It is part of the dma_ops.
  */
-static int amd_iommu_dma_supported(struct device *dev, u64 mask)
+static int amd_iommu_device_supported(struct device *dev)
 {
 	u16 bdf;
 	struct pci_dev *pcidev;
@@ -1320,7 +1320,7 @@ static struct dma_mapping_ops amd_iommu_dma_ops = {
 	.unmap_single = unmap_single,
 	.map_sg = map_sg,
 	.unmap_sg = unmap_sg,
-	.dma_supported = amd_iommu_dma_supported,
+	.device_supported = amd_iommu_device_supported,
 };
 
 /*
@@ -1362,6 +1362,7 @@ int __init amd_iommu_init_dma_ops(void)
 #endif
 
 	/* Make the driver finally visible to the drivers */
+	x86_register_dma_ops(&amd_iommu_dma_ops, DMA_OPS_TYPE_HW);
 	dma_ops = &amd_iommu_dma_ops;
 
 	return 0;
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 8/9] x86/iommu: change Intel IOMMU to use dma_ops register interface
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (6 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 7/9] x86/iommu: change AMD IOMMU " Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-22 18:21 ` [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops Joerg Roedel
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 drivers/pci/intel-iommu.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 6c4c1c3..283e65f 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -2255,6 +2255,12 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist,
 	return nelems;
 }
 
+static int intel_device_supported(struct device *dev)
+{
+	/* FIXME: is this correct? */
+	return  dev && dev->bus == &pci_bus_type;
+}
+
 static struct dma_mapping_ops intel_dma_ops = {
 	.alloc_coherent = intel_alloc_coherent,
 	.free_coherent = intel_free_coherent,
@@ -2262,6 +2268,7 @@ static struct dma_mapping_ops intel_dma_ops = {
 	.unmap_single = intel_unmap_single,
 	.map_sg = intel_map_sg,
 	.unmap_sg = intel_unmap_sg,
+	.device_supported = intel_device_supported,
 };
 
 static inline int iommu_domain_cache_init(void)
@@ -2449,6 +2456,7 @@ int __init intel_iommu_init(void)
 
 	init_timer(&unmap_timer);
 	force_iommu = 1;
+	x86_register_dma_ops(&intel_dma_ops, DMA_OPS_TYPE_HW);
 	dma_ops = &intel_dma_ops;
 	return 0;
 }
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (7 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 8/9] x86/iommu: change Intel " Joerg Roedel
@ 2008-09-22 18:21 ` Joerg Roedel
  2008-09-26  7:56   ` Amit Shah
  2008-09-28 14:21   ` FUJITA Tomonori
  2008-09-22 18:36 ` [PATCH 0/9][RFC] stackable dma_ops for x86 Arjan van de Ven
  2008-09-28 14:21 ` FUJITA Tomonori
  10 siblings, 2 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:21 UTC (permalink / raw)
  To: linux-kernel, kvm, iommu
  Cc: David Woodhouse, Muli Ben-Yehuda, Amit Shah, Ingo Molnar,
	FUJITA Tomonori, Joerg Roedel

This patch enables stackable dma_ops on x86. To do this, it also enables
the per-device dma_ops on i386.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
 include/asm-x86/device.h      |    6 +++---
 include/asm-x86/dma-mapping.h |   14 +++++++-------
 3 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index b990fb6..2e517c2 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
 	write_unlock_irqrestore(&dma_ops_list_lock, flags);
 }
 
+struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
+{
+	int i;
+	unsigned long flags;
+	struct dma_mapping_ops *entry, *ops = NULL;
+
+	read_lock_irqsave(&dma_ops_list_lock, flags);
+
+	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
+		list_for_each_entry(entry, &dma_ops_list[i], list) {
+			if (!entry->device_supported)
+				continue;
+			if (entry->device_supported(dev)) {
+				ops = entry;
+				goto out;
+			}
+		}
+out:
+	read_unlock_irqrestore(&dma_ops_list_lock, flags);
+
+	BUG_ON(ops == NULL);
+
+	return ops;
+}
+EXPORT_SYMBOL(find_dma_ops_for_device);
+
 int dma_set_mask(struct device *dev, u64 mask)
 {
 	if (!dev->dma_mask || !dma_supported(dev, mask))
diff --git a/include/asm-x86/device.h b/include/asm-x86/device.h
index 3c034f4..9ed12e3 100644
--- a/include/asm-x86/device.h
+++ b/include/asm-x86/device.h
@@ -5,9 +5,9 @@ struct dev_archdata {
 #ifdef CONFIG_ACPI
 	void	*acpi_handle;
 #endif
-#ifdef CONFIG_X86_64
-struct dma_mapping_ops *dma_ops;
-#endif
+
+	struct dma_mapping_ops *dma_ops;
+
 #ifdef CONFIG_DMAR
 	void *iommu; /* hook for IOMMU specific extension */
 #endif
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index 128786b..83d6d7f 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -82,19 +82,19 @@ enum dma_ops_driver_type {
 
 extern void x86_register_dma_ops(struct dma_mapping_ops *ops,
 				 enum dma_ops_driver_type type);
+extern struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev);
 
 extern struct dma_mapping_ops *dma_ops;
 
 static inline struct dma_mapping_ops *get_dma_ops(struct device *dev)
 {
-#ifdef CONFIG_X86_32
-	return dma_ops;
-#else
-	if (unlikely(!dev) || !dev->archdata.dma_ops)
+	if (unlikely(!dev))
 		return dma_ops;
-	else
-		return dev->archdata.dma_ops;
-#endif
+
+	if (unlikely(!dev->archdata.dma_ops))
+		dev->archdata.dma_ops = find_dma_ops_for_device(dev);
+
+	return dev->archdata.dma_ops;
 }
 
 /* Make sure we keep the same behaviour */
-- 
1.5.6.4



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (8 preceding siblings ...)
  2008-09-22 18:21 ` [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops Joerg Roedel
@ 2008-09-22 18:36 ` Arjan van de Ven
  2008-09-22 18:39   ` Joerg Roedel
  2008-09-28 14:21 ` FUJITA Tomonori
  10 siblings, 1 reply; 44+ messages in thread
From: Arjan van de Ven @ 2008-09-22 18:36 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Amit Shah, Ingo Molnar, FUJITA Tomonori

On Mon, 22 Sep 2008 20:21:12 +0200
Joerg Roedel <joerg.roedel@amd.com> wrote:

> Hi,
> 
> this patch series implements stackable dma_ops on x86. This is useful
> to be able to fall back to a different dma_ops implementation if one
> can not handle a particular device (as necessary for example with
> paravirtualized device passthrough or if a hardware IOMMU only
> handles a subset of available devices).

isn't the right answer here to have a per device DMA ops instead ?


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-22 18:36 ` [PATCH 0/9][RFC] stackable dma_ops for x86 Arjan van de Ven
@ 2008-09-22 18:39   ` Joerg Roedel
  2008-09-23  2:41     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-22 18:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Amit Shah, Ingo Molnar, FUJITA Tomonori

On Mon, Sep 22, 2008 at 11:36:19AM -0700, Arjan van de Ven wrote:
> On Mon, 22 Sep 2008 20:21:12 +0200
> Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > Hi,
> > 
> > this patch series implements stackable dma_ops on x86. This is useful
> > to be able to fall back to a different dma_ops implementation if one
> > can not handle a particular device (as necessary for example with
> > paravirtualized device passthrough or if a hardware IOMMU only
> > handles a subset of available devices).
> 
> isn't the right answer here to have a per device DMA ops instead ?

Its implemented using the per-device dma-ops already there. With this
patches there is a list of available dma_ops implementations which are
asked in a particular order if they can handle the device. The first
implementation which returns true is assigned to the device as the
per-device dma_ops structure.

(Hmm, maybe the name stackable is misleading, is "dma_ops multiplexing"
 better?)

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-22 18:39   ` Joerg Roedel
@ 2008-09-23  2:41     ` Jeremy Fitzhardinge
  2008-09-23  2:50       ` Arjan van de Ven
  0 siblings, 1 reply; 44+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-23  2:41 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Arjan van de Ven, linux-kernel, kvm, iommu, David Woodhouse,
	Muli Ben-Yehuda, Amit Shah, Ingo Molnar, FUJITA Tomonori

Joerg Roedel wrote:
> Its implemented using the per-device dma-ops already there. With this
> patches there is a list of available dma_ops implementations which are
> asked in a particular order if they can handle the device. The first
> implementation which returns true is assigned to the device as the
> per-device dma_ops structure.
>
> (Hmm, maybe the name stackable is misleading, is "dma_ops multiplexing"
>  better?)

Is per-device the right level?  Wouldn't per-bus make more sense?  How
does a dma_ops implementation "know" whether it can handle a particular
device?

(I haven't had a chance to read the patches yet.)

    J

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-23  2:41     ` Jeremy Fitzhardinge
@ 2008-09-23  2:50       ` Arjan van de Ven
  0 siblings, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2008-09-23  2:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Joerg Roedel, linux-kernel, kvm, iommu, David Woodhouse,
	Muli Ben-Yehuda, Amit Shah, Ingo Molnar, FUJITA Tomonori

On Mon, 22 Sep 2008 19:41:28 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Joerg Roedel wrote:
> > Its implemented using the per-device dma-ops already there. With
> > this patches there is a list of available dma_ops implementations
> > which are asked in a particular order if they can handle the
> > device. The first implementation which returns true is assigned to
> > the device as the per-device dma_ops structure.
> >
> > (Hmm, maybe the name stackable is misleading, is "dma_ops
> > multiplexing" better?)
> 
> Is per-device the right level?  Wouldn't per-bus make more sense? 

not really; all DMA functions get a device as argument already anyway;
just going to bus makes no sense there.

Even if you set it the same for the whole bus almost all of the time...
the APIs just work per device.

(and device assignment clearly is per device as well)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-22 18:21 ` [PATCH 6/9] x86/iommu: change Calgary " Joerg Roedel
@ 2008-09-23 14:37   ` Muli Ben-Yehuda
  2008-09-30 11:10   ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-23 14:37 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Amit Shah,
	Ingo Molnar, FUJITA Tomonori

On Mon, Sep 22, 2008 at 08:21:18PM +0200, Joerg Roedel wrote:
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
>  arch/x86/kernel/pci-calgary_64.c |   21 +++++++++++----------
>  1 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
> index fe7695e..e2f2f60 100644
> --- a/arch/x86/kernel/pci-calgary_64.c
> +++ b/arch/x86/kernel/pci-calgary_64.c
> @@ -525,6 +525,15 @@ static void calgary_free_coherent(struct device *dev, size_t size,
>  	free_pages((unsigned long)vaddr, get_order(size));
>  }
>  
> +static int calgary_device_supported(struct device *dev)
> +{
> +	struct iommu_table *tbl;
> +
> +	tbl = find_iommu_table(dev);
> +
> +	return translation_enabled(tbl);
> +}
> +

Whole patchset and this patch in particular looks very good. Will ack
after I get a chance to test it.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                     and
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-22 18:21 ` [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops Joerg Roedel
@ 2008-09-26  7:56   ` Amit Shah
  2008-09-26  8:59     ` Joerg Roedel
  2008-09-26 11:00     ` Joerg Roedel
  2008-09-28 14:21   ` FUJITA Tomonori
  1 sibling, 2 replies; 44+ messages in thread
From: Amit Shah @ 2008-09-26  7:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Ingo Molnar, FUJITA Tomonori

* On Monday 22 Sep 2008 23:51:21 Joerg Roedel wrote:
> This patch enables stackable dma_ops on x86. To do this, it also enables
> the per-device dma_ops on i386.
>
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
>  arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
>  include/asm-x86/device.h      |    6 +++---
>  include/asm-x86/dma-mapping.h |   14 +++++++-------
>  3 files changed, 36 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index b990fb6..2e517c2 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
>  	write_unlock_irqrestore(&dma_ops_list_lock, flags);
>  }
>
> +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> +{
> +	int i;
> +	unsigned long flags;
> +	struct dma_mapping_ops *entry, *ops = NULL;
> +
> +	read_lock_irqsave(&dma_ops_list_lock, flags);
> +
> +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> +			if (!entry->device_supported)
> +				continue;
> +			if (entry->device_supported(dev)) {
> +				ops = entry;
> +				goto out;
> +			}
> +		}
> +out:
> +	read_unlock_irqrestore(&dma_ops_list_lock, flags);

For PVDMA, we want the "native" dma_ops to succeed first, eg, nommu, and then 
do our "PV DMA", which is just translating gpa  to hpa and then program the 
hardware. This isn't being done here. This can be done by extending the 
return type:

DMA_DEV_NOT_SUPPORTED
DMA_DEV_HANDLED
DMA_DEV_PASS

Where NOT_SUPPORTED means we should look for the next one in the chain 
(current return value 0), DEV_HANDLED means the dma operation has been 
handled successfully (current return value 1) and DEV_PASS means fall back to 
the next layer and then return back.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-26  7:56   ` Amit Shah
@ 2008-09-26  8:59     ` Joerg Roedel
  2008-09-26 10:49       ` Amit Shah
  2008-09-26 11:00     ` Joerg Roedel
  1 sibling, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-26  8:59 UTC (permalink / raw)
  To: Amit Shah
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Ingo Molnar, FUJITA Tomonori

On Fri, Sep 26, 2008 at 01:26:19PM +0530, Amit Shah wrote:
> * On Monday 22 Sep 2008 23:51:21 Joerg Roedel wrote:
> > This patch enables stackable dma_ops on x86. To do this, it also enables
> > the per-device dma_ops on i386.
> >
> > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> > ---
> >  arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
> >  include/asm-x86/device.h      |    6 +++---
> >  include/asm-x86/dma-mapping.h |   14 +++++++-------
> >  3 files changed, 36 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> > index b990fb6..2e517c2 100644
> > --- a/arch/x86/kernel/pci-dma.c
> > +++ b/arch/x86/kernel/pci-dma.c
> > @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
> >  	write_unlock_irqrestore(&dma_ops_list_lock, flags);
> >  }
> >
> > +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> > +{
> > +	int i;
> > +	unsigned long flags;
> > +	struct dma_mapping_ops *entry, *ops = NULL;
> > +
> > +	read_lock_irqsave(&dma_ops_list_lock, flags);
> > +
> > +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> > +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> > +			if (!entry->device_supported)
> > +				continue;
> > +			if (entry->device_supported(dev)) {
> > +				ops = entry;
> > +				goto out;
> > +			}
> > +		}
> > +out:
> > +	read_unlock_irqrestore(&dma_ops_list_lock, flags);
> 
> For PVDMA, we want the "native" dma_ops to succeed first, eg, nommu, and then 
> do our "PV DMA", which is just translating gpa  to hpa and then program the 
> hardware. This isn't being done here. This can be done by extending the 
> return type:
> 
> DMA_DEV_NOT_SUPPORTED
> DMA_DEV_HANDLED
> DMA_DEV_PASS
> 
> Where NOT_SUPPORTED means we should look for the next one in the chain 
> (current return value 0), DEV_HANDLED means the dma operation has been 
> handled successfully (current return value 1) and DEV_PASS means fall back to 
> the next layer and then return back.

I am not sure I fully understand what you mean? Why do we need to call
nommu handlers first for PVDMA devices?
I think that PVDMA devices must always be handled by a pv-dma_ops
implementation. So it makes more sense for me to assign the the dma_ops
of this implementation to the per-device dma_ops structure when we do
the first call to the dma api. So we pay this overhead of finding out
who is responsible only once and not at every call to the dma api.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-26  8:59     ` Joerg Roedel
@ 2008-09-26 10:49       ` Amit Shah
  2008-09-26 12:32         ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: Amit Shah @ 2008-09-26 10:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Ingo Molnar, FUJITA Tomonori

* On Friday 26 Sep 2008 14:29:24 Joerg Roedel wrote:
> On Fri, Sep 26, 2008 at 01:26:19PM +0530, Amit Shah wrote:
> > * On Monday 22 Sep 2008 23:51:21 Joerg Roedel wrote:
> > > This patch enables stackable dma_ops on x86. To do this, it also
> > > enables the per-device dma_ops on i386.
> > >
> > > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> > > ---
> > >  arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
> > >  include/asm-x86/device.h      |    6 +++---
> > >  include/asm-x86/dma-mapping.h |   14 +++++++-------
> > >  3 files changed, 36 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> > > index b990fb6..2e517c2 100644
> > > --- a/arch/x86/kernel/pci-dma.c
> > > +++ b/arch/x86/kernel/pci-dma.c
> > > @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops
> > > *ops, write_unlock_irqrestore(&dma_ops_list_lock, flags);
> > >  }
> > >
> > > +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> > > +{
> > > +	int i;
> > > +	unsigned long flags;
> > > +	struct dma_mapping_ops *entry, *ops = NULL;
> > > +
> > > +	read_lock_irqsave(&dma_ops_list_lock, flags);
> > > +
> > > +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> > > +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> > > +			if (!entry->device_supported)
> > > +				continue;
> > > +			if (entry->device_supported(dev)) {
> > > +				ops = entry;
> > > +				goto out;
> > > +			}
> > > +		}
> > > +out:
> > > +	read_unlock_irqrestore(&dma_ops_list_lock, flags);
> >
> > For PVDMA, we want the "native" dma_ops to succeed first, eg, nommu, and
> > then do our "PV DMA", which is just translating gpa  to hpa and then
> > program the hardware. This isn't being done here. This can be done by
> > extending the return type:
> >
> > DMA_DEV_NOT_SUPPORTED
> > DMA_DEV_HANDLED
> > DMA_DEV_PASS
> >
> > Where NOT_SUPPORTED means we should look for the next one in the chain
> > (current return value 0), DEV_HANDLED means the dma operation has been
> > handled successfully (current return value 1) and DEV_PASS means fall
> > back to the next layer and then return back.
>
> I am not sure I fully understand what you mean? Why do we need to call
> nommu handlers first for PVDMA devices?

For the usual dma_alloc_coherent, dma_map_single, etc. routines. They return 
the gpa to the driver. We want to intercept this gpa and convert it to the 
hpa before passing on the value to the driver. So our dma_alloc_coherent will 
assume the real underlying alloc_coherent has succeeded and then make a 
hypercall.

The PV dma_ops routines won't do the usual allocation, etc. that's already 
done elsewhere.

> I think that PVDMA devices must always be handled by a pv-dma_ops
> implementation. So it makes more sense for me to assign the the dma_ops
> of this implementation to the per-device dma_ops structure when we do
> the first call to the dma api. So we pay this overhead of finding out
> who is responsible only once and not at every call to the dma api.
>
> Joerg



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-26  7:56   ` Amit Shah
  2008-09-26  8:59     ` Joerg Roedel
@ 2008-09-26 11:00     ` Joerg Roedel
  1 sibling, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-26 11:00 UTC (permalink / raw)
  To: Amit Shah
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Ingo Molnar, FUJITA Tomonori

On Fri, Sep 26, 2008 at 01:26:19PM +0530, Amit Shah wrote:
> * On Monday 22 Sep 2008 23:51:21 Joerg Roedel wrote:
> > This patch enables stackable dma_ops on x86. To do this, it also enables
> > the per-device dma_ops on i386.
> >
> > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> > ---
> >  arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
> >  include/asm-x86/device.h      |    6 +++---
> >  include/asm-x86/dma-mapping.h |   14 +++++++-------
> >  3 files changed, 36 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> > index b990fb6..2e517c2 100644
> > --- a/arch/x86/kernel/pci-dma.c
> > +++ b/arch/x86/kernel/pci-dma.c
> > @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
> >  	write_unlock_irqrestore(&dma_ops_list_lock, flags);
> >  }
> >
> > +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> > +{
> > +	int i;
> > +	unsigned long flags;
> > +	struct dma_mapping_ops *entry, *ops = NULL;
> > +
> > +	read_lock_irqsave(&dma_ops_list_lock, flags);
> > +
> > +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> > +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> > +			if (!entry->device_supported)
> > +				continue;
> > +			if (entry->device_supported(dev)) {
> > +				ops = entry;
> > +				goto out;
> > +			}
> > +		}
> > +out:
> > +	read_unlock_irqrestore(&dma_ops_list_lock, flags);
> 
> For PVDMA, we want the "native" dma_ops to succeed first, eg, nommu, and then 
> do our "PV DMA", which is just translating gpa  to hpa and then program the 

Btw, how will dma_masks be handled when PV DMA just translates gpa to
hpa?

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-26 10:49       ` Amit Shah
@ 2008-09-26 12:32         ` Joerg Roedel
  2008-09-27  0:13           ` Muli Ben-Yehuda
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-26 12:32 UTC (permalink / raw)
  To: Amit Shah
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Ingo Molnar, FUJITA Tomonori

On Fri, Sep 26, 2008 at 04:19:51PM +0530, Amit Shah wrote:
> * On Friday 26 Sep 2008 14:29:24 Joerg Roedel wrote:
> > On Fri, Sep 26, 2008 at 01:26:19PM +0530, Amit Shah wrote:
> > > * On Monday 22 Sep 2008 23:51:21 Joerg Roedel wrote:
> > > > This patch enables stackable dma_ops on x86. To do this, it also
> > > > enables the per-device dma_ops on i386.
> > > >
> > > > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> > > > ---
> > > >  arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
> > > >  include/asm-x86/device.h      |    6 +++---
> > > >  include/asm-x86/dma-mapping.h |   14 +++++++-------
> > > >  3 files changed, 36 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> > > > index b990fb6..2e517c2 100644
> > > > --- a/arch/x86/kernel/pci-dma.c
> > > > +++ b/arch/x86/kernel/pci-dma.c
> > > > @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops
> > > > *ops, write_unlock_irqrestore(&dma_ops_list_lock, flags);
> > > >  }
> > > >
> > > > +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> > > > +{
> > > > +	int i;
> > > > +	unsigned long flags;
> > > > +	struct dma_mapping_ops *entry, *ops = NULL;
> > > > +
> > > > +	read_lock_irqsave(&dma_ops_list_lock, flags);
> > > > +
> > > > +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> > > > +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> > > > +			if (!entry->device_supported)
> > > > +				continue;
> > > > +			if (entry->device_supported(dev)) {
> > > > +				ops = entry;
> > > > +				goto out;
> > > > +			}
> > > > +		}
> > > > +out:
> > > > +	read_unlock_irqrestore(&dma_ops_list_lock, flags);
> > >
> > > For PVDMA, we want the "native" dma_ops to succeed first, eg, nommu, and
> > > then do our "PV DMA", which is just translating gpa  to hpa and then
> > > program the hardware. This isn't being done here. This can be done by
> > > extending the return type:
> > >
> > > DMA_DEV_NOT_SUPPORTED
> > > DMA_DEV_HANDLED
> > > DMA_DEV_PASS
> > >
> > > Where NOT_SUPPORTED means we should look for the next one in the chain
> > > (current return value 0), DEV_HANDLED means the dma operation has been
> > > handled successfully (current return value 1) and DEV_PASS means fall
> > > back to the next layer and then return back.
> >
> > I am not sure I fully understand what you mean? Why do we need to call
> > nommu handlers first for PVDMA devices?
> 
> For the usual dma_alloc_coherent, dma_map_single, etc. routines. They return 
> the gpa to the driver. We want to intercept this gpa and convert it to the 
> hpa before passing on the value to the driver. So our dma_alloc_coherent will 
> assume the real underlying alloc_coherent has succeeded and then make a 
> hypercall.
> 
> The PV dma_ops routines won't do the usual allocation, etc. that's already 
> done elsewhere.

Ok, the allocation only matters for dma_alloc_coherent. Fujita
introduced a generic software-based dma_alloc_coherent recently which
you can use for that. I think implementing PVDMA into an own dma_ops
backend and multiplex it using my patches introduces less overhead than
an additional layer over the current dma_ops implementation.

Another two questions to your approach: What happens if a
dma_alloc_coherent allocation crosses page boundarys and the gpa's are
not contiguous in host memory? How will dma masks be handled?

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-26 12:32         ` Joerg Roedel
@ 2008-09-27  0:13           ` Muli Ben-Yehuda
  2008-09-28 19:13             ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-27  0:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Amit Shah, linux-kernel, kvm, iommu, David Woodhouse,
	Ingo Molnar, FUJITA Tomonori

On Fri, Sep 26, 2008 at 02:32:43PM +0200, Joerg Roedel wrote:

> Ok, the allocation only matters for dma_alloc_coherent. Fujita
> introduced a generic software-based dma_alloc_coherent recently
> which you can use for that. I think implementing PVDMA into an own
> dma_ops backend and multiplex it using my patches introduces less
> overhead than an additional layer over the current dma_ops
> implementation.

I'm not sure what you have in mind, but I agree with Amit that
conceptually pvdma should be called after the guest's "native" dma_ops
have done their thing. This is not just for nommu, consider a guest
that is using an (emulated) hardware IOMMU, or that wants to use
swiotlb. We can't replicate their functionality in the pv_dma_ops
layer, we have to let them run first and then pass deal with whatever
we get back.

> Another two questions to your approach: What happens if a
> dma_alloc_coherent allocation crosses page boundarys and the gpa's
> are not contiguous in host memory? How will dma masks be handled?

That's a very good question. The host will need to be aware of a
device's DMA capabilities in order to return I/O addresses (which
could be hpa's if you don't have an IOMMU) that satisfy them. That's
quite a pain.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-22 18:21 ` [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops Joerg Roedel
  2008-09-26  7:56   ` Amit Shah
@ 2008-09-28 14:21   ` FUJITA Tomonori
  2008-09-28 18:44     ` Joerg Roedel
  1 sibling, 1 reply; 44+ messages in thread
From: FUJITA Tomonori @ 2008-09-28 14:21 UTC (permalink / raw)
  To: joerg.roedel
  Cc: linux-kernel, kvm, iommu, dwmw2, muli, amit.shah, mingo, fujita.tomonori

On Mon, 22 Sep 2008 20:21:21 +0200
Joerg Roedel <joerg.roedel@amd.com> wrote:

> This patch enables stackable dma_ops on x86. To do this, it also enables
> the per-device dma_ops on i386.
> 
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
>  arch/x86/kernel/pci-dma.c     |   26 ++++++++++++++++++++++++++
>  include/asm-x86/device.h      |    6 +++---
>  include/asm-x86/dma-mapping.h |   14 +++++++-------
>  3 files changed, 36 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index b990fb6..2e517c2 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
>  	write_unlock_irqrestore(&dma_ops_list_lock, flags);
>  }
>  
> +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> +{
> +	int i;
> +	unsigned long flags;
> +	struct dma_mapping_ops *entry, *ops = NULL;
> +
> +	read_lock_irqsave(&dma_ops_list_lock, flags);
> +
> +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> +			if (!entry->device_supported)
> +				continue;
> +			if (entry->device_supported(dev)) {
> +				ops = entry;
> +				goto out;
> +			}
> +		}
> +out:
> +	read_unlock_irqrestore(&dma_ops_list_lock, flags);

Hmm, every time we call dma_sg/map_single, we call
read_lock_irqsave(&dma_ops_list_lock, flags). It's likely that we see
notable performance drop?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
                   ` (9 preceding siblings ...)
  2008-09-22 18:36 ` [PATCH 0/9][RFC] stackable dma_ops for x86 Arjan van de Ven
@ 2008-09-28 14:21 ` FUJITA Tomonori
  2008-09-28 18:49   ` Joerg Roedel
  10 siblings, 1 reply; 44+ messages in thread
From: FUJITA Tomonori @ 2008-09-28 14:21 UTC (permalink / raw)
  To: joerg.roedel
  Cc: linux-kernel, kvm, iommu, dwmw2, muli, amit.shah, mingo, fujita.tomonori

On Mon, 22 Sep 2008 20:21:12 +0200
Joerg Roedel <joerg.roedel@amd.com> wrote:

> Hi,
> 
> this patch series implements stackable dma_ops on x86. This is useful to
> be able to fall back to a different dma_ops implementation if one can
> not handle a particular device (as necessary for example with
> paravirtualized device passthrough or if a hardware IOMMU only handles a
> subset of available devices).

We already handle the latter. This patchset is more flexible but
seems to incur more overheads.

This feature will be used for only paravirtualized device passthrough?
If so, I feel that there is more simpler (and specific) solutions for
it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-28 14:21   ` FUJITA Tomonori
@ 2008-09-28 18:44     ` Joerg Roedel
  2008-09-29  9:25       ` Muli Ben-Yehuda
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-28 18:44 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: joerg.roedel, linux-kernel, kvm, iommu, dwmw2, muli, amit.shah, mingo

On Sun, Sep 28, 2008 at 11:21:23PM +0900, FUJITA Tomonori wrote:
> > +struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
> > +{
> > +	int i;
> > +	unsigned long flags;
> > +	struct dma_mapping_ops *entry, *ops = NULL;
> > +
> > +	read_lock_irqsave(&dma_ops_list_lock, flags);
> > +
> > +	for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
> > +		list_for_each_entry(entry, &dma_ops_list[i], list) {
> > +			if (!entry->device_supported)
> > +				continue;
> > +			if (entry->device_supported(dev)) {
> > +				ops = entry;
> > +				goto out;
> > +			}
> > +		}
> > +out:
> > +	read_unlock_irqrestore(&dma_ops_list_lock, flags);
> 
> Hmm, every time we call dma_sg/map_single, we call
> read_lock_irqsave(&dma_ops_list_lock, flags). It's likely that we see
> notable performance drop?

Hmm, we should only call find_dma_ops_for_device() the first time a
dma api call is done (look into get_dma_ops). But I also thought about
how this lock can be avoided. In the real world it should not be
necessary because the dma_ops list is initialized before dma api calls
are done. But since there is now a register function which can be called
its safer this way. What do you think, are we still safe enough without
this lock?

Joerg


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-28 14:21 ` FUJITA Tomonori
@ 2008-09-28 18:49   ` Joerg Roedel
  2008-09-29 13:16     ` FUJITA Tomonori
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-28 18:49 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: joerg.roedel, linux-kernel, kvm, iommu, dwmw2, muli, amit.shah, mingo

On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
> On Mon, 22 Sep 2008 20:21:12 +0200
> Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > Hi,
> > 
> > this patch series implements stackable dma_ops on x86. This is useful to
> > be able to fall back to a different dma_ops implementation if one can
> > not handle a particular device (as necessary for example with
> > paravirtualized device passthrough or if a hardware IOMMU only handles a
> > subset of available devices).
> 
> We already handle the latter. This patchset is more flexible but
> seems to incur more overheads.
> 
> This feature will be used for only paravirtualized device passthrough?
> If so, I feel that there is more simpler (and specific) solutions for
> it.

Its not only for device passthrough. It handles also the cases where a
hardware IOMMU does not handle all devices in the system (like in some
Calgary systems but also possible with AMD IOMMU). With this patchset we
can handle these cases in a generic way without hacking it into the
hardware drivers (these hacks are also in the AMD IOMMU code and I plan
to remove them in the case this patchset will be accepted).

Joerg


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-27  0:13           ` Muli Ben-Yehuda
@ 2008-09-28 19:13             ` Joerg Roedel
  2008-09-29  9:30               ` Muli Ben-Yehuda
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-28 19:13 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Joerg Roedel, Amit Shah, linux-kernel, kvm, iommu,
	David Woodhouse, Ingo Molnar, FUJITA Tomonori

On Sat, Sep 27, 2008 at 03:13:21AM +0300, Muli Ben-Yehuda wrote:
> On Fri, Sep 26, 2008 at 02:32:43PM +0200, Joerg Roedel wrote:
> 
> > Ok, the allocation only matters for dma_alloc_coherent. Fujita
> > introduced a generic software-based dma_alloc_coherent recently
> > which you can use for that. I think implementing PVDMA into an own
> > dma_ops backend and multiplex it using my patches introduces less
> > overhead than an additional layer over the current dma_ops
> > implementation.
> 
> I'm not sure what you have in mind, but I agree with Amit that
> conceptually pvdma should be called after the guest's "native" dma_ops
> have done their thing. This is not just for nommu, consider a guest
> that is using an (emulated) hardware IOMMU, or that wants to use
> swiotlb. We can't replicate their functionality in the pv_dma_ops
> layer, we have to let them run first and then pass deal with whatever
> we get back.

I have something in mind what I discussed with Amit at the last KVM
forum. The idea was not ready at the event but meanwhile it has matured
a bit.
I think we should try to build a paravirtualized IOMMU for KVM guests.
It should work this way: We reserve a configurable amount of contiguous
guest physical memory and map it dma contiguous using some kind of
hardware IOMMU. This is possible with all hardare IOMMUs we have in the
field by now, also Calgary and GART. The guest does dma_coherent
allocations from this memory directly and is done. For map_single and map_sg
the guest can do bounce buffering. We avoid nearly all pvdma hypercalls
with this approach, keep guest swapping working and solve also the
problems with device dma_masks and guest memory that is not contigous on
the host side.
For systems without any kind of hardware IOMMU we can extend the
interface to support bounce buffering between host and guest (in this
case we can not avoid the hypercalls). This means that the host
reserves the memory for the DMA transaction (also recognizing the
dma_mask) and copies it from/to the guest directly upon the dma_*_sync
calls.
This is what I have in mind and want to propose. Maybe we can discuss
these ideas here. I think since there are many systems out there with
some kind of hardware IOMMUs (every 64bit AMD processor has a GART) we
should really consider this approach.

> > Another two questions to your approach: What happens if a
> > dma_alloc_coherent allocation crosses page boundarys and the gpa's
> > are not contiguous in host memory? How will dma masks be handled?
> 
> That's a very good question. The host will need to be aware of a
> device's DMA capabilities in order to return I/O addresses (which
> could be hpa's if you don't have an IOMMU) that satisfy them. That's
> quite a pain.

True. And I fear we don't get a simple and clean interface with this
approach.

Joerg


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-28 18:44     ` Joerg Roedel
@ 2008-09-29  9:25       ` Muli Ben-Yehuda
  2008-09-29  9:29         ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-29  9:25 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: FUJITA Tomonori, joerg.roedel, linux-kernel, kvm, iommu, dwmw2,
	amit.shah, mingo

On Sun, Sep 28, 2008 at 08:44:24PM +0200, Joerg Roedel wrote:

> Hmm, we should only call find_dma_ops_for_device() the first time a
> dma api call is done (look into get_dma_ops). But I also thought
> about how this lock can be avoided. In the real world it should not
> be necessary because the dma_ops list is initialized before dma api
> calls are done. But since there is now a register function which can
> be called its safer this way. What do you think, are we still safe
> enough without this lock?

We could be, if we add a check to the register function that verifies
it isn't being called after DMAs have started. Something like:

in register:

  if (dma_started)
    yell loudly

before PCI device initialization and after IOMMU initialization:
  dma_started = true

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-29  9:25       ` Muli Ben-Yehuda
@ 2008-09-29  9:29         ` Joerg Roedel
  0 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-29  9:29 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: FUJITA Tomonori, joerg.roedel, linux-kernel, kvm, iommu, dwmw2,
	amit.shah, mingo

On Mon, Sep 29, 2008 at 12:25:49PM +0300, Muli Ben-Yehuda wrote:
> On Sun, Sep 28, 2008 at 08:44:24PM +0200, Joerg Roedel wrote:
> 
> > Hmm, we should only call find_dma_ops_for_device() the first time a
> > dma api call is done (look into get_dma_ops). But I also thought
> > about how this lock can be avoided. In the real world it should not
> > be necessary because the dma_ops list is initialized before dma api
> > calls are done. But since there is now a register function which can
> > be called its safer this way. What do you think, are we still safe
> > enough without this lock?
> 
> We could be, if we add a check to the register function that verifies
> it isn't being called after DMAs have started. Something like:
> 
> in register:
> 
>   if (dma_started)
>     yell loudly
> 
> before PCI device initialization and after IOMMU initialization:
>   dma_started = true

Good idea. I will change this in my patchset, thanks.

Joerg


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-28 19:13             ` Joerg Roedel
@ 2008-09-29  9:30               ` Muli Ben-Yehuda
  2008-09-29  9:36                 ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-29  9:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Joerg Roedel, Amit Shah, linux-kernel, kvm, iommu,
	David Woodhouse, Ingo Molnar, FUJITA Tomonori

On Sun, Sep 28, 2008 at 09:13:33PM +0200, Joerg Roedel wrote:

> I think we should try to build a paravirtualized IOMMU for KVM
> guests.  It should work this way: We reserve a configurable amount
> of contiguous guest physical memory and map it dma contiguous using
> some kind of hardware IOMMU. This is possible with all hardare
> IOMMUs we have in the field by now, also Calgary and GART. The guest
> does dma_coherent allocations from this memory directly and is done.
> For map_single and map_sg 
> the guest can do bounce buffering. We avoid nearly all pvdma hypercalls
> with this approach, keep guest swapping working and solve also the
> problems with device dma_masks and guest memory that is not contigous on
> the host side.

I'm not sure I follow, but if I understand correctly with this
approach the guest could only DMA into buffers that fall within the
range you allocated for DMA and mapped. Isn't that a pretty nasty
limitation?  The guest would need to bounce-bufer every frame that
happened to not fall inside that range, with the resulting loss of
performance.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-29  9:30               ` Muli Ben-Yehuda
@ 2008-09-29  9:36                 ` Joerg Roedel
  2008-09-29 13:16                   ` FUJITA Tomonori
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-29  9:36 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Joerg Roedel, Amit Shah, linux-kernel, kvm, iommu,
	David Woodhouse, Ingo Molnar, FUJITA Tomonori

On Mon, Sep 29, 2008 at 12:30:44PM +0300, Muli Ben-Yehuda wrote:
> On Sun, Sep 28, 2008 at 09:13:33PM +0200, Joerg Roedel wrote:
> 
> > I think we should try to build a paravirtualized IOMMU for KVM
> > guests.  It should work this way: We reserve a configurable amount
> > of contiguous guest physical memory and map it dma contiguous using
> > some kind of hardware IOMMU. This is possible with all hardare
> > IOMMUs we have in the field by now, also Calgary and GART. The guest
> > does dma_coherent allocations from this memory directly and is done.
> > For map_single and map_sg 
> > the guest can do bounce buffering. We avoid nearly all pvdma hypercalls
> > with this approach, keep guest swapping working and solve also the
> > problems with device dma_masks and guest memory that is not contigous on
> > the host side.
> 
> I'm not sure I follow, but if I understand correctly with this
> approach the guest could only DMA into buffers that fall within the
> range you allocated for DMA and mapped. Isn't that a pretty nasty
> limitation?  The guest would need to bounce-bufer every frame that
> happened to not fall inside that range, with the resulting loss of
> performance.

The bounce buffering is needed for map_single/map_sg allocations. For
dma_alloc_coherent we can directly allocate from that range. The
performance loss of the bounce buffering may be lower than the
hypercalls we need as the alternative (we need hypercalls for map, unmap
and sync).

Joerg


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-28 18:49   ` Joerg Roedel
@ 2008-09-29 13:16     ` FUJITA Tomonori
  2008-09-29 13:26       ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: FUJITA Tomonori @ 2008-09-29 13:16 UTC (permalink / raw)
  To: joro
  Cc: fujita.tomonori, joerg.roedel, linux-kernel, kvm, iommu, dwmw2,
	muli, amit.shah, mingo

On Sun, 28 Sep 2008 20:49:26 +0200
Joerg Roedel <joro@8bytes.org> wrote:

> On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
> > On Mon, 22 Sep 2008 20:21:12 +0200
> > Joerg Roedel <joerg.roedel@amd.com> wrote:
> > 
> > > Hi,
> > > 
> > > this patch series implements stackable dma_ops on x86. This is useful to
> > > be able to fall back to a different dma_ops implementation if one can
> > > not handle a particular device (as necessary for example with
> > > paravirtualized device passthrough or if a hardware IOMMU only handles a
> > > subset of available devices).
> > 
> > We already handle the latter. This patchset is more flexible but
> > seems to incur more overheads.
> > 
> > This feature will be used for only paravirtualized device passthrough?
> > If so, I feel that there is more simpler (and specific) solutions for
> > it.
> 
> Its not only for device passthrough. It handles also the cases where a
> hardware IOMMU does not handle all devices in the system (like in some
> Calgary systems but also possible with AMD IOMMU). With this patchset we

I know that. As I wrote in the previous mail, we already solved that
problem with per-device-dma-ops.

My question is what unsolved problems this patchset can fix?


This patchset is named "stackable dma_ops" but it's different from
what we discussed as "stackable dma_ops". This patchset provides
IOMMUs a generic mechanism to set up "stackable dma_ops". But this
patchset doesn't solve the problem that a hardware IOMMU does not
handle all devices (it was already solved with per-device-dma-ops).

If paravirtualized device passthrough still needs to call multiple
dma_ops, then this patchset doesn't solve that issue.


> can handle these cases in a generic way without hacking it into the
> hardware drivers (these hacks are also in the AMD IOMMU code and I plan
> to remove them in the case this patchset will be accepted).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-29  9:36                 ` Joerg Roedel
@ 2008-09-29 13:16                   ` FUJITA Tomonori
  2008-09-29 13:33                     ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: FUJITA Tomonori @ 2008-09-29 13:16 UTC (permalink / raw)
  To: joro
  Cc: muli, joerg.roedel, amit.shah, linux-kernel, kvm, iommu, dwmw2,
	mingo, fujita.tomonori

On Mon, 29 Sep 2008 11:36:52 +0200
Joerg Roedel <joro@8bytes.org> wrote:

> On Mon, Sep 29, 2008 at 12:30:44PM +0300, Muli Ben-Yehuda wrote:
> > On Sun, Sep 28, 2008 at 09:13:33PM +0200, Joerg Roedel wrote:
> > 
> > > I think we should try to build a paravirtualized IOMMU for KVM
> > > guests.  It should work this way: We reserve a configurable amount
> > > of contiguous guest physical memory and map it dma contiguous using
> > > some kind of hardware IOMMU. This is possible with all hardare
> > > IOMMUs we have in the field by now, also Calgary and GART. The guest
> > > does dma_coherent allocations from this memory directly and is done.
> > > For map_single and map_sg 
> > > the guest can do bounce buffering. We avoid nearly all pvdma hypercalls
> > > with this approach, keep guest swapping working and solve also the
> > > problems with device dma_masks and guest memory that is not contigous on
> > > the host side.
> > 
> > I'm not sure I follow, but if I understand correctly with this
> > approach the guest could only DMA into buffers that fall within the
> > range you allocated for DMA and mapped. Isn't that a pretty nasty
> > limitation?  The guest would need to bounce-bufer every frame that
> > happened to not fall inside that range, with the resulting loss of
> > performance.
> 
> The bounce buffering is needed for map_single/map_sg allocations. For
> dma_alloc_coherent we can directly allocate from that range. The
> performance loss of the bounce buffering may be lower than the
> hypercalls we need as the alternative (we need hypercalls for map, unmap
> and sync).

Nobody cares about the performance of dma_alloc_coherent. Only the
performance of map_single/map_sg matters.

I'm not sure how expensive the hypercalls are, but they are more
expensive than bounce buffering coping lots of data for every I/Os?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-29 13:16     ` FUJITA Tomonori
@ 2008-09-29 13:26       ` Joerg Roedel
  2008-09-29 13:42         ` FUJITA Tomonori
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-29 13:26 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: joro, linux-kernel, kvm, iommu, dwmw2, muli, amit.shah, mingo

On Mon, Sep 29, 2008 at 10:16:39PM +0900, FUJITA Tomonori wrote:
> On Sun, 28 Sep 2008 20:49:26 +0200
> Joerg Roedel <joro@8bytes.org> wrote:
> 
> > On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
> > > On Mon, 22 Sep 2008 20:21:12 +0200
> > > Joerg Roedel <joerg.roedel@amd.com> wrote:
> > > 
> > > > Hi,
> > > > 
> > > > this patch series implements stackable dma_ops on x86. This is useful to
> > > > be able to fall back to a different dma_ops implementation if one can
> > > > not handle a particular device (as necessary for example with
> > > > paravirtualized device passthrough or if a hardware IOMMU only handles a
> > > > subset of available devices).
> > > 
> > > We already handle the latter. This patchset is more flexible but
> > > seems to incur more overheads.
> > > 
> > > This feature will be used for only paravirtualized device passthrough?
> > > If so, I feel that there is more simpler (and specific) solutions for
> > > it.
> > 
> > Its not only for device passthrough. It handles also the cases where a
> > hardware IOMMU does not handle all devices in the system (like in some
> > Calgary systems but also possible with AMD IOMMU). With this patchset we
> 
> I know that. As I wrote in the previous mail, we already solved that
> problem with per-device-dma-ops.
> 
> My question is what unsolved problems this patchset can fix?
> 
> 
> This patchset is named "stackable dma_ops" but it's different from
> what we discussed as "stackable dma_ops". This patchset provides
> IOMMUs a generic mechanism to set up "stackable dma_ops". But this
> patchset doesn't solve the problem that a hardware IOMMU does not
> handle all devices (it was already solved with per-device-dma-ops).
> 
> If paravirtualized device passthrough still needs to call multiple
> dma_ops, then this patchset doesn't solve that issue.

Ok, the name "stackable" is misleading and was a bad choice. I will
rename it to "multiplexing". This should make it more clear what it is.
Like you pointed out, the problems are solved with per-device dma_ops,
but in the current implementation it needs special hacks in the IOMMU
drivers to use these per-device dma_ops.
I see this patchset as a continuation of the per-device dma_ops idea. It
moves the per-device handling out of the specific drivers to a common
place. So we can avoid or remove special hacks in the IOMMU drivers.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-29 13:16                   ` FUJITA Tomonori
@ 2008-09-29 13:33                     ` Joerg Roedel
  2008-09-30 19:44                       ` Muli Ben-Yehuda
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-09-29 13:33 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: joro, muli, amit.shah, linux-kernel, kvm, iommu, dwmw2, mingo

On Mon, Sep 29, 2008 at 10:16:44PM +0900, FUJITA Tomonori wrote:
> On Mon, 29 Sep 2008 11:36:52 +0200
> Joerg Roedel <joro@8bytes.org> wrote:
> 
> > On Mon, Sep 29, 2008 at 12:30:44PM +0300, Muli Ben-Yehuda wrote:
> > > On Sun, Sep 28, 2008 at 09:13:33PM +0200, Joerg Roedel wrote:
> > > 
> > > > I think we should try to build a paravirtualized IOMMU for KVM
> > > > guests.  It should work this way: We reserve a configurable amount
> > > > of contiguous guest physical memory and map it dma contiguous using
> > > > some kind of hardware IOMMU. This is possible with all hardare
> > > > IOMMUs we have in the field by now, also Calgary and GART. The guest
> > > > does dma_coherent allocations from this memory directly and is done.
> > > > For map_single and map_sg 
> > > > the guest can do bounce buffering. We avoid nearly all pvdma hypercalls
> > > > with this approach, keep guest swapping working and solve also the
> > > > problems with device dma_masks and guest memory that is not contigous on
> > > > the host side.
> > > 
> > > I'm not sure I follow, but if I understand correctly with this
> > > approach the guest could only DMA into buffers that fall within the
> > > range you allocated for DMA and mapped. Isn't that a pretty nasty
> > > limitation?  The guest would need to bounce-bufer every frame that
> > > happened to not fall inside that range, with the resulting loss of
> > > performance.
> > 
> > The bounce buffering is needed for map_single/map_sg allocations. For
> > dma_alloc_coherent we can directly allocate from that range. The
> > performance loss of the bounce buffering may be lower than the
> > hypercalls we need as the alternative (we need hypercalls for map, unmap
> > and sync).
> 
> Nobody cares about the performance of dma_alloc_coherent. Only the
> performance of map_single/map_sg matters.
>
> I'm not sure how expensive the hypercalls are, but they are more
> expensive than bounce buffering coping lots of data for every I/Os?

I don't think that we can avoid bounce buffering into the guests at all
(with and without my idea of a paravirtualized IOMMU) when we want to
handle dma_masks and requests that cross guest physical pages properly.

With mapping/unmapping through hypercalls we add the world-switch
overhead to the copy-overhead. We can't avoid this when we have no
hardware support at all. But already with older IOMMUs like Calgary and
GART we can at least avoid the world-switch. And since, for example,
every 64 bit capable AMD processor has a GART we can make use of it.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-29 13:26       ` Joerg Roedel
@ 2008-09-29 13:42         ` FUJITA Tomonori
  2008-09-29 13:51           ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: FUJITA Tomonori @ 2008-09-29 13:42 UTC (permalink / raw)
  To: joerg.roedel
  Cc: fujita.tomonori, joro, linux-kernel, kvm, iommu, dwmw2, muli,
	amit.shah, mingo

On Mon, 29 Sep 2008 15:26:47 +0200
Joerg Roedel <joerg.roedel@amd.com> wrote:

> On Mon, Sep 29, 2008 at 10:16:39PM +0900, FUJITA Tomonori wrote:
> > On Sun, 28 Sep 2008 20:49:26 +0200
> > Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
> > > > On Mon, 22 Sep 2008 20:21:12 +0200
> > > > Joerg Roedel <joerg.roedel@amd.com> wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > this patch series implements stackable dma_ops on x86. This is useful to
> > > > > be able to fall back to a different dma_ops implementation if one can
> > > > > not handle a particular device (as necessary for example with
> > > > > paravirtualized device passthrough or if a hardware IOMMU only handles a
> > > > > subset of available devices).
> > > > 
> > > > We already handle the latter. This patchset is more flexible but
> > > > seems to incur more overheads.
> > > > 
> > > > This feature will be used for only paravirtualized device passthrough?
> > > > If so, I feel that there is more simpler (and specific) solutions for
> > > > it.
> > > 
> > > Its not only for device passthrough. It handles also the cases where a
> > > hardware IOMMU does not handle all devices in the system (like in some
> > > Calgary systems but also possible with AMD IOMMU). With this patchset we
> > 
> > I know that. As I wrote in the previous mail, we already solved that
> > problem with per-device-dma-ops.
> > 
> > My question is what unsolved problems this patchset can fix?
> > 
> > 
> > This patchset is named "stackable dma_ops" but it's different from
> > what we discussed as "stackable dma_ops". This patchset provides
> > IOMMUs a generic mechanism to set up "stackable dma_ops". But this
> > patchset doesn't solve the problem that a hardware IOMMU does not
> > handle all devices (it was already solved with per-device-dma-ops).
> > 
> > If paravirtualized device passthrough still needs to call multiple
> > dma_ops, then this patchset doesn't solve that issue.
> 
> Ok, the name "stackable" is misleading and was a bad choice. I will
> rename it to "multiplexing". This should make it more clear what it is.
> Like you pointed out, the problems are solved with per-device dma_ops,
> but in the current implementation it needs special hacks in the IOMMU
> drivers to use these per-device dma_ops.
> I see this patchset as a continuation of the per-device dma_ops idea. It
> moves the per-device handling out of the specific drivers to a common
> place. So we can avoid or remove special hacks in the IOMMU drivers.

Basically, I'm not against this patchset. It simplify Calgary and AMD
IOMMUs code to set up per-device-dma-ops (though it makes dma_ops a
bit complicated).

But it doesn't solve any problems including the paravirtualized device
passthrough. When I wrote per-device-dma-ops, I expected that KVM
people want more changes (such as stackable dma_ops) to dma_ops for
the paravirtualized device passthrough. I'd like to hear what they
want first.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9][RFC] stackable dma_ops for x86
  2008-09-29 13:42         ` FUJITA Tomonori
@ 2008-09-29 13:51           ` Joerg Roedel
  0 siblings, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-29 13:51 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: joro, linux-kernel, kvm, iommu, dwmw2, muli, amit.shah, mingo

On Mon, Sep 29, 2008 at 10:42:37PM +0900, FUJITA Tomonori wrote:
> On Mon, 29 Sep 2008 15:26:47 +0200
> Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > On Mon, Sep 29, 2008 at 10:16:39PM +0900, FUJITA Tomonori wrote:
> > > On Sun, 28 Sep 2008 20:49:26 +0200
> > > Joerg Roedel <joro@8bytes.org> wrote:
> > > 
> > > > On Sun, Sep 28, 2008 at 11:21:26PM +0900, FUJITA Tomonori wrote:
> > > > > On Mon, 22 Sep 2008 20:21:12 +0200
> > > > > Joerg Roedel <joerg.roedel@amd.com> wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > this patch series implements stackable dma_ops on x86. This is useful to
> > > > > > be able to fall back to a different dma_ops implementation if one can
> > > > > > not handle a particular device (as necessary for example with
> > > > > > paravirtualized device passthrough or if a hardware IOMMU only handles a
> > > > > > subset of available devices).
> > > > > 
> > > > > We already handle the latter. This patchset is more flexible but
> > > > > seems to incur more overheads.
> > > > > 
> > > > > This feature will be used for only paravirtualized device passthrough?
> > > > > If so, I feel that there is more simpler (and specific) solutions for
> > > > > it.
> > > > 
> > > > Its not only for device passthrough. It handles also the cases where a
> > > > hardware IOMMU does not handle all devices in the system (like in some
> > > > Calgary systems but also possible with AMD IOMMU). With this patchset we
> > > 
> > > I know that. As I wrote in the previous mail, we already solved that
> > > problem with per-device-dma-ops.
> > > 
> > > My question is what unsolved problems this patchset can fix?
> > > 
> > > 
> > > This patchset is named "stackable dma_ops" but it's different from
> > > what we discussed as "stackable dma_ops". This patchset provides
> > > IOMMUs a generic mechanism to set up "stackable dma_ops". But this
> > > patchset doesn't solve the problem that a hardware IOMMU does not
> > > handle all devices (it was already solved with per-device-dma-ops).
> > > 
> > > If paravirtualized device passthrough still needs to call multiple
> > > dma_ops, then this patchset doesn't solve that issue.
> > 
> > Ok, the name "stackable" is misleading and was a bad choice. I will
> > rename it to "multiplexing". This should make it more clear what it is.
> > Like you pointed out, the problems are solved with per-device dma_ops,
> > but in the current implementation it needs special hacks in the IOMMU
> > drivers to use these per-device dma_ops.
> > I see this patchset as a continuation of the per-device dma_ops idea. It
> > moves the per-device handling out of the specific drivers to a common
> > place. So we can avoid or remove special hacks in the IOMMU drivers.
> 
> Basically, I'm not against this patchset. It simplify Calgary and AMD
> IOMMUs code to set up per-device-dma-ops (though it makes dma_ops a
> bit complicated).

Yes. But mind that this patchset adds complexity to one point (at
dma_ops initialization) while we can avoid and remove it
at many other places (in the dma_ops drivers).

> But it doesn't solve any problems including the paravirtualized device
> passthrough. When I wrote per-device-dma-ops, I expected that KVM
> people want more changes (such as stackable dma_ops) to dma_ops for
> the paravirtualized device passthrough. I'd like to hear what they
> want first.

Sure. Therefore this patchset is RFC and I cc'ed them.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-22 18:21 ` [PATCH 6/9] x86/iommu: change Calgary " Joerg Roedel
  2008-09-23 14:37   ` Muli Ben-Yehuda
@ 2008-09-30 11:10   ` Ingo Molnar
  2008-09-30 11:58     ` Joerg Roedel
  2008-09-30 13:30     ` Muli Ben-Yehuda
  1 sibling, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-09-30 11:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Amit Shah, Ingo Molnar, FUJITA Tomonori


* Joerg Roedel <joerg.roedel@amd.com> wrote:

> +static int calgary_device_supported(struct device *dev)
> +{
> +	struct iommu_table *tbl;
> +
> +	tbl = find_iommu_table(dev);
> +
> +	return translation_enabled(tbl);
> +}

i guess this could be written as a simple:

> +static int calgary_device_supported(struct device *dev)
> +{
> +	return translation_enabled(find_iommu_table(dev));
> +}

?

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-30 11:10   ` Ingo Molnar
@ 2008-09-30 11:58     ` Joerg Roedel
  2008-09-30 13:30     ` Muli Ben-Yehuda
  1 sibling, 0 replies; 44+ messages in thread
From: Joerg Roedel @ 2008-09-30 11:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, kvm, iommu, David Woodhouse, Muli Ben-Yehuda,
	Amit Shah, Ingo Molnar, FUJITA Tomonori

On Tue, Sep 30, 2008 at 01:10:28PM +0200, Ingo Molnar wrote:
> 
> * Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > +static int calgary_device_supported(struct device *dev)
> > +{
> > +	struct iommu_table *tbl;
> > +
> > +	tbl = find_iommu_table(dev);
> > +
> > +	return translation_enabled(tbl);
> > +}
> 
> i guess this could be written as a simple:
> 
> > +static int calgary_device_supported(struct device *dev)
> > +{
> > +	return translation_enabled(find_iommu_table(dev));
> > +}
> 
> ?

Yes. I will change it.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-30 11:10   ` Ingo Molnar
  2008-09-30 11:58     ` Joerg Roedel
@ 2008-09-30 13:30     ` Muli Ben-Yehuda
  2008-09-30 15:38       ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-30 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, linux-kernel, kvm, iommu, David Woodhouse,
	Amit Shah, Ingo Molnar, FUJITA Tomonori

On Tue, Sep 30, 2008 at 01:10:28PM +0200, Ingo Molnar wrote:
> 
> * Joerg Roedel <joerg.roedel@amd.com> wrote:
> 
> > +static int calgary_device_supported(struct device *dev)
> > +{
> > +	struct iommu_table *tbl;
> > +
> > +	tbl = find_iommu_table(dev);
> > +
> > +	return translation_enabled(tbl);
> > +}
> 
> i guess this could be written as a simple:
> 
> > +static int calgary_device_supported(struct device *dev)
> > +{
> > +	return translation_enabled(find_iommu_table(dev));
> > +}

Sure, but I prefer the explicit form since it lends itself to easier
debugging (oops line numbers, adding printks, etc.).

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-30 13:30     ` Muli Ben-Yehuda
@ 2008-09-30 15:38       ` Ingo Molnar
  2008-09-30 18:33         ` Muli Ben-Yehuda
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2008-09-30 15:38 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Joerg Roedel, linux-kernel, kvm, iommu, David Woodhouse,
	Amit Shah, Ingo Molnar, FUJITA Tomonori


* Muli Ben-Yehuda <muli@il.ibm.com> wrote:

> > > +static int calgary_device_supported(struct device *dev)
> > > +{
> > > +	return translation_enabled(find_iommu_table(dev));
> > > +}
> 
> Sure, but I prefer the explicit form since it lends itself to easier
> debugging (oops line numbers, adding printks, etc.).

we never do that for simple stuff like this. The kernel would be twice 
as large if we did. An oops is easy enough to decode and an oops does 
not come with a line number.

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/9] x86/iommu: change Calgary to use dma_ops register interface
  2008-09-30 15:38       ` Ingo Molnar
@ 2008-09-30 18:33         ` Muli Ben-Yehuda
  0 siblings, 0 replies; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-30 18:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, linux-kernel, kvm, iommu, David Woodhouse,
	Amit Shah, Ingo Molnar, FUJITA Tomonori

On Tue, Sep 30, 2008 at 05:38:15PM +0200, Ingo Molnar wrote:
> 
> * Muli Ben-Yehuda <muli@il.ibm.com> wrote:
> 
> > > > +static int calgary_device_supported(struct device *dev)
> > > > +{
> > > > +	return translation_enabled(find_iommu_table(dev));
> > > > +}
> > 
> > Sure, but I prefer the explicit form since it lends itself to easier
> > debugging (oops line numbers, adding printks, etc.).
> 
> we never do that for simple stuff like this. The kernel would be
> twice as large if we did.

I assume you are talking about the source size, because the generated
code size should be exactly the same.

In any case, arguing where to place the semi-colon is a waste of
time. Do as you see fit.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-29 13:33                     ` Joerg Roedel
@ 2008-09-30 19:44                       ` Muli Ben-Yehuda
  2008-10-01  7:19                         ` Joerg Roedel
  0 siblings, 1 reply; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-09-30 19:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: FUJITA Tomonori, joro, amit.shah, linux-kernel, kvm, iommu, dwmw2, mingo

On Mon, Sep 29, 2008 at 03:33:11PM +0200, Joerg Roedel wrote:

> > Nobody cares about the performance of dma_alloc_coherent. Only the
> > performance of map_single/map_sg matters.
> >
> > I'm not sure how expensive the hypercalls are, but they are more
> > expensive than bounce buffering coping lots of data for every
> > I/Os?
> 
> I don't think that we can avoid bounce buffering into the guests at
> all (with and without my idea of a paravirtualized IOMMU) when we
> want to handle dma_masks and requests that cross guest physical
> pages properly.

It might be possible to have a per-device slow or fast path, where the
fast path is for devices which have no DMA limitations (high-end
devices generally don't) and the slow path is for devices which do.

> With mapping/unmapping through hypercalls we add the world-switch
> overhead to the copy-overhead. We can't avoid this when we have no
> hardware support at all. But already with older IOMMUs like Calgary
> and GART we can at least avoid the world-switch. And since, for
> example, every 64 bit capable AMD processor has a GART we can make
> use of it.

It should be possible to reduce the number and overhead of hypercalls
to the point where their cost is immaterial. I think that's
fundamentally a better approach.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-09-30 19:44                       ` Muli Ben-Yehuda
@ 2008-10-01  7:19                         ` Joerg Roedel
  2008-10-03  8:38                           ` Muli Ben-Yehuda
  0 siblings, 1 reply; 44+ messages in thread
From: Joerg Roedel @ 2008-10-01  7:19 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: FUJITA Tomonori, joro, amit.shah, linux-kernel, kvm, iommu, dwmw2, mingo

On Tue, Sep 30, 2008 at 10:44:01PM +0300, Muli Ben-Yehuda wrote:
> On Mon, Sep 29, 2008 at 03:33:11PM +0200, Joerg Roedel wrote:
> 
> > > Nobody cares about the performance of dma_alloc_coherent. Only the
> > > performance of map_single/map_sg matters.
> > >
> > > I'm not sure how expensive the hypercalls are, but they are more
> > > expensive than bounce buffering coping lots of data for every
> > > I/Os?
> > 
> > I don't think that we can avoid bounce buffering into the guests at
> > all (with and without my idea of a paravirtualized IOMMU) when we
> > want to handle dma_masks and requests that cross guest physical
> > pages properly.
> 
> It might be possible to have a per-device slow or fast path, where the
> fast path is for devices which have no DMA limitations (high-end
> devices generally don't) and the slow path is for devices which do.

This solves the problem with the DMA masks. But what happens to requests
that cross guest page boundarys?

> > With mapping/unmapping through hypercalls we add the world-switch
> > overhead to the copy-overhead. We can't avoid this when we have no
> > hardware support at all. But already with older IOMMUs like Calgary
> > and GART we can at least avoid the world-switch. And since, for
> > example, every 64 bit capable AMD processor has a GART we can make
> > use of it.
> 
> It should be possible to reduce the number and overhead of hypercalls
> to the point where their cost is immaterial. I think that's
> fundamentally a better approach.

Ok, we can queue map_sg allocations together an queue them into one
hypercall. But I remember a paper from you where you wrote that most
allocations are mapping only one area. Are there other ways to optimize
this? I must say that reducing the number of hypercalls was important
while thinking about my idea. If there are better ways I am all ears to
hear from them.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops
  2008-10-01  7:19                         ` Joerg Roedel
@ 2008-10-03  8:38                           ` Muli Ben-Yehuda
  0 siblings, 0 replies; 44+ messages in thread
From: Muli Ben-Yehuda @ 2008-10-03  8:38 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: FUJITA Tomonori, joro, amit.shah, linux-kernel, kvm, iommu,
	dwmw2, mingo, Ben-Ami Yassour1

On Wed, Oct 01, 2008 at 09:19:56AM +0200, Joerg Roedel wrote:

> > It might be possible to have a per-device slow or fast path, where
> > the fast path is for devices which have no DMA limitations
> > (high-end devices generally don't) and the slow path is for
> > devices which do.
> 
> This solves the problem with the DMA masks. But what happens to
> requests that cross guest page boundarys?

I'm not sure I follow. If a buffer is contiguous in the guest space,
it will remain contiguous (i.e., be mapped contiguously) in the IOMMU
I/O address space, even if each I/O PTE ends up mapping a different
physical frame.

> > > With mapping/unmapping through hypercalls we add the
> > > world-switch overhead to the copy-overhead. We can't avoid this
> > > when we have no hardware support at all. But already with older
> > > IOMMUs like Calgary and GART we can at least avoid the
> > > world-switch. And since, for example, every 64 bit capable AMD
> > > processor has a GART we can make use of it.
> > 
> > It should be possible to reduce the number and overhead of
> > hypercalls to the point where their cost is immaterial. I think
> > that's fundamentally a better approach.
> 
> Ok, we can queue map_sg allocations together an queue them into one
> hypercall. But I remember a paper from you where you wrote that most
> allocations are mapping only one area.

I'm afraid that bit of the paper was poorly done (mea culpa). As far
as I can recall, the majority of dma_alloc_coherent + scatter-gather
list *element* mappings only map a single frame, but we didn't look at
the time at the average length of a scatter gather list and the
frequency of sg list mappings vs. single page mappings. If the length
and frequency are high enough, and you map entire sg lists in a single
hcall or a single batch of hcalls, it might have a nice boost.

> Are there other ways to optimize this? I must say that reducing the
> number of hypercalls was important while thinking about my idea. If
> there are better ways I am all ears to hear from them.

There were a number of ideas mentioned in our paper (for example,
switching drivers from the streaming DMA API to the persistent DMA
API, which will be a big help to the scheme you propose), and Willman,
Rixner and Cox also had some input to the problem[1]. Unfortunately no
implementations exist yet AFAIK.

[1] "Protection Strategies for Direct Access to Virtualized I/O
Devices", by Paul Willmann, Scott Rixner and Alan L. Cox, USENIX '08.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                       <->
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2008-10-03  8:40 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-22 18:21 [PATCH 0/9][RFC] stackable dma_ops for x86 Joerg Roedel
2008-09-22 18:21 ` [PATCH 1/9] x86/iommu: add necessary types for stackable dma_ops Joerg Roedel
2008-09-22 18:21 ` [PATCH 2/9] x86/iommu: add stackable dma_ops registration interface Joerg Roedel
2008-09-22 18:21 ` [PATCH 3/9] x86/iommu: change PCI-NOMMU to use dma_ops register interface Joerg Roedel
2008-09-22 18:21 ` [PATCH 4/9] x86/iommu: change SWIOTLB " Joerg Roedel
2008-09-22 18:21 ` [PATCH 5/9] x86/iommu: change GART " Joerg Roedel
2008-09-22 18:21 ` [PATCH 6/9] x86/iommu: change Calgary " Joerg Roedel
2008-09-23 14:37   ` Muli Ben-Yehuda
2008-09-30 11:10   ` Ingo Molnar
2008-09-30 11:58     ` Joerg Roedel
2008-09-30 13:30     ` Muli Ben-Yehuda
2008-09-30 15:38       ` Ingo Molnar
2008-09-30 18:33         ` Muli Ben-Yehuda
2008-09-22 18:21 ` [PATCH 7/9] x86/iommu: change AMD IOMMU " Joerg Roedel
2008-09-22 18:21 ` [PATCH 8/9] x86/iommu: change Intel " Joerg Roedel
2008-09-22 18:21 ` [PATCH 9/9] x86/iommu: use dma_ops_list in get_dma_ops Joerg Roedel
2008-09-26  7:56   ` Amit Shah
2008-09-26  8:59     ` Joerg Roedel
2008-09-26 10:49       ` Amit Shah
2008-09-26 12:32         ` Joerg Roedel
2008-09-27  0:13           ` Muli Ben-Yehuda
2008-09-28 19:13             ` Joerg Roedel
2008-09-29  9:30               ` Muli Ben-Yehuda
2008-09-29  9:36                 ` Joerg Roedel
2008-09-29 13:16                   ` FUJITA Tomonori
2008-09-29 13:33                     ` Joerg Roedel
2008-09-30 19:44                       ` Muli Ben-Yehuda
2008-10-01  7:19                         ` Joerg Roedel
2008-10-03  8:38                           ` Muli Ben-Yehuda
2008-09-26 11:00     ` Joerg Roedel
2008-09-28 14:21   ` FUJITA Tomonori
2008-09-28 18:44     ` Joerg Roedel
2008-09-29  9:25       ` Muli Ben-Yehuda
2008-09-29  9:29         ` Joerg Roedel
2008-09-22 18:36 ` [PATCH 0/9][RFC] stackable dma_ops for x86 Arjan van de Ven
2008-09-22 18:39   ` Joerg Roedel
2008-09-23  2:41     ` Jeremy Fitzhardinge
2008-09-23  2:50       ` Arjan van de Ven
2008-09-28 14:21 ` FUJITA Tomonori
2008-09-28 18:49   ` Joerg Roedel
2008-09-29 13:16     ` FUJITA Tomonori
2008-09-29 13:26       ` Joerg Roedel
2008-09-29 13:42         ` FUJITA Tomonori
2008-09-29 13:51           ` Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).