[RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
@ 2015-12-13 21:28 Alexander Duyck
  2015-12-13 21:28 ` [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions Alexander Duyck
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Alexander Duyck @ 2015-12-13 21:28 UTC (permalink / raw)
  To: kvm, linux-pci, x86, linux-kernel, alexander.duyck, qemu-devel
  Cc: tianyu.lan, yang.zhang.wz, mst, konrad.wilk, dgilbert, agraf,
	alex.williamson

This patch set is meant to be the guest side code for a proof of concept
involving leaving pass-through devices in the guest during the warm-up
phase of guest live migration.  In order to accomplish this I have added a
new function called dma_mark_dirty that will mark the pages associated with
the DMA transaction as dirty in the case of either an unmap or a
sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
the stop-and-copy phase, however allowing the device to be present should
significantly improve the performance of the guest during the warm-up
period.

This current implementation is very preliminary and there are number of
items still missing.  Specifically in order to make this a more complete 
solution we need to support:
1.  Notifying hypervisor that drivers are dirtying DMA pages received
2.  Bypassing page dirtying when it is not needed.

The two mechanisms referenced above would likely require coordination with
QEMU and as such are open to discussion.  I haven't attempted to address
them as I am not sure there is a consensus as of yet.  My personal
preference would be to add a vendor-specific configuration block to the
emulated pci-bridge interfaces created by QEMU that would allow us to
essentially extend shpc to support guest live migration with pass-through
devices.

The functionality in this patch set is currently disabled by default.  To
enable it you can select "SWIOTLB page dirtying" from the "Processor type
and features" menu.

---

Alexander Duyck (3):
      swiotlb: Fold static unmap and sync calls into calling functions
      xen/swiotlb: Fold static unmap and sync calls into calling functions
      x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

 arch/arm/include/asm/dma-mapping.h       |    3 +
 arch/arm64/include/asm/dma-mapping.h     |    5 +-
 arch/ia64/include/asm/dma.h              |    1 
 arch/mips/include/asm/dma-mapping.h      |    1 
 arch/powerpc/include/asm/swiotlb.h       |    1 
 arch/tile/include/asm/dma-mapping.h      |    1 
 arch/unicore32/include/asm/dma-mapping.h |    1 
 arch/x86/Kconfig                         |   11 ++++
 arch/x86/include/asm/swiotlb.h           |   26 ++++++++
 drivers/xen/swiotlb-xen.c                |   92 +++++++++++++-----------------
 lib/swiotlb.c                            |   83 ++++++++++++---------------
 11 files changed, 123 insertions(+), 102 deletions(-)

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions
  2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
@ 2015-12-13 21:28 ` Alexander Duyck
  2015-12-13 21:28 ` [RFC PATCH 2/3] xen/swiotlb: " Alexander Duyck
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2015-12-13 21:28 UTC (permalink / raw)
  To: kvm, linux-pci, x86, linux-kernel, alexander.duyck, qemu-devel
  Cc: tianyu.lan, yang.zhang.wz, mst, konrad.wilk, dgilbert, agraf,
	alex.williamson

This change essentially does two things.  First it folds the swiotlb_unmap
and swiotlb_sync calls into their callers.  The goal behind this is three
fold.  First this helps to reduce execution time and improves performance
since we aren't having to call into as many functions.  Second it allows us
to split up some of the sync functionality as there is the dma_mark_clean
portion of the sync call that is only really needed for dma_sync_for_cpu
since we don't actually want to mark the page as clean if we are syncing
for the device.

The second change is to move dma_mark_clean inside the if statement instead
of using a return in the case of sync and unmap.  By doing this we make it
so that we can also add a dma_mark_dirty function later.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 lib/swiotlb.c |   81 +++++++++++++++++++++++----------------------------------
 1 file changed, 33 insertions(+), 48 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 76f29ecba8f4..384ac06217b2 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -781,8 +781,9 @@ EXPORT_SYMBOL_GPL(swiotlb_map_page);
  * After this call, reads by the cpu to the buffer are guaranteed to see
  * whatever the device wrote there.
  */
-static void unmap_single(struct device *hwdev, dma_addr_t dev_addr,
-			 size_t size, enum dma_data_direction dir)
+void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+			size_t size, enum dma_data_direction dir,
+			struct dma_attrs *attrs)
 {
 	phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);
 
@@ -793,23 +794,14 @@ static void unmap_single(struct device *hwdev, dma_addr_t dev_addr,
 		return;
 	}
 
-	if (dir != DMA_FROM_DEVICE)
-		return;
-
 	/*
 	 * phys_to_virt doesn't work with hihgmem page but we could
 	 * call dma_mark_clean() with hihgmem page here. However, we
 	 * are fine since dma_mark_clean() is null on POWERPC. We can
 	 * make dma_mark_clean() take a physical address if necessary.
 	 */
-	dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
-			size_t size, enum dma_data_direction dir,
-			struct dma_attrs *attrs)
-{
-	unmap_single(hwdev, dev_addr, size, dir);
+	if (dir == DMA_FROM_DEVICE)
+		dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(swiotlb_unmap_page);
 
@@ -823,31 +815,21 @@ EXPORT_SYMBOL_GPL(swiotlb_unmap_page);
  * address back to the card, you must first perform a
  * swiotlb_dma_sync_for_device, and then the device again owns the buffer
  */
-static void
-swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
-		    size_t size, enum dma_data_direction dir,
-		    enum dma_sync_target target)
+void
+swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+			    size_t size, enum dma_data_direction dir)
 {
 	phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);
 
 	BUG_ON(dir == DMA_NONE);
 
 	if (is_swiotlb_buffer(paddr)) {
-		swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
+		swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_CPU);
 		return;
 	}
 
-	if (dir != DMA_FROM_DEVICE)
-		return;
-
-	dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void
-swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
-			    size_t size, enum dma_data_direction dir)
-{
-	swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU);
+	if (dir == DMA_FROM_DEVICE)
+		dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL(swiotlb_sync_single_for_cpu);
 
@@ -855,7 +837,14 @@ void
 swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
 			       size_t size, enum dma_data_direction dir)
 {
-	swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE);
+	phys_addr_t paddr = dma_to_phys(hwdev, dev_addr);
+
+	BUG_ON(dir == DMA_NONE);
+
+	if (!is_swiotlb_buffer(paddr))
+		return;
+
+	swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_DEVICE);
 }
 EXPORT_SYMBOL(swiotlb_sync_single_for_device);
 
@@ -929,10 +918,9 @@ swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
 	struct scatterlist *sg;
 	int i;
 
-	BUG_ON(dir == DMA_NONE);
-
 	for_each_sg(sgl, sg, nelems, i)
-		unmap_single(hwdev, sg->dma_address, sg_dma_len(sg), dir);
+		swiotlb_unmap_page(hwdev, sg->dma_address, sg_dma_len(sg),
+				   dir, attrs);
 
 }
 EXPORT_SYMBOL(swiotlb_unmap_sg_attrs);
@@ -952,32 +940,29 @@ EXPORT_SYMBOL(swiotlb_unmap_sg);
  * The same as swiotlb_sync_single_* but for a scatter-gather list, same rules
  * and usage.
  */
-static void
-swiotlb_sync_sg(struct device *hwdev, struct scatterlist *sgl,
-		int nelems, enum dma_data_direction dir,
-		enum dma_sync_target target)
+void
+swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sgl,
+			int nelems, enum dma_data_direction dir)
 {
 	struct scatterlist *sg;
 	int i;
 
 	for_each_sg(sgl, sg, nelems, i)
-		swiotlb_sync_single(hwdev, sg->dma_address,
-				    sg_dma_len(sg), dir, target);
-}
-
-void
-swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg,
-			int nelems, enum dma_data_direction dir)
-{
-	swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_CPU);
+		swiotlb_sync_single_for_cpu(hwdev, sg->dma_address,
+					    sg_dma_len(sg), dir);
 }
 EXPORT_SYMBOL(swiotlb_sync_sg_for_cpu);
 
 void
-swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg,
+swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sgl,
 			   int nelems, enum dma_data_direction dir)
 {
-	swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_DEVICE);
+	struct scatterlist *sg;
+	int i;
+
+	for_each_sg(sgl, sg, nelems, i)
+		swiotlb_sync_single_for_device(hwdev, sg->dma_address,
+					       sg_dma_len(sg), dir);
 }
 EXPORT_SYMBOL(swiotlb_sync_sg_for_device);
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/3] xen/swiotlb: Fold static unmap and sync calls into calling functions
  2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
  2015-12-13 21:28 ` [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions Alexander Duyck
@ 2015-12-13 21:28 ` Alexander Duyck
  2015-12-13 21:28 ` [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest Alexander Duyck
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2015-12-13 21:28 UTC (permalink / raw)
  To: kvm, linux-pci, x86, linux-kernel, alexander.duyck, qemu-devel
  Cc: tianyu.lan, yang.zhang.wz, mst, konrad.wilk, dgilbert, agraf,
	alex.williamson

This change essentially does two things.  First it folds the swiotlb_unmap
and swiotlb_sync calls into their callers.  The goal behind this is three
fold.  First this helps to reduce execution time and improves performance
since we aren't having to call into as many functions.  Second it allows us
to split up some of the sync functionality as there is the dma_mark_clean
portion of the sync call that is only really needed for dma_sync_for_cpu
since we don't actually want to mark the page as clean if we are syncing
for the device.

The second change is to move dma_mark_clean inside the if statement instead
of using a return in the case of sync and unmap.  By doing this we make it
so that we can also add a dma_mark_dirty function later.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/xen/swiotlb-xen.c |   90 ++++++++++++++++++---------------------------
 1 file changed, 36 insertions(+), 54 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 7399782c0998..2154c70e47da 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -432,9 +432,9 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
  * After this call, reads by the cpu to the buffer are guaranteed to see
  * whatever the device wrote there.
  */
-static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr,
-			     size_t size, enum dma_data_direction dir,
-				 struct dma_attrs *attrs)
+void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+			    size_t size, enum dma_data_direction dir,
+			    struct dma_attrs *attrs)
 {
 	phys_addr_t paddr = xen_bus_to_phys(dev_addr);
 
@@ -448,23 +448,14 @@ static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr,
 		return;
 	}
 
-	if (dir != DMA_FROM_DEVICE)
-		return;
-
 	/*
 	 * phys_to_virt doesn't work with hihgmem page but we could
 	 * call dma_mark_clean() with hihgmem page here. However, we
 	 * are fine since dma_mark_clean() is null on POWERPC. We can
 	 * make dma_mark_clean() take a physical address if necessary.
 	 */
-	dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
-			    size_t size, enum dma_data_direction dir,
-			    struct dma_attrs *attrs)
-{
-	xen_unmap_single(hwdev, dev_addr, size, dir, attrs);
+	if (dir == DMA_FROM_DEVICE)
+		dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
 
@@ -478,36 +469,22 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
  * address back to the card, you must first perform a
  * xen_swiotlb_dma_sync_for_device, and then the device again owns the buffer
  */
-static void
-xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
-			size_t size, enum dma_data_direction dir,
-			enum dma_sync_target target)
+void
+xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+				size_t size, enum dma_data_direction dir)
 {
 	phys_addr_t paddr = xen_bus_to_phys(dev_addr);
 
 	BUG_ON(dir == DMA_NONE);
 
-	if (target == SYNC_FOR_CPU)
-		xen_dma_sync_single_for_cpu(hwdev, dev_addr, size, dir);
+	xen_dma_sync_single_for_cpu(hwdev, dev_addr, size, dir);
 
 	/* NOTE: We use dev_addr here, not paddr! */
 	if (is_xen_swiotlb_buffer(dev_addr))
-		swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
-
-	if (target == SYNC_FOR_DEVICE)
-		xen_dma_sync_single_for_device(hwdev, dev_addr, size, dir);
-
-	if (dir != DMA_FROM_DEVICE)
-		return;
+		swiotlb_tbl_sync_single(hwdev, paddr, size, dir, SYNC_FOR_CPU);
 
-	dma_mark_clean(phys_to_virt(paddr), size);
-}
-
-void
-xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
-				size_t size, enum dma_data_direction dir)
-{
-	xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU);
+	if (dir == DMA_FROM_DEVICE)
+		dma_mark_clean(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_cpu);
 
@@ -515,7 +492,16 @@ void
 xen_swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
 				   size_t size, enum dma_data_direction dir)
 {
-	xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE);
+	phys_addr_t paddr = xen_bus_to_phys(dev_addr);
+
+	BUG_ON(dir == DMA_NONE);
+
+	/* NOTE: We use dev_addr here, not paddr! */
+	if (is_xen_swiotlb_buffer(dev_addr))
+		swiotlb_tbl_sync_single(hwdev, paddr, size, dir,
+					SYNC_FOR_DEVICE);
+
+	xen_dma_sync_single_for_device(hwdev, dev_addr, size, dir);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_device);
 
@@ -604,10 +590,9 @@ xen_swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
 	struct scatterlist *sg;
 	int i;
 
-	BUG_ON(dir == DMA_NONE);
-
 	for_each_sg(sgl, sg, nelems, i)
-		xen_unmap_single(hwdev, sg->dma_address, sg_dma_len(sg), dir, attrs);
+		xen_swiotlb_unmap_page(hwdev, sg->dma_address, sg_dma_len(sg),
+				       dir, attrs);
 
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_sg_attrs);
@@ -619,32 +604,29 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_sg_attrs);
  * The same as swiotlb_sync_single_* but for a scatter-gather list, same rules
  * and usage.
  */
-static void
-xen_swiotlb_sync_sg(struct device *hwdev, struct scatterlist *sgl,
-		    int nelems, enum dma_data_direction dir,
-		    enum dma_sync_target target)
+void
+xen_swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sgl,
+			    int nelems, enum dma_data_direction dir)
 {
 	struct scatterlist *sg;
 	int i;
 
 	for_each_sg(sgl, sg, nelems, i)
-		xen_swiotlb_sync_single(hwdev, sg->dma_address,
-					sg_dma_len(sg), dir, target);
-}
-
-void
-xen_swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg,
-			    int nelems, enum dma_data_direction dir)
-{
-	xen_swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_CPU);
+		xen_swiotlb_sync_single_for_cpu(hwdev, sg->dma_address,
+						sg_dma_len(sg), dir);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_cpu);
 
 void
-xen_swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg,
+xen_swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sgl,
 			       int nelems, enum dma_data_direction dir)
 {
-	xen_swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_DEVICE);
+	struct scatterlist *sg;
+	int i;
+
+	for_each_sg(sgl, sg, nelems, i)
+		xen_swiotlb_sync_single_for_device(hwdev, sg->dma_address,
+						   sg_dma_len(sg), dir);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_device);
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
  2015-12-13 21:28 ` [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions Alexander Duyck
  2015-12-13 21:28 ` [RFC PATCH 2/3] xen/swiotlb: " Alexander Duyck
@ 2015-12-13 21:28 ` Alexander Duyck
  2015-12-14 14:00   ` Michael S. Tsirkin
  2015-12-14  2:27 ` [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Yang Zhang
       [not found] ` <20160104204104.GB17427@char.us.oracle.com>
  4 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2015-12-13 21:28 UTC (permalink / raw)
  To: kvm, linux-pci, x86, linux-kernel, alexander.duyck, qemu-devel
  Cc: tianyu.lan, yang.zhang.wz, mst, konrad.wilk, dgilbert, agraf,
	alex.williamson

This patch is meant to provide the guest with a way of flagging DMA pages
as being dirty to the host when using a direct-assign device within a
guest.  The advantage to this approach is that it is fairly simple, however
It currently has a singificant impact on device performance in all the
scenerios where it won't be needed.

As such this is really meant only as a proof of concept and to get the ball
rolling in terms of figuring out how best to approach the issue of dirty
page tracking for a guest that is using a direct assigned device.  In
addition with just this patch it should be possible to modify current
migration approaches so that instead of having to hot-remove the device
before starting the migration this can instead be delayed until the period
before the final stop and copy.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 arch/arm/include/asm/dma-mapping.h       |    3 ++-
 arch/arm64/include/asm/dma-mapping.h     |    5 ++---
 arch/ia64/include/asm/dma.h              |    1 +
 arch/mips/include/asm/dma-mapping.h      |    1 +
 arch/powerpc/include/asm/swiotlb.h       |    1 +
 arch/tile/include/asm/dma-mapping.h      |    1 +
 arch/unicore32/include/asm/dma-mapping.h |    1 +
 arch/x86/Kconfig                         |   11 +++++++++++
 arch/x86/include/asm/swiotlb.h           |   26 ++++++++++++++++++++++++++
 drivers/xen/swiotlb-xen.c                |    6 ++++++
 lib/swiotlb.c                            |    6 ++++++
 11 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index ccb3aa64640d..1962d7b471c7 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
 	return 1;
 }
 
-static inline void dma_mark_clean(void *addr, size_t size) { }
+static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
 
diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h
index 61e08f360e31..8d24fe11c8a3 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
 	return addr + size - 1 <= *dev->dma_mask;
 }
 
-static inline void dma_mark_clean(void *addr, size_t size)
-{
-}
+static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 #endif	/* __KERNEL__ */
 #endif	/* __ASM_DMA_MAPPING_H */
diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
index 4d97f60f1ef5..d92ebeb2758e 100644
--- a/arch/ia64/include/asm/dma.h
+++ b/arch/ia64/include/asm/dma.h
@@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
 #define free_dma(x)
 
 void dma_mark_clean(void *addr, size_t size);
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 #endif /* _ASM_IA64_DMA_H */
diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h
index e604f760c4a0..567f6e03e337 100644
--- a/arch/mips/include/asm/dma-mapping.h
+++ b/arch/mips/include/asm/dma-mapping.h
@@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
 }
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 #include <asm-generic/dma-mapping-common.h>
 
diff --git a/arch/powerpc/include/asm/swiotlb.h b/arch/powerpc/include/asm/swiotlb.h
index de99d6e29430..b694e8399e28 100644
--- a/arch/powerpc/include/asm/swiotlb.h
+++ b/arch/powerpc/include/asm/swiotlb.h
@@ -16,6 +16,7 @@
 extern struct dma_map_ops swiotlb_dma_ops;
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 extern unsigned int ppc_swiotlb_enable;
 int __init swiotlb_setup_bus_notifier(void);
diff --git a/arch/tile/include/asm/dma-mapping.h b/arch/tile/include/asm/dma-mapping.h
index 96ac6cce4a32..79953f09e938 100644
--- a/arch/tile/include/asm/dma-mapping.h
+++ b/arch/tile/include/asm/dma-mapping.h
@@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
 }
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
 {
diff --git a/arch/unicore32/include/asm/dma-mapping.h b/arch/unicore32/include/asm/dma-mapping.h
index 8140e053ccd3..b9d357ab122d 100644
--- a/arch/unicore32/include/asm/dma-mapping.h
+++ b/arch/unicore32/include/asm/dma-mapping.h
@@ -49,6 +49,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
 }
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
+static inline void dma_mark_dirty(void *addr, size_t size) {}
 
 static inline void dma_cache_sync(struct device *dev, void *vaddr,
 		size_t size, enum dma_data_direction direction)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index db3622f22b61..f0b09156d7d8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -841,6 +841,17 @@ config SWIOTLB
 	  with more than 3 GB of memory.
 	  If unsure, say Y.
 
+config SWIOTLB_PAGE_DIRTYING
+	bool "SWIOTLB page dirtying"
+	depends on SWIOTLB
+	default n
+	---help---
+	  SWIOTLB page dirtying support provides a means for the guest to
+	  trigger write faults on pages which received DMA from the device
+	  without changing the data contained within.  By doing this the
+	  guest can then support migration assuming the device and any
+	  remaining pages are unmapped prior to the CPU itself being halted.
+
 config IOMMU_HELPER
 	def_bool y
 	depends on CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU
diff --git a/arch/x86/include/asm/swiotlb.h b/arch/x86/include/asm/swiotlb.h
index ab05d73e2bb7..7f9f2e76d081 100644
--- a/arch/x86/include/asm/swiotlb.h
+++ b/arch/x86/include/asm/swiotlb.h
@@ -29,6 +29,32 @@ static inline void pci_swiotlb_late_init(void)
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
 
+/*
+ * Make certain that the pages get marked as dirty
+ * now that the device has completed the DMA transaction.
+ *
+ * Without this we run the risk of a guest migration missing
+ * the pages that the device has written to as they are not
+ * tracked as a part of the dirty page tracking.
+ */
+static inline void dma_mark_dirty(void *addr, size_t size)
+{
+#ifdef CONFIG_SWIOTLB_PAGE_DIRTYING
+	unsigned long pg_addr, start;
+
+	start = (unsigned long)addr;
+	pg_addr = PAGE_ALIGN(start + size);
+	start &= ~(sizeof(atomic_t) - 1);
+
+	/* trigger a write fault on each page, excluding first page */
+	while ((pg_addr -= PAGE_SIZE) > start)
+		atomic_add(0, (atomic_t *)pg_addr);
+
+	/* trigger a write fault on first word of DMA */
+	atomic_add(0, (atomic_t *)start);
+#endif /* CONFIG_SWIOTLB_PAGE_DIRTYING */
+}
+
 extern void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flags,
 					struct dma_attrs *attrs);
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 2154c70e47da..1533b3eefb67 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -456,6 +456,9 @@ void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
 	 */
 	if (dir == DMA_FROM_DEVICE)
 		dma_mark_clean(phys_to_virt(paddr), size);
+
+	if (dir != DMA_TO_DEVICE)
+		dma_mark_dirty(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
 
@@ -485,6 +488,9 @@ xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
 
 	if (dir == DMA_FROM_DEVICE)
 		dma_mark_clean(phys_to_virt(paddr), size);
+
+	if (dir != DMA_TO_DEVICE)
+		dma_mark_dirty(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_cpu);
 
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 384ac06217b2..4223d6c54724 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -802,6 +802,9 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
 	 */
 	if (dir == DMA_FROM_DEVICE)
 		dma_mark_clean(phys_to_virt(paddr), size);
+
+	if (dir != DMA_TO_DEVICE)
+		dma_mark_dirty(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL_GPL(swiotlb_unmap_page);
 
@@ -830,6 +833,9 @@ swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
 
 	if (dir == DMA_FROM_DEVICE)
 		dma_mark_clean(phys_to_virt(paddr), size);
+
+	if (dir != DMA_TO_DEVICE)
+		dma_mark_dirty(phys_to_virt(paddr), size);
 }
 EXPORT_SYMBOL(swiotlb_sync_single_for_cpu);
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
                   ` (2 preceding siblings ...)
  2015-12-13 21:28 ` [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest Alexander Duyck
@ 2015-12-14  2:27 ` Yang Zhang
  2015-12-14  4:54   ` Alexander Duyck
       [not found] ` <20160104204104.GB17427@char.us.oracle.com>
  4 siblings, 1 reply; 35+ messages in thread
From: Yang Zhang @ 2015-12-14  2:27 UTC (permalink / raw)
  To: Alexander Duyck, kvm, linux-pci, x86, linux-kernel,
	alexander.duyck, qemu-devel
  Cc: tianyu.lan, mst, konrad.wilk, dgilbert, agraf, alex.williamson

On 2015/12/14 5:28, Alexander Duyck wrote:
> This patch set is meant to be the guest side code for a proof of concept
> involving leaving pass-through devices in the guest during the warm-up
> phase of guest live migration.  In order to accomplish this I have added a
> new function called dma_mark_dirty that will mark the pages associated with
> the DMA transaction as dirty in the case of either an unmap or a
> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
> the stop-and-copy phase, however allowing the device to be present should
> significantly improve the performance of the guest during the warm-up
> period.
>
> This current implementation is very preliminary and there are number of
> items still missing.  Specifically in order to make this a more complete
> solution we need to support:
> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
> 2.  Bypassing page dirtying when it is not needed.
>

Shouldn't current log dirty mechanism already cover them?


> The two mechanisms referenced above would likely require coordination with
> QEMU and as such are open to discussion.  I haven't attempted to address
> them as I am not sure there is a consensus as of yet.  My personal
> preference would be to add a vendor-specific configuration block to the
> emulated pci-bridge interfaces created by QEMU that would allow us to
> essentially extend shpc to support guest live migration with pass-through
> devices.
>
> The functionality in this patch set is currently disabled by default.  To
> enable it you can select "SWIOTLB page dirtying" from the "Processor type
> and features" menu.

Only SWIOTLB is supported?

>
> ---
>
> Alexander Duyck (3):
>        swiotlb: Fold static unmap and sync calls into calling functions
>        xen/swiotlb: Fold static unmap and sync calls into calling functions
>        x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
>
>
>   arch/arm/include/asm/dma-mapping.h       |    3 +
>   arch/arm64/include/asm/dma-mapping.h     |    5 +-
>   arch/ia64/include/asm/dma.h              |    1
>   arch/mips/include/asm/dma-mapping.h      |    1
>   arch/powerpc/include/asm/swiotlb.h       |    1
>   arch/tile/include/asm/dma-mapping.h      |    1
>   arch/unicore32/include/asm/dma-mapping.h |    1
>   arch/x86/Kconfig                         |   11 ++++
>   arch/x86/include/asm/swiotlb.h           |   26 ++++++++
>   drivers/xen/swiotlb-xen.c                |   92 +++++++++++++-----------------
>   lib/swiotlb.c                            |   83 ++++++++++++---------------
>   11 files changed, 123 insertions(+), 102 deletions(-)
>
> --
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2015-12-14  2:27 ` [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Yang Zhang
@ 2015-12-14  4:54   ` Alexander Duyck
  2015-12-14  5:22     ` Yang Zhang
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2015-12-14  4:54 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Michael S. Tsirkin, konrad.wilk,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang.wz@gmail.com> wrote:
> On 2015/12/14 5:28, Alexander Duyck wrote:
>>
>> This patch set is meant to be the guest side code for a proof of concept
>> involving leaving pass-through devices in the guest during the warm-up
>> phase of guest live migration.  In order to accomplish this I have added a
>> new function called dma_mark_dirty that will mark the pages associated
>> with
>> the DMA transaction as dirty in the case of either an unmap or a
>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>> the stop-and-copy phase, however allowing the device to be present should
>> significantly improve the performance of the guest during the warm-up
>> period.
>>
>> This current implementation is very preliminary and there are number of
>> items still missing.  Specifically in order to make this a more complete
>> solution we need to support:
>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>> 2.  Bypassing page dirtying when it is not needed.
>>
>
> Shouldn't current log dirty mechanism already cover them?

The guest has no way of currently knowing that the hypervisor is doing
dirty page logging, and the log dirty mechanism currently has no way
of tracking device DMA accesses.  This change is meant to bridge the
two so that the guest device driver will force the SWIOTLB DMA API to
mark pages written to by the device as dirty.

>> The two mechanisms referenced above would likely require coordination with
>> QEMU and as such are open to discussion.  I haven't attempted to address
>> them as I am not sure there is a consensus as of yet.  My personal
>> preference would be to add a vendor-specific configuration block to the
>> emulated pci-bridge interfaces created by QEMU that would allow us to
>> essentially extend shpc to support guest live migration with pass-through
>> devices.
>>
>> The functionality in this patch set is currently disabled by default.  To
>> enable it you can select "SWIOTLB page dirtying" from the "Processor type
>> and features" menu.
>
>
> Only SWIOTLB is supported?

Yes.  For right now this only supports SWIOTLB.  The assumption here
is that SWIOTLB is in use for most cases where an IOMMU is not
present.  If an IOMMU is present in a virtualized guest then most
likely the IOMMU might be able to provide a separate mechanism for
dirty page tracking.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2015-12-14  4:54   ` Alexander Duyck
@ 2015-12-14  5:22     ` Yang Zhang
  2015-12-14  5:46       ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Yang Zhang @ 2015-12-14  5:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Michael S. Tsirkin, konrad.wilk,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On 2015/12/14 12:54, Alexander Duyck wrote:
> On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang.wz@gmail.com> wrote:
>> On 2015/12/14 5:28, Alexander Duyck wrote:
>>>
>>> This patch set is meant to be the guest side code for a proof of concept
>>> involving leaving pass-through devices in the guest during the warm-up
>>> phase of guest live migration.  In order to accomplish this I have added a
>>> new function called dma_mark_dirty that will mark the pages associated
>>> with
>>> the DMA transaction as dirty in the case of either an unmap or a
>>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>>> the stop-and-copy phase, however allowing the device to be present should
>>> significantly improve the performance of the guest during the warm-up
>>> period.
>>>
>>> This current implementation is very preliminary and there are number of
>>> items still missing.  Specifically in order to make this a more complete
>>> solution we need to support:
>>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>>> 2.  Bypassing page dirtying when it is not needed.
>>>
>>
>> Shouldn't current log dirty mechanism already cover them?
>
> The guest has no way of currently knowing that the hypervisor is doing
> dirty page logging, and the log dirty mechanism currently has no way
> of tracking device DMA accesses.  This change is meant to bridge the
> two so that the guest device driver will force the SWIOTLB DMA API to
> mark pages written to by the device as dirty.

OK. This is what we called "dummy write mechanism". Actually, this is 
just a workaround before iommu dirty bit ready. Eventually, we need to 
change to use the hardware dirty bit. Besides, we may still lost the 
data if dma happens during/just before stop and copy phase.

>
>>> The two mechanisms referenced above would likely require coordination with
>>> QEMU and as such are open to discussion.  I haven't attempted to address
>>> them as I am not sure there is a consensus as of yet.  My personal
>>> preference would be to add a vendor-specific configuration block to the
>>> emulated pci-bridge interfaces created by QEMU that would allow us to
>>> essentially extend shpc to support guest live migration with pass-through
>>> devices.
>>>
>>> The functionality in this patch set is currently disabled by default.  To
>>> enable it you can select "SWIOTLB page dirtying" from the "Processor type
>>> and features" menu.
>>
>>
>> Only SWIOTLB is supported?
>
> Yes.  For right now this only supports SWIOTLB.  The assumption here
> is that SWIOTLB is in use for most cases where an IOMMU is not
> present.  If an IOMMU is present in a virtualized guest then most
> likely the IOMMU might be able to provide a separate mechanism for
> dirty page tracking.
>
> - Alex
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2015-12-14  5:22     ` Yang Zhang
@ 2015-12-14  5:46       ` Alexander Duyck
  2015-12-14  7:20         ` Yang Zhang
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2015-12-14  5:46 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Michael S. Tsirkin, konrad.wilk,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang <yang.zhang.wz@gmail.com> wrote:
> On 2015/12/14 12:54, Alexander Duyck wrote:
>>
>> On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang.wz@gmail.com>
>> wrote:
>>>
>>> On 2015/12/14 5:28, Alexander Duyck wrote:
>>>>
>>>>
>>>> This patch set is meant to be the guest side code for a proof of concept
>>>> involving leaving pass-through devices in the guest during the warm-up
>>>> phase of guest live migration.  In order to accomplish this I have added
>>>> a
>>>> new function called dma_mark_dirty that will mark the pages associated
>>>> with
>>>> the DMA transaction as dirty in the case of either an unmap or a
>>>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>>>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>>>> the stop-and-copy phase, however allowing the device to be present
>>>> should
>>>> significantly improve the performance of the guest during the warm-up
>>>> period.
>>>>
>>>> This current implementation is very preliminary and there are number of
>>>> items still missing.  Specifically in order to make this a more complete
>>>> solution we need to support:
>>>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>>>> 2.  Bypassing page dirtying when it is not needed.
>>>>
>>>
>>> Shouldn't current log dirty mechanism already cover them?
>>
>>
>> The guest has no way of currently knowing that the hypervisor is doing
>> dirty page logging, and the log dirty mechanism currently has no way
>> of tracking device DMA accesses.  This change is meant to bridge the
>> two so that the guest device driver will force the SWIOTLB DMA API to
>> mark pages written to by the device as dirty.
>
>
> OK. This is what we called "dummy write mechanism". Actually, this is just a
> workaround before iommu dirty bit ready. Eventually, we need to change to
> use the hardware dirty bit. Besides, we may still lost the data if dma
> happens during/just before stop and copy phase.

Right, this is a "dummy write mechanism" in order to allow for entry
tracking.  This only works completely if we force the hardware to
quiesce via a hot-plug event before we reach the stop-and-copy phase
of the migration.

The IOMMU dirty bit approach is likely going to have a significant
number of challenges involved.  Looking over the driver and the data
sheet it looks like the current implementation is using a form of huge
pages in the IOMMU, as such we will need to tear that down and replace
it with 4K pages if we don't want to dirty large regions with each DMA
transaction, and I'm not sure that is something we can change while
DMA is active to the affected regions.  In addition the data sheet
references the fact that the page table entries are stored in a
translation cache and in order to sync things up you have to
invalidate the entries.  I'm not sure what the total overhead would be
for invalidating something like a half million 4K pages to migrate a
guest with just 2G of RAM, but I would think that might be a bit
expensive given the fact that IOMMU accesses aren't known for being
incredibly fast when invalidating DMA on the host.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2015-12-14  5:46       ` Alexander Duyck
@ 2015-12-14  7:20         ` Yang Zhang
  2015-12-14 14:02           ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Yang Zhang @ 2015-12-14  7:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Michael S. Tsirkin, konrad.wilk,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On 2015/12/14 13:46, Alexander Duyck wrote:
> On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang <yang.zhang.wz@gmail.com> wrote:
>> On 2015/12/14 12:54, Alexander Duyck wrote:
>>>
>>> On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang.wz@gmail.com>
>>> wrote:
>>>>
>>>> On 2015/12/14 5:28, Alexander Duyck wrote:
>>>>>
>>>>>
>>>>> This patch set is meant to be the guest side code for a proof of concept
>>>>> involving leaving pass-through devices in the guest during the warm-up
>>>>> phase of guest live migration.  In order to accomplish this I have added
>>>>> a
>>>>> new function called dma_mark_dirty that will mark the pages associated
>>>>> with
>>>>> the DMA transaction as dirty in the case of either an unmap or a
>>>>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>>>>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>>>>> the stop-and-copy phase, however allowing the device to be present
>>>>> should
>>>>> significantly improve the performance of the guest during the warm-up
>>>>> period.
>>>>>
>>>>> This current implementation is very preliminary and there are number of
>>>>> items still missing.  Specifically in order to make this a more complete
>>>>> solution we need to support:
>>>>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>>>>> 2.  Bypassing page dirtying when it is not needed.
>>>>>
>>>>
>>>> Shouldn't current log dirty mechanism already cover them?
>>>
>>>
>>> The guest has no way of currently knowing that the hypervisor is doing
>>> dirty page logging, and the log dirty mechanism currently has no way
>>> of tracking device DMA accesses.  This change is meant to bridge the
>>> two so that the guest device driver will force the SWIOTLB DMA API to
>>> mark pages written to by the device as dirty.
>>
>>
>> OK. This is what we called "dummy write mechanism". Actually, this is just a
>> workaround before iommu dirty bit ready. Eventually, we need to change to
>> use the hardware dirty bit. Besides, we may still lost the data if dma
>> happens during/just before stop and copy phase.
>
> Right, this is a "dummy write mechanism" in order to allow for entry
> tracking.  This only works completely if we force the hardware to
> quiesce via a hot-plug event before we reach the stop-and-copy phase
> of the migration.
>
> The IOMMU dirty bit approach is likely going to have a significant
> number of challenges involved.  Looking over the driver and the data
> sheet it looks like the current implementation is using a form of huge
> pages in the IOMMU, as such we will need to tear that down and replace
> it with 4K pages if we don't want to dirty large regions with each DMA

Yes, we need to split the huge page into small pages to get the small 
dirty range.

> transaction, and I'm not sure that is something we can change while
> DMA is active to the affected regions.  In addition the data sheet

what changes do you mean?

> references the fact that the page table entries are stored in a
> translation cache and in order to sync things up you have to
> invalidate the entries.  I'm not sure what the total overhead would be
> for invalidating something like a half million 4K pages to migrate a
> guest with just 2G of RAM, but I would think that might be a bit

Do you mean the cost of submit the flush request or the performance 
impaction due to IOTLB miss? For the former, we have domain-selective 
invalidation. For the latter, it would be acceptable since live 
migration shouldn't last too long.

> expensive given the fact that IOMMU accesses aren't known for being
> incredibly fast when invalidating DMA on the host.
>
> - Alex
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-13 21:28 ` [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest Alexander Duyck
@ 2015-12-14 14:00   ` Michael S. Tsirkin
  2015-12-14 16:34     ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14 14:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm, linux-pci, x86, linux-kernel, alexander.duyck, qemu-devel,
	tianyu.lan, yang.zhang.wz, konrad.wilk, dgilbert, agraf,
	alex.williamson

On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote:
> This patch is meant to provide the guest with a way of flagging DMA pages
> as being dirty to the host when using a direct-assign device within a
> guest.  The advantage to this approach is that it is fairly simple, however
> It currently has a singificant impact on device performance in all the
> scenerios where it won't be needed.
> 
> As such this is really meant only as a proof of concept and to get the ball
> rolling in terms of figuring out how best to approach the issue of dirty
> page tracking for a guest that is using a direct assigned device.  In
> addition with just this patch it should be possible to modify current
> migration approaches so that instead of having to hot-remove the device
> before starting the migration this can instead be delayed until the period
> before the final stop and copy.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  arch/arm/include/asm/dma-mapping.h       |    3 ++-
>  arch/arm64/include/asm/dma-mapping.h     |    5 ++---
>  arch/ia64/include/asm/dma.h              |    1 +
>  arch/mips/include/asm/dma-mapping.h      |    1 +
>  arch/powerpc/include/asm/swiotlb.h       |    1 +
>  arch/tile/include/asm/dma-mapping.h      |    1 +
>  arch/unicore32/include/asm/dma-mapping.h |    1 +
>  arch/x86/Kconfig                         |   11 +++++++++++
>  arch/x86/include/asm/swiotlb.h           |   26 ++++++++++++++++++++++++++
>  drivers/xen/swiotlb-xen.c                |    6 ++++++
>  lib/swiotlb.c                            |    6 ++++++
>  11 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
> index ccb3aa64640d..1962d7b471c7 100644
> --- a/arch/arm/include/asm/dma-mapping.h
> +++ b/arch/arm/include/asm/dma-mapping.h
> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
>  	return 1;
>  }
>  
> -static inline void dma_mark_clean(void *addr, size_t size) { }
> +static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
>  
> diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h
> index 61e08f360e31..8d24fe11c8a3 100644
> --- a/arch/arm64/include/asm/dma-mapping.h
> +++ b/arch/arm64/include/asm/dma-mapping.h
> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
>  	return addr + size - 1 <= *dev->dma_mask;
>  }
>  
> -static inline void dma_mark_clean(void *addr, size_t size)
> -{
> -}
> +static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #endif	/* __KERNEL__ */
>  #endif	/* __ASM_DMA_MAPPING_H */
> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
> index 4d97f60f1ef5..d92ebeb2758e 100644
> --- a/arch/ia64/include/asm/dma.h
> +++ b/arch/ia64/include/asm/dma.h
> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
>  #define free_dma(x)
>  
>  void dma_mark_clean(void *addr, size_t size);
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #endif /* _ASM_IA64_DMA_H */
> diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h
> index e604f760c4a0..567f6e03e337 100644
> --- a/arch/mips/include/asm/dma-mapping.h
> +++ b/arch/mips/include/asm/dma-mapping.h
> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #include <asm-generic/dma-mapping-common.h>
>  
> diff --git a/arch/powerpc/include/asm/swiotlb.h b/arch/powerpc/include/asm/swiotlb.h
> index de99d6e29430..b694e8399e28 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -16,6 +16,7 @@
>  extern struct dma_map_ops swiotlb_dma_ops;
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  extern unsigned int ppc_swiotlb_enable;
>  int __init swiotlb_setup_bus_notifier(void);
> diff --git a/arch/tile/include/asm/dma-mapping.h b/arch/tile/include/asm/dma-mapping.h
> index 96ac6cce4a32..79953f09e938 100644
> --- a/arch/tile/include/asm/dma-mapping.h
> +++ b/arch/tile/include/asm/dma-mapping.h
> @@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
>  {
> diff --git a/arch/unicore32/include/asm/dma-mapping.h b/arch/unicore32/include/asm/dma-mapping.h
> index 8140e053ccd3..b9d357ab122d 100644
> --- a/arch/unicore32/include/asm/dma-mapping.h
> +++ b/arch/unicore32/include/asm/dma-mapping.h
> @@ -49,6 +49,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  static inline void dma_cache_sync(struct device *dev, void *vaddr,
>  		size_t size, enum dma_data_direction direction)
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index db3622f22b61..f0b09156d7d8 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -841,6 +841,17 @@ config SWIOTLB
>  	  with more than 3 GB of memory.
>  	  If unsure, say Y.
>  
> +config SWIOTLB_PAGE_DIRTYING
> +	bool "SWIOTLB page dirtying"
> +	depends on SWIOTLB
> +	default n
> +	---help---
> +	  SWIOTLB page dirtying support provides a means for the guest to
> +	  trigger write faults on pages which received DMA from the device
> +	  without changing the data contained within.  By doing this the
> +	  guest can then support migration assuming the device and any
> +	  remaining pages are unmapped prior to the CPU itself being halted.
> +
>  config IOMMU_HELPER
>  	def_bool y
>  	depends on CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU
> diff --git a/arch/x86/include/asm/swiotlb.h b/arch/x86/include/asm/swiotlb.h
> index ab05d73e2bb7..7f9f2e76d081 100644
> --- a/arch/x86/include/asm/swiotlb.h
> +++ b/arch/x86/include/asm/swiotlb.h
> @@ -29,6 +29,32 @@ static inline void pci_swiotlb_late_init(void)
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
>  
> +/*
> + * Make certain that the pages get marked as dirty
> + * now that the device has completed the DMA transaction.
> + *
> + * Without this we run the risk of a guest migration missing
> + * the pages that the device has written to as they are not
> + * tracked as a part of the dirty page tracking.
> + */
> +static inline void dma_mark_dirty(void *addr, size_t size)
> +{
> +#ifdef CONFIG_SWIOTLB_PAGE_DIRTYING

I like where this is going. However
as distributions don't like shipping multiple kernels,
I think we also need a way to configure this
at runtime, even if enabled at build time.

How about
- mark dirty is enabled at boot if requested (e.g. by kernel command line)
- mark dirty can later be disabled/enabled by sysctl

(Enabling at runtime might be a bit tricky as it has to
 sync with all CPUs - use e.g. RCU for this?).

This way distro can use a guest agent to disable
dirtying until before migration starts.

> +	unsigned long pg_addr, start;
> +
> +	start = (unsigned long)addr;
> +	pg_addr = PAGE_ALIGN(start + size);
> +	start &= ~(sizeof(atomic_t) - 1);
> +
> +	/* trigger a write fault on each page, excluding first page */
> +	while ((pg_addr -= PAGE_SIZE) > start)
> +		atomic_add(0, (atomic_t *)pg_addr);
> +
> +	/* trigger a write fault on first word of DMA */
> +	atomic_add(0, (atomic_t *)start);

start might not be aligned correctly for a cast to atomic_t.
It's harmless to do this for any memory, so I think you should
just do this for 1st byte of all pages including the first one.


> +#endif /* CONFIG_SWIOTLB_PAGE_DIRTYING */
> +}
> +
>  extern void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
>  					dma_addr_t *dma_handle, gfp_t flags,
>  					struct dma_attrs *attrs);
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 2154c70e47da..1533b3eefb67 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -456,6 +456,9 @@ void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
>  	 */
>  	if (dir == DMA_FROM_DEVICE)
>  		dma_mark_clean(phys_to_virt(paddr), size);
> +
> +	if (dir != DMA_TO_DEVICE)
> +		dma_mark_dirty(phys_to_virt(paddr), size);
>  }
>  EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
>  
> @@ -485,6 +488,9 @@ xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
>  
>  	if (dir == DMA_FROM_DEVICE)
>  		dma_mark_clean(phys_to_virt(paddr), size);
> +
> +	if (dir != DMA_TO_DEVICE)
> +		dma_mark_dirty(phys_to_virt(paddr), size);
>  }
>  EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_cpu);
>  
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 384ac06217b2..4223d6c54724 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -802,6 +802,9 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
>  	 */
>  	if (dir == DMA_FROM_DEVICE)
>  		dma_mark_clean(phys_to_virt(paddr), size);
> +
> +	if (dir != DMA_TO_DEVICE)
> +		dma_mark_dirty(phys_to_virt(paddr), size);
>  }
>  EXPORT_SYMBOL_GPL(swiotlb_unmap_page);
>  
> @@ -830,6 +833,9 @@ swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
>  
>  	if (dir == DMA_FROM_DEVICE)
>  		dma_mark_clean(phys_to_virt(paddr), size);
> +
> +	if (dir != DMA_TO_DEVICE)
> +		dma_mark_dirty(phys_to_virt(paddr), size);
>  }
>  EXPORT_SYMBOL(swiotlb_sync_single_for_cpu);
>  

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2015-12-14  7:20         ` Yang Zhang
@ 2015-12-14 14:02           ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14 14:02 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Alexander Duyck, Alexander Duyck, kvm, linux-pci, x86,
	linux-kernel, qemu-devel, Lan Tianyu, konrad.wilk,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On Mon, Dec 14, 2015 at 03:20:26PM +0800, Yang Zhang wrote:
> On 2015/12/14 13:46, Alexander Duyck wrote:
> >On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang <yang.zhang.wz@gmail.com> wrote:
> >>On 2015/12/14 12:54, Alexander Duyck wrote:
> >>>
> >>>On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang <yang.zhang.wz@gmail.com>
> >>>wrote:
> >>>>
> >>>>On 2015/12/14 5:28, Alexander Duyck wrote:
> >>>>>
> >>>>>
> >>>>>This patch set is meant to be the guest side code for a proof of concept
> >>>>>involving leaving pass-through devices in the guest during the warm-up
> >>>>>phase of guest live migration.  In order to accomplish this I have added
> >>>>>a
> >>>>>new function called dma_mark_dirty that will mark the pages associated
> >>>>>with
> >>>>>the DMA transaction as dirty in the case of either an unmap or a
> >>>>>sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
> >>>>>DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
> >>>>>the stop-and-copy phase, however allowing the device to be present
> >>>>>should
> >>>>>significantly improve the performance of the guest during the warm-up
> >>>>>period.
> >>>>>
> >>>>>This current implementation is very preliminary and there are number of
> >>>>>items still missing.  Specifically in order to make this a more complete
> >>>>>solution we need to support:
> >>>>>1.  Notifying hypervisor that drivers are dirtying DMA pages received
> >>>>>2.  Bypassing page dirtying when it is not needed.
> >>>>>
> >>>>
> >>>>Shouldn't current log dirty mechanism already cover them?
> >>>
> >>>
> >>>The guest has no way of currently knowing that the hypervisor is doing
> >>>dirty page logging, and the log dirty mechanism currently has no way
> >>>of tracking device DMA accesses.  This change is meant to bridge the
> >>>two so that the guest device driver will force the SWIOTLB DMA API to
> >>>mark pages written to by the device as dirty.
> >>
> >>
> >>OK. This is what we called "dummy write mechanism". Actually, this is just a
> >>workaround before iommu dirty bit ready. Eventually, we need to change to
> >>use the hardware dirty bit. Besides, we may still lost the data if dma
> >>happens during/just before stop and copy phase.
> >
> >Right, this is a "dummy write mechanism" in order to allow for entry
> >tracking.  This only works completely if we force the hardware to
> >quiesce via a hot-plug event before we reach the stop-and-copy phase
> >of the migration.
> >
> >The IOMMU dirty bit approach is likely going to have a significant
> >number of challenges involved.  Looking over the driver and the data
> >sheet it looks like the current implementation is using a form of huge
> >pages in the IOMMU, as such we will need to tear that down and replace
> >it with 4K pages if we don't want to dirty large regions with each DMA
> 
> Yes, we need to split the huge page into small pages to get the small dirty
> range.
> 
> >transaction, and I'm not sure that is something we can change while
> >DMA is active to the affected regions.  In addition the data sheet
> 
> what changes do you mean?
> 
> >references the fact that the page table entries are stored in a
> >translation cache and in order to sync things up you have to
> >invalidate the entries.  I'm not sure what the total overhead would be
> >for invalidating something like a half million 4K pages to migrate a
> >guest with just 2G of RAM, but I would think that might be a bit
> 
> Do you mean the cost of submit the flush request or the performance
> impaction due to IOTLB miss? For the former, we have domain-selective
> invalidation. For the latter, it would be acceptable since live migration
> shouldn't last too long.

That's pretty weak - if migration time is short and speed does not
matter during migration, then all this work is useless, temporarily
switching to a virtual card would be preferable.

> >expensive given the fact that IOMMU accesses aren't known for being
> >incredibly fast when invalidating DMA on the host.
> >
> >- Alex
> >
> 
> 
> -- 
> best regards
> yang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-14 14:00   ` Michael S. Tsirkin
@ 2015-12-14 16:34     ` Alexander Duyck
  2015-12-14 17:20       ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2015-12-14 16:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Yang Zhang, konrad.wilk, Dr. David Alan Gilbert,
	Alexander Graf, Alex Williamson

On Mon, Dec 14, 2015 at 6:00 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote:
>> This patch is meant to provide the guest with a way of flagging DMA pages
>> as being dirty to the host when using a direct-assign device within a
>> guest.  The advantage to this approach is that it is fairly simple, however
>> It currently has a singificant impact on device performance in all the
>> scenerios where it won't be needed.
>>
>> As such this is really meant only as a proof of concept and to get the ball
>> rolling in terms of figuring out how best to approach the issue of dirty
>> page tracking for a guest that is using a direct assigned device.  In
>> addition with just this patch it should be possible to modify current
>> migration approaches so that instead of having to hot-remove the device
>> before starting the migration this can instead be delayed until the period
>> before the final stop and copy.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
>>  arch/arm/include/asm/dma-mapping.h       |    3 ++-
>>  arch/arm64/include/asm/dma-mapping.h     |    5 ++---
>>  arch/ia64/include/asm/dma.h              |    1 +
>>  arch/mips/include/asm/dma-mapping.h      |    1 +
>>  arch/powerpc/include/asm/swiotlb.h       |    1 +
>>  arch/tile/include/asm/dma-mapping.h      |    1 +
>>  arch/unicore32/include/asm/dma-mapping.h |    1 +
>>  arch/x86/Kconfig                         |   11 +++++++++++
>>  arch/x86/include/asm/swiotlb.h           |   26 ++++++++++++++++++++++++++
>>  drivers/xen/swiotlb-xen.c                |    6 ++++++
>>  lib/swiotlb.c                            |    6 ++++++
>>  11 files changed, 58 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
>> index ccb3aa64640d..1962d7b471c7 100644
>> --- a/arch/arm/include/asm/dma-mapping.h
>> +++ b/arch/arm/include/asm/dma-mapping.h
>> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
>>       return 1;
>>  }
>>
>> -static inline void dma_mark_clean(void *addr, size_t size) { }
>> +static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
>>
>> diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h
>> index 61e08f360e31..8d24fe11c8a3 100644
>> --- a/arch/arm64/include/asm/dma-mapping.h
>> +++ b/arch/arm64/include/asm/dma-mapping.h
>> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
>>       return addr + size - 1 <= *dev->dma_mask;
>>  }
>>
>> -static inline void dma_mark_clean(void *addr, size_t size)
>> -{
>> -}
>> +static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #endif       /* __KERNEL__ */
>>  #endif       /* __ASM_DMA_MAPPING_H */
>> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
>> index 4d97f60f1ef5..d92ebeb2758e 100644
>> --- a/arch/ia64/include/asm/dma.h
>> +++ b/arch/ia64/include/asm/dma.h
>> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
>>  #define free_dma(x)
>>
>>  void dma_mark_clean(void *addr, size_t size);
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #endif /* _ASM_IA64_DMA_H */
>> diff --git a/arch/mips/include/asm/dma-mapping.h b/arch/mips/include/asm/dma-mapping.h
>> index e604f760c4a0..567f6e03e337 100644
>> --- a/arch/mips/include/asm/dma-mapping.h
>> +++ b/arch/mips/include/asm/dma-mapping.h
>> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
>>  }
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #include <asm-generic/dma-mapping-common.h>
>>
>> diff --git a/arch/powerpc/include/asm/swiotlb.h b/arch/powerpc/include/asm/swiotlb.h
>> index de99d6e29430..b694e8399e28 100644
>> --- a/arch/powerpc/include/asm/swiotlb.h
>> +++ b/arch/powerpc/include/asm/swiotlb.h
>> @@ -16,6 +16,7 @@
>>  extern struct dma_map_ops swiotlb_dma_ops;
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  extern unsigned int ppc_swiotlb_enable;
>>  int __init swiotlb_setup_bus_notifier(void);
>> diff --git a/arch/tile/include/asm/dma-mapping.h b/arch/tile/include/asm/dma-mapping.h
>> index 96ac6cce4a32..79953f09e938 100644
>> --- a/arch/tile/include/asm/dma-mapping.h
>> +++ b/arch/tile/include/asm/dma-mapping.h
>> @@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
>>  }
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
>>  {
>> diff --git a/arch/unicore32/include/asm/dma-mapping.h b/arch/unicore32/include/asm/dma-mapping.h
>> index 8140e053ccd3..b9d357ab122d 100644
>> --- a/arch/unicore32/include/asm/dma-mapping.h
>> +++ b/arch/unicore32/include/asm/dma-mapping.h
>> @@ -49,6 +49,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
>>  }
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  static inline void dma_cache_sync(struct device *dev, void *vaddr,
>>               size_t size, enum dma_data_direction direction)
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index db3622f22b61..f0b09156d7d8 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -841,6 +841,17 @@ config SWIOTLB
>>         with more than 3 GB of memory.
>>         If unsure, say Y.
>>
>> +config SWIOTLB_PAGE_DIRTYING
>> +     bool "SWIOTLB page dirtying"
>> +     depends on SWIOTLB
>> +     default n
>> +     ---help---
>> +       SWIOTLB page dirtying support provides a means for the guest to
>> +       trigger write faults on pages which received DMA from the device
>> +       without changing the data contained within.  By doing this the
>> +       guest can then support migration assuming the device and any
>> +       remaining pages are unmapped prior to the CPU itself being halted.
>> +
>>  config IOMMU_HELPER
>>       def_bool y
>>       depends on CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU
>> diff --git a/arch/x86/include/asm/swiotlb.h b/arch/x86/include/asm/swiotlb.h
>> index ab05d73e2bb7..7f9f2e76d081 100644
>> --- a/arch/x86/include/asm/swiotlb.h
>> +++ b/arch/x86/include/asm/swiotlb.h
>> @@ -29,6 +29,32 @@ static inline void pci_swiotlb_late_init(void)
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>>
>> +/*
>> + * Make certain that the pages get marked as dirty
>> + * now that the device has completed the DMA transaction.
>> + *
>> + * Without this we run the risk of a guest migration missing
>> + * the pages that the device has written to as they are not
>> + * tracked as a part of the dirty page tracking.
>> + */
>> +static inline void dma_mark_dirty(void *addr, size_t size)
>> +{
>> +#ifdef CONFIG_SWIOTLB_PAGE_DIRTYING
>
> I like where this is going. However
> as distributions don't like shipping multiple kernels,
> I think we also need a way to configure this
> at runtime, even if enabled at build time.

Agreed.  Like I sad in the cover page this is just needed until we can
come up with a way to limit the scope.  Then we could probably default
this to Y and distributions can have it enabled by default.

> How about
> - mark dirty is enabled at boot if requested (e.g. by kernel command line)
> - mark dirty can later be disabled/enabled by sysctl
>
> (Enabling at runtime might be a bit tricky as it has to
>  sync with all CPUs - use e.g. RCU for this?).

I was considering RCU but I am still not sure it is the best way to go
since all we essentially need to do is swap a couple of function
pointers.  I was thinking of making use of the dma_ops pointer
contained in dev_archdata.  If I were to create two dma_ops setups,
one with standard swiotlb and one with a dirty page pointer version
for the unmap and sync calls then it is just a matter of assigning a
pointer to enable the DMA page dirtying, and clearing the pointer to
disable it.  An alternative might be to just add a device specific
flag and then pass the device to the dma_mark_dirty function.  I'm
still debating the possible options.

> This way distro can use a guest agent to disable
> dirtying until before migration starts.

Right.  For a v2 version I would definitely want to have some way to
limit the scope of this.  My main reason for putting this out here is
to start altering the course of discussions since it seems like were
weren't getting anywhere with the ixgbevf migration changes that were
being proposed.

>> +     unsigned long pg_addr, start;
>> +
>> +     start = (unsigned long)addr;
>> +     pg_addr = PAGE_ALIGN(start + size);
>> +     start &= ~(sizeof(atomic_t) - 1);
>> +
>> +     /* trigger a write fault on each page, excluding first page */
>> +     while ((pg_addr -= PAGE_SIZE) > start)
>> +             atomic_add(0, (atomic_t *)pg_addr);
>> +
>> +     /* trigger a write fault on first word of DMA */
>> +     atomic_add(0, (atomic_t *)start);
>
> start might not be aligned correctly for a cast to atomic_t.
> It's harmless to do this for any memory, so I think you should
> just do this for 1st byte of all pages including the first one.

You may not have noticed it but I actually aligned start in the line
after pg_addr.  However instead of aligning to the start of the next
atomic_t I just masked off the lower bits so that we start at the
DWORD that contains the first byte of the starting address.  The
assumption here is that I cannot trigger any sort of fault since if I
have access to a given byte within a DWORD I will have access to the
entire DWORD.  I coded this up so that the spots where we touch the
memory should match up with addresses provided by the hardware to
perform the DMA over the PCI bus.

Also I intentionally ran from highest address to lowest since that way
we don't risk pushing the first cache line of the DMA buffer out of
the L1 cache due to the PAGE_SIZE stride.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-14 16:34     ` Alexander Duyck
@ 2015-12-14 17:20       ` Michael S. Tsirkin
  2015-12-14 17:59         ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14 17:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Yang Zhang, konrad.wilk, Dr. David Alan Gilbert,
	Alexander Graf, Alex Williamson

On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> > This way distro can use a guest agent to disable
> > dirtying until before migration starts.
> 
> Right.  For a v2 version I would definitely want to have some way to
> limit the scope of this.  My main reason for putting this out here is
> to start altering the course of discussions since it seems like were
> weren't getting anywhere with the ixgbevf migration changes that were
> being proposed.

Absolutely, thanks for working on this.

> >> +     unsigned long pg_addr, start;
> >> +
> >> +     start = (unsigned long)addr;
> >> +     pg_addr = PAGE_ALIGN(start + size);
> >> +     start &= ~(sizeof(atomic_t) - 1);
> >> +
> >> +     /* trigger a write fault on each page, excluding first page */
> >> +     while ((pg_addr -= PAGE_SIZE) > start)
> >> +             atomic_add(0, (atomic_t *)pg_addr);
> >> +
> >> +     /* trigger a write fault on first word of DMA */
> >> +     atomic_add(0, (atomic_t *)start);
> >
> > start might not be aligned correctly for a cast to atomic_t.
> > It's harmless to do this for any memory, so I think you should
> > just do this for 1st byte of all pages including the first one.
> 
> You may not have noticed it but I actually aligned start in the line
> after pg_addr.

Yes you did. alignof would make it a bit more noticeable.

>  However instead of aligning to the start of the next
> atomic_t I just masked off the lower bits so that we start at the
> DWORD that contains the first byte of the starting address.  The
> assumption here is that I cannot trigger any sort of fault since if I
> have access to a given byte within a DWORD I will have access to the
> entire DWORD.

I'm curious where does this come from.  Isn't it true that access is
controlled at page granularity normally, so you can touch beginning of
page just as well?

>  I coded this up so that the spots where we touch the
> memory should match up with addresses provided by the hardware to
> perform the DMA over the PCI bus.

Yes but there's no requirement to do it like this from
virt POV. You just need to touch each page.

> Also I intentionally ran from highest address to lowest since that way
> we don't risk pushing the first cache line of the DMA buffer out of
> the L1 cache due to the PAGE_SIZE stride.
> 
> - Alex

Interesting. How does order of access help with this?

By the way, if you are into these micro-optimizations you might want to
limit prefetch, to this end you want to access the last line of the
page.  And it's probably worth benchmarking a bit and not doing it all just
based on theory, keep code simple in v1 otherwise.

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-14 17:20       ` Michael S. Tsirkin
@ 2015-12-14 17:59         ` Alexander Duyck
  2015-12-14 20:52           ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2015-12-14 17:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Yang Zhang, konrad.wilk, Dr. David Alan Gilbert,
	Alexander Graf, Alex Williamson

On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
>> > This way distro can use a guest agent to disable
>> > dirtying until before migration starts.
>>
>> Right.  For a v2 version I would definitely want to have some way to
>> limit the scope of this.  My main reason for putting this out here is
>> to start altering the course of discussions since it seems like were
>> weren't getting anywhere with the ixgbevf migration changes that were
>> being proposed.
>
> Absolutely, thanks for working on this.
>
>> >> +     unsigned long pg_addr, start;
>> >> +
>> >> +     start = (unsigned long)addr;
>> >> +     pg_addr = PAGE_ALIGN(start + size);
>> >> +     start &= ~(sizeof(atomic_t) - 1);
>> >> +
>> >> +     /* trigger a write fault on each page, excluding first page */
>> >> +     while ((pg_addr -= PAGE_SIZE) > start)
>> >> +             atomic_add(0, (atomic_t *)pg_addr);
>> >> +
>> >> +     /* trigger a write fault on first word of DMA */
>> >> +     atomic_add(0, (atomic_t *)start);
>> >
>> > start might not be aligned correctly for a cast to atomic_t.
>> > It's harmless to do this for any memory, so I think you should
>> > just do this for 1st byte of all pages including the first one.
>>
>> You may not have noticed it but I actually aligned start in the line
>> after pg_addr.
>
> Yes you did. alignof would make it a bit more noticeable.
>
>>  However instead of aligning to the start of the next
>> atomic_t I just masked off the lower bits so that we start at the
>> DWORD that contains the first byte of the starting address.  The
>> assumption here is that I cannot trigger any sort of fault since if I
>> have access to a given byte within a DWORD I will have access to the
>> entire DWORD.
>
> I'm curious where does this come from.  Isn't it true that access is
> controlled at page granularity normally, so you can touch beginning of
> page just as well?

Yeah, I am pretty sure it probably is page granularity.  However my
thought was to try and stick to the start of the DMA as the last
access.  That way we don't pull in any more cache lines than we need
to in order to dirty the pages.  Usually the start of the DMA region
will contain some sort of headers or something that needs to be
accessed with the highest priority so I wanted to make certain that we
were forcing usable data into the L1 cache rather than just the first
cache line of the page where the DMA started.  If however the start of
a DMA was the start of the page there is nothing there to prevent
that.

>>  I coded this up so that the spots where we touch the
>> memory should match up with addresses provided by the hardware to
>> perform the DMA over the PCI bus.
>
> Yes but there's no requirement to do it like this from
> virt POV. You just need to touch each page.

I know, but at the same time if we match up with the DMA then it is
more likely that we avoid grabbing unneeded cache lines.  In the case
of most drivers the data for headers and start is at the start of the
DMA.  So if we dirty the cache line associated with the start of the
DMA it will be pulled into the L1 cache and there is a greater chance
that it may already be prefetched as well.

>> Also I intentionally ran from highest address to lowest since that way
>> we don't risk pushing the first cache line of the DMA buffer out of
>> the L1 cache due to the PAGE_SIZE stride.
>
> Interesting. How does order of access help with this?

If you use a PAGE_SIZE stride you will start evicting things from L1
cache after something like 8 accesses on an x86 processor as most of
the recent ones have a 32K 8 way associative L1 cache.  So if I go
from back to front then I evict the stuff that would likely be in the
data portion of a buffer instead of headers which are usually located
at the front.

> By the way, if you are into these micro-optimizations you might want to
> limit prefetch, to this end you want to access the last line of the
> page.  And it's probably worth benchmarking a bit and not doing it all just
> based on theory, keep code simple in v1 otherwise.

My main goal for now is functional code over high performance code.
That is why I have kept this code fairly simple.  I might have done
some optimization but it was as much about the optimization as keeping
the code simple.  For example by using the start of the page instead
of the end I could easily do the comparison against start and avoid
doing more than one write per page.

The issue for me doing performance testing is that I don't have
anything that uses DMA blocks that are actually big enough to make use
of the PAGE_SIZE stride.  That is why the PAGE_SIZE stride portion is
mostly just theoretical.  I just have a few NICs and most of them only
allocate 1 page or so for DMA buffers.  What little benchmarking I
have done with netperf only showed a ~1% CPU penalty for the page
dirtying code.  For setups where we did more with the DMA such as
small packet handling I would expect that value to increase, but I
still wouldn't expect to see a penalty of much more than ~5% most
likely since there are still a number of other items that are calling
atomic operations as well such as the code for releasing pages.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-14 17:59         ` Alexander Duyck
@ 2015-12-14 20:52           ` Michael S. Tsirkin
  2015-12-14 22:32             ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14 20:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Yang Zhang, konrad.wilk, Dr. David Alan Gilbert,
	Alexander Graf, Alex Williamson

On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> >> > This way distro can use a guest agent to disable
> >> > dirtying until before migration starts.
> >>
> >> Right.  For a v2 version I would definitely want to have some way to
> >> limit the scope of this.  My main reason for putting this out here is
> >> to start altering the course of discussions since it seems like were
> >> weren't getting anywhere with the ixgbevf migration changes that were
> >> being proposed.
> >
> > Absolutely, thanks for working on this.
> >
> >> >> +     unsigned long pg_addr, start;
> >> >> +
> >> >> +     start = (unsigned long)addr;
> >> >> +     pg_addr = PAGE_ALIGN(start + size);
> >> >> +     start &= ~(sizeof(atomic_t) - 1);
> >> >> +
> >> >> +     /* trigger a write fault on each page, excluding first page */
> >> >> +     while ((pg_addr -= PAGE_SIZE) > start)
> >> >> +             atomic_add(0, (atomic_t *)pg_addr);
> >> >> +
> >> >> +     /* trigger a write fault on first word of DMA */
> >> >> +     atomic_add(0, (atomic_t *)start);

Actually, I have second thoughts about using atomic_add here,
especially for _sync.

Many architectures do

#define ATOMIC_OP_RETURN(op, c_op)                                      \
static inline int atomic_##op##_return(int i, atomic_t *v)              \
{                                                                       \
        unsigned long flags;                                            \
        int ret;                                                        \
                                                                        \
        raw_local_irq_save(flags);                                      \
        ret = (v->counter = v->counter c_op i);                         \
        raw_local_irq_restore(flags);                                   \
                                                                        \
        return ret;                                                     \
}

and this is not safe if device is still doing DMA to/from
this memory.

Generally, atomic_t is there for SMP effects, not for sync
with devices.

This is why I said you should do
	cmpxchg(pg_addr, 0xdead, 0xdead); 

Yes, we probably never actually want to run m68k within a VM,
but let's not misuse interfaces like this.


> >> >
> >> > start might not be aligned correctly for a cast to atomic_t.
> >> > It's harmless to do this for any memory, so I think you should
> >> > just do this for 1st byte of all pages including the first one.
> >>
> >> You may not have noticed it but I actually aligned start in the line
> >> after pg_addr.
> >
> > Yes you did. alignof would make it a bit more noticeable.
> >
> >>  However instead of aligning to the start of the next
> >> atomic_t I just masked off the lower bits so that we start at the
> >> DWORD that contains the first byte of the starting address.  The
> >> assumption here is that I cannot trigger any sort of fault since if I
> >> have access to a given byte within a DWORD I will have access to the
> >> entire DWORD.
> >
> > I'm curious where does this come from.  Isn't it true that access is
> > controlled at page granularity normally, so you can touch beginning of
> > page just as well?
> 
> Yeah, I am pretty sure it probably is page granularity.  However my
> thought was to try and stick to the start of the DMA as the last
> access.  That way we don't pull in any more cache lines than we need
> to in order to dirty the pages.  Usually the start of the DMA region
> will contain some sort of headers or something that needs to be
> accessed with the highest priority so I wanted to make certain that we
> were forcing usable data into the L1 cache rather than just the first
> cache line of the page where the DMA started.  If however the start of
> a DMA was the start of the page there is nothing there to prevent
> that.

OK, maybe this helps. You should document all these tricks
in code comments.

> >>  I coded this up so that the spots where we touch the
> >> memory should match up with addresses provided by the hardware to
> >> perform the DMA over the PCI bus.
> >
> > Yes but there's no requirement to do it like this from
> > virt POV. You just need to touch each page.
> 
> I know, but at the same time if we match up with the DMA then it is
> more likely that we avoid grabbing unneeded cache lines.  In the case
> of most drivers the data for headers and start is at the start of the
> DMA.  So if we dirty the cache line associated with the start of the
> DMA it will be pulled into the L1 cache and there is a greater chance
> that it may already be prefetched as well.
> 
> >> Also I intentionally ran from highest address to lowest since that way
> >> we don't risk pushing the first cache line of the DMA buffer out of
> >> the L1 cache due to the PAGE_SIZE stride.
> >
> > Interesting. How does order of access help with this?
> 
> If you use a PAGE_SIZE stride you will start evicting things from L1
> cache after something like 8 accesses on an x86 processor as most of
> the recent ones have a 32K 8 way associative L1 cache.  So if I go
> from back to front then I evict the stuff that would likely be in the
> data portion of a buffer instead of headers which are usually located
> at the front.

I see, interesting.

> > By the way, if you are into these micro-optimizations you might want to
> > limit prefetch, to this end you want to access the last line of the
> > page.  And it's probably worth benchmarking a bit and not doing it all just
> > based on theory, keep code simple in v1 otherwise.
> 
> My main goal for now is functional code over high performance code.
> That is why I have kept this code fairly simple.  I might have done
> some optimization but it was as much about the optimization as keeping
> the code simple.

Well you were trying to avoid putting extra stress on
the cache, and it seems clear to me that prefetch
is not your friend here. So
-             atomic_add(0, (atomic_t *)pg_addr);
+             atomic_add(0, (atomic_t *)(pg_addr + PAGE_SIZE - sizeof(atomic_t));
(or whatever we change atomic_t to) is probably a win.

>  For example by using the start of the page instead
> of the end I could easily do the comparison against start and avoid
> doing more than one write per page.

That's probably worth fixing, we don't want two atomics
if we can help it.

-     while ((pg_addr -= PAGE_SIZE) > start)
+     while ((pg_addr -= PAGE_SIZE) >= PAGE_ALIGN(start + PAGE_SIZE))

will do it with no fuss.

> The issue for me doing performance testing is that I don't have
> anything that uses DMA blocks that are actually big enough to make use
> of the PAGE_SIZE stride.  That is why the PAGE_SIZE stride portion is
> mostly just theoretical.  I just have a few NICs and most of them only
> allocate 1 page or so for DMA buffers.  What little benchmarking I
> have done with netperf only showed a ~1% CPU penalty for the page
> dirtying code.  For setups where we did more with the DMA such as
> small packet handling I would expect that value to increase, but I
> still wouldn't expect to see a penalty of much more than ~5% most
> likely since there are still a number of other items that are calling
> atomic operations as well such as the code for releasing pages.
> 
> - Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
  2015-12-14 20:52           ` Michael S. Tsirkin
@ 2015-12-14 22:32             ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2015-12-14 22:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Yang Zhang, konrad.wilk, Dr. David Alan Gilbert,
	Alexander Graf, Alex Williamson

On Mon, Dec 14, 2015 at 12:52 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
>> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
>> >> > This way distro can use a guest agent to disable
>> >> > dirtying until before migration starts.
>> >>
>> >> Right.  For a v2 version I would definitely want to have some way to
>> >> limit the scope of this.  My main reason for putting this out here is
>> >> to start altering the course of discussions since it seems like were
>> >> weren't getting anywhere with the ixgbevf migration changes that were
>> >> being proposed.
>> >
>> > Absolutely, thanks for working on this.
>> >
>> >> >> +     unsigned long pg_addr, start;
>> >> >> +
>> >> >> +     start = (unsigned long)addr;
>> >> >> +     pg_addr = PAGE_ALIGN(start + size);
>> >> >> +     start &= ~(sizeof(atomic_t) - 1);
>> >> >> +
>> >> >> +     /* trigger a write fault on each page, excluding first page */
>> >> >> +     while ((pg_addr -= PAGE_SIZE) > start)
>> >> >> +             atomic_add(0, (atomic_t *)pg_addr);
>> >> >> +
>> >> >> +     /* trigger a write fault on first word of DMA */
>> >> >> +     atomic_add(0, (atomic_t *)start);
>
> Actually, I have second thoughts about using atomic_add here,
> especially for _sync.
>
> Many architectures do
>
> #define ATOMIC_OP_RETURN(op, c_op)                                      \
> static inline int atomic_##op##_return(int i, atomic_t *v)              \
> {                                                                       \
>         unsigned long flags;                                            \
>         int ret;                                                        \
>                                                                         \
>         raw_local_irq_save(flags);                                      \
>         ret = (v->counter = v->counter c_op i);                         \
>         raw_local_irq_restore(flags);                                   \
>                                                                         \
>         return ret;                                                     \
> }
>
> and this is not safe if device is still doing DMA to/from
> this memory.
>
> Generally, atomic_t is there for SMP effects, not for sync
> with devices.
>
> This is why I said you should do
>         cmpxchg(pg_addr, 0xdead, 0xdead);
>
> Yes, we probably never actually want to run m68k within a VM,
> but let's not misuse interfaces like this.

Right now this implementation is for x86 only.  Any other architecture
currently reports dma_mark_dirty as an empty inline function.  The
reason why I chose the atomic_add for x86 is simply because it is
guaranteed dirty the cache line with relatively few instructions and
operands as all I have to have is the pointer and 0.

For the m68k we could implement it as a cmpxchg instead.  The general
thought here is that each architecture is probably going to have to do
it a little bit differently.

>> >> >
>> >> > start might not be aligned correctly for a cast to atomic_t.
>> >> > It's harmless to do this for any memory, so I think you should
>> >> > just do this for 1st byte of all pages including the first one.
>> >>
>> >> You may not have noticed it but I actually aligned start in the line
>> >> after pg_addr.
>> >
>> > Yes you did. alignof would make it a bit more noticeable.
>> >
>> >>  However instead of aligning to the start of the next
>> >> atomic_t I just masked off the lower bits so that we start at the
>> >> DWORD that contains the first byte of the starting address.  The
>> >> assumption here is that I cannot trigger any sort of fault since if I
>> >> have access to a given byte within a DWORD I will have access to the
>> >> entire DWORD.
>> >
>> > I'm curious where does this come from.  Isn't it true that access is
>> > controlled at page granularity normally, so you can touch beginning of
>> > page just as well?
>>
>> Yeah, I am pretty sure it probably is page granularity.  However my
>> thought was to try and stick to the start of the DMA as the last
>> access.  That way we don't pull in any more cache lines than we need
>> to in order to dirty the pages.  Usually the start of the DMA region
>> will contain some sort of headers or something that needs to be
>> accessed with the highest priority so I wanted to make certain that we
>> were forcing usable data into the L1 cache rather than just the first
>> cache line of the page where the DMA started.  If however the start of
>> a DMA was the start of the page there is nothing there to prevent
>> that.
>
> OK, maybe this helps. You should document all these tricks
> in code comments.

I'll try to get that taken care of for v2.

>> >>  I coded this up so that the spots where we touch the
>> >> memory should match up with addresses provided by the hardware to
>> >> perform the DMA over the PCI bus.
>> >
>> > Yes but there's no requirement to do it like this from
>> > virt POV. You just need to touch each page.
>>
>> I know, but at the same time if we match up with the DMA then it is
>> more likely that we avoid grabbing unneeded cache lines.  In the case
>> of most drivers the data for headers and start is at the start of the
>> DMA.  So if we dirty the cache line associated with the start of the
>> DMA it will be pulled into the L1 cache and there is a greater chance
>> that it may already be prefetched as well.
>>
>> >> Also I intentionally ran from highest address to lowest since that way
>> >> we don't risk pushing the first cache line of the DMA buffer out of
>> >> the L1 cache due to the PAGE_SIZE stride.
>> >
>> > Interesting. How does order of access help with this?
>>
>> If you use a PAGE_SIZE stride you will start evicting things from L1
>> cache after something like 8 accesses on an x86 processor as most of
>> the recent ones have a 32K 8 way associative L1 cache.  So if I go
>> from back to front then I evict the stuff that would likely be in the
>> data portion of a buffer instead of headers which are usually located
>> at the front.
>
> I see, interesting.
>
>> > By the way, if you are into these micro-optimizations you might want to
>> > limit prefetch, to this end you want to access the last line of the
>> > page.  And it's probably worth benchmarking a bit and not doing it all just
>> > based on theory, keep code simple in v1 otherwise.
>>
>> My main goal for now is functional code over high performance code.
>> That is why I have kept this code fairly simple.  I might have done
>> some optimization but it was as much about the optimization as keeping
>> the code simple.
>
> Well you were trying to avoid putting extra stress on
> the cache, and it seems clear to me that prefetch
> is not your friend here. So
> -             atomic_add(0, (atomic_t *)pg_addr);
> +             atomic_add(0, (atomic_t *)(pg_addr + PAGE_SIZE - sizeof(atomic_t));
> (or whatever we change atomic_t to) is probably a win.

What is the advantage to writing to the last field in the page versus
the first?  I think that is the part I am not getting.

>>  For example by using the start of the page instead
>> of the end I could easily do the comparison against start and avoid
>> doing more than one write per page.
>
> That's probably worth fixing, we don't want two atomics
> if we can help it.
>
> -     while ((pg_addr -= PAGE_SIZE) > start)
> +     while ((pg_addr -= PAGE_SIZE) >= PAGE_ALIGN(start + PAGE_SIZE))
>
> will do it with no fuss.

I'm still not seeing what the gain here is.  It just seems like it is
making things more complicated.

The main goal of keeping things inside the DMA is to keep us from
doing too much cache bouncing.  Us reaching out and dirtying cache
lines that we aren't actually using seems to be really wasteful.  If
for example a page was split between two CPUs with one doing DMA on
one half, and one doing DMA on another I wouldn't want to have both
devices dirtying the same cache line, I would rather have them each
marking a separate cache line in order to avoid cache thrash.  By
having the start aligned with the start of the DMA, and all of the
other entries aligned with the start of pages contained within the DMA
we can avoid that since devices are generally working with at least
cache aligned buffers.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
       [not found] ` <20160104204104.GB17427@char.us.oracle.com>
@ 2016-01-05  3:11   ` Alexander Duyck
  2016-01-05  9:40     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2016-01-05  3:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Alexander Duyck, kvm, linux-pci, x86, linux-kernel, qemu-devel,
	Lan Tianyu, Yang Zhang, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On Mon, Jan 4, 2016 at 12:41 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Sun, Dec 13, 2015 at 01:28:09PM -0800, Alexander Duyck wrote:
>> This patch set is meant to be the guest side code for a proof of concept
>> involving leaving pass-through devices in the guest during the warm-up
>> phase of guest live migration.  In order to accomplish this I have added a
>
> What does that mean? 'warm-up-phase'?

It is the first phase in a pre-copy migration.
https://en.wikipedia.org/wiki/Live_migration

Basically in this phase all the memory is marked as dirty and then
copied.  Any memory that changes gets marked as dirty as well.
Currently DMA circumvents this as the user space dirty page tracking
isn't able to track DMA.

>> new function called dma_mark_dirty that will mark the pages associated with
>> the DMA transaction as dirty in the case of either an unmap or a
>> sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
>> DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
>> the stop-and-copy phase, however allowing the device to be present should
>> significantly improve the performance of the guest during the warm-up
>> period.
>
> .. if the warm-up phase is short I presume? If the warm-up phase takes
> a long time (busy guest that is of 1TB size) it wouldn't help much as the
> tracking of these DMA's may be quite long?
>
>>
>> This current implementation is very preliminary and there are number of
>> items still missing.  Specifically in order to make this a more complete
>> solution we need to support:
>> 1.  Notifying hypervisor that drivers are dirtying DMA pages received
>
> .. And somehow giving the hypervisor the GPFN so it can retain the PFN in
> the VT-d as long as possible.

Yes, what has happened is that the host went through and marked all
memory as read-only.  So trying to do any operation that requires
write access triggers a page fault which is then used by the host to
track pages that were dirtied.

>> 2.  Bypassing page dirtying when it is not needed.
>
> How would this work with with device doing DMA operations _after_ the migration?
> That is the driver submits and DMA READ.. migrates away, device is unplugged,
> VT-d context is torn down - device does the DMA READ gets an VT-d error...
>
> and what then? How should the device on the other host replay the DMA READ?

The device has to quiesce before the migration can occur.  We cannot
have any DMA mappings still open when we reach the stop-and-copy phase
of the migration.  The solution I have proposed here works for
streaming mappings but doesn't solve the case for things like
dma_alloc_coherent where a bidirectional mapping is maintained between
the CPU and the device.

>>
>> The two mechanisms referenced above would likely require coordination with
>> QEMU and as such are open to discussion.  I haven't attempted to address
>> them as I am not sure there is a consensus as of yet.  My personal
>> preference would be to add a vendor-specific configuration block to the
>> emulated pci-bridge interfaces created by QEMU that would allow us to
>> essentially extend shpc to support guest live migration with pass-through
>> devices.
>
> shpc?

That is kind of what I was thinking.  We basically need some mechanism
to allow for the host to ask the device to quiesce.  It has been
proposed to possibly even look at something like an ACPI interface
since I know ACPI is used by QEMU to manage hot-plug in the standard
case.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05  3:11   ` Alexander Duyck
@ 2016-01-05  9:40     ` Michael S. Tsirkin
  2016-01-05 10:01       ` Dr. David Alan Gilbert
  2016-01-05 16:18       ` Alexander Duyck
  0 siblings, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05  9:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Konrad Rzeszutek Wilk, Alexander Duyck, kvm, linux-pci, x86,
	linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> >> The two mechanisms referenced above would likely require coordination with
> >> QEMU and as such are open to discussion.  I haven't attempted to address
> >> them as I am not sure there is a consensus as of yet.  My personal
> >> preference would be to add a vendor-specific configuration block to the
> >> emulated pci-bridge interfaces created by QEMU that would allow us to
> >> essentially extend shpc to support guest live migration with pass-through
> >> devices.
> >
> > shpc?
> 
> That is kind of what I was thinking.  We basically need some mechanism
> to allow for the host to ask the device to quiesce.  It has been
> proposed to possibly even look at something like an ACPI interface
> since I know ACPI is used by QEMU to manage hot-plug in the standard
> case.
> 
> - Alex


Start by using hot-unplug for this!

Really use your patch guest side, and write host side
to allow starting migration with the device, but
defer completing it.

So

1.- host tells guest to start tracking memory writes
2.- guest acks
3.- migration starts
4.- most memory is migrated
5.- host tells guest to eject device
6.- guest acks
7.- stop vm and migrate rest of state


It will already be a win since hot unplug after migration starts and
most memory has been migrated is better than hot unplug before migration
starts.

Then measure downtime and profile. Then we can look at ways
to quiesce device faster which really means step 5 is replaced
with "host tells guest to quiesce device and dirty (or just unmap!)
all memory mapped for write by device".

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05  9:40     ` Michael S. Tsirkin
@ 2016-01-05 10:01       ` Dr. David Alan Gilbert
  2016-01-05 10:35         ` Michael S. Tsirkin
  2016-01-05 16:18       ` Alexander Duyck
  1 sibling, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-05 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > >> The two mechanisms referenced above would likely require coordination with
> > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > >> them as I am not sure there is a consensus as of yet.  My personal
> > >> preference would be to add a vendor-specific configuration block to the
> > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > >> essentially extend shpc to support guest live migration with pass-through
> > >> devices.
> > >
> > > shpc?
> > 
> > That is kind of what I was thinking.  We basically need some mechanism
> > to allow for the host to ask the device to quiesce.  It has been
> > proposed to possibly even look at something like an ACPI interface
> > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > case.
> > 
> > - Alex
> 
> 
> Start by using hot-unplug for this!
> 
> Really use your patch guest side, and write host side
> to allow starting migration with the device, but
> defer completing it.
> 
> So
> 
> 1.- host tells guest to start tracking memory writes
> 2.- guest acks
> 3.- migration starts
> 4.- most memory is migrated
> 5.- host tells guest to eject device
> 6.- guest acks
> 7.- stop vm and migrate rest of state
> 
> 
> It will already be a win since hot unplug after migration starts and
> most memory has been migrated is better than hot unplug before migration
> starts.
> 
> Then measure downtime and profile. Then we can look at ways
> to quiesce device faster which really means step 5 is replaced
> with "host tells guest to quiesce device and dirty (or just unmap!)
> all memory mapped for write by device".


Doing a hot-unplug is going to upset the guests network stacks view
of the world; that's something we don't want to change.

Dave

> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 10:01       ` Dr. David Alan Gilbert
@ 2016-01-05 10:35         ` Michael S. Tsirkin
  2016-01-05 10:45           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 10:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > >> The two mechanisms referenced above would likely require coordination with
> > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > >> preference would be to add a vendor-specific configuration block to the
> > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > >> essentially extend shpc to support guest live migration with pass-through
> > > >> devices.
> > > >
> > > > shpc?
> > > 
> > > That is kind of what I was thinking.  We basically need some mechanism
> > > to allow for the host to ask the device to quiesce.  It has been
> > > proposed to possibly even look at something like an ACPI interface
> > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > case.
> > > 
> > > - Alex
> > 
> > 
> > Start by using hot-unplug for this!
> > 
> > Really use your patch guest side, and write host side
> > to allow starting migration with the device, but
> > defer completing it.
> > 
> > So
> > 
> > 1.- host tells guest to start tracking memory writes
> > 2.- guest acks
> > 3.- migration starts
> > 4.- most memory is migrated
> > 5.- host tells guest to eject device
> > 6.- guest acks
> > 7.- stop vm and migrate rest of state
> > 
> > 
> > It will already be a win since hot unplug after migration starts and
> > most memory has been migrated is better than hot unplug before migration
> > starts.
> > 
> > Then measure downtime and profile. Then we can look at ways
> > to quiesce device faster which really means step 5 is replaced
> > with "host tells guest to quiesce device and dirty (or just unmap!)
> > all memory mapped for write by device".
> 
> 
> Doing a hot-unplug is going to upset the guests network stacks view
> of the world; that's something we don't want to change.
> 
> Dave

It might but if you store the IP and restore it quickly
after migration e.g. using guest agent, as opposed to DHCP,
then it won't.

It allows calming the device down in a generic way,
specific drivers can then implement the fast quiesce.

> > 
> > -- 
> > MST
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 10:35         ` Michael S. Tsirkin
@ 2016-01-05 10:45           ` Dr. David Alan Gilbert
  2016-01-05 10:59             ` Michael S. Tsirkin
  2016-01-05 11:05             ` Michael S. Tsirkin
  0 siblings, 2 replies; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-05 10:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > >> The two mechanisms referenced above would likely require coordination with
> > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > >> preference would be to add a vendor-specific configuration block to the
> > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > >> devices.
> > > > >
> > > > > shpc?
> > > > 
> > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > to allow for the host to ask the device to quiesce.  It has been
> > > > proposed to possibly even look at something like an ACPI interface
> > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > case.
> > > > 
> > > > - Alex
> > > 
> > > 
> > > Start by using hot-unplug for this!
> > > 
> > > Really use your patch guest side, and write host side
> > > to allow starting migration with the device, but
> > > defer completing it.
> > > 
> > > So
> > > 
> > > 1.- host tells guest to start tracking memory writes
> > > 2.- guest acks
> > > 3.- migration starts
> > > 4.- most memory is migrated
> > > 5.- host tells guest to eject device
> > > 6.- guest acks
> > > 7.- stop vm and migrate rest of state
> > > 
> > > 
> > > It will already be a win since hot unplug after migration starts and
> > > most memory has been migrated is better than hot unplug before migration
> > > starts.
> > > 
> > > Then measure downtime and profile. Then we can look at ways
> > > to quiesce device faster which really means step 5 is replaced
> > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > all memory mapped for write by device".
> > 
> > 
> > Doing a hot-unplug is going to upset the guests network stacks view
> > of the world; that's something we don't want to change.
> > 
> > Dave
> 
> It might but if you store the IP and restore it quickly
> after migration e.g. using guest agent, as opposed to DHCP,
> then it won't.

I thought if you hot-unplug then it will lose any outstanding connections
on that device.

> It allows calming the device down in a generic way,
> specific drivers can then implement the fast quiesce.

Except that if it breaks the guest networking it's useless.

Dave

> 
> > > 
> > > -- 
> > > MST
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 10:45           ` Dr. David Alan Gilbert
@ 2016-01-05 10:59             ` Michael S. Tsirkin
  2016-01-05 11:03               ` Dr. David Alan Gilbert
  2016-01-05 11:06               ` Michael S. Tsirkin
  2016-01-05 11:05             ` Michael S. Tsirkin
  1 sibling, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 10:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > >> devices.
> > > > > >
> > > > > > shpc?
> > > > > 
> > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > proposed to possibly even look at something like an ACPI interface
> > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > case.
> > > > > 
> > > > > - Alex
> > > > 
> > > > 
> > > > Start by using hot-unplug for this!
> > > > 
> > > > Really use your patch guest side, and write host side
> > > > to allow starting migration with the device, but
> > > > defer completing it.
> > > > 
> > > > So
> > > > 
> > > > 1.- host tells guest to start tracking memory writes
> > > > 2.- guest acks
> > > > 3.- migration starts
> > > > 4.- most memory is migrated
> > > > 5.- host tells guest to eject device
> > > > 6.- guest acks
> > > > 7.- stop vm and migrate rest of state
> > > > 
> > > > 
> > > > It will already be a win since hot unplug after migration starts and
> > > > most memory has been migrated is better than hot unplug before migration
> > > > starts.
> > > > 
> > > > Then measure downtime and profile. Then we can look at ways
> > > > to quiesce device faster which really means step 5 is replaced
> > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > all memory mapped for write by device".
> > > 
> > > 
> > > Doing a hot-unplug is going to upset the guests network stacks view
> > > of the world; that's something we don't want to change.
> > > 
> > > Dave
> > 
> > It might but if you store the IP and restore it quickly
> > after migration e.g. using guest agent, as opposed to DHCP,
> > then it won't.
> 
> I thought if you hot-unplug then it will lose any outstanding connections
> on that device.
> 
> > It allows calming the device down in a generic way,
> > specific drivers can then implement the fast quiesce.
> 
> Except that if it breaks the guest networking it's useless.
> 
> Dave

Is hot unplug useless then?

> > 
> > > > 
> > > > -- 
> > > > MST
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 10:59             ` Michael S. Tsirkin
@ 2016-01-05 11:03               ` Dr. David Alan Gilbert
  2016-01-05 11:11                 ` Michael S. Tsirkin
  2016-01-05 11:06               ` Michael S. Tsirkin
  1 sibling, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-05 11:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > > >> devices.
> > > > > > >
> > > > > > > shpc?
> > > > > > 
> > > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > case.
> > > > > > 
> > > > > > - Alex
> > > > > 
> > > > > 
> > > > > Start by using hot-unplug for this!
> > > > > 
> > > > > Really use your patch guest side, and write host side
> > > > > to allow starting migration with the device, but
> > > > > defer completing it.
> > > > > 
> > > > > So
> > > > > 
> > > > > 1.- host tells guest to start tracking memory writes
> > > > > 2.- guest acks
> > > > > 3.- migration starts
> > > > > 4.- most memory is migrated
> > > > > 5.- host tells guest to eject device
> > > > > 6.- guest acks
> > > > > 7.- stop vm and migrate rest of state
> > > > > 
> > > > > 
> > > > > It will already be a win since hot unplug after migration starts and
> > > > > most memory has been migrated is better than hot unplug before migration
> > > > > starts.
> > > > > 
> > > > > Then measure downtime and profile. Then we can look at ways
> > > > > to quiesce device faster which really means step 5 is replaced
> > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > all memory mapped for write by device".
> > > > 
> > > > 
> > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > of the world; that's something we don't want to change.
> > > > 
> > > > Dave
> > > 
> > > It might but if you store the IP and restore it quickly
> > > after migration e.g. using guest agent, as opposed to DHCP,
> > > then it won't.
> > 
> > I thought if you hot-unplug then it will lose any outstanding connections
> > on that device.
> > 
> > > It allows calming the device down in a generic way,
> > > specific drivers can then implement the fast quiesce.
> > 
> > Except that if it breaks the guest networking it's useless.
> > 
> > Dave
> 
> Is hot unplug useless then?

As a migration hack, yes, unless it's paired with a second network device
as a redundent route.
To do what's being suggested here it's got to be done at the device level
and not visible to the networking stack.

Dave

> 
> > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 10:45           ` Dr. David Alan Gilbert
  2016-01-05 10:59             ` Michael S. Tsirkin
@ 2016-01-05 11:05             ` Michael S. Tsirkin
  2016-01-05 12:43               ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 11:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > >> devices.
> > > > > >
> > > > > > shpc?
> > > > > 
> > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > proposed to possibly even look at something like an ACPI interface
> > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > case.
> > > > > 
> > > > > - Alex
> > > > 
> > > > 
> > > > Start by using hot-unplug for this!
> > > > 
> > > > Really use your patch guest side, and write host side
> > > > to allow starting migration with the device, but
> > > > defer completing it.
> > > > 
> > > > So
> > > > 
> > > > 1.- host tells guest to start tracking memory writes
> > > > 2.- guest acks
> > > > 3.- migration starts
> > > > 4.- most memory is migrated
> > > > 5.- host tells guest to eject device
> > > > 6.- guest acks
> > > > 7.- stop vm and migrate rest of state
> > > > 
> > > > 
> > > > It will already be a win since hot unplug after migration starts and
> > > > most memory has been migrated is better than hot unplug before migration
> > > > starts.
> > > > 
> > > > Then measure downtime and profile. Then we can look at ways
> > > > to quiesce device faster which really means step 5 is replaced
> > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > all memory mapped for write by device".
> > > 
> > > 
> > > Doing a hot-unplug is going to upset the guests network stacks view
> > > of the world; that's something we don't want to change.
> > > 
> > > Dave
> > 
> > It might but if you store the IP and restore it quickly
> > after migration e.g. using guest agent, as opposed to DHCP,
> > then it won't.
> 
> I thought if you hot-unplug then it will lose any outstanding connections
> on that device.

Which connections and which device?  TCP connections and an ethernet
device?  These are on different layers so of course you don't lose them.
Just do not change the IP address.

Some guests send a signal to applications to close connections
when all links go down. One can work around this
in a variety of ways.

> > It allows calming the device down in a generic way,
> > specific drivers can then implement the fast quiesce.
> 
> Except that if it breaks the guest networking it's useless.
> 
> Dave
> 
> > 
> > > > 
> > > > -- 
> > > > MST
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 10:59             ` Michael S. Tsirkin
  2016-01-05 11:03               ` Dr. David Alan Gilbert
@ 2016-01-05 11:06               ` Michael S. Tsirkin
  1 sibling, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 11:06 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

On Tue, Jan 05, 2016 at 12:59:54PM +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > > >> devices.
> > > > > > >
> > > > > > > shpc?
> > > > > > 
> > > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > case.
> > > > > > 
> > > > > > - Alex
> > > > > 
> > > > > 
> > > > > Start by using hot-unplug for this!
> > > > > 
> > > > > Really use your patch guest side, and write host side
> > > > > to allow starting migration with the device, but
> > > > > defer completing it.
> > > > > 
> > > > > So
> > > > > 
> > > > > 1.- host tells guest to start tracking memory writes
> > > > > 2.- guest acks
> > > > > 3.- migration starts
> > > > > 4.- most memory is migrated
> > > > > 5.- host tells guest to eject device
> > > > > 6.- guest acks
> > > > > 7.- stop vm and migrate rest of state
> > > > > 
> > > > > 
> > > > > It will already be a win since hot unplug after migration starts and
> > > > > most memory has been migrated is better than hot unplug before migration
> > > > > starts.
> > > > > 
> > > > > Then measure downtime and profile. Then we can look at ways
> > > > > to quiesce device faster which really means step 5 is replaced
> > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > all memory mapped for write by device".
> > > > 
> > > > 
> > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > of the world; that's something we don't want to change.
> > > > 
> > > > Dave
> > > 
> > > It might but if you store the IP and restore it quickly
> > > after migration e.g. using guest agent, as opposed to DHCP,
> > > then it won't.
> > 
> > I thought if you hot-unplug then it will lose any outstanding connections
> > on that device.
> > 
> > > It allows calming the device down in a generic way,
> > > specific drivers can then implement the fast quiesce.
> > 
> > Except that if it breaks the guest networking it's useless.
> > 
> > Dave
> 
> Is hot unplug useless then?

Actually I misunderstood the question, unplug does not
have to break guest networking.

> > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 11:03               ` Dr. David Alan Gilbert
@ 2016-01-05 11:11                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 11:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

On Tue, Jan 05, 2016 at 11:03:38AM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > > > >> devices.
> > > > > > > >
> > > > > > > > shpc?
> > > > > > > 
> > > > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > > case.
> > > > > > > 
> > > > > > > - Alex
> > > > > > 
> > > > > > 
> > > > > > Start by using hot-unplug for this!
> > > > > > 
> > > > > > Really use your patch guest side, and write host side
> > > > > > to allow starting migration with the device, but
> > > > > > defer completing it.
> > > > > > 
> > > > > > So
> > > > > > 
> > > > > > 1.- host tells guest to start tracking memory writes
> > > > > > 2.- guest acks
> > > > > > 3.- migration starts
> > > > > > 4.- most memory is migrated
> > > > > > 5.- host tells guest to eject device
> > > > > > 6.- guest acks
> > > > > > 7.- stop vm and migrate rest of state
> > > > > > 
> > > > > > 
> > > > > > It will already be a win since hot unplug after migration starts and
> > > > > > most memory has been migrated is better than hot unplug before migration
> > > > > > starts.
> > > > > > 
> > > > > > Then measure downtime and profile. Then we can look at ways
> > > > > > to quiesce device faster which really means step 5 is replaced
> > > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > > all memory mapped for write by device".
> > > > > 
> > > > > 
> > > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > > of the world; that's something we don't want to change.
> > > > > 
> > > > > Dave
> > > > 
> > > > It might but if you store the IP and restore it quickly
> > > > after migration e.g. using guest agent, as opposed to DHCP,
> > > > then it won't.
> > > 
> > > I thought if you hot-unplug then it will lose any outstanding connections
> > > on that device.
> > > 
> > > > It allows calming the device down in a generic way,
> > > > specific drivers can then implement the fast quiesce.
> > > 
> > > Except that if it breaks the guest networking it's useless.
> > > 
> > > Dave
> > 
> > Is hot unplug useless then?
> 
> As a migration hack, yes,

Based on a premise that it breaks connections but it does not
have to.

> unless it's paired with a second network device
> as a redundent route.

You can do this too.

But this is not a must at all.

> To do what's being suggested here it's got to be done at the device level
> and not visible to the networking stack.
> 
> Dave

Need for this was never demonstrated.

> > 
> > > > 
> > > > > > 
> > > > > > -- 
> > > > > > MST
> > > > > --
> > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 11:05             ` Michael S. Tsirkin
@ 2016-01-05 12:43               ` Dr. David Alan Gilbert
  2016-01-05 13:16                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2016-01-05 12:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > > >> devices.
> > > > > > >
> > > > > > > shpc?
> > > > > > 
> > > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > case.
> > > > > > 
> > > > > > - Alex
> > > > > 
> > > > > 
> > > > > Start by using hot-unplug for this!
> > > > > 
> > > > > Really use your patch guest side, and write host side
> > > > > to allow starting migration with the device, but
> > > > > defer completing it.
> > > > > 
> > > > > So
> > > > > 
> > > > > 1.- host tells guest to start tracking memory writes
> > > > > 2.- guest acks
> > > > > 3.- migration starts
> > > > > 4.- most memory is migrated
> > > > > 5.- host tells guest to eject device
> > > > > 6.- guest acks
> > > > > 7.- stop vm and migrate rest of state
> > > > > 
> > > > > 
> > > > > It will already be a win since hot unplug after migration starts and
> > > > > most memory has been migrated is better than hot unplug before migration
> > > > > starts.
> > > > > 
> > > > > Then measure downtime and profile. Then we can look at ways
> > > > > to quiesce device faster which really means step 5 is replaced
> > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > all memory mapped for write by device".
> > > > 
> > > > 
> > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > of the world; that's something we don't want to change.
> > > > 
> > > > Dave
> > > 
> > > It might but if you store the IP and restore it quickly
> > > after migration e.g. using guest agent, as opposed to DHCP,
> > > then it won't.
> > 
> > I thought if you hot-unplug then it will lose any outstanding connections
> > on that device.
> 
> Which connections and which device?  TCP connections and an ethernet
> device?  These are on different layers so of course you don't lose them.
> Just do not change the IP address.
> 
> Some guests send a signal to applications to close connections
> when all links go down. One can work around this
> in a variety of ways.

So, OK, I was surprised that a simple connection didn't go down when
I tested and just removed the network card; I'd thought stuff was more
aggressive when there was no route.
But as you say, some stuff does close connections when the links go down/away
so we do need to work around that; and any new outgoing connections get
a 'no route to host'.  So I'm still nervous what will break.

Dave

> 
> > > It allows calming the device down in a generic way,
> > > specific drivers can then implement the fast quiesce.
> > 
> > Except that if it breaks the guest networking it's useless.
> > 
> > Dave
> > 
> > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 12:43               ` Dr. David Alan Gilbert
@ 2016-01-05 13:16                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 13:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Alexander Duyck, Konrad Rzeszutek Wilk, Alexander Duyck, kvm,
	linux-pci, x86, linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Alexander Graf, Alex Williamson

On Tue, Jan 05, 2016 at 12:43:03PM +0000, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:45:25AM +0000, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > On Tue, Jan 05, 2016 at 10:01:04AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Michael S. Tsirkin (mst@redhat.com) wrote:
> > > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > > >> The two mechanisms referenced above would likely require coordination with
> > > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to address
> > > > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > > > >> preference would be to add a vendor-specific configuration block to the
> > > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > > > > > >> essentially extend shpc to support guest live migration with pass-through
> > > > > > > >> devices.
> > > > > > > >
> > > > > > > > shpc?
> > > > > > > 
> > > > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > > case.
> > > > > > > 
> > > > > > > - Alex
> > > > > > 
> > > > > > 
> > > > > > Start by using hot-unplug for this!
> > > > > > 
> > > > > > Really use your patch guest side, and write host side
> > > > > > to allow starting migration with the device, but
> > > > > > defer completing it.
> > > > > > 
> > > > > > So
> > > > > > 
> > > > > > 1.- host tells guest to start tracking memory writes
> > > > > > 2.- guest acks
> > > > > > 3.- migration starts
> > > > > > 4.- most memory is migrated
> > > > > > 5.- host tells guest to eject device
> > > > > > 6.- guest acks
> > > > > > 7.- stop vm and migrate rest of state
> > > > > > 
> > > > > > 
> > > > > > It will already be a win since hot unplug after migration starts and
> > > > > > most memory has been migrated is better than hot unplug before migration
> > > > > > starts.
> > > > > > 
> > > > > > Then measure downtime and profile. Then we can look at ways
> > > > > > to quiesce device faster which really means step 5 is replaced
> > > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > > all memory mapped for write by device".
> > > > > 
> > > > > 
> > > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > > of the world; that's something we don't want to change.
> > > > > 
> > > > > Dave
> > > > 
> > > > It might but if you store the IP and restore it quickly
> > > > after migration e.g. using guest agent, as opposed to DHCP,
> > > > then it won't.
> > > 
> > > I thought if you hot-unplug then it will lose any outstanding connections
> > > on that device.
> > 
> > Which connections and which device?  TCP connections and an ethernet
> > device?  These are on different layers so of course you don't lose them.
> > Just do not change the IP address.
> > 
> > Some guests send a signal to applications to close connections
> > when all links go down. One can work around this
> > in a variety of ways.
> 
> So, OK, I was surprised that a simple connection didn't go down when
> I tested and just removed the network card; I'd thought stuff was more
> aggressive when there was no route.
> But as you say, some stuff does close connections when the links go down/away
> so we do need to work around that; and any new outgoing connections get
> a 'no route to host'.


You can create a dummy device in guest for the duration of migration.
Use guest agent to move IP address there and that should be enough to trick most guests.


>  So I'm still nervous what will break.
> 
> Dave

I'm not saying nothing breaks.  Far being from it.  For example, some NAT
or firewall implementations keep state per interface and these might
lose state (if using NAT/stateful firewall within guest).


So yes it *would* be useful to teach guests, for example, that a device
is "not dead, just resting" and that another device will shortly come
and take its place.


But the simple setup is already useful and worth supporting, and merging
things gradually will help this project finally get off the ground.


> > 
> > > > It allows calming the device down in a generic way,
> > > > specific drivers can then implement the fast quiesce.
> > > 
> > > Except that if it breaks the guest networking it's useless.
> > > 
> > > Dave
> > > 
> > > > 
> > > > > > 
> > > > > > -- 
> > > > > > MST
> > > > > --
> > > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05  9:40     ` Michael S. Tsirkin
  2016-01-05 10:01       ` Dr. David Alan Gilbert
@ 2016-01-05 16:18       ` Alexander Duyck
  2016-06-06  9:18         ` Re: [Qemu-devel] " Zhou Jie
  1 sibling, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2016-01-05 16:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Konrad Rzeszutek Wilk, Alexander Duyck, kvm, linux-pci, x86,
	linux-kernel, qemu-devel, Lan Tianyu, Yang Zhang,
	Dr. David Alan Gilbert, Alexander Graf, Alex Williamson

On Tue, Jan 5, 2016 at 1:40 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
>> >> The two mechanisms referenced above would likely require coordination with
>> >> QEMU and as such are open to discussion.  I haven't attempted to address
>> >> them as I am not sure there is a consensus as of yet.  My personal
>> >> preference would be to add a vendor-specific configuration block to the
>> >> emulated pci-bridge interfaces created by QEMU that would allow us to
>> >> essentially extend shpc to support guest live migration with pass-through
>> >> devices.
>> >
>> > shpc?
>>
>> That is kind of what I was thinking.  We basically need some mechanism
>> to allow for the host to ask the device to quiesce.  It has been
>> proposed to possibly even look at something like an ACPI interface
>> since I know ACPI is used by QEMU to manage hot-plug in the standard
>> case.
>>
>> - Alex
>
>
> Start by using hot-unplug for this!
>
> Really use your patch guest side, and write host side
> to allow starting migration with the device, but
> defer completing it.

Yeah, I'm fully on board with this idea, though I'm not really working
on this right now since last I knew the folks on this thread from
Intel were working on it.  My patches were mostly meant to be a nudge
in this direction so that we could get away from the driver specific
code.

> So
>
> 1.- host tells guest to start tracking memory writes
> 2.- guest acks
> 3.- migration starts
> 4.- most memory is migrated
> 5.- host tells guest to eject device
> 6.- guest acks
> 7.- stop vm and migrate rest of state
>

Sounds about right.  The only way this differs from what I see as the
final solution for this is that instead of fully ejecting the device
in step 5 the driver would instead pause the device and give the host
something like 10 seconds to stop the VM and resume with the same
device connected if it is available.  We would probably also need to
look at a solution that would force the device to be ejected or abort
prior to starting the migration if it doesn't give us the ack in step
2.

> It will already be a win since hot unplug after migration starts and
> most memory has been migrated is better than hot unplug before migration
> starts.

Right.  Generally the longer the VF can be maintained as a part of the
guest the longer the network performance is improved versus using a
purely virtual interface.

> Then measure downtime and profile. Then we can look at ways
> to quiesce device faster which really means step 5 is replaced
> with "host tells guest to quiesce device and dirty (or just unmap!)
> all memory mapped for write by device".

Step 5 will be the spot where we really need to start modifying
drivers.  Specifically we probably need to go through and clean-up
things so that we can reduce as many of the delays in the driver
suspend/resume path as possible.  I suspect there is quite a bit that
can be done there that would probably also improve boot and shutdown
times since those are also impacted by the devices.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: [Qemu-devel] [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-01-05 16:18       ` Alexander Duyck
@ 2016-06-06  9:18         ` Zhou Jie
  2016-06-06 16:04           ` Alex Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Zhou Jie @ 2016-06-06  9:18 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin
  Cc: Lan Tianyu, Yang Zhang, Alex Williamson, kvm,
	Konrad Rzeszutek Wilk, linux-pci, x86, linux-kernel, qemu-devel,
	Alexander Graf, Alexander Duyck, Dr. David Alan Gilbert

Hi Alex,

On 2016/1/6 0:18, Alexander Duyck wrote:
> On Tue, Jan 5, 2016 at 1:40 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
>>>>> The two mechanisms referenced above would likely require coordination with
>>>>> QEMU and as such are open to discussion.  I haven't attempted to address
>>>>> them as I am not sure there is a consensus as of yet.  My personal
>>>>> preference would be to add a vendor-specific configuration block to the
>>>>> emulated pci-bridge interfaces created by QEMU that would allow us to
>>>>> essentially extend shpc to support guest live migration with pass-through
>>>>> devices.
>>>>
>>>> shpc?
>>>
>>> That is kind of what I was thinking.  We basically need some mechanism
>>> to allow for the host to ask the device to quiesce.  It has been
>>> proposed to possibly even look at something like an ACPI interface
>>> since I know ACPI is used by QEMU to manage hot-plug in the standard
>>> case.
>>>
>>> - Alex
>>
>>
>> Start by using hot-unplug for this!
>>
>> Really use your patch guest side, and write host side
>> to allow starting migration with the device, but
>> defer completing it.
>
> Yeah, I'm fully on board with this idea, though I'm not really working
> on this right now since last I knew the folks on this thread from
> Intel were working on it.  My patches were mostly meant to be a nudge
> in this direction so that we could get away from the driver specific
> code.

I have seen your email about live migration.

I conclude the idea you proposed as following.
1. Extend swiotlb to allow for a page dirtying functionality.
2. Use pci express capability to implement of a PCI bridge to act
    as a bridge between direct assigned devices and the host bridge.
3. Using APCI event or extend shpc driver to support device pause.
Is it right?

Will you implement the patchs for live migration?

Sincerely,
Zhou Jie


>
>> So
>>
>> 1.- host tells guest to start tracking memory writes
>> 2.- guest acks
>> 3.- migration starts
>> 4.- most memory is migrated
>> 5.- host tells guest to eject device
>> 6.- guest acks
>> 7.- stop vm and migrate rest of state
>>
>
> Sounds about right.  The only way this differs from what I see as the
> final solution for this is that instead of fully ejecting the device
> in step 5 the driver would instead pause the device and give the host
> something like 10 seconds to stop the VM and resume with the same
> device connected if it is available.  We would probably also need to
> look at a solution that would force the device to be ejected or abort
> prior to starting the migration if it doesn't give us the ack in step
> 2.
>
>> It will already be a win since hot unplug after migration starts and
>> most memory has been migrated is better than hot unplug before migration
>> starts.
>
> Right.  Generally the longer the VF can be maintained as a part of the
> guest the longer the network performance is improved versus using a
> purely virtual interface.
>
>> Then measure downtime and profile. Then we can look at ways
>> to quiesce device faster which really means step 5 is replaced
>> with "host tells guest to quiesce device and dirty (or just unmap!)
>> all memory mapped for write by device".
>
> Step 5 will be the spot where we really need to start modifying
> drivers.  Specifically we probably need to go through and clean-up
> things so that we can reduce as many of the delays in the driver
> suspend/resume path as possible.  I suspect there is quite a bit that
> can be done there that would probably also improve boot and shutdown
> times since those are also impacted by the devices.
>
> - Alex
>
>
>
> .
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: [Qemu-devel] [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-06-06  9:18         ` Re: [Qemu-devel] " Zhou Jie
@ 2016-06-06 16:04           ` Alex Duyck
  2016-06-09 10:14             ` Zhou Jie
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Duyck @ 2016-06-06 16:04 UTC (permalink / raw)
  To: Zhou Jie
  Cc: Alexander Duyck, Michael S. Tsirkin, Lan Tianyu, Yang Zhang,
	Alex Williamson, kvm, Konrad Rzeszutek Wilk, linux-pci, x86,
	linux-kernel, qemu-devel, Alexander Graf, Dr. David Alan Gilbert

On Mon, Jun 6, 2016 at 2:18 AM, Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
> Hi Alex,
>
>
> On 2016/1/6 0:18, Alexander Duyck wrote:
>>
>> On Tue, Jan 5, 2016 at 1:40 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>
>>> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
>>>>>>
>>>>>> The two mechanisms referenced above would likely require coordination
>>>>>> with
>>>>>> QEMU and as such are open to discussion.  I haven't attempted to
>>>>>> address
>>>>>> them as I am not sure there is a consensus as of yet.  My personal
>>>>>> preference would be to add a vendor-specific configuration block to
>>>>>> the
>>>>>> emulated pci-bridge interfaces created by QEMU that would allow us to
>>>>>> essentially extend shpc to support guest live migration with
>>>>>> pass-through
>>>>>> devices.
>>>>>
>>>>>
>>>>> shpc?
>>>>
>>>>
>>>> That is kind of what I was thinking.  We basically need some mechanism
>>>> to allow for the host to ask the device to quiesce.  It has been
>>>> proposed to possibly even look at something like an ACPI interface
>>>> since I know ACPI is used by QEMU to manage hot-plug in the standard
>>>> case.
>>>>
>>>> - Alex
>>>
>>>
>>>
>>> Start by using hot-unplug for this!
>>>
>>> Really use your patch guest side, and write host side
>>> to allow starting migration with the device, but
>>> defer completing it.
>>
>>
>> Yeah, I'm fully on board with this idea, though I'm not really working
>> on this right now since last I knew the folks on this thread from
>> Intel were working on it.  My patches were mostly meant to be a nudge
>> in this direction so that we could get away from the driver specific
>> code.
>
>
> I have seen your email about live migration.
>
> I conclude the idea you proposed as following.
> 1. Extend swiotlb to allow for a page dirtying functionality.
> 2. Use pci express capability to implement of a PCI bridge to act
>    as a bridge between direct assigned devices and the host bridge.
> 3. Using APCI event or extend shpc driver to support device pause.
> Is it right?
>
> Will you implement the patchs for live migration?

That is pretty much the heart of the proposal I had.  I submitted an
RFC as a proof-of-concept for item 1 in the hopes that someone else
might try tackling items 2 and 3 but I haven't seen any updates since
then.  The trick is to find a way to make it so that item 1 doesn't
slow down standard SWIOTLB when you are not migrating a VM. If nothing
else we would probably just need to add a static key that we could
default to false unless there is a PCI bridge indicating we are
starting a migration.

I haven't had time to really work on this though. In addition I am not
that familiar with QEMU and the internals of live migration so pieces
2 and 3 would take me some additional time to work on.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-06-06 16:04           ` Alex Duyck
@ 2016-06-09 10:14             ` Zhou Jie
  2016-06-09 15:39               ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Zhou Jie @ 2016-06-09 10:14 UTC (permalink / raw)
  To: Alex Duyck, Michael S. Tsirkin
  Cc: Alexander Duyck, Lan Tianyu, Yang Zhang, Alex Williamson, kvm,
	Konrad Rzeszutek Wilk, linux-pci, x86, linux-kernel, qemu-devel,
	Alexander Graf, Dr. David Alan Gilbert, Izumi,
	Taku/泉 拓

TO Alex
TO Michael

    In your solution you add a emulate PCI bridge to act as
    a bridge between direct assigned devices and the host bridge.
    Do you mean put all direct assigned devices to
    one emulate PCI bridge?
    If yes, this maybe bring some problems.

    We are writing a patchset to support aer feature in qemu.
    When assigning a vfio device with AER enabled, we must check whether
    the device supports a host bus reset (ie. hot reset) as this may be
    used by the guest OS in order to recover the device from an AER
    error.
    QEMU must therefore have the ability to perform a physical
    host bus reset using the existing vfio APIs in response to a virtual
    bus reset in the VM.
    A physical bus reset affects all of the devices on the host bus.
    Therefore all physical devices affected by a bus reset must be
    configured on the same virtual bus in the VM.
    And no devices unaffected by the bus reset,
    be configured on the same virtual bus.

    http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg02989.html

Sincerely,
Zhou Jie

On 2016/6/7 0:04, Alex Duyck wrote:
> On Mon, Jun 6, 2016 at 2:18 AM, Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>> Hi Alex,
>>
>>
>> On 2016/1/6 0:18, Alexander Duyck wrote:
>>>
>>> On Tue, Jan 5, 2016 at 1:40 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>
>>>> On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
>>>>>>>
>>>>>>> The two mechanisms referenced above would likely require coordination
>>>>>>> with
>>>>>>> QEMU and as such are open to discussion.  I haven't attempted to
>>>>>>> address
>>>>>>> them as I am not sure there is a consensus as of yet.  My personal
>>>>>>> preference would be to add a vendor-specific configuration block to
>>>>>>> the
>>>>>>> emulated pci-bridge interfaces created by QEMU that would allow us to
>>>>>>> essentially extend shpc to support guest live migration with
>>>>>>> pass-through
>>>>>>> devices.
>>>>>>
>>>>>>
>>>>>> shpc?
>>>>>
>>>>>
>>>>> That is kind of what I was thinking.  We basically need some mechanism
>>>>> to allow for the host to ask the device to quiesce.  It has been
>>>>> proposed to possibly even look at something like an ACPI interface
>>>>> since I know ACPI is used by QEMU to manage hot-plug in the standard
>>>>> case.
>>>>>
>>>>> - Alex
>>>>
>>>>
>>>>
>>>> Start by using hot-unplug for this!
>>>>
>>>> Really use your patch guest side, and write host side
>>>> to allow starting migration with the device, but
>>>> defer completing it.
>>>
>>>
>>> Yeah, I'm fully on board with this idea, though I'm not really working
>>> on this right now since last I knew the folks on this thread from
>>> Intel were working on it.  My patches were mostly meant to be a nudge
>>> in this direction so that we could get away from the driver specific
>>> code.
>>
>>
>> I have seen your email about live migration.
>>
>> I conclude the idea you proposed as following.
>> 1. Extend swiotlb to allow for a page dirtying functionality.
>> 2. Use pci express capability to implement of a PCI bridge to act
>>    as a bridge between direct assigned devices and the host bridge.
>> 3. Using APCI event or extend shpc driver to support device pause.
>> Is it right?
>>
>> Will you implement the patchs for live migration?
>
> That is pretty much the heart of the proposal I had.  I submitted an
> RFC as a proof-of-concept for item 1 in the hopes that someone else
> might try tackling items 2 and 3 but I haven't seen any updates since
> then.  The trick is to find a way to make it so that item 1 doesn't
> slow down standard SWIOTLB when you are not migrating a VM. If nothing
> else we would probably just need to add a static key that we could
> default to false unless there is a PCI bridge indicating we are
> starting a migration.
>
> I haven't had time to really work on this though. In addition I am not
> that familiar with QEMU and the internals of live migration so pieces
> 2 and 3 would take me some additional time to work on.
>
> - Alex
>
>
> .
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-06-09 10:14             ` Zhou Jie
@ 2016-06-09 15:39               ` Alexander Duyck
  2016-06-12  3:03                 ` Zhou Jie
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2016-06-09 15:39 UTC (permalink / raw)
  To: Zhou Jie
  Cc: Alex Duyck, Michael S. Tsirkin, Lan Tianyu, Yang Zhang,
	Alex Williamson, kvm, Konrad Rzeszutek Wilk, linux-pci,
	the arch/x86 maintainers, linux-kernel, qemu-devel,
	Alexander Graf, Dr. David Alan Gilbert, Izumi,
	Taku/泉 拓

On Thu, Jun 9, 2016 at 3:14 AM, Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
> TO Alex
> TO Michael
>
>    In your solution you add a emulate PCI bridge to act as
>    a bridge between direct assigned devices and the host bridge.
>    Do you mean put all direct assigned devices to
>    one emulate PCI bridge?
>    If yes, this maybe bring some problems.
>
>    We are writing a patchset to support aer feature in qemu.
>    When assigning a vfio device with AER enabled, we must check whether
>    the device supports a host bus reset (ie. hot reset) as this may be
>    used by the guest OS in order to recover the device from an AER
>    error.
>    QEMU must therefore have the ability to perform a physical
>    host bus reset using the existing vfio APIs in response to a virtual
>    bus reset in the VM.
>    A physical bus reset affects all of the devices on the host bus.
>    Therefore all physical devices affected by a bus reset must be
>    configured on the same virtual bus in the VM.
>    And no devices unaffected by the bus reset,
>    be configured on the same virtual bus.
>
>    http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg02989.html
>
> Sincerely,
> Zhou Jie

That makes sense, but I don't think you have to worry much about this
at this point at least on my side as this was mostly just theory and I
haven't had a chance to put any of it into practice as of yet.

My idea has been evolving on this for a while.  One thought I had is
that we may want to have something like an emulated IOMMU and if
possible we would want to split it up over multiple domains just so we
can be certain that the virtual interfaces and the physical ones
existed in separate domains.  In regards to your concerns perhaps what
we could do is put each assigned device into its own domain to prevent
them from affecting each other.  To that end we could probably break
things up so that each device effectively lives in its own PCIe slot
in the emulated system.  Then when we start a migration of the guest
the assigned device domains would then have to be tracked for unmap
and sync calls when the direction is from the device.

I will keep your concerns in mind in the future when I get some time
to look at exploring this solution further.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-06-09 15:39               ` Alexander Duyck
@ 2016-06-12  3:03                 ` Zhou Jie
  2016-06-13  1:28                   ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Zhou Jie @ 2016-06-12  3:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alex Duyck, Michael S. Tsirkin, Lan Tianyu, Yang Zhang,
	Alex Williamson, kvm, Konrad Rzeszutek Wilk, linux-pci,
	the arch/x86 maintainers, linux-kernel, qemu-devel,
	Alexander Graf, Dr. David Alan Gilbert, Izumi,
	Taku/泉 拓

Hi, Alex

On 2016/6/9 23:39, Alexander Duyck wrote:
> On Thu, Jun 9, 2016 at 3:14 AM, Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>> TO Alex
>> TO Michael
>>
>>    In your solution you add a emulate PCI bridge to act as
>>    a bridge between direct assigned devices and the host bridge.
>>    Do you mean put all direct assigned devices to
>>    one emulate PCI bridge?
>>    If yes, this maybe bring some problems.
>>
>>    We are writing a patchset to support aer feature in qemu.
>>    When assigning a vfio device with AER enabled, we must check whether
>>    the device supports a host bus reset (ie. hot reset) as this may be
>>    used by the guest OS in order to recover the device from an AER
>>    error.
>>    QEMU must therefore have the ability to perform a physical
>>    host bus reset using the existing vfio APIs in response to a virtual
>>    bus reset in the VM.
>>    A physical bus reset affects all of the devices on the host bus.
>>    Therefore all physical devices affected by a bus reset must be
>>    configured on the same virtual bus in the VM.
>>    And no devices unaffected by the bus reset,
>>    be configured on the same virtual bus.
>>
>>    http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg02989.html
>>
>> Sincerely,
>> Zhou Jie
>
> That makes sense, but I don't think you have to worry much about this
> at this point at least on my side as this was mostly just theory and I
> haven't had a chance to put any of it into practice as of yet.
>
> My idea has been evolving on this for a while.  One thought I had is
> that we may want to have something like an emulated IOMMU and if
> possible we would want to split it up over multiple domains just so we
> can be certain that the virtual interfaces and the physical ones
> existed in separate domains.  In regards to your concerns perhaps what
> we could do is put each assigned device into its own domain to prevent
> them from affecting each other.  To that end we could probably break
> things up so that each device effectively lives in its own PCIe slot
> in the emulated system.  Then when we start a migration of the guest
> the assigned device domains would then have to be tracked for unmap
> and sync calls when the direction is from the device.
>
> I will keep your concerns in mind in the future when I get some time
> to look at exploring this solution further.
>
> - Alex

I am thinking about the practice of migration of passthrough device.

In your solution, you use a vendor specific configuration space to
negotiate with guest.
If you put each assigned device into its own domain,
how can qemu negotiate with guest?
Add the vendor specific configuration space to every pci bus which
is assigned a passthrough device?

Sincerely
Zhou Jie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking
  2016-06-12  3:03                 ` Zhou Jie
@ 2016-06-13  1:28                   ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2016-06-13  1:28 UTC (permalink / raw)
  To: Zhou Jie
  Cc: Alex Duyck, Michael S. Tsirkin, Lan Tianyu, Yang Zhang,
	Alex Williamson, kvm, Konrad Rzeszutek Wilk, linux-pci,
	the arch/x86 maintainers, linux-kernel, qemu-devel,
	Alexander Graf, Dr. David Alan Gilbert, Izumi,
	Taku/泉 拓

On Sat, Jun 11, 2016 at 8:03 PM, Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
> Hi, Alex
>
>
> On 2016/6/9 23:39, Alexander Duyck wrote:
>>
>> On Thu, Jun 9, 2016 at 3:14 AM, Zhou Jie <zhoujie2011@cn.fujitsu.com>
>> wrote:
>>>
>>> TO Alex
>>> TO Michael
>>>
>>>    In your solution you add a emulate PCI bridge to act as
>>>    a bridge between direct assigned devices and the host bridge.
>>>    Do you mean put all direct assigned devices to
>>>    one emulate PCI bridge?
>>>    If yes, this maybe bring some problems.
>>>
>>>    We are writing a patchset to support aer feature in qemu.
>>>    When assigning a vfio device with AER enabled, we must check whether
>>>    the device supports a host bus reset (ie. hot reset) as this may be
>>>    used by the guest OS in order to recover the device from an AER
>>>    error.
>>>    QEMU must therefore have the ability to perform a physical
>>>    host bus reset using the existing vfio APIs in response to a virtual
>>>    bus reset in the VM.
>>>    A physical bus reset affects all of the devices on the host bus.
>>>    Therefore all physical devices affected by a bus reset must be
>>>    configured on the same virtual bus in the VM.
>>>    And no devices unaffected by the bus reset,
>>>    be configured on the same virtual bus.
>>>
>>>    http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg02989.html
>>>
>>> Sincerely,
>>> Zhou Jie
>>
>>
>> That makes sense, but I don't think you have to worry much about this
>> at this point at least on my side as this was mostly just theory and I
>> haven't had a chance to put any of it into practice as of yet.
>>
>> My idea has been evolving on this for a while.  One thought I had is
>> that we may want to have something like an emulated IOMMU and if
>> possible we would want to split it up over multiple domains just so we
>> can be certain that the virtual interfaces and the physical ones
>> existed in separate domains.  In regards to your concerns perhaps what
>> we could do is put each assigned device into its own domain to prevent
>> them from affecting each other.  To that end we could probably break
>> things up so that each device effectively lives in its own PCIe slot
>> in the emulated system.  Then when we start a migration of the guest
>> the assigned device domains would then have to be tracked for unmap
>> and sync calls when the direction is from the device.
>>
>> I will keep your concerns in mind in the future when I get some time
>> to look at exploring this solution further.
>>
>> - Alex
>
>
> I am thinking about the practice of migration of passthrough device.
>
> In your solution, you use a vendor specific configuration space to
> negotiate with guest.
> If you put each assigned device into its own domain,
> how can qemu negotiate with guest?
> Add the vendor specific configuration space to every pci bus which
> is assigned a passthrough device?

This is kind of the direction I was thinking of heading in, so yes.
Basically in my mind we should be emulating a PCIe hierarchy if we
want so support device assignment.  That way we can already make use
of things like hot-plug and AER natively.  So if we have a root port
assigned to each assigned device we should be able to place some extra
logic there to handle things like this.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-06-13  1:28 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 2/3] xen/swiotlb: " Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest Alexander Duyck
2015-12-14 14:00   ` Michael S. Tsirkin
2015-12-14 16:34     ` Alexander Duyck
2015-12-14 17:20       ` Michael S. Tsirkin
2015-12-14 17:59         ` Alexander Duyck
2015-12-14 20:52           ` Michael S. Tsirkin
2015-12-14 22:32             ` Alexander Duyck
2015-12-14  2:27 ` [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Yang Zhang
2015-12-14  4:54   ` Alexander Duyck
2015-12-14  5:22     ` Yang Zhang
2015-12-14  5:46       ` Alexander Duyck
2015-12-14  7:20         ` Yang Zhang
2015-12-14 14:02           ` Michael S. Tsirkin
     [not found] ` <20160104204104.GB17427@char.us.oracle.com>
2016-01-05  3:11   ` Alexander Duyck
2016-01-05  9:40     ` Michael S. Tsirkin
2016-01-05 10:01       ` Dr. David Alan Gilbert
2016-01-05 10:35         ` Michael S. Tsirkin
2016-01-05 10:45           ` Dr. David Alan Gilbert
2016-01-05 10:59             ` Michael S. Tsirkin
2016-01-05 11:03               ` Dr. David Alan Gilbert
2016-01-05 11:11                 ` Michael S. Tsirkin
2016-01-05 11:06               ` Michael S. Tsirkin
2016-01-05 11:05             ` Michael S. Tsirkin
2016-01-05 12:43               ` Dr. David Alan Gilbert
2016-01-05 13:16                 ` Michael S. Tsirkin
2016-01-05 16:18       ` Alexander Duyck
2016-06-06  9:18         ` Re: [Qemu-devel] " Zhou Jie
2016-06-06 16:04           ` Alex Duyck
2016-06-09 10:14             ` Zhou Jie
2016-06-09 15:39               ` Alexander Duyck
2016-06-12  3:03                 ` Zhou Jie
2016-06-13  1:28                   ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).